2012年度バイオスタティスティクス基礎論テキスト3

Introduction to Biostatistics 2012, Text 3

西田洋巳Hiromi Nishida

講義・実習日:2012年5月9日（水）

[3-1] 第1種の過誤と第2種の過誤Type I and type II errors

第1種: 帰無仮説が正しいときにそれを棄却

第2種: 帰無仮説が間違っているときにそれを採択

par(mfrow=c(2, 2))

a <- c()

for(i in 1:1000) a[i] <- t.test(runif(10), runif(10))$p.value

length(a[a<0.05])

同じ母集団（ここでは一様分布）から10のサンプリングを独立に行い、それらのt検定を1000回行った結果についてp値<0.05の数を数える

Count the number of p < 0.05

hist(a)

サンプルサイズを1000にした場合には

for(i in 1:1000) a[i] <- t.test(runif(1000), runif(1000))$p.value

length(a[a<0.05])

hist(a)

母集団が異なる（ここでは一様分布と正規分布）の場合には

for(i in 1:1000) a[i] <- t.test(runif(10), rnorm(10))$p.value

length(a[a<0.05])

hist(a)

異なる母集団でサンプルサイズを1000にした場合には

for(i in 1:1000) a[i] <- t.test(runif(1000), rnorm(1000))$p.value

length(a[a<0.05])

hist(a)

[3-2]検出力Statistical power

有意水準（危険率）が真の仮説を棄却する確率を示し、その値を小さくすればするほど、偽の仮説が棄却されない場合が増す。検出力とは偽の仮説が棄却される確率を言う。通常、危険率Significance level=0.05、検出力=0.8となる標本の大きさが必要となる

Significance level is the probability that the test rejects the null hypothesis when it is true. On the other hand, statistical power is the probability that the test rejects the null hypothesis when it is false.

例えば、検出力0.8、有意水準0.05、標準偏差2の分布において0.5の差を検定するために必要な標本の大きさは

power.t.test(power=0.8, sig.level=0.05, sd=2, delta=0.5)

により253が求まる

標本の大きさ 200、有意水準0.05、標準偏差2の分布において0.5の差を検定するときの検出力は

power.t.test(n=200, sig.level=0.05, sd=2, delta=0.5)

により0.703が求まる

割合の比較:成功率60％から75％に上昇したことを80％の検出力で検定するための数を求める場合は

power.prop.test(power=0.8, p1=0.6, p2=0.75)

によりサンプルサイズ152が求まる

他にpower.anova.testなど

[3-3] 対応のあるデータPaired data

例えば、9人が試験Aおよび試験Bを受け、それぞれ100点満点で評価され、その結果は下記の通りであった

Nine persons took the examinations A and B. The scores are following.

Persons #1, #2, #3, #4, #5, #6, #7, #8, #9

Scores of A: 11, 32, 23, 44, 35, 26, 37, 38, 19

Scores of B: 20, 31, 32, 33, 44, 35, 46, 47, 28

試験AとBの難易度には差がないと言えるか検定する

試験の結果はそれぞれ対応Paired (Dependent)しており独立Unpaired (Independent)ではないので

x <- c(11, 32, 23, 44, 35, 26, 37, 38, 19)

y <- c(20, 31, 32, 33, 44, 35, 46, 47, 28)

t.test(x, y, var.equal=T); wilcox.test(x, y)とするのは間違い

t.test(x, y, paired=T); wilcox.test(x, y, paired=T)

としなければならない

なお、これらは

t.test(x - y); wilcox.test(x - y)

に等しい

1標本の検定においては、デフォルトでμ = 0となっているため

[3-4] 回帰と相関Regression and correlation

データyをデータxの関数として表したいとき、線形回帰を用いて二つの変数の関係を示すExpress the relationships between the two variables

線形回帰モデルLinear regression modelとして、y[i] = a + bx[i] + residual[i]

とした場合、残差平方和Sum of residual squared = ∑(y[i] – a – bx[i])²を最小にするa、bを求める。このa、bをy切片、傾きとする直線を回帰直線 Linear regression lineと呼ぶ

lm(y~x)

線形モデルLinear model

y, dependent variable

x, independent variables

summary(lm(y~x))

plot(x,y)

abline(lm(y~x))

fitted(lm(y~x))

は当てはめ値Fitted valuesを示す

resid(lm(y~x))

は残差Residualを示す

実測値Measured valuesと当てはめ値を直線で結び、残差を示した図は

segments(x, fitted(lm(y~x)), x, y)

残差の分布が正規分布に従っているかどうかの目安として

qqnorm(resid(lm(y~x)))

参考までに

∑(y[i] - a - bx[i])²

= ∑y[i]² + ∑a² + ∑b²x[i]² - 2∑ay[i] + 2∑abx[i] - 2∑bx[i]y[i]

= ∑y[i]² + na² + b²∑x[i]² - 2a∑y[i] + 2ab∑x[i] - 2b∑x[i]y[i]

aで偏微分して

2na - 2∑y[i] + 2b∑x[i] = 0

よって

a = (∑y[i] - b∑x[i])/n

同様にbで偏微分して

2b∑x[i]² + 2a∑x[i] - 2∑x[i]y[i] = 0

よって

b∑x[i]² + (∑x[i]∑y[i] - b∑x[i]∑x[i])/n – ∑x[i]y[i] = 0

b(∑x[i]² – ∑x[i]∑x[i]/n) + ∑x[i]∑y[i]/n - ∑x[i]y[i] = 0

以上より

b = (∑x[i]y[i] – ∑x[i]y[i]/n)/(∑x[i]² – ∑x[i]∑x[i]/n)

a = ∑y[i]/n – (Sxy/Sxx)∑x[i]/n

Sxy = ∑x[i]y[i] – ∑x[i]y[i]/n

Sxx = ∑x[i]² – ∑x[i]∑x[i]/n

[3-5] 相関係数Correlation coefficient

ピアソンKarl Pearson (1857-1936)の積率相関係数

Pearson product-moment correlation coefficient

r = ∑((x[i]-xの平均)(y[i]-yの平均))/sqrt(∑(x[i]-xの平均)²・∑(y[i]-yの平均)²)

スピアマンCharles Edward Spearman (1863-1945)の順位相関係数

Spearman rank correlation coefficientではx[i]、y[i]がそれぞれにおける順位となる

ケンドールMaurice George Kendall (1907-1983)の順位相関係数

Kendall rank correlation coefficient

(x1 - x2)＊(y1 - y2) < 0

(x1 - x3)＊(y1 - y3) > 0

n個のデータから二つを取り出す組み合わせはn(n-1)/2通りあり、それらの関係は正または負となる。正のときに1、負のときに-1を与え、総和をn(n-1)/2で割ると、そのときの値(τ)は-1≦τ≦1となる。このときのτをケンドールの順位相関係数という

相関係数に関する検定Test for association between paired samples

帰無仮説: 相関なし(相関係数が0)

Null hypothesis: correlation coefficient = 0

ピアソンの積率相関係数の有意性検定

cor.test(x, y, method="pearson")

method="p"でも可、default

検定統計量Statistic: r×sqrt(n-2)/sqrt(1-r²)

自由度Degree of freedom n-2のt分布

スピアマンの順位相関係数の有意性検定

cor.test(x, y, method="spearman")

method="s"でも可

ケンドールの順位相関係数の有意性検定

cor.test(x, y, method="kendall")

method="k"でも可

cor(x, y)*sd(y)/sd(x) = 回帰直線の傾きSlope of the regression line

また、回帰直線は主成分分析Principal components analysisの第一主成分とは異なる

[3-3]のx、yで確認

prcomp(data.frame(x, y))

summary(prcomp(data.frame(x, y)))

[3-6] 実習ヌクレオソーム位置の保存度比較Comparison between nucleosome position profiles

出芽酵母ヒストンアセチル化酵素Elp3がヌクレオソームの位置に与える影響を調べるため、その遺伝子欠失株のヌクレオソームマップをコントロールと比較する。本実習では第一染色体の遺伝子におけるプロモータ領域（翻訳開始上流1000塩基領域）およびボディ領域（翻訳開始から終結までの領域）のヌクレオソームマップを比較する。

In order to elucidate the influence of the histone acetyltransferase Elp3 on positions of nucleosomes, compare two nulceosome maps of control and the ELP3 deletion mutant of Saccharomyces cerevisiae. In this practice, we compare the nucleosome maps in the gene promoter (region from 1 kb upstream of the translational start site) and in the gene body (region from the translational start site to the end) of genes on the chromosome 1.

1) データのダウンロードDownload data and executable statements

2) データのRへの読み取りRead the data into R

a <- read.table("cont1.txt")

A <- read.table("mut1.txt")

chr1p <- read.table("chr1p.txt")

3) 比較する範囲（塩基位置）を決めるDetermine the region on the chromosome

x <- chr1p[1, 1]

y <- chr1p[1, 2]

4) データを抽出するExtract the data in the region

a1 <- a[a[,1]>=x & a[,2]<=y, ]

5) データを整理するArrange the extracted data

a2 <- data.frame(a1[,1]-x+1, a1[,2]-x+1)

a3 <- y-x+1

b <- numeric(nrow(a2)*a3)

for(i in 1:nrow(a2)) b[seq(a3*i+a2[i,1]-a3, a3*i+a2[i,2]-a3)] <- 1

b1 <- matrix(b, nrow=nrow(a2), byrow=T)

b2 <- apply(b1, 2, sum)

6) プロファイルを確認するConfirm the data

plot(b2, type="h")

7) 4)と同じことを変異株でも行うDo the same operation for the mut1

A1 <- A[A[,1]>=x & A[,2]<=y, ]

A2 <- data.frame(A1[,1]-x+1, A1[,2]-x+1)

B <- numeric(nrow(A2)*a3)

for(i in 1:nrow(A2)) B[seq(a3*i+A2[i,1]-a3, a3*i+A2[i,2]-a3)] <- 1

B1 <- matrix(B, nrow=nrow(A2), byrow=T)

B2 <- apply(B1, 2, sum)

8) コントロールと変異株のプロファイルを比較するCompare between the control and the mutant

plot(b2, type="l", ylim=c(0,50), xlab="", ylab="")

par(new=T)

plot(B2, type="l", ylim=c(0, 50), col="blue", xlab="", ylab="")

9) 相関係数を計算するCalculate the correlation coefficient

cor(b2, B2)

10) Identify nucleosome free region and exclude the region. [各領域のヌクレオソームの数をカウント]

source("R_count.txt")

11) 10)の結果において数がゼロの領域を調べる

D[D[,1]==0 | D[,2]==0 | D[,3]==0 | D[,4]==0,]

12) ゼロの領域を除いた領域だけを対象とする

chr1p <- chr1p[c(1:15, 17:34, 36:38, 41:45, 47:66, 68:78, 80:94),]

chr1b <- chr1b[c(1:15, 17:34, 36:38, 41:45, 47:66, 68:78, 80:94),]

13) 3)から9)の操作をすべての遺伝子について行うExecute R_Chr1_pro.txt

source("R_Chr1_pro.txt")

14) 相関係数の分布の様子を見るDistribution of the correlation coefficients

hist(datapro)

boxplot(datapro)

15) 13)と同様にしてdatabodを作成するExecute R_Chr1_bod.txt

source("R_Chr1_bod.txt")

16) 各遺伝子のプロモータ領域およびボディ領域の相関係数を比較するCompare the correlation coefficients between datapro and databod

DATA <- data.frame(datapro, databod)

boxplot(DATA)

matplot(DATA, type="b", ylab="Correlation", xlab="gene")

matplot(t(DATA), type="l", xaxt="n", xlab="Region", ylab="Correlation"); axis(1, c(1, 2), c("Promoter", "Body"))

nrow(DATA[DATA[,1] > DATA[,2], ])

nrow(DATA[DATA[,1] < DATA[,2], ])

[3-7] 実習ゲノムシグニチャー比較Comparison of genome signatures

Data: gs.txt（連続する 2 塩基の各ゲノム DNA における出現頻度データ）

Ecoli, Escherichia coli; Bsublitis, Bacillus subtilis; Mgenitalium, Mycoplasma genitalium; Sgriseus, Streptomyces griseus

1) データのダウンロードDownload gs.txt

2) データのRへの読み込みRead the two data into R

gs <- read.table("gs.txt")

3) EcoliとBsubtilisのゲノムシグニチャーの相関係数を出す

cor(gs$Ecoli, gs$Bsubtilis)

4) すべての生物間のゲノムシグニチャーの相関係数を出す

cor(gs)

[3-8] 階層的クラスタリングHierarchical clustering

先の[3-3]の数値変化をデータフレーム化

x <- c(11, 32, 23, 44, 35, 26, 37, 38, 19)

y <- c(20, 31, 32, 33, 44, 35, 46, 47, 28)

d <- data.frame(x, y)

plot(d)

各ポイント間の距離を求める

dist(d)

Distance matrix computed by using the Euclidean distance measure to compute the distances between the rows of the data matrix

距離データに基づく階層的クラスタリング

plot(hclust(dist(d)))

Hierarchical clustering

散布図とクラスタリングを比較すると

par(mfrow=c(1, 2))

plot(d, pch=c("1", "2", "3", "4", "5", "6", "7", "8", "9")); plot(hclust(dist(d)))

[3-7]に基づく次の2つのクラスタリングを比較

plot(hclust(dist(gs)))

plot(hclust(dist(t(gs))))

heatmap(as.matrix(gs))

平均連結法Unweighted pair-group method using arithmetic averages

非階層的クラスタリング

data <- kmeans(as.matrix(data.frame(x, y)), 3)

plot(data.frame(x, y), col=data$cluster)

points(data$centers, col=1:3, pch=4)

[3-9] コルモゴロフ・スミルノフ検定Kolmogorov-Smirnov test

累積相対度数分布比較による検定

Test based on cumulative relative frequency distribution comparison

Andrey Nikolaevich Kolmogorov (1903–1987)

Nikolay Vasilyevich Smirnov (1900–1966)

1)累積確率分布Cumulative probability distribution

x <- seq(-5, 5, 0.1)

y1 <- dnorm(x)

確率密度関数値Probability density function

y2 <- pnorm(x)

累積確率分布関数値

par(mfrow=c(2, 2))

plot(x, y1); plot(x, y2)

Y1 <- dunif(x, -5, 5)

Y2 <- punif(x, -5, 5)

plot(x, Y1); plot(x, Y2)

2)コルモゴロフ・スミルノフ検定Kolmogorov-Smirnov test

u <- runif(50)

50 random numbers from a uniform distribution (in the range from 0 to 1)

mean(u)

Nearly 0.5

n <- rnorm(50, 0.5)

50 random numbers from a normal distribution (mean=0.5, sd=1)

mean(n)

Nearly 0.5

par(mfrow=c(2,1))

hist(u); hist(n)

Different distributions

var.test(u, n)

Different variances

t.test(u, n)

t test

wilcox.test(u, n)

Wilcoxon rank sum test

ks.test(u, n)

Kolmogorov-Smirnov test