オーム情報解析

Omics Analysis 2010/06/15 Transcriptome analysis

オーム情報解析 2010/06/15－トランスクリプトームデータ解析実習－

1. Introduction

We use a free statistical software R and Active Perl in this practice.

1. はじめに

本実習では、統計解析フリーソフトRとActive Perlを用います。
実習用ノートPCには「R version 2.X.X」が予めインストールされています。自分のWindows PCへのRのインストール法はこちらを参考にしてください。

2. Gene expression data

We analyze a public gene expression matrix consisting of 36 normal human tissues and 22,283 clones (genes).[Journal site][PubMed(15950434)][GEO ID (GSE2361)].
Please create a folder named "20100615" on your Desktop and download following six files (data.txt, annotation.txt, template.txt, hash.pl, kadai1_command.txt, and kadai1_template.txt) to the folder.

2. 実習で用いるマイクロアレイデータ

ヒトの様々な正常組織から得られた遺伝子発現プロファイルデータ（22,283 clones×36 tissues）。
Original paper：Ge et al., Genomics. 86:127-141, 2005.[Journal site][PubMed(15950434)][GEO ID (GSE2361)]
デスクトップに「20100615」という名前のフォルダを作成し、そのフォルダの中に以下の6つのファイルをダウンロードしておいてください。

Data file：data.txt
Annotation file：annotation.txt
Template file：template.txt
Perl program：hash.pl
For exercise 1 (command file)：kadai1_command.txt
For exercise 1 (template file)：kadai1_template.txt

3. We will use a pattern matching (or template matching) method for the identification of tissue-specific genes in this practice.

Schematic illustration of the template matching is here (Pavlidis and Noble, 2001). For example, you can get a subset of genes expressed highly in the "third" tissue using a template pattern such as (0,0,1,0,0,...). A similarity measure (specifically, Pearson correlation coefficient) between gene i in the dataset and the template pattern, S(gene i), is calculated. After the calculation, we get 22,283 correlation coefficients, i.e., S(gene 1), S(gene 2), ..., S(gene 22283). A high value, S(gene i), for gene i indicates high similarity to the template that you used. Accordingly, you can get any type of tissue-specific genes (single-tissue specific high expression pattern, three-tissue specific low expression pattern, and so on) by selecting top-ranked genes to the template pattern you defined.

3. 実習で行うマイクロアレイ解析手法

パターンマッチング法。
任意の発現パターンに似た発現パターンを示す遺伝子群を抽出する方法であり、幅広く利用されています：

癌と正常サンプル間などの二群間で発現の異なる遺伝子群の同定（Shimoji et al., BBRC, 2004）
任意の組織で特異的に発現している遺伝子（組織特異的発現遺伝子、組織選択的発現遺伝子）群の同定（Pavlidis and Noble, Genome Biol., 2001）
出芽酵母の細胞周期に関連する遺伝子群の同定（Johansson et al., Bioinformatics, 2003）
他にも、パターンマッチング法の一種であるCosinor法（sine or cosineカーブをテンプレートパターンとし、似たパターンを示す遺伝子群を同定）なども概日（サーカディアン）リズム関連遺伝子の探索に用いられています。

実習では、「任意の組織で特異的に発現している遺伝子の同定」を行います。

4. Practice (Pattern matching using R)

Open R and change the working (or current) directory to "20100615" you created.
Prepare a file (template.txt) including template pattern for detecting heart-specific genes. In this practice, you don't have to prepare the file because you already have the file (template.txt) in "20100615" folder.

Reading two files (data.txt and template.txt) with the following commands (It's OK by just doing "copy and paste".).

#------    Start    ------
library(genefilter)                       #Open "genefilter" library to use a function, genefinder, in the library for pattern-matching
data_tmp <- read.table("data.txt", header=TRUE, row.names=1)#Reading gene expression data in the file (data.txt) and we use the data as "data_tmp"
data <- as.matrix(data_tmp)               #Exchanging data format (from "data.frame") to "matrix" for object "data_tmp" using the function "as.matrix" and we use the "data_tmp" after formatting change as "data"
template_tmp <- read.table("template.txt")#Reading template data in the file (template.txt) and we use the template data as "template_tmp"
template <- template_tmp[,2]              #Extracting the information of the second column in the "template_tmp" (i.e., template.txt) and we will use the information as "template"
template                                  #Checking the information of "template". The gray color commands indicate "not essential."

#------     End     ------

Extracting the top 10 genes having highest similarities to the template pattern.

#------    Start    ------
dim(data)                          #Showing the numbers of rows and coloumns of an object "data"
tmp <- rbind(data, template)       #Combining two objects ("data" and "template") by row. The "template" is added below "data" and we use the combined one as "tmp". You can refer the detailed usage of the function, rbind, by typing "?rbind".
dim(tmp)                           #Showing the numbers of rows and coloumns of an object "tmp". You can see the row number of "tmp" becomes the row number of "data" plus 1. This is because one row is added for "data".
template_posi <- which(rownames(tmp) == "template")#We use a row number whose row name is "template" as template_posi. Practically, the "template_posi" can be replaced by "nrow(tmp)".
gene_num <- 10                     #Setting the number of top-ranked genes as 10.
closeg <- genefinder(tmp, template_posi, gene_num, method="correlation")#Extracting the top "gene_num" genes from the gene expression matrix "tmp" having highest similarities to the template pattern defined as that in the "template_posi"th line (or row) in the matrix "tmp". We use the results as "closeg"
closeg[[1]]$indices                #Row numbers for the top-ranked genes
rownames(tmp[closeg[[1]]$indices,])#Gene IDs for the genes. The underline part indicates those expression data.
closeg[[1]]$dists                  #Similarity values defined as "1 - correlation coefficient". These values can be changed if the corresponding parameter in the "genefinder" function is changed. For example, method="euclidean". For details, type "?genefinder".

#------     End     ------

Save the expression data for the top-ranked genes as the output file name of "top10_heart.txt".

#------    Start    ------
tmp2 <- cbind(rownames(tmp[closeg[[1]]$indices,]), tmp[closeg[[1]]$indices,])#Two objects (Gene IDs and the expression data for the "gene_num" top-ranked genes) are combined by column and we use the combined one as "tmp2".
write.table(tmp2, "top10_heart.txt", sep = "\t", quote=F, row.names=F)
#You may think "write.table(tmp[closeg[[1]]$indices,], "top10_heart_hoge.txt", sep = "\t", row.names=T)" is more simple compared to the above two commands.
#It is of course OK. But... Open the "top10_heart_hoge.txt" using EXCEL and compare it with "top10_heart.txt". I believe you can understand why I recommend above somewhat complicated commands.

#------     End     ------

If you would like to save only the IDs, type as follows:

#------    Start    ------
write.table(rownames(tmp[closeg[[1]]$indices,]), "top10_heart_ID.txt", quote=F, row.names=F, col.names=F)

#------     End     ------

Making pseudo-color image

#------    Start    ------
heatmap(as.matrix(tmp[closeg[[1]]$indices,]), Rowv =NA, Colv=NA, scale="row", col = heat.colors(20), main="Best 10", xlab="Tissue", ylab="Clone ID", margin=c(8,10))

#------     End     ------

4. Rでパターンマッチング

例題として、22,283遺伝子の中から「心臓（heart）」で特異的に発現している遺伝子上位10個をリストアップする作業を行います。作業の大まかな流れは以下の通りです：
① 目的組織（この場合、心臓）で1、それ以外の組織で0とした「テンプレートパターン」を用意
②22,283行（遺伝子）×36列（組織）からなる遺伝子発現行列の一行一行について、「テンプレートパターン」との類似度を計算
③指定した数（この場合、10個）の遺伝子IDなどを類似度の高い順にリストアップ

Rを起動し、「ファイル」－「ディレクトリの変更」を選択し、デスクトップ上に作成した20100615に作業ディレクトリを変更する。
目的組織（心臓）で1、それ以外の組織で0とした「テンプレートパターン」を含んだファイル（template.txt）を予め作成。
注意点：22,283行（遺伝子）×36列（組織）からなる遺伝子発現行列を含んだファイル(data.txt)中の組織の並びと、template.txt中の組織の並びは同じにしておく必要があります。

22,283遺伝子の中から心臓で特異的に発現している遺伝子の上位10個をリストアップします。
まず、以下をコピー＆ペーストして、パターンマッチングを行って候補遺伝子群を抽出したいファイル（data.txt）と「テンプレートパターン」を含んだファイル（template.txt）を読み込みます。

#------    ここから    ------
library(genefilter)                       #パッケージの読み込み
data_tmp <- read.table("data.txt", header=TRUE, row.names=1)#data.txtファイルを読み込んでdata_tmpに格納（これをそのまま使うわけではないので"_tmp"を付加している）
data <- as.matrix(data_tmp)               #as.matrixの意味は、「データの型を"行列として(as matrix)"dataに格納せよ」です。（read.tableで読み込んで得られたdata_tmpのデータの型は"データフレーム"なので、そのままではこの場合は使用不可なのでやる必要があります）
template_tmp <- read.table("template.txt")#template.txtファイルを読み込んでtemplateに格納
template <- template_tmp[,2]              #テンプレート情報（2列目）のみ抽出し、templateに格納
template                                  #テンプレート情報の確認

#------    ここまで    ------

genefinder()関数を使ってテンプレートパターンに最も似た発現パターンをもつ上位10個を抽出します。この際、類似度は1－相関係数（method=”correlation”）で評価することにします。

#------    ここから    ------
dim(data)                          #行列dataの行数と列数を表示
tmp <- rbind(data, template)       #templateというテンプレートパターンを行列dataの後に追加
dim(tmp)                           #行列dataにテンプレートパターンが追加された行列tmpの行数と列数を表示
template_posi <- which(rownames(tmp) == "template")#行のラベル情報が"template"の行番号をtemplate_posiとして格納
gene_num <- 10                     #上位10個という情報をgene_numに格納
closeg <- genefinder(tmp, template_posi, gene_num, method="correlation")#行列tmpの中で、template_posi行目の遺伝子発現プロファイルと近い上位gene_num個の情報をclosegに格納
closeg[[1]]$indices                #上位10個の行番号を表示
rownames(tmp[closeg[[1]]$indices,])#上位10個の遺伝子ID（AFFX_ID）を表示。underline部分が上位10個の遺伝子発現データに相当する。
closeg[[1]]$dists                  #上位10個の類似度を表示

#------    ここまで    ------

参考：一連の計算はExcelでもできます。計算を行ったExcelファイルはここに置いてあります。

上位10個の遺伝子発現プロファイル情報を取り出し、ファイル（ファイル名：top10_heart.txt）に保存します：

#------    ここから    ------
tmp2 <- cbind(rownames(tmp[closeg[[1]]$indices,]), tmp[closeg[[1]]$indices,])#上位10個の遺伝子ID（AFFX_ID）情報とその遺伝子発現情報を列方向で結合（column方向でbind）し、結果をtmp2に格納
write.table(tmp2, "top10_heart.txt", sep = "\t", quote=F, row.names=F)  #結果をファイル（ファイル名：top10_heart.txt）に保存。

#------    ここまで    ------

遺伝子ID情報のみ取り出してファイル（ファイル名：top10_heart_ID.txt）に保存したい場合には以下のようにします：

#------    ここから    ------
write.table(rownames(tmp[closeg[[1]]$indices,]), "top10_heart_ID.txt",#遺伝子IDに関する情報のみ、top10_heart_ID.txt
quote=F, row.names=F, col.names=F)                                    #というファイル名で保存したい場合。

#------    ここまで    ------

上位10個の遺伝子発現データの入ったファイルtop10_heart.txtを読み込み、pseudo-color imageを作成します。

#------    ここから    ------
data <- read.table("top10_heart.txt", header=TRUE, row.names=1)                #ファイルの読み込み
heatmap(as.matrix(data), Rowv =NA, Colv=NA, scale="row", col = heat.colors(20),#ヒートマップ（pseudo-color image）の作成
main="Best 10", xlab="Tissue", ylab="Clone ID", margin=c(8,10))                #(見づらいので二行に分けているだけ)

###  「data」と「tmp[closeg[[1]]$indices,]」は同じなので以下でも可（参考まで）  ###
heatmap(as.matrix(tmp[closeg[[1]]$indices,]), Rowv =NA, Colv=NA,#ヒートマップ（pseudo-color image）の作成
scale="row", col = heat.colors(20), main="Best 10",             #ヒートマップ（pseudo-color image）の作成
xlab="Tissue", ylab="Clone ID", margin=c(8,10))                 #ヒートマップ（pseudo-color image）の作成

#------    ここまで    ------

pseudo-color imageのpdf

参考1：得られたpseudo-color imageを保存したい場合には、R Graphics画面をアクティブにして「ファイル」－「別名で保存」とすることで、任意の形式で保存することができます。
参考2：ここでは、暖色の20段階のみで作図してみましたが、寒色（cm.colors）や虹色（rainbow）など様々な色を任意の階調（12段階や256段階など）で指定することができます（色調に関する参考URL）。

5. Extracting annotation information for the top-ranked genes using Perl

Open "Command Prompt"
Enter "perl -v" and confirm the installation of Active Perl
Enter "cd " and drag the "20100615" folder to the GUI of "Command prompt" (Sorry for the poor explanation m(_ _)m)
Enter "dir" and confirm the current directory is set to the "20100615" folder
Enter "perl hash.pl annotation.txt top10_heart_ID.txt > output.txt" for extracting annotation information of the top-ranked genes
Open the output file (output.txt) using EXCEL and confirm the validity

For example, the top-ranked clone ID was "219942_at". You can find the GenBank accesion number, NM_021223, from the output file (i.e., output.txt).
NCBI UniGene may be useful for the validation of NM_021223. By searching NM_021223 on NCBI UniGene, you can see the restricted expression of the gene.

5. Perlを用いてアノテーション情報を抽出

アノテーションファイル(annotation.txt)から、得られた心臓特異的遺伝子のIDに対応するアノテーション情報を抽出し考察します。

コマンドプロンプトを起動
「perl -v」と打ち込んでActive Perlがインストールされているか確認
「cd 」と打ち込み、リターンキーを押さずにデスクトップ上にある"20100615"フォルダを選択してコマンドプロンプト画面上にドラッグ＆ドロップ
「dir」と打ち込み"20100615"フォルダの中身が見えていることを確認（見られないヒトは手を挙げてスタッフを呼んでください）
「perl hash.pl annotation.txt top10_heart_ID.txt > output.txt」と打ち込んで（コピペで可）10個の心臓特異的遺伝子IDに対応するアノテーション情報のみ抽出したoutput.txtファイルを出力結果として得る
output.txtファイルをWordPadなどで開き、心臓でよく発現していることが知られているmyosinやtroponinが含まれているかなどを確認。

考察例：心臓特異的発現パターン検出用テンプレートに最も近いパターンをもつクローンは、”219942_at”であり、output.txt中でそのGenBank Accessionを調べてみると”NM_021223”であった。NCBI UniGeneでNM_021223を検索すると「Restricted Expression: heart normal fetus」などと書いてあったので妥当な結果と判断された。

EXERCISE 1
Extract top 20 genes specifically expressed in (1) one tissue of interest (of course except for "heart") OR (2) one or more tissues of interest and discuss the result shortly. Use these files (kadai1_command.txt and kadai1_template.txt; The word "kadai" indicates "EXERCISE".) to tackle it. Practically, you have to change only the first two lines in "kadai1_command.txt". Save the pseudo-color image for the tissue-specific expression patterns of the top 20 genes as "kadai1.pdf". Your report consists three files (kadai1_command.txt, kadai1_template.txt, and kadai1.pdf)

You may extract annotation information for the top 20 genes using Perl. In this case, all what you have to do is to type the following command in "Command Prompt":
"perl hash.pl annotation.txt top20_ID.txt > output_top20.txt"
You will easily get the annotation information only for the top-ranked genes.

EXERCISE 2
You learned some uses (or applications) of microarray technology from Dr. Katsuma the last time. Summarize one use (or application) for microarrays.

E-mail To: kadota@iu.a.u-tokyo.ac.jp
Subject: Omics Analysis report
Deadline: Jun 25 2010

【課題1】
どちらか興味ある課題を行ってください。いずれも上位20個をリストアップし、簡単な結果の考察を行ってください。
実際の解析は、一連のコマンドをまとめたファイル（提出物1に相当）を用意しておいて、その中の必要な部分のみ（最初の二行のみ）変更して解析を行います。

自分の興味ある一組織（「心臓」以外）で特異的に発現している遺伝子
複数組織で選択的に高発現している遺伝子

提出物1：しかるべきところを変更したコマンドファイル（kadai1_command.txt）。ファイルの編集は「WordPad」で行ってください。
提出物2：しかるべきところを変更したテンプレートファイル（kadai1_template.txt）。ファイルの編集は「WordPad」で行ってください。
提出物3：得られたpseudo-color imageのPDFファイル（kadai1.pdf）

ヒント：Perlを用いて上位20個のアノテーション情報を抽出して考察することになると思います。この際、例えば以下のコマンドは「"hash.pl"という"perl"プログラムを用いて、"annotation.txt"ファイルの中から"top20_ID.txt"で指定したIDリストを含む行全てを"output_top20.txt"というファイル名で保存せよ」という意味です：
「perl hash.pl annotation.txt top20_ID.txt > output_top20.txt」

【課題2】
マイクロアレイは単に遺伝子発現量の比較だけではなく、いろいろ応用的に利用されている。マイクロアレイの応用的な利用法に関して一つ例をあげ、概説しなさい。

【課題3 (optional)】
（私を含む多くのユーザーは）(Rで)マイクロアレイデータ解析を参考にして実際のマイクロアレイ解析を行います。このウェブページ中の任意のサンプルデータを用いて、自分にとって今後役立ちそうな項目を一つ選んで解析せよ。

【課題1】に相当する項目は解析 | 似た発現パターンを持つ遺伝子の同定
（サンプル間）クラスタリングは解析 | クラスタリング | 階層的 | hclust
（組織特異的発現遺伝子検出のための別の方法は解析 | 発現変動遺伝子 | 組織特異的（選択的）発現遺伝子 | ROKU
二群間比較を行うためのWAD法は解析 | 発現変動遺伝子 | 二群間 | 対応なし | WAD (Kadota_2008)

【課題4 (optional)】
Rで、次世代シーケンサーなどから得られる大量塩基配列のデータ解析を行うことも可能（時間はかかるものの...）です。 Rで(塩基)配列解析ページ中のSequence logos (Schneider and Stephens, 1990)のサンプルプログラムを実行して、Rでも配列解析が可能であることを実感するとともに、どのような結果が得られているか考察せよ。
注意：講義用ノートPCには、Sequence logoの実行に必要なライブラリがインストールされていないため、これをやりたい方はR Console画面上で、予め以下を実行しておいてください。

#------    ここから    ------
source("http://www.bioconductor.org/biocLite.R")               #おまじない
biocLite("Biostrings")                                         #Biostringsパッケージのインストール
biocLite("seqLogo")                                            #seqLogoパッケージのインストール

#------    ここまで    ------

課題提出先(To.)：kadota@iu.a.u-tokyo.ac.jp
Subject: オーム情報解析課題(Omics Analysis report)
提出締切り(Deadline)：Jun 25 2010

Last updated on Jun 14 2010