• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

[protocol]GOenrichmentanalysis

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

[protocol]GO enrichment analysis

 

   

背景:

   什么是富集分析,自己可以百度。我到目前也没发现一个比较通俗易懂的介绍。直接理解为一种统计学方法就可以了。 用于查看显著性。

   富集分析有很多种,最常见的是GO富集分析。也有pathway富集分析。[pathway的我目前不会啊 ::>_<:: ]

 

工具:

   也有很多种,我这里主要是用Ontologizer (links:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html)

 

需要的文件:

  1.gene_ontology_edit.obo (这个类似GO的库文件, links:http://www.geneontology.org/GO.downloads.ontology.shtml)

 

  2.whole_gene_list (这是所研究物种中,包含的所有gene_list)

      通过一下脚本可以从CDS文件中提取gene_list

        grep "^>" sequence.fasta | tr -d ">"> sequenceHeader.list

 

  3. sample_1_gene_list (其中一个点的gene_list)

 

  4.最后一个也是最关键的。就是要有一个Association file .这个通过一下脚本生成

       4.1需要2个文件 和一个脚本

           文件1:GO_ids 文件 格式如下

gene136870 GO:0006950,GO:0005737
gene143400 0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020

     文件2: GOIDTermIndexFile 这个可以下载到 links:http://www.geneontology.org/GO.downloads.files.shtml

     脚本1:

   

import sys

#input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website inFile = open(sys.argv[1],'r') goFile = open(sys.argv[2],'r') #parse the GOID/terms index and store it in the dictionary, goHash goHash = {} for line in goFile: #skip lines that start with '!' character as they are comment headers if line[0] != "!": data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects goHash[data[0]] = data[-1] #Here are some columns that the GAF format wants. #Since Ontologizer doesn't care about this, we can just make it up DB = 'yourOrganism' DBRef = 'PMID:0000000' DBEvi = 'ISO' DBObjectType = 'gene' DBTaxon = 'taxon:79327' DBDate = '23022011' DBAssignedBy = 'PFAM' #potential obselete goids that you have in your annotation potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf if len(sys.argv) == 3: print '!gaf-version: 2.0' #Loop through the GO annotation file and generate assocation lines. for line in inFile: data = line.strip().split('\t') #if gene has go annotations if len(data) > 1: #gid is the gene id, goIDs is an array of assocated GO ids gid = data[0] goIDs = data[1].split(',') #second column of the .gaf file that Ontologizer doesn't care about DBID = "db." gid #third column is the name of the gene which Ontologizer does use DBObjSym = gid #for each GO ID in the line, print out an association line for goID in goIDs: if goHash.has_key(goID): DBAspect = goHash[goID] DBObjName = 'myOrganism' DBID outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids. if len(sys.argv) == 3: print '\t'.join(outArray) else: potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list if len(sys.argv) == 4: print '\n'.join(set(potentialObs))
   执行:
python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf

 

 

以上4个文件都有了 可以执行Ontologizer了 参考:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

 

java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology_edit.obo  -s moso_target_gene.txt -p sequnecHeader.list -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000

 

 

 

   5.这个软件是可以画图的。但是需要下载软件。下载步骤如下: (as root )

        yum list available 'graphviz*'
        yum install 'graphviz*'
dot -Tpng view-4hourSMinduced-Parent-Child-Westfall-Young-Single-Step.dot -oExample.png
 
 
还可以参考 http://blog.nextgenetics.net/?e=5
 


 

Most gene enrichment websites out there only allow you to find enrichments for popular model organisms using pre-established gene ontology annotations. I ran into this problem early on during my phd when confronted with having to generate enrichment data on Schmidtea mediterranea.

 

Software and inputs

To get custom enrichments, I used Ontologizer. The inputs for ontologizer are:

  • Gene Ontology OBO file
    A flatfile containing information on all the current GO terms and their relationship to other terms. You can get it here.
  • List of all gene ids
    If you have a .fasta file of all your genes, you can use this command to extract the gene id information:
    grep "^>" sequence.fasta | tr -d ">" > sequenceHeader.list
    
  • List of gene ids you want to find enrichment for
    The subset of all your genes that you want the enrichment for, ie. differentially expressed gene. Ontologizer can take in an entire folder of gene id lists if you set a directory as input. 
  • Association file for your organism
    This is a GO Annotation File (.gaf). The specifications is detailed here. Fortunately, Ontologizer doesn't really care about most of the database information columns in the .gaf file, so you can fake most of it. To generate this file, you will have to first generate GO annotations for your genes through software like InterproScan or Blast2Go. You will also need a tab delimited index of GO IDs to GO category which you can get here.

 

Generating a gene association file

Assuming your GO annotations are formatted in two columns where first column is the gene id and second column is a comma separated list of your GO ids:

gene136870 GO:0006950,GO:0005737
gene143400 GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020

Here is the python script to generate the association file with comments:

import sys

#input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website inFile = open(sys.argv[1],'r') goFile = open(sys.argv[2],'r') #parse the GOID/terms index and store it in the dictionary, goHash goHash = {} for line in goFile: #skip lines that start with '!' character as they are comment headers if line[0] != "!": data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects goHash[data[0]] = data[-1] #Here are some columns that the GAF format wants. #Since Ontologizer doesn't care about this, we can just make it up DB = 'yourOrganism' DBRef = 'PMID:0000000' DBEvi = 'ISO' DBObjectType = 'gene' DBTaxon = 'taxon:79327' DBDate = '23022011' DBAssignedBy = 'PFAM' #potential obselete goids that you have in your annotation potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf if len(sys.argv) == 3: print '!gaf-version: 2.0' #Loop through the GO annotation file and generate assocation lines. for line in inFile: data = line.strip().split('\t') #if gene has go annotations if len(data) > 1: #gid is the gene id, goIDs is an array of assocated GO ids gid = data[0] goIDs = data[1].split(',') #second column of the .gaf file that Ontologizer doesn't care about DBID = "db." gid #third column is the name of the gene which Ontologizer does use DBObjSym = gid #for each GO ID in the line, print out an association line for goID in goIDs: if goHash.has_key(goID): DBAspect = goHash[goID] DBObjName = 'myOrganism' DBID outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids. if len(sys.argv) == 3: print '\t'.join(outArray) else: potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list if len(sys.argv) == 4: print '\n'.join(set(potentialObs))

 

Putting it all together

Make sure you have your custom GO annotation file generated in the format shown above and the GOID to term aspect index file from here (text file version of 'Terms, IDs, secondary IDs, obsoletes'). Save this script and use it by:

python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf

If you want to check if any of your annotation is potentially deprecated or obsolete, use this command:

python myScript.py GO.customAnnotationFile GOIDTermIndexFile y > potentialDeprecated.goids

Now that you have the Gene Assocation File, you can use Ontologizer by:

java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology.1_2.obo -o out -p myOrganism.ids -s in

The parameters used are:

  • a - the gaf file
  • g - the obo file
  • o - the output directory
  • p - the list of all gene ids
  • s - the input directory/file of gene subsets 

The enrichments with all relevent statistics will be generated per input gene list in a tab delimited file.


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
【Linux】【Go】Centos7安装go1.13环境发布时间:2022-07-10
下一篇:
Go如何发送广播包发布时间:2022-07-10
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap