蛋白序列GO号注释及问题
#=============================== 版本1 ===============================================
InterProScan的三种使用方法
Interproscan,通过蛋白质结构域和功能位点数据库预测蛋白质功能。是EBI开发的一个集成了蛋白质家族、结构域和功能位点的非冗余数据库。Interproscan整合了一些使用最普及的一些数据库,并应用于功能未知的蛋白进行Interpro注释和GO注释。
以下介绍3中interpro注释的方法:
三、本地化的InterProScan注释
3.1 本地化的InterProScan安装与配置
3.1.1 从ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan下载以下5个文件:
RELEASE/latest/iprscan_v4.8.tar.gz BIN/4.x/iprscan_bin4.x_[PLATFORM].tar.gz DATA/iprscan_DATA_[LATESTDATAVERSION].tar.gz DATA/iprscan_PTHR_DATA_[LATESTDATAVERSION].tar.gz DATA/iprscan_MATCH_DATA_[LATESTDATAVERSION].tar.gz
3.1.2 将5个文件解压到一个文件夹中,然后运行其中的文件Config.pl,来对InterProScan进行配置。
3.1.3 配置的过程中,若选择进行本地web配置,则修改本地www服务的配置文件,以能进行本地化网页版的运行。
3.2 本地化InterProScan的使用。
3.2.1 命令行运行iprscan的方法:
$bin/iprscan -cli -iprlookup -goterms -format xml -i test.fasta -o test.out
# help
http://www.chenlianfu.com/?tag=iprscan
该模块中XML::Parser XML::Parser::Expat 这两个模块,后一个必须先安装,后续一个接着安装,由于是C层面的模块,需要安装一些东西
Expat must be installed prior to building XML::Parser and I can\'t find it in the standard library directories. Install \'expat-devel\' (or \'libexpat1-dev\') package
小提示: (root或者sudo权限) yum 或者 apt-get install expat-devel (具体版本具体办)
#============================================== 版本2 =============================================
https://github.com/ebi-pf-team/interproscan/wiki 原文链接
第一步: 环境配置
Software requirements:
- 64-bit Linux
- Perl (default on most Linux distributions)
- Python 2.7.x only
- Oracle\'s Java JDK/JRE version 8 (required by InterProScan 5.17-56.0 onwards). Earlier InterProScan release versions required Java 6 (version 6u4 and above) or Java 7.
- Environment variables set
- $JAVA_HOME should point to the location of the JVM
$JAVA_HOME/bin should be added to the $PATH
第二步: 数据下载
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz.md5
md5sum -c interproscan-5.27-66.0-64-bit.tar.gz.md5 (解压前,把xxx.tar.gz xxx.tar.gz.md5放到同一目录下做检查完整性)
tar -pxvzf interproscan-5.27-66.0-64-bit.tar.gz (-p参数为了保持文件的权限 -v 建议去掉,这个是解压过程显示)
(解压后进去有个data目录,后续panther数据解压放进去,配置文件默认路径,如果放其他地方,设置一下)
第三步:运行测试
./interproscan.sh -i test_proteins.fasta -f tsv
./interproscan.sh -i test_proteins.fasta -cpu 8 -f GFF3 -goterms -iprlookup -t p -T 20171127tmp
# 参数: -i 输入 -f format -goterms -iprlookup GO注释 -t 数据类型 -T 临时文件目录名称
小提示:
TSV 是Tab-separated values的缩写,即制表符分隔值。
CSV,Comma-separated values(逗号分隔值)。
#============================= 具体参数 ========================================
27/11/2017 14:41:35:049 Welcome to InterProScan-5.27-66.0 usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar interproscan-5.jar Please give us your feedback by sending an email to [email protected] -appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option is not set, ALL analyses will be run. -b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path). Note that this option, the --output-dir (-d) option and the --outfile (-o) option are mutually exclusive. The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used. -cpu,--cpu <CPU> Optional, number of cores for inteproscan. -d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the --outfile (-o) option and the --output-file-base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically . -dp,--disable-precalc Optional. Disables use of the precalculated match lookup service. All match calculations will be run locally. -dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output -f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML. -goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology annotation (IMPLIES -iprlookup option) -help,--help Optional, display help information -i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert. -iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats. -ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only be considered if n is specified as a sequence type. Please be aware of the fact that if you specify a too short value it might be that the analysis takes a very long time! -o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute path). Note that this option, the --output-dir (-d) option and the --output-file-base (-b) option are mutually exclusive. If this option is given, you MUST specify a single output format using the -f option. The output file name will not be modified. Note that specifying an output file name using this option OVERWRITES ANY EXISTING FILE. -pa,--pathways Optional, switch on lookup of corresponding Pathway annotation (IMPLIES -iprlookup option) -t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or protein (p)). The default sequence type is protein. -T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or absolute path). The default location is temp/. -version,--version Optional, display version number -vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV output (when TSV output requested) Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html). Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the individual member database websites for details. Available analyses: TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. PANTHER (12.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database Hamap (2017_10) : High-quality Automated and Manual Annotation of Microbial Proteomes Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins ProSiteProfiles (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs CDD (3.16) : Prediction of CDD domains in Proteins PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family ProSitePatterns (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs) ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database. MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. Deactivated analyses: Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp