[Bioperl-l] HMMER3 to GFF3
Doug Hoen
douglas.hoen at gmail.com
Thu Aug 12 05:59:37 UTC 2010
Hi,
I am trying to convert HMMER3 (hmmscan) output files into GFF3 files.
Based on previous advice (see the thread, "How to store results of
searches of translated DNA in SeqFeature::Store database of the
original DNA?"), I have installed bioperl-live for its new HMMER3
parsing capabilities (in SearchIO) and am trying to use
bp_search2gff.pl to do the file conversion.
The hmmscan was done on translated chromosome sequences with conserved
domain models. I want to get the GFF 'start' and 'end' columns to be
based on these coordinates, not those of the models. To do this (with
my files), it seems I need to use the option "--type hit". However,
this changes the "Target" sequence name from the model name to
chromosome name, and the model name does not appear anywhere in the
output (see below).
Could someone please confirm whether the results are incorrect and, if
so, perhaps suggest a fix? It may well be that this problem is due to
the unusual way I am using hmmscan, rather than a problem with HMMER3
parsing...?
Many thanks,
-- Doug
========================================================
Here's what it looks like if I do *not* use the "--type hit" option.
(RVT_2 is a conserved domain name. I need this in the output.)
COMMAND:
------------------
bp_search2gff.pl -i ../chr1-tesigsv2.hmmscan -o chr1-tesigsv2-hmmscan-
original-locations-v2.gff3 --format hmmer3 --source HMMER3 --version 3
--component
OUTPUT:
------------------
==> chr1-tesigsv2-hmmscan-original-locations-v2.gff3 <==
##gff-version 3
Chr1_1 chromosome Component 1 10142557 . . 1 sequence=Chr1_1
Chr1_1 HMMER3 similarity 1 245 307.3 . 0 Target=Sequence:RVT_2 1898330
1898579
Chr1_1 HMMER3 similarity 1 244 329.5 . 0 Target=Sequence:RVT_2 2573551
2573796
Chr1_1 HMMER3 similarity 1 245 308.8 . 0 Target=Sequence:RVT_2 3159685
3159930
Chr1_1 HMMER3 similarity 1 102 108.2 . 0 Target=Sequence:RVT_2 3438684
3438791
Chr1_1 HMMER3 similarity 2 245 277.2 . 0 Target=Sequence:RVT_2 3566642
3566891
Chr1_1 HMMER3 similarity 13 213 251.4 . 0 Target=Sequence:RVT_2
4251160 4251373
Chr1_1 HMMER3 similarity 1 244 310.6 . 0 Target=Sequence:RVT_2 4252791
4253036
Chr1_1 HMMER3 similarity 6 99 94.2 . 0 Target=Sequence:RVT_2 4271555
4271653
========================================================
And here's what it looks like if I *do* use the "--type hit" option.
The coordinates look good but the model name has disappeared (and the
Target=Sequence seems wrong).
COMMAND:
------------------
bp_search2gff.pl -i ../chr1-tesigsv2.hmmscan -o chr1-tesigsv2-hmmscan-
original-locations-v3.gff3 --format hmmer3 --type hit --source HMMER3
--version 3 --component
OUTPUT:
------------------
==> chr1-tesigsv2-hmmscan-original-locations-v3.gff3 <==
##gff-version 3
RVT_2 HMMER3 similarity 1898330 1898579 307.3 . 0
Target=Sequence:Chr1_1 1 245
RVT_2 HMMER3 similarity 2573551 2573796 329.5 . 0
Target=Sequence:Chr1_1 1 244
RVT_2 HMMER3 similarity 3159685 3159930 308.8 . 0
Target=Sequence:Chr1_1 1 245
RVT_2 HMMER3 similarity 3438684 3438791 108.2 . 0
Target=Sequence:Chr1_1 1 102
RVT_2 HMMER3 similarity 3566642 3566891 277.2 . 0
Target=Sequence:Chr1_1 2 245
RVT_2 HMMER3 similarity 4251160 4251373 251.4 . 0
Target=Sequence:Chr1_1 13 213
RVT_2 HMMER3 similarity 4252791 4253036 310.6 . 0
Target=Sequence:Chr1_1 1 244
RVT_2 HMMER3 similarity 4271555 4271653 94.2 . 0
Target=Sequence:Chr1_1 6 99
RVT_2 HMMER3 similarity 4481232 4481477 281.5 . 0
Target=Sequence:Chr1_1 2 245
========================================================
And here's what the input HMMER3 result file looks like:
==> ../chr1-tesigsv2.hmmscan <==
# hmmscan :: search sequence(s) against a profile database
# HMMER 3.0rc1 (February 2010); http://hmmer.org/
# Copyright (C) 2010 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
# query sequence file: [...]/whole_chromosomes/translated/
chr1.pep
# target HMM database: [...]/signatures/Pfam-A.hmm
# output directed to file: chr1-tesigsv2.hmmscan
# model-specific thresholding: TC cutoffs
# Max sensitivity mode: on [all heuristic filters off]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
Query: Chr1_1 [L=10142557]
Description: CHROMOSOME dumped from ADB: Jun/20/09 14:53; last
updated: 2009-02-02
Scores for complete sequence (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N
Model Description
------- ------ ----- ------- ------ ----- ---- --
-------- -----------
0 3971.3 17.7 2.6e-101 329.5 0.6 19.4 17
RVT_2 Reverse transcriptase (RNA-dependent DNA pol
0 3040.7 23.0 1e-206 678.6 0.1 12.2 10
ATHILA ATHILA ORF-1 family
0 1681.9 79.1 1.9e-46 149.9 0.4 28.0 21
RVT_1 Reverse transcriptase (RNA-dependent DNA pol
0 1446.9 27.4 3.6e-95 309.1 0.2 7.6 5
Transposase_21 Transposase family tnp2
0 1168.4 50.3 1.4e-29 94.4 0.3 21.5 18
rve Integrase core domain
9.1e-300 960.0 69.0 3.1e-20 64.0 0.0 28.8 20
Retrotrans_gag Retrotransposon gag protein
1.5e-180 577.0 31.6 1.6e-29 93.1 1.5 9.5 8
Transposase_23 TNP1/EN/SPM transposase
4.4e-143 456.9 82.8 4.8e-18 56.4 0.1 12.9 11
MuDR MuDR family transposase
3.8e-116 371.4 19.6 1.2e-18 58.9 0.0 13.7 7
MULE MULE transposase domain
7.1e-106 344.1 5.6 2.7e-97 316.0 0.0 3.6 1
Plant_tran Plant transposon protein
9.2e-85 275.4 22.9 5.4e-60 194.4 0.3 6.4 3
Peptidase_C48 Ulp1 protease family, C-terminal catalytic d
1.8e-77 249.8 24.8 4.4e-28 89.8 0.1 10.8 3
Transposase_24 Plant transposase (Ptta/En/Spm family)
2.8e-47 150.1 1.2 5.5e-23 72.3 0.2 3.7 2
hATC hAT family dimerisation domain
5.7e-28 89.4 3.6 4.7e-13 41.1 0.0 6.5 1
RVP_2 Retroviral aspartyl protease
1e-16 53.3 0.0 4.4e-07 22.1 0.0 6.8 1
RnaseH RNase H
1.5e-08 25.3 2.4 0.00016 12.1 0.0 4.9 0
Transposase_mut Transposase, Mutator family
Domain annotation for each model (and alignments):
>> RVT_2 Reverse transcriptase (RNA-dependent DNA polymerase)
# score bias c-Evalue i-Evalue hmmfrom hmm to alifrom
ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- -------
------- ------- ------- ----
1 ! 307.3 0.0 5.3e-95 1.5e-94 1 245 [. 1898330
1898578 .. 1898330 1898579 .. 0.99
2 ! 329.5 0.6 8.9e-102 2.6e-101 1 244 [. 2573551
2573794 .. 2573551 2573796 .. 0.99
3 ! 308.8 0.0 1.8e-95 5.2e-95 1 245 [. 3159685
3159929 .. 3159685 3159930 .. 0.99
4 ! 108.2 0.1 3.4e-34 9.7e-34 1 102 [. 3438684
3438785 .. 3438684 3438791 .. 0.96
5 ! 277.2 0.0 8.1e-86 2.3e-85 2 245 .. 3566643
3566890 .. 3566642 3566891 .. 0.99
6 ! 251.4 0.0 6.2e-78 1.8e-77 13 213 .. 4251164
4251364 .. 4251160 4251373 .. 0.97
7 ! 310.6 0.0 5.1e-96 1.5e-95 1 244 [. 4252791
4253034 .. 4252791 4253036 .. 0.99
8 ! 94.2 0.1 6.1e-30 1.8e-29 6 99 .. 4271560
4271653 .. 4271555 4271653 .. 0.97
9 ! 281.5 0.9 3.9e-87 1.1e-86 2 245 .. 4481233
4481476 .. 4481232 4481477 .. 0.98
10 ! 248.2 0.0 5.9e-77 1.7e-76 1 190 [. 4521040
4521233 .. 4521040 4521237 .. 0.97
11 ! 314.6 0.1 3.2e-97 9.2e-97 1 244 [. 4652456
4652702 .. 4652456 4652704 .. 0.98
12 ! 40.7 0.0 1.3e-13 3.7e-13 2 92 .. 5219607
5219697 .. 5219606 5219701 .. 0.90
13 ! 221.0 0.0 1.2e-68 3.4e-68 2 245 .. 5241015
5241258 .. 5241014 5241259 .. 0.95
14 ! 81.2 0.0 5.6e-26 1.6e-25 2 115 .. 5501957
5502070 .. 5501956 5502080 .. 0.92
15 ! 272.4 0.0 2.3e-84 6.7e-84 30 245 .. 6483057
6483271 .. 6483050 6483272 .. 0.98
16 ! 178.5 0.0 1.2e-55 3.3e-55 81 244 .. 7250563
7250726 .. 7250552 7250728 .. 0.96
17 ! 313.7 0.0 5.9e-97 1.7e-96 2 245 .. 7707124
7707367 .. 7707123 7707368 .. 0.99
Alignments for each domain:
== domain 1 score: 307.3 bits; conditional E-value: 5.3e-95
RVT_2 1
nktwelvelpkgkkviglkWvfklKlnedgeierykARlVakGftqkegidyeetfspvvklesirlllalaaekkleleqlDvktaFLngelee
95
n tw +++lp gkk++g+kWv+k+Kln+dg++erykARlVakG+tq+eg+dy
+tfspv+kl++++ll+a+aa+k+++l+qlD+++aFLng+l+e
Chr1_1 1898330
NGTWVVCSLPVGKKAVGCKWVYKIKLNADGSLERYKARLVAKGYTQTEGLDYVDTFSPVAKLTTVKLLIAVAAAKGWSLSQLDISNAFLNGSLDE
1898424
68*********************************************************************************************
PP
RVT_2 96
evYvkqpeGfedkkk....enkvckLkkslYgLkqapraWyeklsevllklgfkkseadkclfvkkkeeeliivllYVDDlliagsskelieelk
186
e+Y++ p+G++ ++ +n vc+LkkslYgLkqa+r+Wy k+se l++lgf+
+s+ d++lf++k++++ ++vl+YVDD++ia+s +++ e l
Chr1_1 1898425
EIYMTLPPGYSPRQGdsfpPNAVCRLKKSLYGLKQASRQWYLKFSESLKALGFTQSSGDHTLFTRKSKNSYMAVLVYVDDIIIASSCDRETELLR
1898519
***********998889999***************************************************************************
PP
RVT_2 187
eeLkkefemkdlgelkyfLgleierkeegillsqekyvkkllkkfkmedakpvstplea 245
++L+++ +++dlg+l+yfLglei+r+++gi+++q+ky+ +ll+++++ +k++s
+p+e+
Chr1_1 1898520
DALQRSSKLRDLGTLRYFLGLEIARNTDGISICQRKYTLELLAETGLLGCKSSSVPMEP 1898578
*********************************************************97 PP
== domain 2 score: 329.5 bits; conditional E-value: 8.9e-102
RVT_2 1
nktwelvelpkgkkviglkWvfklKlnedgeierykARlVakGftqkegidyeetfspvvklesirlllalaaekkleleqlDvktaFLngelee
95
n+twel++lp+g+k+ig+kWv+k K+n++ge+erykARlVakG++q++gidy+e
+f+pv++le++rl+++laa++k++++q+D k aFLng++ee
Chr1_1 2573551
NDTWELTSLPNGHKAIGVKWVYKAKKNSKGEVERYKARLVAKGYSQRAGIDYDEVFAPVARLETVRLIISLAAQNKWKIHQMDFKLAFLNGDFEE
2573645
79*********************************************************************************************
PP
RVT_2 96
evYvkqpeGfedkkkenkvckLkkslYgLkqapraWyeklsevllklgfkkseadkclfvkkkeeeliivllYVDDlliagsskelieelkeeLk
190
evY++qp+G+ +k++e+kv++Lkk+lYgLkqapraW++++++++++++f k+ +
+++l++k ++e+++i +lYVDDl+++g++ ++ ee+k+e++
Chr1_1 2573646
EVYIEQPQGYIVKGEEDKVLRLKKALYGLKQAPRAWNTRIDKYFKEKDFIKCPYEHALYIKIQKEDILIACLYVDDLIFTGNNPSMFEEFKKEMT
2573740
***********************************************************************************************
PP
RVT_2 191
kefemkdlgelkyfLgleierkeegillsqekyvkkllkkfkmedakpvstple 244
kefem+d+g ++y+Lg+e+++++++i+++qe y+k++lkkfkm+d++pv tp
+e
Chr1_1 2573741
KEFEMTDIGLMSYYLGIEVKQEDNRIFITQEGYAKEVLKKFKMDDSNPVCTPME 2573794
****************************************************97 PP
More information about the Bioperl-l
mailing list