[Bioperl-l] Bio::Search results

Barry Moore bmoore at genetics.utah.edu
Fri Sep 2 14:43:27 EDT 2005


Sean,

Am I guessing correctly that you are using bioperl 1.4?  You should not
have to have header information at the top of each result.  The problem
you are having is that the parser can read single result or multi result
files.  I didn't write this parser, so I won't pretend to know it well,
but when you are only getting the last result it is parsing your multi
result file as a single result file and everything but the last result
is being overwritten rather than pushed to an array (I think).  I
introduced a problem in the header when I copied the data out of out of
your e-mail, and fixing that made it work.  I just tried it with bioperl
1.4, and I get the same problem again (with a good header).  Diffing the
1.4 and 1.5 versions shows a lot of changes.  If you're on 1.4 try
upgrading to 1.5 or live from CVS, and see if that solves your problem.
The two attached files are your data, and your script as I have been
running them.  They should work as is with 1.5.

Barry

-----Original Message-----
From: Sean O'Keeffe [mailto:limericksean at gmail.com] 
Sent: Friday, September 02, 2005 3:32 AM
To: Barry Moore
Cc: Bioperl List
Subject: Re: [Bioperl-l] Bio::Search results

I have those four lines (all contained on separate lines) but it didn't
work . 

I ran "gawk '{if(/^\/\/$/) {print "//\n\nHMMER 2.3.2 (Oct 2003)"} else
{print}}' my_file > whatever" on the pfam file to include a HMMER
header after the // separator for each result.

It now works fine. I still don't understand why I need to include this
header to parse the file. Although from memory don't Blast reports all
include Blast header lines to separate next results?
By the way I think that its the mailer that seems to be altering the
text in these scripts (don't know where the "=3D" came from - it's not
in my original). I'm running a standard linux (suse 9.0) using nedit
as an editor.
Sean.

On 9/2/05, Barry Moore <bmoore at genetics.utah.edu> wrote:
> Sean I see that my mailer messed up the line formatting also.  What I
> meant was you should have the top four lines like this:
> 
> #Line 1: hmmpfam - search one or more sequences against HMM database
> #Line 2: HMMER 2.3.2(Oct 2003)
> #Line 3: Copyright (C) 1992-2003 HHMI/Washington University School of
> #Line 4: Medicine Freely distributed under the GNU General Public
> License (GPL)
> 
> If you have that, and fix the hsp to hit error mentioned it works for
> me.
> 
> Barry
> 
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Barry
Moore
> Sent: Thursday, September 01, 2005 11:34 AM
> To: Bioperl List
> Subject: RE: [Bioperl-l] Bio::Search results
> 
> Sean-
> 
> 
> 
> The top of you input hmmpfam file has these four lines:
> 
> 
> 
> hmmpfam - search one or more sequences against HMM database HMMER
2.3.2
> (Oct 2003) Copyright (C) 1992-2003 HHMI/Washington University School
of
> Medicine Freely distributed under the GNU General Public License (GPL)
> 
> 
> 
> When I copied that out of you e-mail and into emacs on my computer I
got
> it all as one line like this:
> 
> 
> 
> hmmpfam - search one or more sequences against HMM database HMMER
2.3.2
> (Oct 2003) Copyright (C) 1992-2003 HHMI/Washington University School
of
> Medicine Freely distributed under the GNU General Public License (GPL)
> 
> 
> 
> The parser needs to see a line beginning with HMMER to trigger setting
> an internal variable called _reporttype (among other things).  Since
> this never happened with a single line header I got the same results
you
> did.  Your header lines look OK in the e-mail you sent me, but
assuming
> you have the same problem with you header lines that I did, fixing
that
> should lead you to your next problem...
> 
> 
> 
> This line of your code has an error:
> 
> 
> 
> next unless ($hsp->name =~ /^ig|^lrr|^fn3|^egf|^tsp|^psi/i);
> 
> 
> 
> HSP objects don't have names but their associated hit objects do, so
> this makes your script work as intended:
> 
> 
> 
> next unless ($hit->name =~ /^ig|^lrr|^fn3|^egf|^tsp|^psi/i);
> 
> 
> 
> I wonder if the header line problems could be the result of cross
> platform cutting and pasting (or just cross platform file
> incompatibilities).  I noticed also that in you script that you sent
the
> '=' turned into '=3D'.  What computer platforms do you use in your
work?
> 
> 
> 
> HTH
> 
> 
> 
> Barry
> 
> 
> 
> -----Original Message-----
> 
> From: Sean O'Keeffe [mailto:limericksean at gmail.com]
> 
> Sent: Tuesday, August 30, 2005 9:28 AM
> 
> To: Barry Moore
> 
> Cc: bioperl-l at portal.open-bio.org
> 
> Subject: Re: [Bioperl-l] Bio::Search results
> 
> 
> 
> Hi Barry, thanks for the reply. Below is a snippet of the file (I
> generated it with hmmpfam using the alignment flag set to -A 0, to
> remove alignments - this shouldn't affect the parsing of the file) :
> 
> 
> 
> hmmpfam - search one or more sequences against HMM database HMMER
2.3.2
> (Oct 2003) Copyright (C) 1992-2003 HHMI/Washington University School
of
> Medicine Freely distributed under the GNU General Public License (GPL)
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
> 
> HMM file:                 /usr/local/lib/pfam-tm
> 
> Sequence file:            Mus_musculus.NCBIM34.jul.pep.fa-short
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> 
> 
> 
> Query sequence: ENSMUSP00000089702
> 
> Accession:      [none]
> 
> Description:    [none]
> 
> 
> 
> Scores for sequence family classification (score includes all
domains):
> 
> Model    Description                                    Score
E-value
> N
> 
> -------- -----------                                    -----
-------
> ---
> 
>       [no hits above thresholds]
> 
> 
> 
> Parsed for domains:
> 
> Model    Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
> 
> -------- ------- ----- -----    ----- -----      -----  -------
> 
>       [no hits above thresholds]
> 
> //
> 
> 
> 
> Query sequence: ENSMUSP00000089701
> 
> Accession:      [none]
> 
> Description:    [none]
> 
> 
> 
> Scores for sequence family classification (score includes all
domains):
> 
> Model    Description                                    Score
E-value
> N
> 
> -------- -----------                                    -----
-------
> ---
> 
>       [no hits above thresholds]
> 
> 
> 
> Parsed for domains:
> 
> Model    Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
> 
> -------- ------- ----- -----    ----- -----      -----  -------
> 
>       [no hits above thresholds]
> 
> //
> 
> 
> 
> Query sequence: ENSMUSP00000020094
> 
> Accession:      [none]
> 
> Description:    [none]
> 
> 
> 
> Scores for sequence family classification (score includes all
domains):
> 
> Model    Description                                    Score
E-value
> N
> 
> -------- -----------                                    -----
-------
> ---
> 
> LRR_1    Leucine Rich Repeat                             55.0
3.7e-15
> 6
> 
> LRRNT    Leucine rich repeat N-terminal domain           30.1
1.1e-07
> 1
> 
> 
> 
> Parsed for domains:
> 
> Model    Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
> 
> -------- ------- ----- -----    ----- -----      -----  -------
> 
> LRRNT      1/1     117   142 ..     1    34 []    30.1  1.1e-07
> 
> LRR_1      1/6     168   191 ..     1    25 []    14.7   0.0049
> 
> LRR_1      2/6     192   210 ..     1    25 []     9.1      0.2
> 
> LRR_1      3/6     212   237 ..     1    25 []     8.4     0.26
> 
> LRR_1      4/6     238   257 ..     1    25 []    10.3      0.1
> 
> LRR_1      5/6     259   282 ..     1    25 []    10.1     0.12
> 
> LRR_1      6/6     290   314 ..     1    25 []     2.4        2
> 
> //
> 
> 
> 
> Query sequence: ENSMUSP00000074175
> 
> Accession:      [none]
> 
> Description:    [none]
> 
> 
> 
> Scores for sequence family classification (score includes all
domains):
> 
> Model    Description                                    Score
E-value
> N
> 
> -------- -----------                                    -----
-------
> ---
> 
> CUB      CUB domain                                     232.5
1.3e-68
> 2
> 
> Trypsin  Trypsin                                        206.0
1.2e-60
> 1
> 
> Sushi    Sushi domain (SCR repeat)                       88.7
2.6e-25
> 2
> 
> EGF_CA   Calcium binding EGF domain                      29.4
1.8e-07
> 1
> 
> 
> 
> Parsed for domains:
> 
> Model    Domain  seq-f seq-t    hmm-f hmm-t      score  E-value
> 
> -------- ------- ----- -----    ----- -----      -----  -------
> 
> CUB        1/2      16   137 ..     1   116 []    70.0  1.1e-19
> 
> EGF_CA     1/1     141   188 ..     1    55 []    29.4  1.8e-07
> 
> CUB        2/2     192   301 ..     1   116 []   162.5  1.5e-47
> 
> Sushi      1/2     308   370 ..     1    62 []    47.5  6.5e-13
> 
> Sushi      2/2     375   446 ..     1    62 []    41.2  5.1e-11
> 
> Trypsin    1/1     463   698 ..     1   259 []   206.0  1.2e-60
> 
> //
> 
> 
> 
> 
> 
> Cheers,
> 
> Sean.
> 
> 
> 
> 
> 
> 
> 
> On 8/30/05, Barry Moore <bmoore at genetics.utah.edu> wrote:
> 
> > Sean,
> 
> >
> 
> > Don't see anything obviously wrong.  If you want to send your input
> 
> > file, I'll try to recreate the problem.
> 
> >
> 
> > Barry
> 
> >
> 
> > -----Original Message-----
> 
> > From: bioperl-l-bounces at portal.open-bio.org
> 
> > [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Sean
> 
> > O'Keeffe
> 
> > Sent: Tuesday, August 30, 2005 2:55 AM
> 
> > To: bioperl-l at portal.open-bio.org
> 
> > Subject: [Bioperl-l] Bio::Search results
> 
> >
> 
> > Hi,
> 
> > The following code snippet is something I use to extract information
> 
> > from hmmer result files:
> 
> >
> 
> > use Bio::SearchIO;
> 
> >
> 
> > my $in =3D new Bio::SearchIO( -format =3D> 'hmmer',  -file =3D>
> 
> > $ARGV[0] ); while(my $result =3D $in->next_result) {
> 
> >   print $result->query_name(),
"\n",$result->query_description(),"\n";
> 
> >   while (my $hit =3D $result->next_hit) {
> 
> >     while(my $hsp =3D $hit->next_domain) {
> 
> >       next unless ($hsp->name =3D~ /^ig|^lrr|^fn3|^egf|^tsp|^psi/i);
> 
> >       print $hsp->start(),"\t",$hsp->end(),"\t",$hsp->evalue(),"\n";
> 
> >     }
> 
> >   }
> 
> > }
> 
> >
> 
> > The input file is generated by hmmpfam and is given at the command
> 
> > line. I use it to scan for specific domain names e.g ig, fn3 lrr
etc.
> 
> > This code works for the first loop and then ends so I get the name
and
> 
> 
> > description (no hsp values as their are none for this result):
> 
> >
> 
> > ENSMUSP00000065602=20
> 
> > pep:novel supercontig::NT_085813:405:1510:-1 gene:ENSMUSG00000054059
> 
> > transcript:ENSMUST00000066517
> 
> >
> 
> > My question is why does the loop end after one instance.
Incidentally
> 
> > the outputted  name and description above are the last ones in the
> 
> > hmmer file (maybe the file is read from the back??? - don't know if
> 
> > this means anything). Any thoughts would be appreciated. Thanks,
> 
> > Sean.
> 
> >
> 
> > _______________________________________________
> 
> > Bioperl-l mailing list
> 
> > Bioperl-l at portal.open-bio.org
> 
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 
> >
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pfam.out
Type: application/octet-stream
Size: 2758 bytes
Desc: pfam.out
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050902/fca07e14/pfam-0001.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bp_pfam.pl
Type: application/octet-stream
Size: 473 bytes
Desc: bp_pfam.pl
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050902/fca07e14/bp_pfam-0001.obj


More information about the Bioperl-l mailing list