[Bioperl-l] Re: bug #949

Jason Stajich jason@chg.mc.duke.edu
Thu, 14 Jun 2001 10:20:51 -0400 (EDT)


On Thu, 14 Jun 2001, Roger Hall wrote:

> I am not sure what the bug is - do I need to
>     1. leave the test in place comparing frame sign [+-] against the subject
> strand directional sign and change the warning to say 'subject' instead of
> 'query', or

>     2. change the test to compare against the query strand directional sign?
>

The current behavior was that the frame sign (+/-) was dependent on the
query strand.  I don't think it is.

Let's walk through a blastx and tblastn report and make sure we agree on  
everything 

For this specific blastx HSP:

>gi|728837|sp|P39194|ALU7_HUMAN ALU SUBFAMILY SQ SEQUENCE
            CONTAMINATION WARNING ENTRY
          Length = 593

 Score =  161 bits (404), Expect = 8e-39
 Identities = 73/97 (75%), Positives = 76/97 (78%)
 Frame = -1

Query: 2996 FFLRQSFTLVTQAGVQWHDLSSLQPLPPRFKGFSSLSLPISWDYRRLPPCLANFCIFHKD  2817
            FFLR+SF LV QAGVQW DL SLQP PP FK FS LSLP SWDYRR PP  ANFCIF +D
Sbjct: 299  FFLRRSFALVAQAGVQWRDLGSLQPPPPGFKRFSCLSLPSSWDYRRPPPRPANFCIFSRD 358

Query: 2816 GVLPCWPGWS*TPDLR*YAHFGIPKCWDYRREPPCPA 2706
            GV PCWPGWS TPDLR     G+PKCWDYRREPP PA
Sbjct: 359  GVSPCWPGWSRTPDLRXSTRLGLPKCWDYRREPPRPA 395

query start = 2706
query end   = 2996
query strand= -1

sbjct start = 299
sbjct end   = 395
sbjct strand= 1

Frame = -1


But for the next HSP of the same Sbjct
 Score =  156 bits (391), Expect = 3e-37
 Identities = 76/97 (78%), Positives = 83/97 (85%)
 Frame = +3

Query: 2706 GRARWLTPVIPALWDAKVGISSEVRSSRPAWPTWQNSVFMKNTKISQAWWQAPVIPANWE 2885
            GRARWLTPVIPALW+A+ G S EVRSSRPAWPTW N V  KNTKIS+AWW+APVIPA  E
Sbjct: 1    GRARWLTPVIPALWEAEAGGSPEVRSSRPAWPTWXNPVSTKNTKISRAWWRAPVIPATRE 60

Query: 2886 AEAGESLESRRQRLQ*AEIVPLHSSLGDKSKTLSQKK 2996
            AEAGESLE  R+RLQ AEI PLHSSLG+KS+T SQKK
Sbjct: 61   AEAGESLEPGRRRLQXAEIAPLHSSLGNKSETPSQKK 97

query start = 2706
query end   = 2996
query strand= -1

sbjct start = 1
sbjct end   = 97
sbjct strand= 1

Frame = +3

 
> I have verified that the frame and start/end pair are parsed correctly, and
> that the start/end pair is correctly used to determine the strand direction.
> 
> I am puzzled by the line:
>      $frame = $2 - 1;
> which comes after the match:
>     $frame !~ /^([+-])?([1-3])/
> This reduces the frame domain to [0-2] for every frame assigned. (and I
> don't understand why - string index?  :).

frames in blast are 1-3, in GFF they are 0-2, we follow GFF Bioperl.

> 
> Even though the warning is made to STDERR, I had no trouble calling various
> methods (score, bits, percent, etc.) on my HSP objects (including verifying
> that the frame domain was now [0-2] - did I mention my puzzlement?). Perhaps
> the 'Devil of a time' is the warnings...?

I believe that is true.

> I compared the test data (sample.blx as provided by the issue originator) to
> their respective results by hand (about a dozen Sbjcts and 150 HSPs), and it
> was obvious that the subject strand sign is properly the opposite of the
> frame sign when the warnings are made. 

> It was equally obvious that the
> *query* strand sign was *equal* to the frame sign.
> 

Not sure this is true if you look at the second HSP for the first
Sbjct hit in the report.  I *think* this should be true for tblastn
reports however which is why that assertion/warning was written into HSP.

> As I said - I'm not sure what the problem is - although I am very familiar
> with BPLite now. :}
> 
> aka: Whaddaya want it/me to do?
> 

We should probably disable the test/warnings in BPlite::HSP unless we
want to autodetect blastx reports and just supress warnings on those
type of reports. 

I believe I am interpreting everything correctly here.

I've committed a shortened version of the submitted blastx report as well as 
added the test to BPlite.t so we can be sure and see these msgs.

> Thanks (and sorry for being so green)!
> 
> Roger
> 
> 
> Code Details:
> The frame value is properly parsed in Sbjct::nextHSP(), although it does not
> enforce a domain of [1-3], but rather allows any numerical value.
> 
>     elsif( $_ =~ /^\s*Frame\s*=\s*([-\+]\d+)/ ) {
>  $frame = $1;
>     }
> 
> The query, homology, and subject lines are parsed, and the start and end
> positions are gathered (subject line match shown).
> 
>   for(my $i=0;$i<@hspline;$i+=3) {
>     $hspline[$i+2] =~ /^Sbjct:\s+(\d+)\s*([\D\S]+)\s+(\d+)/;
>     $sl = $2; $sb = $1 unless $sb; $se = $3;
>     push @SL, $sl;
>   }
>   $sl = join("", @SL);
> 
> Similar code is used for the query and homology lines.
> 
> Once parsed, the new HSP object is created using the frame, start/end
> positions, etc.
> 
> After rearranging args, HSP::new() determines the start/end pair vector for
> the strand param to a new Similarity object.
> 
>     if ($se > $sb) {  # normal subject
>  $self->subject( Bio::SeqFeature::Similarity->new
>    (-start=>$sb, -end=>$se, -strand=>1,
>     -source=>"BLAST" ) ) }
>     else {   # reverse subject: start bigger than end
>  $self->subject( Bio::SeqFeature::Similarity->new
>    (-start=>$se, -end=>$sb, -strand=>-1,
>     -source=>"BLAST" ) ) }
> 
> Late in new(), frame() is called, where the normal frame domain is enforced.
> When the param to frame() is defined and matches /^([+-])?([1-3])/, the
> frame value is compared with the subject strand. If the signs are unequal,
> the 'did not match' warning is produced.
> 
> Then the absolute value of frame is reduced by one.   (?)
> 
> 
> 
> ----- Original Message -----
> From: "Jason Stajich" <jason@chg.mc.duke.edu>
> To: "Roger Hall" <roger@iosea.com>
> Cc: "Hilmar Lapp" <hilmarl@yahoo.com>; "Ewan Birney" <birney@ebi.ac.uk>
> Sent: Wednesday, June 13, 2001 1:18 PM
> Subject: bug #949
> 
> 
> > I'm starting to look at bug 949 although I'm not particularly excited
> > about it...
> >
> > Let me know if you
> > a) claim it
> > b) have ideas
> >
> 
> 

Jason Stajich
jason@chg.mc.duke.edu
Center for Human Genetics
Duke University Medical Center 
http://www.chg.duke.edu/