[BioPython] NCBIStandalone.iterator hangs
Peter
biopython at maubp.freeserve.co.uk
Mon Oct 16 15:39:17 UTC 2006
David Toomey wrote:
> Thanks for the help Peter
>
> The attached file has the output from two queries, problem.txt and
> works.txt, when run manually from the command line
Excellent - that looks like everything I asked for.
> I also edited the NCBIStandalone module to add a print statement to
> Iterator.next() and then ran the same two files using the script.
> If you compare the two reports for problem.txt you can see on which line of
> the report the iterator is hanging. I have had a look at this and can't see
> anything about the line that is unusual?
>
> The last line outputted by the script is
> Query: 266 LAAHIDQYDIDAMTGIRATDIEKTDEAIKVTLENGAVLESKTVIIATGAGWRKLNIPGEE 325
>
>
> And the manual report continues with
> I + + ++ + VL
> Sbjct: 68 GIMSIPTLILFKGGE-PVKQLIGYQPKEQLEAQLADVL- 125
>
Your script is hanging during the second alignment in the results for:
Nadph Dehydrogenase 1 - Clostridium beijerinckii (Clostridium MP)
This alignment does look "funny" to me, but its not the first "funny"
alignment in the results.
I suspect you have found a problem with NCBI blast, or perhaps have a
malformed database (especially given the errors you mention below).
However, there are similar odd pairwise alignments before this one in
the results which BioPython has apparently coped with... so it is a
little odd that BioPython get stuck at this particular point.
As to why I think there is a problem:
Notice that the Query sequence continues for several lines (up to Query
504) while the Sbjct sequence is blank (up to Sbjct 313) except for a
single lone gap character at position 246.
I would have expected the match to finish at about Query 303 / Match
103. Very odd.
In addition, notice that the header information is inconsistent:
Score = 149 bits (376), Expect = 7e-037
Identities = 0/309 (0%), Positives = 0/309 (0%), Gaps = 15/309 (4%)
Even looking at just the second set of 60 characters (quoted above) we
have three identical matches (I, V and L) and five close matches.
In all I would say there where five identical matches (A, S, I, V, L)
and a further nine close matches. So the identities score should be
5/length, and the positives 14/length.
I would also say the alignment length is either 297 (based on the length
of the gapped query shown) or 99+1 (based on the length of the gapped
subject sequence shown). Even allowing for my quick counts being out by
plus of minus one, I can't see where the stated length of 309 comes from.
>
> Even though problem.txt generates a valid report when run manually it does
> output a load of errors of the type below, but I am not sure how this would
> cause the script to stop at the line above.
>
> [NULL_Caption] ERROR: ncbiapi [000.000] AHPF_STAAC: SeqPortNew:
> lcl|EXPT02286 s
> top(365) >= len(329)
> [NULL_Caption] ERROR: ncbiapi [000.000] AHPF_STAAC: SeqPortNew:
> lcl|EXPT02286 s
> top(336) >= len(329)
> [NULL_Caption] ERROR: ncbiapi [000.000] AHPF_STAAC: SeqPortNew:
> lcl|EXPT02286 s
> tart(337) >= len(329)
> [NULL_Caption] ERROR: ncbiapi [000.000] AHPF_STAAC: SeqPortNew:
> lcl|EXPT02286 s
> tart(338) >= len(329)
> [NULL_Caption] ERROR: ncbiapi [000.000] AHPF_STAAC: SeqPortNew:
> lcl|EXPT02113 s
> tart(284) >= len(149)
>
>
> If it is easier for you I can certainly raise a bug, I just wanted to be
> sure it wasn't anything silly that I was doing before I did this.
>
Have a look over the output yourself, and see if you agree with me.
I assume you get exactly the same results from running Blast on both
Linux and Windows.
I see you are using standalone BLASTP 2.2.13 [Nov-27-2005], so one thing
you could try is updating your copy of Blast.
I would also double check how you created/installed the database.
I think BioPython is going wrong because its been given "funny" input.
It may be possible for us to improve that, but even so, I wouldn't trust
those blast results.
Good luck
Peter
More information about the Biopython
mailing list