[BioPython] Problem parsing Blast XML output from different sources
Michiel de Hoon
mdehoon at c2b2.columbia.edu
Sun Oct 8 04:51:09 UTC 2006
Hi Steffi,
I am trying to replicate this problem with Blast. Where did you get the
pat database? I searched for it with google, but there seems to be more
than one blast database called pat.
--Michiel.
Steffi Gebauer-Jung wrote:
> Hello,
>
> I don't know what local databases you have available for testing.
> The discrepancy between xml and 'pairwise text' output should be seen
> for every Plus/Minus Hsp created by local Blastn (local server or
> standalone blastall from command line, I use version 2.2.14)
>
> I tried several combinations, one is M38240 vs. pat database,
> the hsp hit was BD298385.
> Here are the interesting output snippets:
>
>> dbj|BD298385.1|
>> <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=92136243&dopt=GenBank>
>> CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS AND PLANT PARTS
> CONTAINING THEM, AND METHODS FOR OBTAINING THEM
> Length = 14108
>
> Score = 125 bits (63), Expect = 1e-25
> Identities = 63/63 (100%)
> Strand = Plus / Minus
>
>
> Query: 727 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc
> 786
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 8332 aatgaagactaatctttttctctttctcatcttttcacttctcctatcattatcctcggc
> 8273
>
> Query: 787 cga 789
> |||
> Sbjct: 8272 cga 8270
>
> =====================================================
> <Hit>
> <Hit_num>15</Hit_num>
> <Hit_id>gi|92136243|dbj|BD298385.1|</Hit_id>
> <Hit_def>CLEAN SYNTHETIC VECTORS, PLASMIDS, TRANSGENIC PLANTS
> AND PLANT PARTS CONTAINING THEM, AND METHODS FOR OBTAINING THEM</Hit_def>
> <Hit_accession>BD298385</Hit_accession>
> <Hit_len>14108</Hit_len>
> <Hit_hsps>
> <Hsp>
> <Hsp_num>1</Hsp_num>
> <Hsp_bit-score>125.381</Hsp_bit-score>
> <Hsp_score>63</Hsp_score>
> <Hsp_evalue>9.63859e-26</Hsp_evalue>
> <Hsp_query-from>789</Hsp_query-from>
> <Hsp_query-to>727</Hsp_query-to>
> <Hsp_hit-from>8270</Hsp_hit-from>
> <Hsp_hit-to>8332</Hsp_hit-to>
> <Hsp_query-frame>1</Hsp_query-frame>
> <Hsp_hit-frame>-1</Hsp_hit-frame>
> <Hsp_identity>63</Hsp_identity>
> <Hsp_positive>63</Hsp_positive>
> <Hsp_align-len>63</Hsp_align-len>
>
> <Hsp_qseq>TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT</Hsp_qseq>
>
>
> <Hsp_hseq>TCGGCCGAGGATAATGATAGGAGAAGTGAAAAGATGAGAAAGAGAAAAAGATTAGTCTTCATT</Hsp_hseq>
>
>
> <Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||</Hsp_midline>
>
> </Hsp>
> </Hit_hsps>
> </Hit>
>
> Thanks, Steffi
>
>
>
>
>
>
> Michiel Jan Laurens de Hoon wrote:
>
>> Which sequence are you running blast on?
>> I'd like to try this on our local blast installation.
>>
>> --Michiel.
>>
>> Steffi Gebauer-Jung wrote:
>>
>>> Hello,
>>>
>>> because of blastall 2.2.14 output was not parsed from the
>>> Bio.Blast.NCBIStandalone parser,
>>> I tried to switch to the recommended Bio.Blast.NCBIXML parser.
>>>
>>> Thereby I found, that the xml output of the locally installed
>>> standalone blastall (2.2.14)
>>> differs from the web xml output.
>>>
>>> For BlastN hsps on Plus/Minus strands, the xml gives
>>> query_frame/hit_frame 1 / -1 as usual.
>>> But query and frame positions and sequences are switched in direction
>>> (would match frames -1/1).
>>>
>>> As the Bio.Blast.Record returned by the NCBIXML parser only gives
>>> frames, sequences
>>> and start positions it is not possible (without knowing the source of
>>> the xml file)
>>> to be sure to find the right data.
>>>
>>> This is clearly a problem of Blast.
>>> But because of the missing end positions in the returned record object
>>> it becomes a problem for users of the parser too.
>>>
>>> Could somebody try to confirm the different behaviour of the xml
>>> blast output
>>> with his/her own examples/installation?
>>>
>>> Thanks, Steffi
>>>
>>>
>>>
>>> _______________________________________________
>>> BioPython mailing list - BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>>
>>
>
More information about the Biopython
mailing list