[Biopython] Parsing large blast files
Peter Cock
p.j.a.cock at googlemail.com
Tue Apr 28 13:36:37 UTC 2009
On Tue, Apr 28, 2009 at 2:00 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> NCBIStandalone.Iterator() is the old semi-obsolete plain
>> text parser - it won't parse the XML output, hence the
>> "Invalid header" error. Maybe the tutorial
>> (or the error message) could be clearer.
>
> I think part of the problem is the organization of the code in Bio.Blast,
> which seems to have grown historically. Bio.Blast.NCBIStandalone
> contains blastall, blastpgp, and rpsblast, which makes sense, but also
> BlastParser and PsiBlastParser, which are not necessarily connected
> to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for
> blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the
> parser for Blast HTML output, though qblast does not necessarily
> generate output in HTML format.
I presumed that initially the standalone tools only produced plain text,
and the website (qblast) only produced HTML - hence the use of
Bio.Blast.NCBIStandalone for both command line wrappers AND the
plain text parser, and Bio.Blast.NCBIWWW for both the qblast function
AND the HTML parser.
> The usage of this module may be more understandable if all functions
> were accessible from Bio.Blast directly in a fashion more consistent
> with current Biopython. Bio.Blast would then have the following functions:
>
> read(handle, format='xml')
> parse(handle, format='xml')
> blastall
> blastpgp
> rpsblast
> qblast
>
> with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera.
>
> Any objections, comments?
I do like the idea of moving/importing the qblast function directly
under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML later on.
For read/parse functions, we should probably call the format
"blastxml" to match BioPerl. Would you continue to support the plain
text output here? Also something to keep in mind is there may be
non-NCBI variants of BLAST with their own formats as well.
Rather than continuing to encourage the use of blastall, blastpgp and
rpsblast I would rather bring Bio.Blast.Applications up to date, and
then declare them obsolete . These three "helper" functions are very
limiting in how the command line is invoked - you can't choose the
exact call used (e.g. subprocess options) or what you want back (e.g.
you may not care about the handles). For example, getting BLAST to
write its output to a file is confusingly difficult right now using
these functions. Also, dealing with errors isn't nice.
Peter
More information about the Biopython
mailing list