[Biopython-dev] Fwd: [Open-bio-l] Proposed BLAST XML Changes

Tue Mar 18 06:58:06 EDT 2014

On Tue, Mar 18, 2014 at 10:33 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> On Tue, Mar 18, 2014 at 11:17 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Tue, Mar 18, 2014 at 9:52 AM, Wibowo Arindrarto
>> <w.arindrarto at gmail.com> wrote:
>>> Hi Peter, everyone,
>>>
>>> Thanks for the heads up. If implemented as it is, the updates will
>>> change our underlying SearchIO model (aside from the blast-xml parser
>>> itself), by allowing a Hit retrieval using multiple different keys.
>>
>> Could you clarify what you mean by multiple keys here?
>
> Currently, we can retrieve hits from a query using its ID, aside from
> its numeric index. With their proposed changes to the Hit element
> here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf,
> it means that a given Hit can now be annotated with more than one ID.

But this happens already in the current output from merged entries
in databases like NR - we effectively use the first alternative ID as
the hit ID. See for example the nasty &gt; separated entries in
the legacy BLAST XML's <Hit_def> tag where only the first ID
appears in the <Hit_id> tag:

http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html

See also the new optional fields in the tabular output which
explicitly list all the aliases for the merge record (e.g. sallseqid).

> Ideally, this should also be reflected in the QueryResult object: a
> hit item should be retrievable using any of the IDs it has.
>
> This will also affect membership checking on the QueryResult object.

This looks like something we should review anyway, regardless
of the new BLAST XML format.

>>> I have a feeling it will be difficult to jam all the new changes into
>>> a backwards-compatible parser. One way to make it transparent to users
>>> is to use the underlying DTD to do validation before parsing (for the
>>> two BLAST DTDs, use the one which the file can be validated against).
>>> However, this comes at a price. Since the standard library-bundled
>>> elementtree doesn't seem to support validation, we have to use another
>>> library (lxml is my choice). This means adding 3rd party dependency
>>> which require compiling (lxml is also partly written in C).
>>
>> We can probably tell by sniffing the first few lines... but how
>> to do that without using a handle seek to rewind may be
>> tricky (desirable to support parsing streams, e.g. stdin).
>
> Ah yes. We have a rewindable file seek object in Bio.File, don't we
> :)? I'll have to play around with some real datasets first, I think.

Yes, the UndoHandle in Bio.File might be the best solution
here for auto-detection. But two explicit formats is probably better.

> The other thing we should take into account is the Xinclude tag. Would
> we want to make it possible to query *either* the single query XML
> results or the master Xinclude document (point 2 of the proposed
> change)? Or should we restrict our parser only to the single query
> files?

I think single files is a reasonable restriction... assuming BLAST
will still have the option of producing a big multi-query XML?
Probably we should ask the NCBI about that...

I would hope the Bio.SearchIO.index_db(...) approach could
be used on a colloection of little XML files, one for each query.

>>> The other option is to introduce a new format name (e.g.
>>> 'blast-xml2'), which makes the user responsible for knowing which
>>> BLAST XML he/she is parsing. It feels more explicit this way, so I am
>>> leaning towards this option, despite 'blast-xml2' not sounding very
>>> nice to me ;).
>>>
>>> Any other thoughts?
>>>
>>> Best,
>>> Bow
>>
>> I agree for the SearchIO interface, two format names makes
>> sense - unless there is a neat way to auto-detect this on input.
>>
>> Using "blast-xml2" would work, or maybe something like
>> "blast-xml-2014" (too long?).
>>
>> We could even go for "blast-xml-old" and "blast-xml" perhaps?
>
> Hmm..'blast-xml-old', may make it difficult to adapt for future XML
> schema changes. How about renaming the current parser to
> 'blast-xml-legacy', and the new one to just 'blast-xml'?

A possible downside of 'blast-xml-legacy' over 'blast-xml-old'
is this may be confused with the "legacy" BLAST in C to the
current BLAST+ in C++ move (which happened well before
this XML format change).

Peter