[Biopython-dev] Fwd: [Open-bio-l] Proposed BLAST XML Changes

Tue Mar 18 11:48:56 UTC 2014

On Tue, Mar 18, 2014 at 11:58 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Mar 18, 2014 at 10:33 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> On Tue, Mar 18, 2014 at 11:17 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> On Tue, Mar 18, 2014 at 9:52 AM, Wibowo Arindrarto
>>> <w.arindrarto at gmail.com> wrote:
>>>> Hi Peter, everyone,
>>>>
>>>> Thanks for the heads up. If implemented as it is, the updates will
>>>> change our underlying SearchIO model (aside from the blast-xml parser
>>>> itself), by allowing a Hit retrieval using multiple different keys.
>>>
>>> Could you clarify what you mean by multiple keys here?
>>
>> Currently, we can retrieve hits from a query using its ID, aside from
>> its numeric index. With their proposed changes to the Hit element
>> here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf,
>> it means that a given Hit can now be annotated with more than one ID.
>
> But this happens already in the current output from merged entries
> in databases like NR - we effectively use the first alternative ID as
> the hit ID. See for example the nasty &gt; separated entries in
> the legacy BLAST XML's <Hit_def> tag where only the first ID
> appears in the <Hit_id> tag:
>
> http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html
>
> See also the new optional fields in the tabular output which
> explicitly list all the aliases for the merge record (e.g. sallseqid).

In the BLAST outputs, yes. However, there's no explicit support yet in
SearchIO for this. Currently we only parse whatever is in <Hit_id> as
the ID and <Hit_def> as the description. If the <Hit_id> tag has is
separated by semicolons / has more than one IDs, the current parser
does not try to split it into multiple IDs. Instead it takes the whole
string as the ID.

Also, in the blast tabular format, even though sallseqid is parsed,
it's merely stored as an attribute of the hit object, not something
that can be used to retrieve Hits from the QueryResult object.

>> Ideally, this should also be reflected in the QueryResult object: a
>> hit item should be retrievable using any of the IDs it has.
>>
>> This will also affect membership checking on the QueryResult object.
>
> This looks like something we should review anyway, regardless
> of the new BLAST XML format.

Of course :).

>>>> I have a feeling it will be difficult to jam all the new changes into
>>>> a backwards-compatible parser. One way to make it transparent to users
>>>> is to use the underlying DTD to do validation before parsing (for the
>>>> two BLAST DTDs, use the one which the file can be validated against).
>>>> However, this comes at a price. Since the standard library-bundled
>>>> elementtree doesn't seem to support validation, we have to use another
>>>> library (lxml is my choice). This means adding 3rd party dependency
>>>> which require compiling (lxml is also partly written in C).
>>>
>>> We can probably tell by sniffing the first few lines... but how
>>> to do that without using a handle seek to rewind may be
>>> tricky (desirable to support parsing streams, e.g. stdin).
>>
>> Ah yes. We have a rewindable file seek object in Bio.File, don't we
>> :)? I'll have to play around with some real datasets first, I think.
>
> Yes, the UndoHandle in Bio.File might be the best solution
> here for auto-detection. But two explicit formats is probably better.
>
>> The other thing we should take into account is the Xinclude tag. Would
>> we want to make it possible to query *either* the single query XML
>> results or the master Xinclude document (point 2 of the proposed
>> change)? Or should we restrict our parser only to the single query
>> files?
>
> I think single files is a reasonable restriction... assuming BLAST
> will still have the option of producing a big multi-query XML?
> Probably we should ask the NCBI about that...

In a way, the Xinclude file is the file containing multi-query XML. I
have a feeling that if Xinclude is proposed, producing multi-output
BLAST XML files will not be an option anymore (otherwise it seems
redundant). But yes, NCBI should has more info about this.

> I would hope the Bio.SearchIO.index_db(...) approach could
> be used on a colloection of little XML files, one for each query.
>
>>>> The other option is to introduce a new format name (e.g.
>>>> 'blast-xml2'), which makes the user responsible for knowing which
>>>> BLAST XML he/she is parsing. It feels more explicit this way, so I am
>>>> leaning towards this option, despite 'blast-xml2' not sounding very
>>>> nice to me ;).
>>>>
>>>> Any other thoughts?
>>>>
>>>> Best,
>>>> Bow
>>>
>>> I agree for the SearchIO interface, two format names makes
>>> sense - unless there is a neat way to auto-detect this on input.
>>>
>>> Using "blast-xml2" would work, or maybe something like
>>> "blast-xml-2014" (too long?).
>>>
>>> We could even go for "blast-xml-old" and "blast-xml" perhaps?
>>
>> Hmm..'blast-xml-old', may make it difficult to adapt for future XML
>> schema changes. How about renaming the current parser to
>> 'blast-xml-legacy', and the new one to just 'blast-xml'?
>
> A possible downside of 'blast-xml-legacy' over 'blast-xml-old'
> is this may be confused with the "legacy" BLAST in C to the
> current BLAST+ in C++ move (which happened well before
> this XML format change).

Hmm. In this case then I am leaning to 'blast-xml2', I think. It's the
shortest and most future-proof (subsequent changes to the XML format
could be denoted as 'blast-xml3'). But it does make it slightly
inconsistent with the names we have for HMMER (i.e. 'hmmer2-text' is
for HMMER version 2 text output, 'hmmer3-text' is for HMMER version 3
text output).

Cheers,
Bow