[Biopython-dev] Fwd: [Open-bio-l] Proposed BLAST XML Changes

Wed Apr 2 17:54:21 EDT 2014

Dear all,

Regarding the update on querying SearchIO hits with multiple keys,
I've just added some (not so) small changes to the submodule here:
https://github.com/biopython/biopython/pull/307.

The proposed changes affect our current SearchIO core object model, so
I suppose it's better if more than one person take a look at this
first :).

Cheers,
Bow

On Tue, Mar 18, 2014 at 12:48 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> On Tue, Mar 18, 2014 at 11:58 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> On Tue, Mar 18, 2014 at 10:33 AM, Wibowo Arindrarto
>> <w.arindrarto at gmail.com> wrote:
>>> On Tue, Mar 18, 2014 at 11:17 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>> On Tue, Mar 18, 2014 at 9:52 AM, Wibowo Arindrarto
>>>> <w.arindrarto at gmail.com> wrote:
>>>>> Hi Peter, everyone,
>>>>>
>>>>> Thanks for the heads up. If implemented as it is, the updates will
>>>>> change our underlying SearchIO model (aside from the blast-xml parser
>>>>> itself), by allowing a Hit retrieval using multiple different keys.
>>>>
>>>> Could you clarify what you mean by multiple keys here?
>>>
>>> Currently, we can retrieve hits from a query using its ID, aside from
>>> its numeric index. With their proposed changes to the Hit element
>>> here: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf,
>>> it means that a given Hit can now be annotated with more than one ID.
>>
>> But this happens already in the current output from merged entries
>> in databases like NR - we effectively use the first alternative ID as
>> the hit ID. See for example the nasty &gt; separated entries in
>> the legacy BLAST XML's <Hit_def> tag where only the first ID
>> appears in the <Hit_id> tag:
>>
>> http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html
>>
>> See also the new optional fields in the tabular output which
>> explicitly list all the aliases for the merge record (e.g. sallseqid).
>
> In the BLAST outputs, yes. However, there's no explicit support yet in
> SearchIO for this. Currently we only parse whatever is in <Hit_id> as
> the ID and <Hit_def> as the description. If the <Hit_id> tag has is
> separated by semicolons / has more than one IDs, the current parser
> does not try to split it into multiple IDs. Instead it takes the whole
> string as the ID.
>
> Also, in the blast tabular format, even though sallseqid is parsed,
> it's merely stored as an attribute of the hit object, not something
> that can be used to retrieve Hits from the QueryResult object.
>
>>> Ideally, this should also be reflected in the QueryResult object: a
>>> hit item should be retrievable using any of the IDs it has.
>>>
>>> This will also affect membership checking on the QueryResult object.
>>
>> This looks like something we should review anyway, regardless
>> of the new BLAST XML format.
>
> Of course :).
>
>>>>> I have a feeling it will be difficult to jam all the new changes into
>>>>> a backwards-compatible parser. One way to make it transparent to users
>>>>> is to use the underlying DTD to do validation before parsing (for the
>>>>> two BLAST DTDs, use the one which the file can be validated against).
>>>>> However, this comes at a price. Since the standard library-bundled
>>>>> elementtree doesn't seem to support validation, we have to use another
>>>>> library (lxml is my choice). This means adding 3rd party dependency
>>>>> which require compiling (lxml is also partly written in C).
>>>>
>>>> We can probably tell by sniffing the first few lines... but how
>>>> to do that without using a handle seek to rewind may be
>>>> tricky (desirable to support parsing streams, e.g. stdin).
>>>
>>> Ah yes. We have a rewindable file seek object in Bio.File, don't we
>>> :)? I'll have to play around with some real datasets first, I think.
>>
>> Yes, the UndoHandle in Bio.File might be the best solution
>> here for auto-detection. But two explicit formats is probably better.
>>
>>> The other thing we should take into account is the Xinclude tag. Would
>>> we want to make it possible to query *either* the single query XML
>>> results or the master Xinclude document (point 2 of the proposed
>>> change)? Or should we restrict our parser only to the single query
>>> files?
>>
>> I think single files is a reasonable restriction... assuming BLAST
>> will still have the option of producing a big multi-query XML?
>> Probably we should ask the NCBI about that...
>
> In a way, the Xinclude file is the file containing multi-query XML. I
> have a feeling that if Xinclude is proposed, producing multi-output
> BLAST XML files will not be an option anymore (otherwise it seems
> redundant). But yes, NCBI should has more info about this.
>
>> I would hope the Bio.SearchIO.index_db(...) approach could
>> be used on a colloection of little XML files, one for each query.
>>
>>>>> The other option is to introduce a new format name (e.g.
>>>>> 'blast-xml2'), which makes the user responsible for knowing which
>>>>> BLAST XML he/she is parsing. It feels more explicit this way, so I am
>>>>> leaning towards this option, despite 'blast-xml2' not sounding very
>>>>> nice to me ;).
>>>>>
>>>>> Any other thoughts?
>>>>>
>>>>> Best,
>>>>> Bow
>>>>
>>>> I agree for the SearchIO interface, two format names makes
>>>> sense - unless there is a neat way to auto-detect this on input.
>>>>
>>>> Using "blast-xml2" would work, or maybe something like
>>>> "blast-xml-2014" (too long?).
>>>>
>>>> We could even go for "blast-xml-old" and "blast-xml" perhaps?
>>>
>>> Hmm..'blast-xml-old', may make it difficult to adapt for future XML
>>> schema changes. How about renaming the current parser to
>>> 'blast-xml-legacy', and the new one to just 'blast-xml'?
>>
>> A possible downside of 'blast-xml-legacy' over 'blast-xml-old'
>> is this may be confused with the "legacy" BLAST in C to the
>> current BLAST+ in C++ move (which happened well before
>> this XML format change).
>
> Hmm. In this case then I am leaning to 'blast-xml2', I think. It's the
> shortest and most future-proof (subsequent changes to the XML format
> could be denoted as 'blast-xml3'). But it does make it slightly
> inconsistent with the names we have for HMMER (i.e. 'hmmer2-text' is
> for HMMER version 2 text output, 'hmmer3-text' is for HMMER version 3
> text output).
>
> Cheers,
> Bow