[Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri Apr 30 15:18:00 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=3069





------- Comment #8 from laserson at mit.edu  2010-04-30 11:17 EST -------
(In reply to comment #7)

> You're in a much better position to access this - but could you ask them about
> this anyway? They may at least clarify how they bend the EMBL specification.

I am waiting to hear from them regarding all the changes compared with the EMBL
spec.  But I am not confident they are even sure.  Part of the problem is the
database was started over 20 years ago, so some older records may not have been
updated properly.

> Do they have a preferred file format (e.g. XML)?

The only have a text file in their "EMBL" format.  See here for all their
download options:
http://imgt.cines.fr/textes/IMGTdownloads.html


> How I would try this would be to write a new scanner subclassing the EMBL
> scanner in Bio/GenBank/Scanner.py (which probably only needs to override the
> feature parsing), and then new functions in Bio/SeqIO/InsdcIO.py to call it
> (matching the GenBank and EMBL functions), and define a new format name
> (mabye "embl-imgt") in the dictionary in Bio/SeqIO/__init__.py and done.

Done.  I will upload the patch shortly.  The code only reads the IMGT info.  It
does not write it.  I can work on that as well, if you think it's prudent that
every readable format should also be writable.

> However, if the only out-of-specification thing in the IMGT EMBL files is the
> feature indentation and long feature keys, many your original request to make
> the EMBL parser more tolerant is the best route.

I think it will actually be a headache to do so.  Unless you want to rewrite
the EMBL parser the way that I wrote the IMGT parser.  The only thing that
needed changing was handling the header lines.  Once it finds an FH line, it
uses the position of the "Location..." string to determine how indented the
qualifiers are.

> Thinking ahead would you also want to be able to write out IMGT variant EMBL
> files?
> 

I personally don't need this functionality, but I am willing to write it to
complement the IMGT parser that I wrote.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list