[Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Wed May 12 23:41:03 UTC 2010
http://bugzilla.open-bio.org/show_bug.cgi?id=3069
------- Comment #11 from laserson at mit.edu 2010-05-12 19:41 EST -------
Hi Peter,
Sorry for my short hiatus...see responses below.
(In reply to comment #10)
> Could you retest as "embl" format with the trunk? I would expect some warnings
> from these over indented features in IMGT, and we can certainly remove the
> warning if we decide not to introduce a separate IMGT format variant.
I still get the LocationParserErrors for many records.
Also note that the SeqIO.index function doesn't treat the IMGT headers
correctly, so it's not possible to access any of the records from the index it
creates (this was also addressed in my patch where I subclassed an independent
IMGT parser).
>
> http://github.com/biopython/biopython/commit/e6ba962dd60fe585baa1237445d33f67d47dd57f
>
> This change takes a slightly different approach to your work on github, but
> is quite similar to your two line patch - but this should still work with
> another odd form:
>
> FH Key Location/Qualifiers
> FT L-V-D-J-C-SEQUEN1..1151
> FT /db_xref="taxon:32630"
> FT /organism="synthetic construct"
> FT 5'UTR 1..37
> ...
I still couldn't get the current master branch 'embl' format to work. But
hardcoding the alternate indentation did work, even in the cases where the
feature key is right up against the location qualifier.
>
> In the above example (generated by Biopython itself), the strict EMBL column
> limits have been obeyed but the feature key has been truncated to just
> L-V-D-J-C-SEQUEN rather than L-V-D-J-C-SEQUENCE. This is a related query -
> when asked to output such a feature as EMBL or GenBank format, should we raise
> an exception here? We could add a warning instead, and either leave the code
> as is, or output this:
>
> FH Key Location/Qualifiers
> FT L-V-D-J-C-SEQUE 1..1151
> FT /db_xref="taxon:32630"
> FT /organism="synthetic construct"
> FT 5'UTR 1..37
> ...
>
I think we should probably output all IMGT records using the increased
indentation. This way there will be no ambiguity and no information loss. If
you want to manually "convert" to standard EMBL format, I think the truncation
makes sense as you proposed it, and we could issue a warning about lost
information.
> > > Thinking ahead would you also want to be able to write out IMGT variant
> > > EMBL files?
> >
> > I personally don't need this functionality, but I am willing to write it to
> > complement the IMGT parser that I wrote.
>
> If we go done the route of formalising IMGT as an EMBL variant with a different
> feature indent, it should just be a trivial subclass of the existing EMBL
> writer object but with the indentation constant changed.
>
Agreed.
> Note there are other problem in the IMGT data, including locations like
> "1..428>" and "<1..328>" where the greater than should be BEFORE the location
> (but we could probably cope with this all the same), and just "1." where half
> the location is missing (which we can't really do much with other than treat
> it as simply "1" instead?).
I have already notified IMGT regarding the ">" problem, though they seem like
they will be slow to change it. It's a very simple fix to the flatfile, and I
did it manually with regular expressions. My preference is that we do NOT
support the backwards notation, as it's clearly wrong. We'll have them fix it.
In the meanwhile, I can post my python script that corrects it somewhere
(maybe as a gist on github) and we can just point people to it in a warning if
they are using the IMGT parser.
Regarding the 1. problem, I have not yet told the IMGT people, but I will do so
shortly.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list