[Biojava-l] EMBL/Genbank parse/write improvments
Matthew Pocock
mrp@sanger.ac.uk
Tue, 20 Mar 2001 12:25:02 +0000
Keith James wrote:
> Hi all,
>
> I've made a few home-improvements in FeatureTableParser and
> SeqFormatTools (which contains the static method for stringifying
> EMBL/Genbank Locations).
>
> Locations of the form
>
> (123.456)..789
> 123..(456.789)
> (123.345)..(456.678)
>
> (plus combinations like <123..(456.789), (123.456)..>789)
>
> should now be supported by both readSequence and writeSequence.
>
Keith, this is sooo cool. Thanks.
> Unsupported are fuzzy points of the form (123.456), 'between residue'
> locations like 123^456, remote locations like AL123456:(123...456) and
> unbounded ranges which only have a single point within the entry
> e.g. <123, >123 or <>123.
>
Keith, do you want to write FuzzyPoint, or shall I?
> When these are encountered an Exception is thrown, then caught by some
> code that Greg (I think) put in, resulting in an message to
> System.err, rather than instant flaming death. However, I think the
> Exceptions and sensible (documented) Feature repair/recovery options
> need some work.
>
Yes. This *should* go away once FuzzyPoint is in.
> Locations like <123 are a bit odd, because they are really ranges, but
> exist as points in the entry. So are they best represented as
> FuzzyRange, or FuzzyPoint (along with (123.456))?
>
> There is still a deficiency in the parser as it is makes no to attempt
> to interpret feature types e.g. CDS, gene etc. Therefore a gene still
> ends up having its exons represented by a CompoundLocation on one
> strand, rather than set of sub-features, each with their own strand
> information.
>
So - my take on this is that we slot an extra 'feature interpritation'
layer into the listener pipe-line that builds objects from our genomics
package. You can then chose to get out CDSs as compound locations using
the raw pipeline and the full genomic-complient model using the modified
one.
> In the short term I'm going to add some code to store feature
> information not fully preserved by the parser in the feature's
> annotation bundle. It should therefore be possible to post-process
> Feature(s) with a type (CDS, gene, repeat, exon) specific heuristic,
> rather than burden the parser with decision-making code.
>
> At the moment this stuff may well break with some of the more scary
> EMBL entries.
>
> Keith
Have you run the parser over a complete EMBL database file yet? This is
the acid test.
Again, thanks for all this.
Matthew