[Biojava-l] Emblparsing
Thomas Down
td2@sanger.ac.uk
Tue, 5 Dec 2000 11:04:06 +0000
On Tue, Dec 05, 2000 at 11:54:27AM +0100, Kristina Engdahl wrote:
> Hello everyone!
> I have recently started to work with BioJava with mostly satisfying
> results.
> At the moment I'm trying to parse an Embl flatfile to get all the
> features. It works fine and I get features such as repeat regions and
> misc_features etc. However, I would like to be able to retrieve the
> individual "exons" that are specified in the CDS feature. Like this:
>
> FT CDS
> join(<5642..5793,10804..10976,12496..12656,14136..14266,
> FT
> 14403..14532,16852..16987,17821..17959,18068..18122,
> FT
> 19456..19570,23623..23753,25885..26053,29102..29240,
> FT 32621..32738,33595..33771)
Hi...
Our current parsing behaviour (either the old EmblParser, or the
new EmblLikeParser-EmblProcessor combination), will build that
feature table entry into a single BioJava feature. All the information
in the location part of the entry will be preserved in the Biojava
Location -- you can retrieve the exons using the location's
blockIterator() method. I hope this might do what you want.
In the longer term (or as soon as anyone feels like coding it
up...)
In the current CVS development tree, we now have a larger set
of feature interfaces, including special interfaces for representing
genes, transcripts, exons, etc. It would be good if, in future,
we could have a more sophisticated EmblProcessor which recognises
genes in EMBL feature tables and builds more appropriate feature
objects. Since the newio changes landed a couple of weeks back,
all the infrastructure is there to allow something like this to
be plugged in, but we don't have any code yet.
Good luck,
Thomas.
--
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
-- Terry Pratchett