[Biojava-l] Remote Locations
Thomas Down
td2@sanger.ac.uk
Tue, 6 Feb 2001 11:52:42 +0000
On Mon, Feb 05, 2001 at 02:36:58PM -0500, Cox, Greg wrote:
> I plugged some new data into the genbank and embl parsers, and there's a
> slight problem. A location like "join(L41624.1:2858..5660,1..419)" is valid
> and refers to a different sequence, L41624. I've coded up a new location
> type, RemoteLocation to handle this case, but I want some feedback before
> committing it.
This is a really horrible issue... It comes from the fact that
99% of the time we want to deal with EMBL/GENBANK/whatever
as simple files, but in reality you need to look at the database
as a whole.
> I've attached my code, but the big problem I see is that
> RemoteLocation implements Location, and contains a Location. I've dealt
> with this recursive inheritance before and not enjoyed the experience. The
> other option, inheriting from a concrete location, begs the question of
> which one.
I'm afraid I'm with Matthew on this one. BioJava Locations represent
sets of points within some coordinate system. EMBL-locations,
which can include joins between two separate coordinate systems
are a much more complicated case -- Features feel a far more
appropriate place to keep this semantically rich information.
The `nice' way to handle this case is to assemble all the sequences
involved into a single coordinate system, and build features there.
As an example, I've been working on a BioJava bridge for the Ensembl
database. In their gene model, exons are always stored in the
coordinate system of the working-draft raw contigs. You then get
transcripts which are simply sets of these exons. In BioJava,
we try to create Transcript features on the raw contigs whenever
possible, but if a transcript spans two or more of these contigs
we create a feature on the assembled sequence instead. It's been
a bit awkward to code efficiently, but does work very cleanly and
seems to be behaving itself in practice.
Below I've suggested a possible roadmap for dealing with this issue.
How does this fit with your requirements?
For 1.1:
- Add a boolean property on EmblProcessor (and GenbankProcessor) which
defines the behaviour on seeing a remote location. The
options are:
+ Throw an exception, like we do at the moment (but
hopefully rather clearer).
+ Parse the location entry, including remote parts.
Construct a BioJava location covering all the local
parts, then add an Annotation bundle property to the
feature giving the full EMBL (Genbank) location.
For early 1.2 development
- Write special SequenceDB implementations for EMBL and GENBANK,
which offer all the single-entry sequences, but which can also
construct assemblies when we need to represent remote locations.
This should also make these databases really usable resources
in BioJava. There should be a simple interface (system properties?)
for defining where the data comes from -- we should be able to
support local files, web interfaces (SRS?), the EBI CORBA service,
and probably some others.
It should be possible to hide a lot of this behind naming
and directory services.
How does this sound?
Thomas.