[Biojava-l] File parsing in BJ3
Mark Schreiber
markjschreiber at gmail.com
Tue Oct 21 03:16:51 UTC 2008
So if I want to build a BioSQL loader from Genbank then would the
classes (or there wrappers) in the BioSQL Entity package need to
implement Thing? Would maven have an issue with that or would it just
create a dependency on core? (you can tell I've never used Maven
right).
>From a design point of view should Thing be an interface or an
Annotation? The reason I ask is that it doesn't define any methods so
it is more of a tag than an interface.
Anyway, my understanding is that I would use a Genbank parser (or
write one). Write a EntityReceiver interface (probably more than one
given the number of entities in BioSQL, implement a EntityBuilder
(again possibly more than one) that implements EntityReceiver and
builds Entity beans from messages it receives. In this case I probably
wouldn't provide a writer as JPA would be writing the beans to the
database. Would this be how you imagine it?
- Mark
On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland
<holland at eaglegenomics.com> wrote:
> (From now on I will only be posting these development messages to
> biojava-dev, which is the intended purpose of that list. Those of you who
> wish to keep track of things but are currently only subscribed to biojava-l
> should also subscribe to biojava-dev in order to keep up to date.)
>
> As promised, I've committed a new package in the biojava-core module that
> should help understand how to do file parsing and conversion and writing in
> the new BJ3 modules. Here's an example of how to use it to write a Genbank
> parser (note no parsers actually exist yet!):
>
> 1. Design yourself a Genbank class which implements the interface Thing and
> can fully represent all the data that might possibly occur inside a Genbank
> file.
>
> 2. Write an interface called GenbankReceiver, which extends ThingReceiver
> and defines all the methods you might need in order to construct a Genbank
> object in an asynchronous fashion.
>
> 3. Write a GenbankBuilder class which implements GenbankReceiver and
> ThingBuilder. It's job is to receive data via method calls, use that data to
> construct a Genbank object, then provide that object on demand.
>
> 4. Write a GenbankWriter class which implements GenbankReceiver and
> ThingWriter. It's job is similar to GenbankBuilder, but instead of
> constructing new Genbank objects, it writes Genbank records to file that
> reflect the data it receives.
>
> 5. Write a GenbankReader class which implements ThingReader. It can read
> GenbankFiles and output the data to the methods of the ThingReceiver
> provided to it, which in this case could be anything which implements the
> interface GenbankReceiver.
>
> 6. Write a GenbankEmitter class which implements ThingEmitter. It takes a
> Genbank object and will fire off data from it to the provided ThingReceiver
> (a GenbankReceiver instance) as if the Genbank object was being read from a
> file or some other source.
>
> That's it! OK so it's a minimum of 6 classes instead of the original 1 or 2,
> but the additional steps are necessary for flexibility in converting between
> formats.
>
> Now to use it (you'll probably want a GenbankTools class to wrap these steps
> up for user-friendliness, including various options for opening files,
> etc.):
>
> 1. To read a file - instantiate ThingParser with your GenbankReader as the
> reader, and GenbankBuilder as the receiver. Use the iterator methods on
> ThingParser to get the objects out.
>
> 2. To write a file - instantiate ThingParser with a GenbankEmitter wrapping
> your Genbank object, and a GenbankWriter as the receiver. Use the parseAll()
> method on the ThingParser to dump the whole lot to your chosen output.
>
> The clever bit comes when you want to convert between files. Imagine you've
> done all the above for Genbank, and you've also done it for FASTA. How to
> convert between them? What you need to do is this:
>
> 1. Implement all the classes for both Genbank and FASTA.
>
> 2. Write a GenbankFASTAConverter class that implements ThingConverter<FASTA>
> and GenbankReceiver, and will internally convert the data received and pass
> it on out to the receiver provided, which will be a FASTAReceiver instance.
>
> 3. Write a FASTAGenbankConverter class that operates in exactly the opposite
> way, implementing ThingConverter<Genbank> and FASTAReceiver.
>
> Then to convert you use ThingParser again:
>
> 1. From FASTA file to Genbank object: Instantiate ThingParser with a
> FASTAReader reader, a GenbankBuilder receiver, and add a
> FASTAGenbankConverter instance to the converter chain. Use the iterator to
> get your Genbank objects out of your FASTA file.
>
> 2. From FASTA file to Genbank file: Same as option 1, but provide a
> GenbankWriter instead and use parseAll() instead of the iterator methos.
>
> 3. From FASTA object to Genbank object: Same as option 1, but provide a
> FASTAEmitter wrapping your FASTA object as the reader instead.
>
> 4. From FASTA object to Genbank file: Same as option 1, but swap both the
> reader and the receiver as per options 2 and 3.
>
> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all mentions
> of FASTA and Genbank, and use GenbankFASTAConverter instead.
>
> One last and very important feature of this approach is that if you discover
> that nobody has written the appropriate converter for your chosen pair of
> formats A and C, but converters do exist to map A to some other format B and
> that other format B on to C, then you can just put the two converts A-B and
> B-C into the ThingParser chain and it'll work perfectly.
>
> Enjoy!
>
> cheers,
> Richard
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
More information about the Biojava-l
mailing list