[Biojava-dev] [Biojava-l] File parsing in BJ3
Mark Schreiber
markjschreiber at gmail.com
Tue Oct 21 11:24:13 UTC 2008
Depending on what you want them for isMachineGenerated(),
isManuallyCurated(), would possibly be better as annotations
(@MachineGenerated, @ManuallyCurated). This is true metadata.
Probably if Java had annotations in version 1.1 Serializable would
also be an Annotation. I would agree with the idea that ThingBuilder
etc should be typed on extends Serializable.
- Mark
On Tue, Oct 21, 2008 at 7:14 PM, Richard Holland
<dicknetherlands at gmail.com> wrote:
> For now, yes it's empty. But I can envisage situations where it might be
> nice to have Thing implement some common methods (e.g. isMachineGenerated(),
> isManuallyCurated(), etc.). I'd rather have it there now to be a placeholder
> for future expansion, than have to re-engineer everything should we identify
> a need for common functions in future.
>
> You'll see that Thing already extends Serializable, implying that all Things
> must be able to persist to an object backing store. Serializable itself is
> also an empty interface!
>
> Also I like the idea of having Thing, not Object, as a kind of marker of
> intention. To me it makes it clearer when reading code to avoid Object
> wherever possible. Thing may not be any more clever than Object, but it
> immediately declares an intention when reading code as to what kind of
> Object should be expected.
>
>
> 2008/10/21 Mark Schreiber <markjschreiber at gmail.com>
>>
>> Is there any need for Thing at all? Can't a bulder be typed to produce
>> something that extends Object?
>>
>> If Thing provides no behaivour contract or meta-information then why
>> does it exist?
>>
>> - Mark
>>
>> On Tue, Oct 21, 2008 at 4:49 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> > Depends on what you want to program. If you want to have a collection of
>> > objects which are Things & perform a common action on them then
>> > annotations are not the way forward.
>> >
>> > If you want to have some kind of meta-programming occurring & need a
>> > class to be multiple things then annotations are right. There is
>> > currently no way to enforce compile time dependencies on annotations &
>> > my thinking is that this is right. Annotations should be meta data or
>> > provide a way to alter a class in a non-invasive way (think Web Service
>> > annotations creating WS Servers & Clients without any alteration of the
>> > class).
>> >
>> > Andy
>> >
>> > Richard Holland wrote:
>> >> Spot on.
>> >>
>> >> Annotation/interface.... i think Annotation is probably better as you
>> >> suggest, but I'd have to look into that. Not sure how it works with
>> >> collections and generics. If it does turn out to be a better bet, I'll
>> >> change it over.
>> >>
>> >> With the BioSQL dependencies, take a look at the pom.xml file inside
>> >> the
>> >> biojava-dna module. It declares a dependency on biojava-core. If you
>> >> want to
>> >> add dependencies to external JARs, take a look at biojava-biosql's
>> >> pom.xml
>> >> to see how it depends on javax.persistence. (The easiest way to add
>> >> these is
>> >> via an IDE such as NetBeans, which is what I'm using at the moment).
>> >>
>> >> cheers,
>> >> Richard
>> >>
>> >> 2008/10/21 Mark Schreiber <markjschreiber at gmail.com>
>> >>
>> >>> So if I want to build a BioSQL loader from Genbank then would the
>> >>> classes (or there wrappers) in the BioSQL Entity package need to
>> >>> implement Thing? Would maven have an issue with that or would it just
>> >>> create a dependency on core? (you can tell I've never used Maven
>> >>> right).
>> >>>
>> >>> From a design point of view should Thing be an interface or an
>> >>> Annotation? The reason I ask is that it doesn't define any methods so
>> >>> it is more of a tag than an interface.
>> >>>
>> >>> Anyway, my understanding is that I would use a Genbank parser (or
>> >>> write one). Write a EntityReceiver interface (probably more than one
>> >>> given the number of entities in BioSQL, implement a EntityBuilder
>> >>> (again possibly more than one) that implements EntityReceiver and
>> >>> builds Entity beans from messages it receives. In this case I probably
>> >>> wouldn't provide a writer as JPA would be writing the beans to the
>> >>> database. Would this be how you imagine it?
>> >>>
>> >>> - Mark
>> >>>
>> >>>
>> >>> On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland
>> >>> <holland at eaglegenomics.com> wrote:
>> >>>> (From now on I will only be posting these development messages to
>> >>>> biojava-dev, which is the intended purpose of that list. Those of you
>> >>>> who
>> >>>> wish to keep track of things but are currently only subscribed to
>> >>> biojava-l
>> >>>> should also subscribe to biojava-dev in order to keep up to date.)
>> >>>>
>> >>>> As promised, I've committed a new package in the biojava-core module
>> >>>> that
>> >>>> should help understand how to do file parsing and conversion and
>> >>>> writing
>> >>> in
>> >>>> the new BJ3 modules. Here's an example of how to use it to write a
>> >>> Genbank
>> >>>> parser (note no parsers actually exist yet!):
>> >>>>
>> >>>> 1. Design yourself a Genbank class which implements the interface
>> >>>> Thing
>> >>> and
>> >>>> can fully represent all the data that might possibly occur inside a
>> >>> Genbank
>> >>>> file.
>> >>>>
>> >>>> 2. Write an interface called GenbankReceiver, which extends
>> >>>> ThingReceiver
>> >>>> and defines all the methods you might need in order to construct a
>> >>> Genbank
>> >>>> object in an asynchronous fashion.
>> >>>>
>> >>>> 3. Write a GenbankBuilder class which implements GenbankReceiver and
>> >>>> ThingBuilder. It's job is to receive data via method calls, use that
>> >>>> data
>> >>> to
>> >>>> construct a Genbank object, then provide that object on demand.
>> >>>>
>> >>>> 4. Write a GenbankWriter class which implements GenbankReceiver and
>> >>>> ThingWriter. It's job is similar to GenbankBuilder, but instead of
>> >>>> constructing new Genbank objects, it writes Genbank records to file
>> >>>> that
>> >>>> reflect the data it receives.
>> >>>>
>> >>>> 5. Write a GenbankReader class which implements ThingReader. It can
>> >>>> read
>> >>>> GenbankFiles and output the data to the methods of the ThingReceiver
>> >>>> provided to it, which in this case could be anything which implements
>> >>>> the
>> >>>> interface GenbankReceiver.
>> >>>>
>> >>>> 6. Write a GenbankEmitter class which implements ThingEmitter. It
>> >>>> takes a
>> >>>> Genbank object and will fire off data from it to the provided
>> >>> ThingReceiver
>> >>>> (a GenbankReceiver instance) as if the Genbank object was being read
>> >>>> from
>> >>> a
>> >>>> file or some other source.
>> >>>>
>> >>>> That's it! OK so it's a minimum of 6 classes instead of the original
>> >>>> 1 or
>> >>> 2,
>> >>>> but the additional steps are necessary for flexibility in converting
>> >>> between
>> >>>> formats.
>> >>>>
>> >>>> Now to use it (you'll probably want a GenbankTools class to wrap
>> >>>> these
>> >>> steps
>> >>>> up for user-friendliness, including various options for opening
>> >>>> files,
>> >>>> etc.):
>> >>>>
>> >>>> 1. To read a file - instantiate ThingParser with your GenbankReader
>> >>>> as
>> >>> the
>> >>>> reader, and GenbankBuilder as the receiver. Use the iterator methods
>> >>>> on
>> >>>> ThingParser to get the objects out.
>> >>>>
>> >>>> 2. To write a file - instantiate ThingParser with a GenbankEmitter
>> >>> wrapping
>> >>>> your Genbank object, and a GenbankWriter as the receiver. Use the
>> >>> parseAll()
>> >>>> method on the ThingParser to dump the whole lot to your chosen
>> >>>> output.
>> >>>>
>> >>>> The clever bit comes when you want to convert between files. Imagine
>> >>> you've
>> >>>> done all the above for Genbank, and you've also done it for FASTA.
>> >>>> How to
>> >>>> convert between them? What you need to do is this:
>> >>>>
>> >>>> 1. Implement all the classes for both Genbank and FASTA.
>> >>>>
>> >>>> 2. Write a GenbankFASTAConverter class that implements
>> >>> ThingConverter<FASTA>
>> >>>> and GenbankReceiver, and will internally convert the data received
>> >>>> and
>> >>> pass
>> >>>> it on out to the receiver provided, which will be a FASTAReceiver
>> >>> instance.
>> >>>> 3. Write a FASTAGenbankConverter class that operates in exactly the
>> >>> opposite
>> >>>> way, implementing ThingConverter<Genbank> and FASTAReceiver.
>> >>>>
>> >>>> Then to convert you use ThingParser again:
>> >>>>
>> >>>> 1. From FASTA file to Genbank object: Instantiate ThingParser with a
>> >>>> FASTAReader reader, a GenbankBuilder receiver, and add a
>> >>>> FASTAGenbankConverter instance to the converter chain. Use the
>> >>>> iterator
>> >>> to
>> >>>> get your Genbank objects out of your FASTA file.
>> >>>>
>> >>>> 2. From FASTA file to Genbank file: Same as option 1, but provide a
>> >>>> GenbankWriter instead and use parseAll() instead of the iterator
>> >>>> methos.
>> >>>>
>> >>>> 3. From FASTA object to Genbank object: Same as option 1, but provide
>> >>>> a
>> >>>> FASTAEmitter wrapping your FASTA object as the reader instead.
>> >>>>
>> >>>> 4. From FASTA object to Genbank file: Same as option 1, but swap both
>> >>>> the
>> >>>> reader and the receiver as per options 2 and 3.
>> >>>>
>> >>>> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all
>> >>> mentions
>> >>>> of FASTA and Genbank, and use GenbankFASTAConverter instead.
>> >>>>
>> >>>> One last and very important feature of this approach is that if you
>> >>> discover
>> >>>> that nobody has written the appropriate converter for your chosen
>> >>>> pair of
>> >>>> formats A and C, but converters do exist to map A to some other
>> >>>> format B
>> >>> and
>> >>>> that other format B on to C, then you can just put the two converts
>> >>>> A-B
>> >>> and
>> >>>> B-C into the ThingParser chain and it'll work perfectly.
>> >>>>
>> >>>> Enjoy!
>> >>>>
>> >>>> cheers,
>> >>>> Richard
>> >>>>
>> >>>> --
>> >>>> Richard Holland, BSc MBCS
>> >>>> Finance Director, Eagle Genomics Ltd
>> >>>> M: +44 7500 438846 | E: holland at eaglegenomics.com
>> >>>> http://www.eaglegenomics.com/
>> >>>> _______________________________________________
>> >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >>>>
>> >>
>> >>
>> >>
>> >
>
>
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
More information about the biojava-dev
mailing list