[BioSQL-l] Genbank loading time

Wed Jan 28 18:57:25 UTC 2009

On Jan 28, 2009, at 12:18 PM, Peter wrote:

>>> You could re-invent the wheel, and write yet another
>>> GenBank/EMBL/Swiss parser in standalone perl for use within
>>> load_seqdatabase.pl but I really don't see any point to this.   
>>> Reusing
>>> the BioPerl parser seems most sensible, especially given that
>>> bioperl-db is an extension to bioperl in the first place - and the
>>> BioPerl parsers already exist and are well tested.
>>>
>>> Peter
>>
>> My point is, instead of first mapping record data to a specific  
>> object/class
>> then mapping the object data to the database, bypass the object  
>> completely
>> and generically map relevant data directly in the database  
>> according to the
>> BioSQL schema.
>>
>> If anything this may force some consistency between the various Bio*
>> languages.
>>
>> chris
>
> Ah - so rather than using BioPerl/Biopython/BioJava to import your
> sequence files into a BioSQL database, you'd like BioSQL to come with
> its own script that does the job?  It would "solve" any
> inconsistencies for getting files of data into the database if this
> where the only sanctioned way to add records to the database.
> However, there are a number of downsides - in addition to the
> considerable extra effort needed to write and support another set of
> parsers just for BioSQL (without reusing BioPerl/Biopython/BioJava).
>
> What about BioPerl/Biopython/BioJava users who have sequence-record
> objects in memory they want to record in the database?  These could
> have been loaded from GenBank files originally and then manipulated
> (e.g. adding additional crude annotation from running BLAST).  How
> would they get them into the database - write them to a GenBank file
> and then invoke the project neutral BioSQL provided script?

No, one would use the same adaptors as before (bioperl-db for BioPerl,  
for instance).

> I think each project needs their own ORM bindings for both loading
> data into and from the database.  Improving any inconsistencies in how
> each ends up storing sequence files (e.g. GenBank files) can be worked
> on gradually.
>
> [Perhaps I have read more into your comment than you intended - if I
> have got the wrong end of the stick, please clarify - thanks]
>
> Still, a project neutral BioSQL bundled script (not depending on any
> of BioPerl/Biopython/BioJava) for importing a GenBank file into a
> database could serve as a "reference implementation" (the role I
> currently assign to BioPerl's load_seqdatabase.pl).  And if this
> proves faster than load_seqdatabase.pl that's a nice bonus.
>
> Peter

That's what I'm thinking, essentially; something that is Bio*-neutral  
that can be tested against.  And it should be faster at least from the  
standpoint of not having to generate tons of objects.

It's icing if it evolves past the point of a simple reference  
implementation into something that is useful as a fast BioSQL loader.

chris