[BioSQL-l] Genbank loading time

Hilmar Lapp hlapp at gmx.net
Wed Jan 28 05:09:04 UTC 2009


The loader for BioPerl is load_seqdatabase.pl, which is part of  
bioperl-db. With machines current as of 3-4 years ago, I saw upload  
speeds of between 5 and 15 sequences per second for richly annotated  
sequences (human/mouse RefSeqs).

If you are talking about all of GenBank, the far majority of that will  
be ESTs and sequencing reads (do you really want to load those?),  
which are typically sparsely annotated if at all, and so should be  
faster. mRNA and cDNA sequences will be more in the above range.

I have never loaded all of GenBank into a database (and I'm not sure  
why anyone would want to do this) and so don't have a comparison  
figure for the total for that.

Finally, several instances of load_seqdatabase.pl can be nicely run in  
parallel on multi-core machines.

	-hilmar

On Jan 27, 2009, at 5:57 PM, Richard Holland wrote:

> It would depend on the toolkit you use. BioWarehouse is a complete  
> API,
> whereas BioSQL is just a schema and the way in which it is populated
> (and therefore how long that takes) depends on your toolkit.
>
> Currently I'm aware of loaders existing for BioJava, BioPerl, and
> possibly also BioPython. However each of them load the same data in
> subtly different ways, so can't be directly compared in terms of which
> one is faster than the other.
>
> I vaguely remember seeing some performance figures for the
> BioJava/Genbank/BioSQL combination somewhere, but it's been a while!  
> I'm
> not sure where they were documented though - I certainly haven't got
> them written down anywhere. Mark Schreiber might know as he definitely
> did some testing of this - Mark, can you remember what the figures  
> were
> for BioJava?
>
> As for BioPerl/BioPython/etc. I expect their respective project  
> authors
> will respond to this thread accordingly with the figures from their  
> own
> domains!
>
> cheers,
> Richard
>
> gwu wrote:
>> Hi Everyone,
>>
>> I recently visited the BioWarehouse web site and the document shows
>> loading the whole Genbank into their database takes the data loader  
>> 68
>> hours for MySQL, and 27.5 hours for Oracle. So I wonder if there is a
>> similar test done with BioSQL?
>>
>> Gang Wu
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>
> -- 
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================






More information about the BioSQL-l mailing list