[Biopython] Parsing GB seq files with BioPython into BioSQL

Shyam Saladi saladi at caltech.edu
Tue Mar 26 14:22:06 UTC 2013


Hi,

Thanks for the quick response. Here's the code:

server = BioSeqDatabase.open_database( ...
db = server["microbial"]

handle = open(sys.argv[1], "rU")

count = db.load(SeqIO.parse(handle, "genbank"))
print "Loaded %i records" % count
server.commit()

Since each microbial genome with it's annotations comes in single genbank
file, I guess it's processed as one record with many annotations for genes
and proteins.

We are using BioSQL running on MySQL (on a different machine). Are there
any tips on configuration here? Upon further thought, I think the commit
step might actually be the issue.

Thanks,
Shyam


On Tue, Mar 26, 2013 at 9:50 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Mar 26, 2013 at 1:08 PM, Shyam Saladi <saladi at caltech.edu> wrote:
> > Hi,
> >
> > I am parsing genbank genome files for microbial genomes and loading the
> > sequence and annotations into a BioSQL database.
> >
> > The program I have is quite simple (same as given onlinehttp://
> > biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database).
> >
> > The issue is that each record when loaded into memory is huge. Some
> genomes
> > take up the entire 32 gb ram + 32 gb swap.
> >
> > Does anyone have suggestions on how to make this process more efficient?
>
> Could you show us your code and/or give some examples were
> you find a single microbial genome is taking that much RAM -
> it does seem more likely there is something else happening,
> like keeping old records in memory (possibly as simple as
> failing to commit the data to the database regularly). Which
> database are you using? How are you doing the commits?
>
> Thanks,
>
> Peter
>
>



More information about the Biopython mailing list