[Biopython] Parsing GB seq files with BioPython into BioSQL

Tue Mar 26 14:36:06 UTC 2013

On Tue, Mar 26, 2013 at 2:22 PM, Shyam Saladi <saladi at caltech.edu> wrote:
> Hi,
>
> Thanks for the quick response. Here's the code:
>
> server = BioSeqDatabase.open_database( ...
> db = server["microbial"]
>
> handle = open(sys.argv[1], "rU")
>
> count = db.load(SeqIO.parse(handle, "genbank"))
> print "Loaded %i records" % count
> server.commit()
>
> Since each microbial genome with it's annotations comes in single genbank
> file, I guess it's processed as one record with many annotations for genes
> and proteins.
>
> We are using BioSQL running on MySQL (on a different machine). Are there any
> tips on configuration here? Upon further thought, I think the commit step
> might actually be the issue.
>
> Thanks,
> Shyam

How many records in the file, and can you confirm there are no
memory issues just parsing the file? e.g.

count = 0
for record in SeqIO.parse(handle, "genbank"):
    count += 1
print count

My guess is that you're not using auto-commit, so the database
itself (or possibly the Python MySQL layer?) is caching all the
changes (until the explicit commit is made). This could be a lot
of data!

Either try turning on auto-commit, or use a batched loading
approach. Most simply, you could commit after each record:

count = 0
for record in SeqIO.parse(handle, "genbank"):
    assert 1 == db.load([record])
    server.commit()
    count += 1
print "Loaded %i records" % count

Peter