[Biojava-l] Load Genbank files takes ages
Florian Mittag
florian.mittag at uni-tuebingen.de
Thu Jul 16 15:38:25 UTC 2009
Hi all!
We try to load Genbank files into our bioseqdb database using BioJava. I
copy-pasted the code together from tutorials and previous posts on this
mailinglist. My problems:
1) It eats huge amounts of memory, so that I needed to increase the heap size
to 2GB.
2) Loading the first two files works great, but the third one ran for one two
hours without completion. Here is my code:
--- snip ---
// loop over all downloaded *.gbk files starting with the highest number
System.out.println("Updating chromosome " + chrNo[j] + " ...");
BufferedReader fileIn = new BufferedReader(new FileReader(localFile));
tx = session.beginTransaction();
GenbankFormat gf = new GenbankFormat();
SimpleRichSequenceBuilder listener = new SimpleRichSequenceBuilder();
RichSequence seq = null;
gf.readRichSequence(fileIn, dnaTokenization, listener, nsGenbank);
seq = listener.makeRichSequence();
if( seq != null ) {
// check, if a sequence with this identifier is already in the DB
Query q = session.createQuery(
"select be from BioEntry as be where identifier=:identifier");
q.setString("identifier",seq.getIdentifier());
List entries = q.list();
for( Object o : entries ) {
// delete the old sequence in the DB
BioEntry oldSeq = (BioEntry)o;
session.delete("BioEntry", oldSeq);
}
tx.commit();
tx = session.beginTransaction();
session.save("Sequence", seq);
System.out.println("Chromosome " + chrNo[j] + " was updated.\n");
} else {
System.out.println("Chromosome " + chrNo[j] + " was NOT updated.\n");
}
tx.commit();
--- snap ---
This is the generated output:
---snip ---
Jul 16, 2009 4:33:53 PM - FINE: Starting update of chromosome 001807
Updating chromosome 001807 ...
Chromosome 001807 was updated.
Jul 16, 2009 4:33:55 PM - FINE: Starting update of chromosome 000024
Updating chromosome 000024 ...
Chromosome 000024 was updated.
Jul 16, 2009 4:35:27 PM - FINE: Starting update of chromosome 000023
Updating chromosome 000023 ...
--- snap ---
The files for this are downloaded from Genbank and the file sizes are:
NC_001807.gbk 58.4 KB
NC_000024.gbk 70.8 MB
NC_000023.gbk 190.1 MB
So, I don't see, why loading a 70.8 MB file took less than 2 minutes and a
190.1 MB file isn't completed after 2 hours. But during this time, the CPU
load is almost 100% and there is no significant network or harddisk activity.
When I paused the program (I'm using Eclipse) and looked, where the whole
processing power is going to, I ended up with the following stacktrace (sorry
for the unreadable format):
CharacterTokenization.tokenizeSymbolList(SymbolList) line: 214
AlphabetManager$WellKnownTokenizationWrapper.tokenizeSymbolList(SymbolList)
line: 1460
SimpleSymbolList(AbstractSymbolList).seqString() line: 102
BioSQLRichSequenceHandler(DummyRichSequenceHandler).seqString(RichSequence)
line: 115
BioSQLRichSequenceHandler.seqString(RichSequence) line: 155
SimpleRichSequence(ThinRichSequence).seqString() line: 203
SimpleRichSequence.getStringSequence() line: 77
GeneratedMethodAccessor132.invoke(Object, Object[]) line: not available
DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
Method.invoke(Object, Object...) line: 597
BasicPropertyAccessor$BasicGetter.get(Object) line: 145
PojoEntityTuplizer(AbstractEntityTuplizer).getPropertyValues(Object) line: 249
PojoEntityTuplizer.getPropertyValues(Object) line: 244
JoinedSubclassEntityPersister(AbstractEntityPersister).getPropertyValues(Object,
EntityMode) line: 3567
DefaultFlushEntityEventListener.getValues(Object, EntityEntry, EntityMode,
boolean, SessionImplementor) line: 167
DefaultFlushEntityEventListener.onFlushEntity(FlushEntityEvent) line: 120
DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEntities(FlushEvent)
line: 196
DefaultAutoFlushEventListener(AbstractFlushingEventListener).flushEverythingToExecutions(FlushEvent)
line: 76
DefaultAutoFlushEventListener.onAutoFlush(AutoFlushEvent) line: 35
SessionImpl.autoFlushIfRequired(Set) line: 970
SessionImpl.list(String, QueryParameters) line: 1115
QueryImpl.list() line: 79
QueryImpl(AbstractQueryImpl).uniqueResult() line: 811
GeneratedMethodAccessor38.invoke(Object, Object[]) line: not available
DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25
Method.invoke(Object, Object...) line: 597
BioSQLRichObjectBuilder.buildObject(Class, List) line: 133
RichObjectFactory.getObject(Class, Object[]) line: 107
GenbankFormat.readRichSequence(BufferedReader, SymbolTokenization,
RichSeqIOListener, Namespace) line: 450
UpdateDB_Main.updateChromosome() line: 542
Now we go to GenbankFormat.readRichSequence(). It hangs at about line 450, the
line where it loads a CrossRef object, so I added debug output:
--- snip ---
// parameter on old feature
if (key.equals("db_xref")) {
Matcher m = dbxp.matcher(val);
if (m.matches()) {
String dbname = m.group(1);
String raccession = m.group(2);
if (dbname.equalsIgnoreCase("taxon")) {
[...]
} else {
try {
long starttime = System.currentTimeMillis();
CrossRef cr =
(CrossRef)RichObjectFactory.getObject(SimpleCrossRef.class,new Object[]
{dbname, raccession, new Integer(0)});
long duration = System.currentTimeMillis() - starttime;
if( duration > 100 ) {
System.out.println("dbname: " + dbname + ", raccession: " + raccession);
System.out.println(" took " + duration + "ms");
}
RankedCrossRef rcr = new SimpleRankedCrossRef(cr, ++rcrossrefCount);
rlistener.getCurrentFeature().addRankedCrossRef(rcr);
--- snap ---
Which leads to:
--- snip ---
dbname: GeneID, raccession: 677739
took 3291ms
dbname: HGNC, raccession: 31847
took 2427ms
dbname: GeneID, raccession: 55344
took 2932ms
dbname: HGNC, raccession: 23148
took 2339ms
dbname: GI, raccession: 94158612
took 2418ms
dbname: GI, raccession: 8922995
took 2920ms
[...]
--- snap ---
Which are all /db_xref properties of the NC_000023.gbk file. Searching deeper,
it looks like for every CrossRef object loaded, the whole BioEntry object is
built and the sequence parsed. But remember, this only happens on chromosome
23, not on 24, which has /db_xref, too.
I already spent some time on this, but I can't figure out, what could be the
cause.
Thanks
Florian
More information about the Biojava-l
mailing list