[Biojava-l] New To BioJava.org, right in the question.
Dunarel Badescu
dunarel at gmx.net
Sat Nov 18 02:48:41 UTC 2006
Hello,
My name is Dunarel Badescu, a student at UQAM University in Montreal,Quebec,Canada in Graduate Diploma of Bioinformatics.
Currently I am using BioJavaX and BioSql for my session project.
I have parsed NCBI GeneBank files using RichSequences.
I then inserted the sequences into the database.
Several problems arise:
1) A bug in the code pulled from CVS:
In class BioSQLRichObjectBuilder: I had to append some code for the program to find the right constructors:
// Get the results
Object result = this.uniqueResult.invoke(query, null);
// Return the found object, if found
if (result!=null) return result;
// Create, persist and return the new object otherwise
else {
/*NEW CODE*/
if (SimpleDocRef.class.isAssignableFrom(clazz)) {
// convert String to List constructor
ourParamsList.set(0,
DocRefAuthor.Tools.parseAuthorString((String)ourParamsList.get(0)));
}
// Load the class
2) Meny memory problems, after inserting 800 sequences it slows extremely so performance is degraded.
I thought of a hibernate cache problem and tried to turn it off by setting some parameters:
<property name="hibernate.jdbc.batch_size">20</property>
<property name="hibernate.cache.enabled">false</property>
<property name="hibernate.cache.use_query_cache">false</property>
<property name="hibernate.cache.use_second_level_cache">false</property>
<property name="hibernate.connection.aggressive_release">false</property>
<property name="cache.provider_class">org.hibernate.cache.NoCacheProvider</property>
<property name="cache.use_query_cache">false</property>
<property name="cache.use_minimal_puts">false</property>
<property name="max_fetch_depth">3</property>
to no much benefit.
Then I observed some small performance gain by using :
session.save("Sequence",rs); // persist the sequence
session.getTransaction().commit();
session.flush();
session.evict(rs);
The session.evict(rs);
Any atempt to dealocate memory by closing the session, the session factory either generates errors or it will generate on reopening.
So as a last resort I fragmented the original aprox. 130 mb containing one taxon from ncbi in 38 files 1000 sequences each and made a dos batch script executing the program in the commad line for each file.
So that way it works but:
3) Inserting rows sometimes generates exceptions in the references table.
After taking it more closely I found that by disabling the unique constraint on the dbxref_id on references table solves all the remaining problems.
The coment about it on the original code is:
-- No two references can reference the same reference database entry
-- (dbxref_id). This is where the MEDLINE id goes: PUBMED:123456.
and the modification is:
--UNIQUE ( dbxref_id ) ,
I must say that the script for creating the biosql schema is version 1.29 from the cvs, the most recent I found.
And I must say that for running the script on Postgresql 8.1.5 I had to modify each create table statement adding with oids at the end, now that 8.1.5 doesn't create oids by default.
It must have be a more elegant aproch to all these problems isn't it?
At least the constraint situation, I mean is it normal to exist or not as it seems.
I wish You all the best and thank you for your work which is most useful and scarce as a resource.
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
More information about the Biojava-l
mailing list