From ap3 at sanger.ac.uk Sun Apr 13 14:02:41 2008 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 13 Apr 2008 19:02:41 +0100 Subject: [Biojava-l] biojava 1.6 released Message-ID: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> Biojava 1.6 has been released and is available from http:// biojava.org/wiki/BioJava:Download Biojava 1.6 offers more functionality and stability over the previous official releases. BioJava now depends on Java 1.5+. We highly recommend you to upgrade as soon as possible. In detail, the phylo package org.biojavax.bio.phylo was improved and expanded by our GSOC'07 student Boh-Yun Lee. It now contains fully- functional Nexus and Phylip parsers, and tools for calculating UPGMA and Neighbour Joining, Jukes-Kantor and Kimura Two Parameter, and MP. It uses JGraphT to represent parsed trees. The PDB file parser was improved by Jules Jacobsen for better dealing with PDB header records. Andreas Draeger provided several patches for improving the Genetic Algorithm modules. Additionally this release contains numerous bug fixes and documentation improvements. Thanks to the entire biojava community for making this possible! Happy Biojava-ing, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From debrown at unity.ncsu.edu Wed Apr 16 07:57:44 2008 From: debrown at unity.ncsu.edu (Doug Brown) Date: Wed, 16 Apr 2008 07:57:44 -0400 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL Message-ID: <4805E9B8.6000709@unity.ncsu.edu> Greetings, I am happily climbing the learning curve for Biojava-live, Biojavax, and BioSQL. I believe that I am using the latest releases, Biojava 1.6 and BioSQL 1.0, in that I have performed the installation within the past week. I am attempting to load, via Biojavax, multiple genbank records for the same organism (a whole genome's worth of annotations) and to save those into a BioSQL database via Biojavax's Hibernate persistence mechanism. Loading a second genbank file (same organism, different sequence) croaks with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2 ...... could not insert: [Namespace]. FYI two sample genbank records are CH476760.gb and CH476761.gb and were obtained directly from genbank. Never having used Hibernate before nor its type of database abstraction, I think that I am properly handling the transaction semantics. Either I am violating unspoken presumptions of the persistence paradigm or the behavior of RichSequence.IOTools.readGenbankDNA is not what I expected. I had presumed that the above routine would use the established RichObjectFactory to obtain new or extant objects and then populate those objects with values from the sequence file. This only seems to happen when I load multiple sequences from a single file. Multi file operations fail dismally. What is the proper way of using Biojava to load up a database with records? In advance, thank you all for the traffic on this list, it has been quite helpful in bringing me up to speed. Regards, Doug Brown Here is the relevant [hacked] subroutine: /** * This works for genbank files containing multiple sequences. * Originaly concept from: http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html * It fails on inserting existant record(s) - does not replace... * This causes grief when loading multiple files... */ public void loadNSave( Session session, File fileName) { boolean localSession = (session == null); Transaction tx = null; try { System.out.println( "*********** Loading "+fileName+"..."); BufferedReader br = new BufferedReader( new FileReader( fileName) ); if ( session == null) // create a local session { session = sessionFactory.openSession(); RichObjectFactory.connectToBioSQL(session); } // load the objects. I expect this to use the established factory. RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA( br, new SimpleNamespace( "genbankBiosqlRich") ); while ( rsi.hasNext() ) tx = session.beginTransaction(); // Hibernate requires transactions. System.out.println( "*********** Loading next sequence..."); // ??should automatically fetch existing objects from the database... RichSequence sequence = rsi.nextRichSequence(); System.out.println( "loaded sequence "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier()); try { System.out.println( "*********** saving..."); // synchronize in-memory representation w/ the database // HUGE amounts of time spent doing selects on keys - really slows things down!! session.saveOrUpdate( "Sequence", sequence ); tx.commit(); // save to database - does an automatic flush // batch operations overwhelm the cache - clear it out! session.flush(); // force in-memory to disk. session.clear(); // clean out cache. } catch (HibernateException ex) { tx.rollback(); // discard the sequence and all its annotations ex.printStackTrace(); } } } catch (FileNotFoundException ex) { ex.printStackTrace(); } catch ( BioException bex) { bex.printStackTrace(); } finally { if ( localSession) { session.flush(); // force in-memory to disk. session.close(); // only for local sessions } } } and the following following is a sample stack dump: org.hibernate.exception.ConstraintViolationException: could not insert: [Namespace] at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) at org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) at org.hibernate.engine.Cascade.cascade(Cascade.java:130) at org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108) at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72) Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich' for key 2 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600) at com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) at com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259) at org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) -- Doug Brown - Bioinformatics Fungal Genomics Laboratory Center for Integrated Fungal Research North Carolina State University Campus Box 7251, Raleigh, NC 27695-7251 https://www.fungalgenomics.ncsu.edu/~debrown/ Tel: (919) 513-0394, Fax (919) 513-0024 e-mail: doug_brown at ncsu.edu From debrown at unity.ncsu.edu Wed Apr 16 08:01:52 2008 From: debrown at unity.ncsu.edu (Doug Brown) Date: Wed, 16 Apr 2008 08:01:52 -0400 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL Message-ID: <4805EAB0.6010803@unity.ncsu.edu> Greetings, I am happily climbing the learning curve for Biojava-live, Biojavax, and BioSQL. I believe that I am using the latest releases, Biojava 1.6 and BioSQL 1.0, in that I have performed the installation within the past week. I am attempting to load, via Biojavax, multiple genbank records for the same organism (a whole genome's worth of annotations) and to save those into a BioSQL database via Biojavax's Hibernate persistence mechanism. Loading a second genbank file (same organism, different sequence) croaks with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2 ...... could not insert: [Namespace]. FYI two sample genbank records are CH476760.gb and CH476761.gb and were obtained directly from genbank. Never having used Hibernate before nor its type of database abstraction, I think that I am properly handling the transaction semantics. Either I am violating unspoken presumptions of the persistence paradigm or the behavior of RichSequence.IOTools.readGenbankDNA is not what I expected. I had presumed that the above routine would use the established RichObjectFactory to obtain new or extant objects and then populate those objects with values from the sequence file. This only seems to happen when I load multiple sequences from a single file. Multi file operations fail dismally. What is the proper way of using Biojava to load up a database with records? In advance, thank you all for the traffic on this list, it has been quite helpful in bringing me up to speed. Regards, Doug Brown Here is the relevant [hacked] subroutine: /** * This works for genbank files containing multiple sequences. * Originaly concept from: http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html * It fails on inserting existant record(s) - does not replace... * This causes grief when loading multiple files... */ public void loadNSave( Session session, File fileName) { boolean localSession = (session == null); Transaction tx = null; try { System.out.println( "*********** Loading "+fileName+"..."); BufferedReader br = new BufferedReader( new FileReader( fileName) ); if ( session == null) // create a local session { session = sessionFactory.openSession(); RichObjectFactory.connectToBioSQL(session); } // load the objects. I expect this to use the established factory. RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA( br, new SimpleNamespace( "genbankBiosqlRich") ); while ( rsi.hasNext() ) tx = session.beginTransaction(); // Hibernate requires transactions. System.out.println( "*********** Loading next sequence..."); // ??should automatically fetch existing objects from the database... RichSequence sequence = rsi.nextRichSequence(); System.out.println( "loaded sequence "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier()); try { System.out.println( "*********** saving..."); // synchronize in-memory representation w/ the database // HUGE amounts of time spent doing selects on keys - really slows things down!! session.saveOrUpdate( "Sequence", sequence ); tx.commit(); // save to database - does an automatic flush // batch operations overwhelm the cache - clear it out! session.flush(); // force in-memory to disk. session.clear(); // clean out cache. } catch (HibernateException ex) { tx.rollback(); // discard the sequence and all its annotations ex.printStackTrace(); } } } catch (FileNotFoundException ex) { ex.printStackTrace(); } catch ( BioException bex) { bex.printStackTrace(); } finally { if ( localSession) { session.flush(); // force in-memory to disk. session.close(); // only for local sessions } } } and the following following is a sample stack dump: org.hibernate.exception.ConstraintViolationException: could not insert: [Namespace] at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) at org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) at org.hibernate.engine.Cascade.cascade(Cascade.java:130) at org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108) at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72) Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich' for key 2 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600) at com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) at com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259) at org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) -- Doug Brown - Bioinformatics Fungal Genomics Laboratory Center for Integrated Fungal Research North Carolina State University Campus Box 7251, Raleigh, NC 27695-7251 https://www.fungalgenomics.ncsu.edu/~debrown/ Tel: (919) 513-0394, Fax (919) 513-0024 e-mail: doug_brown at ncsu.edu From dreher at molgen.mpg.de Mon Apr 21 08:51:52 2008 From: dreher at molgen.mpg.de (Felix Dreher) Date: Mon, 21 Apr 2008 14:51:52 +0200 Subject: [Biojava-l] mailing list archives Message-ID: <480C8DE8.5040805@molgen.mpg.de> Hello all, is there a possibility to query the biojava mailing-list archives? (the link provided on the biojava-homepage doesn't work: http://search.open-bio.org/cgi-bin/mail-search.cgi) Best regards, Felix From markjschreiber at gmail.com Mon Apr 21 21:27:51 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 22 Apr 2008 09:27:51 +0800 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL In-Reply-To: <4805E9B8.6000709@unity.ncsu.edu> References: <4805E9B8.6000709@unity.ncsu.edu> Message-ID: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> Hi Doug - Has anyone provided a solution for this yet? I haven't used the Hibernate bindings to BioSQL (I'm actually working on a JPA binding with EntityBeans) but when I did they worked well. However, I have seen this type of error before. Clearly the entry 'genbankBiosqlRich' is being duplicated somewhere where it should be unique. This looks unusual because you call saveOrUpdate which should be able to figure this out unless the way Hibernate determines equality is not the same as the way BioJava does. What happens if you try to save them one at a time (in sequential runs of your program)? From the look of the stack trace you might see the same error. Also, it might pay to look at the biojavax documentation on http://biojava.org/wiki/BioJava:BioJavaXDocs#BioSQL_and_Hibernate. Any Hibernate experts able to offer an opinion here?? - Mark On Wed, Apr 16, 2008 at 7:57 PM, Doug Brown wrote: > Greetings, > I am happily climbing the learning curve for Biojava-live, Biojavax, > and BioSQL. I believe that I am using the latest releases, Biojava 1.6 > and BioSQL 1.0, in that I have performed the installation within the > past week. > > I am attempting to load, via Biojavax, multiple genbank records for the > same organism (a whole genome's worth of annotations) and to save those > into a BioSQL database via Biojavax's Hibernate persistence mechanism. > Loading a second genbank file (same organism, different sequence) croaks > with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2 > ...... could not insert: [Namespace]. FYI two sample genbank records > are CH476760.gb and CH476761.gb and were obtained directly from genbank. > > Never having used Hibernate before nor its type of database abstraction, > I think that I am properly handling the transaction semantics. Either I > am violating unspoken presumptions of the persistence paradigm or the > behavior of RichSequence.IOTools.readGenbankDNA is not what I expected. > I had presumed that the above routine would use the established > RichObjectFactory to obtain new or extant objects and then populate > those objects with values from the sequence file. This only seems to > happen when I load multiple sequences from a single file. Multi file > operations fail dismally. > > What is the proper way of using Biojava to load up a database with records? > > In advance, thank you all for the traffic on this list, it has been > quite helpful in bringing me up to speed. > > Regards, > Doug Brown > > Here is the relevant [hacked] subroutine: > /** > * This works for genbank files containing multiple sequences. > * Originaly concept from: > http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html > * It fails on inserting existant record(s) - does not replace... > * This causes grief when loading multiple files... > */ > public void loadNSave( Session session, File fileName) > { > boolean localSession = (session == null); > Transaction tx = null; > > try > { > System.out.println( "*********** Loading "+fileName+"..."); > BufferedReader br = new BufferedReader( new FileReader( fileName) ); > > if ( session == null) // create a local session > { > session = sessionFactory.openSession(); > RichObjectFactory.connectToBioSQL(session); > } > > // load the objects. I expect this to use the established factory. > RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA( > br, new > SimpleNamespace( "genbankBiosqlRich") ); > > while ( rsi.hasNext() ) > tx = session.beginTransaction(); // Hibernate requires transactions. > > System.out.println( "*********** Loading next sequence..."); > // ??should automatically fetch existing objects from the > database... > RichSequence sequence = rsi.nextRichSequence(); > System.out.println( "loaded sequence > "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier()); > > try > { > System.out.println( "*********** saving..."); > > // synchronize in-memory representation w/ the database > // HUGE amounts of time spent doing selects on keys - really > slows things down!! > session.saveOrUpdate( "Sequence", sequence ); > tx.commit(); // save to database - does an automatic flush > // batch operations overwhelm the cache - clear it out! > session.flush(); // force in-memory to disk. > session.clear(); // clean out cache. > } > catch (HibernateException ex) > { > tx.rollback(); // discard the sequence and all its annotations > ex.printStackTrace(); > } > } > } > catch (FileNotFoundException ex) > { > ex.printStackTrace(); > } > catch ( BioException bex) > { > bex.printStackTrace(); > } > finally > { > if ( localSession) > { > session.flush(); // force in-memory to disk. > session.close(); // only for local sessions > } > } > } > > and the following following is a sample stack dump: > > org.hibernate.exception.ConstraintViolationException: could not insert: > [Namespace] > at > org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) > > at > org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) > > at > org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) > > at > org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) > > at > org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) > > at > org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) > > at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi > > stener.java:187) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 > > 72) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java > > :94) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at > org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > at > org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi > > stener.java:187) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 > > 72) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java > > :94) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at > bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108) > at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72) > Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich' > for key 2 > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600) > at > com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) > > at > com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) > > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259) > at > org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) > > at > org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) > > > > -- > Doug Brown - Bioinformatics > Fungal Genomics Laboratory > Center for Integrated Fungal Research > North Carolina State University > Campus Box 7251, Raleigh, NC 27695-7251 > https://www.fungalgenomics.ncsu.edu/~debrown/ > Tel: (919) 513-0394, Fax (919) 513-0024 > e-mail: doug_brown at ncsu.edu > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Mon Apr 21 21:39:39 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 22 Apr 2008 09:39:39 +0800 Subject: [Biojava-l] Serching mailing list archives Message-ID: <93b45ca50804211839lddf0a7je47c337c5250c85d@mail.gmail.com> Dear Felix - The URL has been updated. Please see below. - Mark ---------- Forwarded message ---------- From: Mauricio Herrera Cuadra via RT Date: Tue, Apr 22, 2008 at 9:20 AM Subject: [O|B|F Helpdesk #507] Fwd: [Biojava-l] mailing list archives To: markjschreiber at gmail.com Cc: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org OBF Search engine URL has been moved to: http://search.open-bio.org I've updated the link in the BioJava wiki (http://biojava.org/wiki/BioJava:MailingLists) with the new URL. Please let the requestor know about the update. Regards, Mauricio. From markjschreiber at gmail.com Tue Apr 22 02:40:35 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 22 Apr 2008 14:40:35 +0800 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL In-Reply-To: <480D7B55.3070708@uni-tuebingen.de> References: <4805E9B8.6000709@unity.ncsu.edu> <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> <480D7B55.3070708@uni-tuebingen.de> Message-ID: <93b45ca50804212340s708df36ek73cdf3337b957316@mail.gmail.com> Hi - Can someone update the docs on biojava.org to reflect this requirement? Thanks, - Mark On Tue, Apr 22, 2008 at 1:44 PM, Andreas Dr?ger wrote: > Hi Doug, > > We also had the same problem. The solution is simple. You always construct a > new SimpleNamespace, each time with the same name. Your code will work if > you do one of the following: > 1. You can load the namespace with the name from the database and set this > namespace to the parser. > 2. You can use the default namespace from the RichObjectFactory or > 3. Just use the parser method, which does not require any namespaces - this > method actually uses the default namespace (so three is actually equal to > two). > This should help. > > Cheers > Andreas > From debrown at unity.ncsu.edu Tue Apr 22 08:22:16 2008 From: debrown at unity.ncsu.edu (Doug Brown) Date: Tue, 22 Apr 2008 08:22:16 -0400 Subject: [Biojava-l] mailing list archives In-Reply-To: <480C8DE8.5040805@molgen.mpg.de> References: <480C8DE8.5040805@molgen.mpg.de> Message-ID: <480DD878.2000100@unity.ncsu.edu> Hi Felix, In addition to the http://search.open-bio.org link mentioned by Mauricio Herrera Cuadra, you could use Google directly with search expressions similar to: "[biojava-l]" site:portal.open-bio.org "[biojava-dev]" site:portal.open-bio.org Of course, you need to add on any additional search terms to limit the results. In general see: http://www.google.com/advanced_search Regards, Doug Felix Dreher wrote: > Hello all, > > is there a possibility to query the biojava mailing-list archives? > (the link provided on the biojava-homepage doesn't work: > http://search.open-bio.org/cgi-bin/mail-search.cgi) > > Best regards, > Felix > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Doug Brown - Bioinformatics Fungal Genomics Laboratory Center for Integrated Fungal Research North Carolina State University Campus Box 7251, Raleigh, NC 27695-7251 https://www.fungalgenomics.ncsu.edu/~debrown/ Tel: (919) 513-0394, Fax (919) 513-0024 e-mail: doug_brown at ncsu.edu From markjschreiber at gmail.com Wed Apr 23 03:22:14 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 23 Apr 2008 15:22:14 +0800 Subject: [Biojava-l] Suspicious Headers Message-ID: <93b45ca50804230022j223ed4eetf219e21d35cd51dc@mail.gmail.com> Hi - If you have ever tried posting to the list and had your email bounced back with a message complaining about a suspicious header chances are your email has been bounced by our spam filter. There are generally 2 reasons why this might happen 1. Your email is HTML, mail to the list must be text only. 2. You email has an attachment, possibly a .vcf file or an image in your email signature (company logo or similar). If you keep it plain text it should get through. - Mark From mail at florianschatz.de Wed Apr 23 09:49:05 2008 From: mail at florianschatz.de (Florian Schatz) Date: Wed, 23 Apr 2008 15:49:05 +0200 Subject: [Biojava-l] Extract non-gene regions Message-ID: Hello, I am new to biojava and worked a lot with in the last few weeks. I hope this is the right place for questions, if not please tell me. I want to get the nucleotid sequence outside the genes of a genebank file. So everything that is not marked by a 'gene' feature. Unfortunately, there is no sustract or exclude function for the Location class. Any hints? Btw: union() of location worked fine for extracting nucleotids of the genes only. Best, Florian From markjschreiber at gmail.com Wed Apr 23 22:29:12 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 24 Apr 2008 10:29:12 +0800 Subject: [Biojava-l] Extract non-gene regions In-Reply-To: References: Message-ID: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Hi Florian - There are at least two approaches. You are on the right track with making a union of all gene locations. The compound location that results from the Union will contain all the nucleotides that are coding. You can then iterate through each nucleotide in the genome and find out if the union contains the nucleotide. If it doesn't then it is non coding. This is surprisingly rapid as the comparisons are simple. The pseudo code would be something like... RichLocation coding; //initialize this by making a union of all locations of CDS or Gene Features. RichSequence genome; // read from file or database for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a bit more sophisticated for a circular genome if( ! genome.contains(i){ //you have a non-coding nucleotide. } } The other approach is to use the blockIterator() method of the compound location that results from the union of coding sequences. This will output each contiguous chunk of coding sequence. If you know the length of the sequence then you can rapidly figure out the intervening pieces. For example, if the block iterator tells you that [10..50], [90..100], [350..380] are coding and you know the genome is of length 400 then you can quickly derive [1..9], [51..89], [101..349] and [381..400] are non-coding. Again it is more complicated for circular sequences and more complex if you consider the opposite strand of a gene (the gene shadow) to be non-coding. Unfortunately there is no convenience method to do this but if you code something up it would be great to put it in the cookbook so others can re-use it. - Mark You could actually make point locations of all the non-coding nucleotides and then merge the whole lot at the end into a compound location of non-coding On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz wrote: > Hello, > > I am new to biojava and worked a lot with in the last few weeks. I hope > this is the right place for questions, if not please tell me. > > I want to get the nucleotid sequence outside the genes of a genebank file. > So everything that is not marked by a 'gene' feature. Unfortunately, there > is no sustract or exclude function for the Location class. Any hints? > > Btw: union() of location worked fine for extracting nucleotids of the genes > only. > > Best, > Florian > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From heuermh at acm.org Thu Apr 24 00:09:46 2008 From: heuermh at acm.org (Michael Heuer) Date: Thu, 24 Apr 2008 00:09:46 -0400 (EDT) Subject: [Biojava-l] Extract non-gene regions In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Message-ID: On Thu, 24 Apr 2008, Mark Schreiber wrote: > Hi Florian - > > There are at least two approaches. You are on the right track with > making a union of all gene locations. The compound location that > results from the Union will contain all the nucleotides that are > coding. You can then iterate through each nucleotide in the genome and > find out if the union contains the nucleotide. If it doesn't then it > is non coding. This is surprisingly rapid as the comparisons are > simple. The pseudo code would be something like... > > RichLocation coding; //initialize this by making a union of all > locations of CDS or Gene Features. > > RichSequence genome; // read from file or database > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > bit more sophisticated for a circular genome > if( ! genome.contains(i){ > //you have a non-coding nucleotide. > } > } typo? if (!coding.contains(i)) { // you have a non-coding nucleotide. } > The other approach is to use the blockIterator() method of the > compound location that results from the union of coding sequences. > This will output each contiguous chunk of coding sequence. If you know > the length of the sequence then you can rapidly figure out the > intervening pieces. > > For example, if the block iterator tells you that [10..50], [90..100], > [350..380] are coding and you know the genome is of length 400 then > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > non-coding. Again it is more complicated for circular sequences and > more complex if you consider the opposite strand of a gene (the gene > shadow) to be non-coding. Unfortunately there is no convenience method > to do this but if you code something up it would be great to put it in > the cookbook so others can re-use it. > > - Mark > > You could actually make point locations of all the non-coding > nucleotides and then merge the whole lot at the end into a compound > location of non-coding > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz wrote: > > Hello, > > > > I am new to biojava and worked a lot with in the last few weeks. I hope > > this is the right place for questions, if not please tell me. > > > > I want to get the nucleotid sequence outside the genes of a genebank file. > > So everything that is not marked by a 'gene' feature. Unfortunately, there > > is no sustract or exclude function for the Location class. Any hints? > > > > Btw: union() of location worked fine for extracting nucleotids of the genes > > only. > > > > Best, > > Florian > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas.draeger at uni-tuebingen.de Thu Apr 24 03:26:31 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Thu, 24 Apr 2008 09:26:31 +0200 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate Message-ID: <48103627.90505@uni-tuebingen.de> Dear all, Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes. Therefore, I would like to ask the following questions: 1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers? I am grateful for any answers. Cheers Andreas From mail at florianschatz.de Thu Apr 24 08:09:24 2008 From: mail at florianschatz.de (Florian Schatz) Date: Thu, 24 Apr 2008 14:09:24 +0200 Subject: [Biojava-l] Extract non-gene regions In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> References: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Message-ID: Hello, I tried that, but is as slow as a version operating on Strings.. however, I created a Cookbook entry: http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions Is there a better way to get a Sequence from a SybolList than: Sequence newsequence = DNATools.createDNASequence(symbolL.seqString (), "New Sequence"); Best, Florian Am 24.04.2008 um 04:29 schrieb Mark Schreiber: > Hi Florian - > > There are at least two approaches. You are on the right track with > making a union of all gene locations. The compound location that > results from the Union will contain all the nucleotides that are > coding. You can then iterate through each nucleotide in the genome and > find out if the union contains the nucleotide. If it doesn't then it > is non coding. This is surprisingly rapid as the comparisons are > simple. The pseudo code would be something like... > > RichLocation coding; //initialize this by making a union of all > locations of CDS or Gene Features. > > RichSequence genome; // read from file or database > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > bit more sophisticated for a circular genome > if( ! genome.contains(i){ > //you have a non-coding nucleotide. > } > } > > The other approach is to use the blockIterator() method of the > compound location that results from the union of coding sequences. > This will output each contiguous chunk of coding sequence. If you know > the length of the sequence then you can rapidly figure out the > intervening pieces. > > For example, if the block iterator tells you that [10..50], [90..100], > [350..380] are coding and you know the genome is of length 400 then > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > non-coding. Again it is more complicated for circular sequences and > more complex if you consider the opposite strand of a gene (the gene > shadow) to be non-coding. Unfortunately there is no convenience method > to do this but if you code something up it would be great to put it in > the cookbook so others can re-use it. > > - Mark > > You could actually make point locations of all the non-coding > nucleotides and then merge the whole lot at the end into a compound > location of non-coding > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz > wrote: >> Hello, >> >> I am new to biojava and worked a lot with in the last few weeks. >> I hope >> this is the right place for questions, if not please tell me. >> >> I want to get the nucleotid sequence outside the genes of a >> genebank file. >> So everything that is not marked by a 'gene' feature. >> Unfortunately, there >> is no sustract or exclude function for the Location class. Any hints? >> >> Btw: union() of location worked fine for extracting nucleotids of >> the genes >> only. >> >> Best, >> Florian >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> From markjschreiber at gmail.com Thu Apr 24 08:47:59 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 24 Apr 2008 20:47:59 +0800 Subject: [Biojava-l] Extract non-gene regions In-Reply-To: References: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Message-ID: <93b45ca50804240547gefb750fh493a01a8d0edbe22@mail.gmail.com> Hi - While Sequences and SymbolLists offer many advantages over Strings or character arrays speed is not one of them. You can create a Sequence using the SequenceFactory implementations which are much more efficient than converting to Strings and back to symbols again. This is a very expensive operation. From memory SimpleRichSequence may even have a constructor that takes a SymbolList and a name. There should be no need to convert to a String and back. Also, do you need a Sequence when a SymbolList may contain all the information you need? Finally the Edit operations you use in your wiki example will cause quite a big performance hit, your comment seems to allude to this. It would be better to collect all the non-coding points (i) and compile them into a compound location and then extract the SymbolList for that location all in one go. - Mark On Thu, Apr 24, 2008 at 8:09 PM, Florian Schatz wrote: > Hello, > > I tried that, but is as slow as a version operating on Strings.. however, I > created a Cookbook entry: > http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions > > Is there a better way to get a Sequence from a SybolList than: > > Sequence newsequence = DNATools.createDNASequence(symbolL.seqString(), "New > Sequence"); > > > Best, > Florian > > Am 24.04.2008 um 04:29 schrieb Mark Schreiber: > > > Hi Florian - > > > > > > > > > > There are at least two approaches. You are on the right track with > > making a union of all gene locations. The compound location that > > results from the Union will contain all the nucleotides that are > > coding. You can then iterate through each nucleotide in the genome and > > find out if the union contains the nucleotide. If it doesn't then it > > is non coding. This is surprisingly rapid as the comparisons are > > simple. The pseudo code would be something like... > > > > RichLocation coding; //initialize this by making a union of all > > locations of CDS or Gene Features. > > > > RichSequence genome; // read from file or database > > > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > > bit more sophisticated for a circular genome > > if( ! genome.contains(i){ > > //you have a non-coding nucleotide. > > } > > } > > > > The other approach is to use the blockIterator() method of the > > compound location that results from the union of coding sequences. > > This will output each contiguous chunk of coding sequence. If you know > > the length of the sequence then you can rapidly figure out the > > intervening pieces. > > > > For example, if the block iterator tells you that [10..50], [90..100], > > [350..380] are coding and you know the genome is of length 400 then > > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > > non-coding. Again it is more complicated for circular sequences and > > more complex if you consider the opposite strand of a gene (the gene > > shadow) to be non-coding. Unfortunately there is no convenience method > > to do this but if you code something up it would be great to put it in > > the cookbook so others can re-use it. > > > > - Mark > > > > You could actually make point locations of all the non-coding > > nucleotides and then merge the whole lot at the end into a compound > > location of non-coding > > > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz > wrote: > > > > > Hello, > > > > > > I am new to biojava and worked a lot with in the last few weeks. I hope > > > this is the right place for questions, if not please tell me. > > > > > > I want to get the nucleotid sequence outside the genes of a genebank > file. > > > So everything that is not marked by a 'gene' feature. Unfortunately, > there > > > is no sustract or exclude function for the Location class. Any hints? > > > > > > Btw: union() of location worked fine for extracting nucleotids of the > genes > > > only. > > > > > > Best, > > > Florian > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From hlapp at gmx.net Thu Apr 24 18:25:29 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 24 Apr 2008 18:25:29 -0400 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate In-Reply-To: <48103627.90505@uni-tuebingen.de> References: <48103627.90505@uni-tuebingen.de> Message-ID: <59C8D987-F6F8-4E83-B70D-127D38DCC0C9@gmx.net> Hi Andreas, On Apr 24, 2008, at 3:26 AM, Andreas Dr?ger wrote: > Or more generally, can I somehow read in an ontology (for instance > the GO), persist it in BioSQL and make use of the terms contained > therein? in principle, you absolutely can. As for whether Biojava lets you do this I do not know but I would suppose yes. (BioPerl has a script load_ontology.pl in its Bioperl-db package that does this.) Your other questions all seem Biojava-specific, so I'll leave them to the list and Mark & Richard. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From andreas.draeger at uni-tuebingen.de Fri Apr 18 03:13:52 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Fri, 18 Apr 2008 07:13:52 -0000 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate Message-ID: <480844DB.6070808@uni-tuebingen.de> Dear all, Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes. Therefore, I would like to ask the following questions: 1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers? I am grateful for any answers. Cheers Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: andreas.draeger.vcf Type: text/x-vcard Size: 509 bytes Desc: not available URL: From andreas.draeger at uni-tuebingen.de Tue Apr 22 01:45:02 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Tue, 22 Apr 2008 05:45:02 -0000 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL In-Reply-To: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> References: <4805E9B8.6000709@unity.ncsu.edu> <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> Message-ID: <480D7B55.3070708@uni-tuebingen.de> Hi Doug, We also had the same problem. The solution is simple. You always construct a new SimpleNamespace, each time with the same name. Your code will work if you do one of the following: 1. You can load the namespace with the name from the database and set this namespace to the parser. 2. You can use the default namespace from the RichObjectFactory or 3. Just use the parser method, which does not require any namespaces - this method actually uses the default namespace (so three is actually equal to two). This should help. Cheers Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: andreas.draeger.vcf Type: text/x-vcard Size: 509 bytes Desc: not available URL: From andreas.draeger at uni-tuebingen.de Tue Apr 22 02:43:29 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Tue, 22 Apr 2008 06:43:29 -0000 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate Message-ID: <480D859F.6080003@uni-tuebingen.de> Dear all, Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes. Therefore, I would like to ask the following questions: 1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers? I am grateful for any answers. Cheers Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: andreas.draeger.vcf Type: text/x-vcard Size: 509 bytes Desc: not available URL: From budhaditya21 at yahoo.co.in Tue Apr 29 01:29:23 2008 From: budhaditya21 at yahoo.co.in (arunabha banerjee) Date: Tue, 29 Apr 2008 05:29:23 -0000 Subject: [Biojava-l] problems installing biojava on Windows XP professional Message-ID: <406135.22320.qm@web94608.mail.in2.yahoo.com> An HTML attachment was scrubbed... URL: From ap3 at sanger.ac.uk Sun Apr 13 18:02:41 2008 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 13 Apr 2008 19:02:41 +0100 Subject: [Biojava-l] biojava 1.6 released Message-ID: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> Biojava 1.6 has been released and is available from http:// biojava.org/wiki/BioJava:Download Biojava 1.6 offers more functionality and stability over the previous official releases. BioJava now depends on Java 1.5+. We highly recommend you to upgrade as soon as possible. In detail, the phylo package org.biojavax.bio.phylo was improved and expanded by our GSOC'07 student Boh-Yun Lee. It now contains fully- functional Nexus and Phylip parsers, and tools for calculating UPGMA and Neighbour Joining, Jukes-Kantor and Kimura Two Parameter, and MP. It uses JGraphT to represent parsed trees. The PDB file parser was improved by Jules Jacobsen for better dealing with PDB header records. Andreas Draeger provided several patches for improving the Genetic Algorithm modules. Additionally this release contains numerous bug fixes and documentation improvements. Thanks to the entire biojava community for making this possible! Happy Biojava-ing, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From debrown at unity.ncsu.edu Wed Apr 16 11:57:44 2008 From: debrown at unity.ncsu.edu (Doug Brown) Date: Wed, 16 Apr 2008 07:57:44 -0400 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL Message-ID: <4805E9B8.6000709@unity.ncsu.edu> Greetings, I am happily climbing the learning curve for Biojava-live, Biojavax, and BioSQL. I believe that I am using the latest releases, Biojava 1.6 and BioSQL 1.0, in that I have performed the installation within the past week. I am attempting to load, via Biojavax, multiple genbank records for the same organism (a whole genome's worth of annotations) and to save those into a BioSQL database via Biojavax's Hibernate persistence mechanism. Loading a second genbank file (same organism, different sequence) croaks with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2 ...... could not insert: [Namespace]. FYI two sample genbank records are CH476760.gb and CH476761.gb and were obtained directly from genbank. Never having used Hibernate before nor its type of database abstraction, I think that I am properly handling the transaction semantics. Either I am violating unspoken presumptions of the persistence paradigm or the behavior of RichSequence.IOTools.readGenbankDNA is not what I expected. I had presumed that the above routine would use the established RichObjectFactory to obtain new or extant objects and then populate those objects with values from the sequence file. This only seems to happen when I load multiple sequences from a single file. Multi file operations fail dismally. What is the proper way of using Biojava to load up a database with records? In advance, thank you all for the traffic on this list, it has been quite helpful in bringing me up to speed. Regards, Doug Brown Here is the relevant [hacked] subroutine: /** * This works for genbank files containing multiple sequences. * Originaly concept from: http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html * It fails on inserting existant record(s) - does not replace... * This causes grief when loading multiple files... */ public void loadNSave( Session session, File fileName) { boolean localSession = (session == null); Transaction tx = null; try { System.out.println( "*********** Loading "+fileName+"..."); BufferedReader br = new BufferedReader( new FileReader( fileName) ); if ( session == null) // create a local session { session = sessionFactory.openSession(); RichObjectFactory.connectToBioSQL(session); } // load the objects. I expect this to use the established factory. RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA( br, new SimpleNamespace( "genbankBiosqlRich") ); while ( rsi.hasNext() ) tx = session.beginTransaction(); // Hibernate requires transactions. System.out.println( "*********** Loading next sequence..."); // ??should automatically fetch existing objects from the database... RichSequence sequence = rsi.nextRichSequence(); System.out.println( "loaded sequence "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier()); try { System.out.println( "*********** saving..."); // synchronize in-memory representation w/ the database // HUGE amounts of time spent doing selects on keys - really slows things down!! session.saveOrUpdate( "Sequence", sequence ); tx.commit(); // save to database - does an automatic flush // batch operations overwhelm the cache - clear it out! session.flush(); // force in-memory to disk. session.clear(); // clean out cache. } catch (HibernateException ex) { tx.rollback(); // discard the sequence and all its annotations ex.printStackTrace(); } } } catch (FileNotFoundException ex) { ex.printStackTrace(); } catch ( BioException bex) { bex.printStackTrace(); } finally { if ( localSession) { session.flush(); // force in-memory to disk. session.close(); // only for local sessions } } } and the following following is a sample stack dump: org.hibernate.exception.ConstraintViolationException: could not insert: [Namespace] at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) at org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) at org.hibernate.engine.Cascade.cascade(Cascade.java:130) at org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108) at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72) Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich' for key 2 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600) at com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) at com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259) at org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) -- Doug Brown - Bioinformatics Fungal Genomics Laboratory Center for Integrated Fungal Research North Carolina State University Campus Box 7251, Raleigh, NC 27695-7251 https://www.fungalgenomics.ncsu.edu/~debrown/ Tel: (919) 513-0394, Fax (919) 513-0024 e-mail: doug_brown at ncsu.edu From debrown at unity.ncsu.edu Wed Apr 16 12:01:52 2008 From: debrown at unity.ncsu.edu (Doug Brown) Date: Wed, 16 Apr 2008 08:01:52 -0400 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL Message-ID: <4805EAB0.6010803@unity.ncsu.edu> Greetings, I am happily climbing the learning curve for Biojava-live, Biojavax, and BioSQL. I believe that I am using the latest releases, Biojava 1.6 and BioSQL 1.0, in that I have performed the installation within the past week. I am attempting to load, via Biojavax, multiple genbank records for the same organism (a whole genome's worth of annotations) and to save those into a BioSQL database via Biojavax's Hibernate persistence mechanism. Loading a second genbank file (same organism, different sequence) croaks with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2 ...... could not insert: [Namespace]. FYI two sample genbank records are CH476760.gb and CH476761.gb and were obtained directly from genbank. Never having used Hibernate before nor its type of database abstraction, I think that I am properly handling the transaction semantics. Either I am violating unspoken presumptions of the persistence paradigm or the behavior of RichSequence.IOTools.readGenbankDNA is not what I expected. I had presumed that the above routine would use the established RichObjectFactory to obtain new or extant objects and then populate those objects with values from the sequence file. This only seems to happen when I load multiple sequences from a single file. Multi file operations fail dismally. What is the proper way of using Biojava to load up a database with records? In advance, thank you all for the traffic on this list, it has been quite helpful in bringing me up to speed. Regards, Doug Brown Here is the relevant [hacked] subroutine: /** * This works for genbank files containing multiple sequences. * Originaly concept from: http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html * It fails on inserting existant record(s) - does not replace... * This causes grief when loading multiple files... */ public void loadNSave( Session session, File fileName) { boolean localSession = (session == null); Transaction tx = null; try { System.out.println( "*********** Loading "+fileName+"..."); BufferedReader br = new BufferedReader( new FileReader( fileName) ); if ( session == null) // create a local session { session = sessionFactory.openSession(); RichObjectFactory.connectToBioSQL(session); } // load the objects. I expect this to use the established factory. RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA( br, new SimpleNamespace( "genbankBiosqlRich") ); while ( rsi.hasNext() ) tx = session.beginTransaction(); // Hibernate requires transactions. System.out.println( "*********** Loading next sequence..."); // ??should automatically fetch existing objects from the database... RichSequence sequence = rsi.nextRichSequence(); System.out.println( "loaded sequence "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier()); try { System.out.println( "*********** saving..."); // synchronize in-memory representation w/ the database // HUGE amounts of time spent doing selects on keys - really slows things down!! session.saveOrUpdate( "Sequence", sequence ); tx.commit(); // save to database - does an automatic flush // batch operations overwhelm the cache - clear it out! session.flush(); // force in-memory to disk. session.clear(); // clean out cache. } catch (HibernateException ex) { tx.rollback(); // discard the sequence and all its annotations ex.printStackTrace(); } } } catch (FileNotFoundException ex) { ex.printStackTrace(); } catch ( BioException bex) { bex.printStackTrace(); } finally { if ( localSession) { session.flush(); // force in-memory to disk. session.close(); // only for local sessions } } } and the following following is a sample stack dump: org.hibernate.exception.ConstraintViolationException: could not insert: [Namespace] at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) at org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) at org.hibernate.engine.Cascade.cascade(Cascade.java:130) at org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi stener.java:187) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 72) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java :94) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) at bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108) at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72) Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich' for key 2 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600) at com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) at com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259) at org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) -- Doug Brown - Bioinformatics Fungal Genomics Laboratory Center for Integrated Fungal Research North Carolina State University Campus Box 7251, Raleigh, NC 27695-7251 https://www.fungalgenomics.ncsu.edu/~debrown/ Tel: (919) 513-0394, Fax (919) 513-0024 e-mail: doug_brown at ncsu.edu From dreher at molgen.mpg.de Mon Apr 21 12:51:52 2008 From: dreher at molgen.mpg.de (Felix Dreher) Date: Mon, 21 Apr 2008 14:51:52 +0200 Subject: [Biojava-l] mailing list archives Message-ID: <480C8DE8.5040805@molgen.mpg.de> Hello all, is there a possibility to query the biojava mailing-list archives? (the link provided on the biojava-homepage doesn't work: http://search.open-bio.org/cgi-bin/mail-search.cgi) Best regards, Felix From markjschreiber at gmail.com Tue Apr 22 01:27:51 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 22 Apr 2008 09:27:51 +0800 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL In-Reply-To: <4805E9B8.6000709@unity.ncsu.edu> References: <4805E9B8.6000709@unity.ncsu.edu> Message-ID: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> Hi Doug - Has anyone provided a solution for this yet? I haven't used the Hibernate bindings to BioSQL (I'm actually working on a JPA binding with EntityBeans) but when I did they worked well. However, I have seen this type of error before. Clearly the entry 'genbankBiosqlRich' is being duplicated somewhere where it should be unique. This looks unusual because you call saveOrUpdate which should be able to figure this out unless the way Hibernate determines equality is not the same as the way BioJava does. What happens if you try to save them one at a time (in sequential runs of your program)? From the look of the stack trace you might see the same error. Also, it might pay to look at the biojavax documentation on http://biojava.org/wiki/BioJava:BioJavaXDocs#BioSQL_and_Hibernate. Any Hibernate experts able to offer an opinion here?? - Mark On Wed, Apr 16, 2008 at 7:57 PM, Doug Brown wrote: > Greetings, > I am happily climbing the learning curve for Biojava-live, Biojavax, > and BioSQL. I believe that I am using the latest releases, Biojava 1.6 > and BioSQL 1.0, in that I have performed the installation within the > past week. > > I am attempting to load, via Biojavax, multiple genbank records for the > same organism (a whole genome's worth of annotations) and to save those > into a BioSQL database via Biojavax's Hibernate persistence mechanism. > Loading a second genbank file (same organism, different sequence) croaks > with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2 > ...... could not insert: [Namespace]. FYI two sample genbank records > are CH476760.gb and CH476761.gb and were obtained directly from genbank. > > Never having used Hibernate before nor its type of database abstraction, > I think that I am properly handling the transaction semantics. Either I > am violating unspoken presumptions of the persistence paradigm or the > behavior of RichSequence.IOTools.readGenbankDNA is not what I expected. > I had presumed that the above routine would use the established > RichObjectFactory to obtain new or extant objects and then populate > those objects with values from the sequence file. This only seems to > happen when I load multiple sequences from a single file. Multi file > operations fail dismally. > > What is the proper way of using Biojava to load up a database with records? > > In advance, thank you all for the traffic on this list, it has been > quite helpful in bringing me up to speed. > > Regards, > Doug Brown > > Here is the relevant [hacked] subroutine: > /** > * This works for genbank files containing multiple sequences. > * Originaly concept from: > http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html > * It fails on inserting existant record(s) - does not replace... > * This causes grief when loading multiple files... > */ > public void loadNSave( Session session, File fileName) > { > boolean localSession = (session == null); > Transaction tx = null; > > try > { > System.out.println( "*********** Loading "+fileName+"..."); > BufferedReader br = new BufferedReader( new FileReader( fileName) ); > > if ( session == null) // create a local session > { > session = sessionFactory.openSession(); > RichObjectFactory.connectToBioSQL(session); > } > > // load the objects. I expect this to use the established factory. > RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA( > br, new > SimpleNamespace( "genbankBiosqlRich") ); > > while ( rsi.hasNext() ) > tx = session.beginTransaction(); // Hibernate requires transactions. > > System.out.println( "*********** Loading next sequence..."); > // ??should automatically fetch existing objects from the > database... > RichSequence sequence = rsi.nextRichSequence(); > System.out.println( "loaded sequence > "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier()); > > try > { > System.out.println( "*********** saving..."); > > // synchronize in-memory representation w/ the database > // HUGE amounts of time spent doing selects on keys - really > slows things down!! > session.saveOrUpdate( "Sequence", sequence ); > tx.commit(); // save to database - does an automatic flush > // batch operations overwhelm the cache - clear it out! > session.flush(); // force in-memory to disk. > session.clear(); // clean out cache. > } > catch (HibernateException ex) > { > tx.rollback(); // discard the sequence and all its annotations > ex.printStackTrace(); > } > } > } > catch (FileNotFoundException ex) > { > ex.printStackTrace(); > } > catch ( BioException bex) > { > bex.printStackTrace(); > } > finally > { > if ( localSession) > { > session.flush(); // force in-memory to disk. > session.close(); // only for local sessions > } > } > } > > and the following following is a sample stack dump: > > org.hibernate.exception.ConstraintViolationException: could not insert: > [Namespace] > at > org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) > > at > org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) > > at > org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) > > at > org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) > > at > org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) > > at > org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) > > at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279) > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi > > stener.java:187) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 > > 72) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java > > :94) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at > org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218) > at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268) > at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216) > at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169) > at org.hibernate.engine.Cascade.cascade(Cascade.java:130) > at > org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) > > at > org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) > > at > org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) > > at > org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) > > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi > > stener.java:187) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 > > 72) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java > > :94) > at > org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) > > at > org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507) > at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499) > at > bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108) > at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72) > Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich' > for key 2 > at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975) > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600) > at > com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) > > at > com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) > > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274) > at > com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259) > at > org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) > > at > org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) > > > > -- > Doug Brown - Bioinformatics > Fungal Genomics Laboratory > Center for Integrated Fungal Research > North Carolina State University > Campus Box 7251, Raleigh, NC 27695-7251 > https://www.fungalgenomics.ncsu.edu/~debrown/ > Tel: (919) 513-0394, Fax (919) 513-0024 > e-mail: doug_brown at ncsu.edu > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Tue Apr 22 01:39:39 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 22 Apr 2008 09:39:39 +0800 Subject: [Biojava-l] Serching mailing list archives Message-ID: <93b45ca50804211839lddf0a7je47c337c5250c85d@mail.gmail.com> Dear Felix - The URL has been updated. Please see below. - Mark ---------- Forwarded message ---------- From: Mauricio Herrera Cuadra via RT Date: Tue, Apr 22, 2008 at 9:20 AM Subject: [O|B|F Helpdesk #507] Fwd: [Biojava-l] mailing list archives To: markjschreiber at gmail.com Cc: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org OBF Search engine URL has been moved to: http://search.open-bio.org I've updated the link in the BioJava wiki (http://biojava.org/wiki/BioJava:MailingLists) with the new URL. Please let the requestor know about the update. Regards, Mauricio. From markjschreiber at gmail.com Tue Apr 22 06:40:35 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 22 Apr 2008 14:40:35 +0800 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL In-Reply-To: <480D7B55.3070708@uni-tuebingen.de> References: <4805E9B8.6000709@unity.ncsu.edu> <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> <480D7B55.3070708@uni-tuebingen.de> Message-ID: <93b45ca50804212340s708df36ek73cdf3337b957316@mail.gmail.com> Hi - Can someone update the docs on biojava.org to reflect this requirement? Thanks, - Mark On Tue, Apr 22, 2008 at 1:44 PM, Andreas Dr?ger wrote: > Hi Doug, > > We also had the same problem. The solution is simple. You always construct a > new SimpleNamespace, each time with the same name. Your code will work if > you do one of the following: > 1. You can load the namespace with the name from the database and set this > namespace to the parser. > 2. You can use the default namespace from the RichObjectFactory or > 3. Just use the parser method, which does not require any namespaces - this > method actually uses the default namespace (so three is actually equal to > two). > This should help. > > Cheers > Andreas > From debrown at unity.ncsu.edu Tue Apr 22 12:22:16 2008 From: debrown at unity.ncsu.edu (Doug Brown) Date: Tue, 22 Apr 2008 08:22:16 -0400 Subject: [Biojava-l] mailing list archives In-Reply-To: <480C8DE8.5040805@molgen.mpg.de> References: <480C8DE8.5040805@molgen.mpg.de> Message-ID: <480DD878.2000100@unity.ncsu.edu> Hi Felix, In addition to the http://search.open-bio.org link mentioned by Mauricio Herrera Cuadra, you could use Google directly with search expressions similar to: "[biojava-l]" site:portal.open-bio.org "[biojava-dev]" site:portal.open-bio.org Of course, you need to add on any additional search terms to limit the results. In general see: http://www.google.com/advanced_search Regards, Doug Felix Dreher wrote: > Hello all, > > is there a possibility to query the biojava mailing-list archives? > (the link provided on the biojava-homepage doesn't work: > http://search.open-bio.org/cgi-bin/mail-search.cgi) > > Best regards, > Felix > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Doug Brown - Bioinformatics Fungal Genomics Laboratory Center for Integrated Fungal Research North Carolina State University Campus Box 7251, Raleigh, NC 27695-7251 https://www.fungalgenomics.ncsu.edu/~debrown/ Tel: (919) 513-0394, Fax (919) 513-0024 e-mail: doug_brown at ncsu.edu From markjschreiber at gmail.com Wed Apr 23 07:22:14 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 23 Apr 2008 15:22:14 +0800 Subject: [Biojava-l] Suspicious Headers Message-ID: <93b45ca50804230022j223ed4eetf219e21d35cd51dc@mail.gmail.com> Hi - If you have ever tried posting to the list and had your email bounced back with a message complaining about a suspicious header chances are your email has been bounced by our spam filter. There are generally 2 reasons why this might happen 1. Your email is HTML, mail to the list must be text only. 2. You email has an attachment, possibly a .vcf file or an image in your email signature (company logo or similar). If you keep it plain text it should get through. - Mark From mail at florianschatz.de Wed Apr 23 13:49:05 2008 From: mail at florianschatz.de (Florian Schatz) Date: Wed, 23 Apr 2008 15:49:05 +0200 Subject: [Biojava-l] Extract non-gene regions Message-ID: Hello, I am new to biojava and worked a lot with in the last few weeks. I hope this is the right place for questions, if not please tell me. I want to get the nucleotid sequence outside the genes of a genebank file. So everything that is not marked by a 'gene' feature. Unfortunately, there is no sustract or exclude function for the Location class. Any hints? Btw: union() of location worked fine for extracting nucleotids of the genes only. Best, Florian From markjschreiber at gmail.com Thu Apr 24 02:29:12 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 24 Apr 2008 10:29:12 +0800 Subject: [Biojava-l] Extract non-gene regions In-Reply-To: References: Message-ID: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Hi Florian - There are at least two approaches. You are on the right track with making a union of all gene locations. The compound location that results from the Union will contain all the nucleotides that are coding. You can then iterate through each nucleotide in the genome and find out if the union contains the nucleotide. If it doesn't then it is non coding. This is surprisingly rapid as the comparisons are simple. The pseudo code would be something like... RichLocation coding; //initialize this by making a union of all locations of CDS or Gene Features. RichSequence genome; // read from file or database for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a bit more sophisticated for a circular genome if( ! genome.contains(i){ //you have a non-coding nucleotide. } } The other approach is to use the blockIterator() method of the compound location that results from the union of coding sequences. This will output each contiguous chunk of coding sequence. If you know the length of the sequence then you can rapidly figure out the intervening pieces. For example, if the block iterator tells you that [10..50], [90..100], [350..380] are coding and you know the genome is of length 400 then you can quickly derive [1..9], [51..89], [101..349] and [381..400] are non-coding. Again it is more complicated for circular sequences and more complex if you consider the opposite strand of a gene (the gene shadow) to be non-coding. Unfortunately there is no convenience method to do this but if you code something up it would be great to put it in the cookbook so others can re-use it. - Mark You could actually make point locations of all the non-coding nucleotides and then merge the whole lot at the end into a compound location of non-coding On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz wrote: > Hello, > > I am new to biojava and worked a lot with in the last few weeks. I hope > this is the right place for questions, if not please tell me. > > I want to get the nucleotid sequence outside the genes of a genebank file. > So everything that is not marked by a 'gene' feature. Unfortunately, there > is no sustract or exclude function for the Location class. Any hints? > > Btw: union() of location worked fine for extracting nucleotids of the genes > only. > > Best, > Florian > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From heuermh at acm.org Thu Apr 24 04:09:46 2008 From: heuermh at acm.org (Michael Heuer) Date: Thu, 24 Apr 2008 00:09:46 -0400 (EDT) Subject: [Biojava-l] Extract non-gene regions In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Message-ID: On Thu, 24 Apr 2008, Mark Schreiber wrote: > Hi Florian - > > There are at least two approaches. You are on the right track with > making a union of all gene locations. The compound location that > results from the Union will contain all the nucleotides that are > coding. You can then iterate through each nucleotide in the genome and > find out if the union contains the nucleotide. If it doesn't then it > is non coding. This is surprisingly rapid as the comparisons are > simple. The pseudo code would be something like... > > RichLocation coding; //initialize this by making a union of all > locations of CDS or Gene Features. > > RichSequence genome; // read from file or database > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > bit more sophisticated for a circular genome > if( ! genome.contains(i){ > //you have a non-coding nucleotide. > } > } typo? if (!coding.contains(i)) { // you have a non-coding nucleotide. } > The other approach is to use the blockIterator() method of the > compound location that results from the union of coding sequences. > This will output each contiguous chunk of coding sequence. If you know > the length of the sequence then you can rapidly figure out the > intervening pieces. > > For example, if the block iterator tells you that [10..50], [90..100], > [350..380] are coding and you know the genome is of length 400 then > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > non-coding. Again it is more complicated for circular sequences and > more complex if you consider the opposite strand of a gene (the gene > shadow) to be non-coding. Unfortunately there is no convenience method > to do this but if you code something up it would be great to put it in > the cookbook so others can re-use it. > > - Mark > > You could actually make point locations of all the non-coding > nucleotides and then merge the whole lot at the end into a compound > location of non-coding > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz wrote: > > Hello, > > > > I am new to biojava and worked a lot with in the last few weeks. I hope > > this is the right place for questions, if not please tell me. > > > > I want to get the nucleotid sequence outside the genes of a genebank file. > > So everything that is not marked by a 'gene' feature. Unfortunately, there > > is no sustract or exclude function for the Location class. Any hints? > > > > Btw: union() of location worked fine for extracting nucleotids of the genes > > only. > > > > Best, > > Florian > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas.draeger at uni-tuebingen.de Thu Apr 24 07:26:31 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Thu, 24 Apr 2008 09:26:31 +0200 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate Message-ID: <48103627.90505@uni-tuebingen.de> Dear all, Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes. Therefore, I would like to ask the following questions: 1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers? I am grateful for any answers. Cheers Andreas From mail at florianschatz.de Thu Apr 24 12:09:24 2008 From: mail at florianschatz.de (Florian Schatz) Date: Thu, 24 Apr 2008 14:09:24 +0200 Subject: [Biojava-l] Extract non-gene regions In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> References: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Message-ID: Hello, I tried that, but is as slow as a version operating on Strings.. however, I created a Cookbook entry: http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions Is there a better way to get a Sequence from a SybolList than: Sequence newsequence = DNATools.createDNASequence(symbolL.seqString (), "New Sequence"); Best, Florian Am 24.04.2008 um 04:29 schrieb Mark Schreiber: > Hi Florian - > > There are at least two approaches. You are on the right track with > making a union of all gene locations. The compound location that > results from the Union will contain all the nucleotides that are > coding. You can then iterate through each nucleotide in the genome and > find out if the union contains the nucleotide. If it doesn't then it > is non coding. This is surprisingly rapid as the comparisons are > simple. The pseudo code would be something like... > > RichLocation coding; //initialize this by making a union of all > locations of CDS or Gene Features. > > RichSequence genome; // read from file or database > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > bit more sophisticated for a circular genome > if( ! genome.contains(i){ > //you have a non-coding nucleotide. > } > } > > The other approach is to use the blockIterator() method of the > compound location that results from the union of coding sequences. > This will output each contiguous chunk of coding sequence. If you know > the length of the sequence then you can rapidly figure out the > intervening pieces. > > For example, if the block iterator tells you that [10..50], [90..100], > [350..380] are coding and you know the genome is of length 400 then > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > non-coding. Again it is more complicated for circular sequences and > more complex if you consider the opposite strand of a gene (the gene > shadow) to be non-coding. Unfortunately there is no convenience method > to do this but if you code something up it would be great to put it in > the cookbook so others can re-use it. > > - Mark > > You could actually make point locations of all the non-coding > nucleotides and then merge the whole lot at the end into a compound > location of non-coding > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz > wrote: >> Hello, >> >> I am new to biojava and worked a lot with in the last few weeks. >> I hope >> this is the right place for questions, if not please tell me. >> >> I want to get the nucleotid sequence outside the genes of a >> genebank file. >> So everything that is not marked by a 'gene' feature. >> Unfortunately, there >> is no sustract or exclude function for the Location class. Any hints? >> >> Btw: union() of location worked fine for extracting nucleotids of >> the genes >> only. >> >> Best, >> Florian >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> From markjschreiber at gmail.com Thu Apr 24 12:47:59 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 24 Apr 2008 20:47:59 +0800 Subject: [Biojava-l] Extract non-gene regions In-Reply-To: References: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com> Message-ID: <93b45ca50804240547gefb750fh493a01a8d0edbe22@mail.gmail.com> Hi - While Sequences and SymbolLists offer many advantages over Strings or character arrays speed is not one of them. You can create a Sequence using the SequenceFactory implementations which are much more efficient than converting to Strings and back to symbols again. This is a very expensive operation. From memory SimpleRichSequence may even have a constructor that takes a SymbolList and a name. There should be no need to convert to a String and back. Also, do you need a Sequence when a SymbolList may contain all the information you need? Finally the Edit operations you use in your wiki example will cause quite a big performance hit, your comment seems to allude to this. It would be better to collect all the non-coding points (i) and compile them into a compound location and then extract the SymbolList for that location all in one go. - Mark On Thu, Apr 24, 2008 at 8:09 PM, Florian Schatz wrote: > Hello, > > I tried that, but is as slow as a version operating on Strings.. however, I > created a Cookbook entry: > http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions > > Is there a better way to get a Sequence from a SybolList than: > > Sequence newsequence = DNATools.createDNASequence(symbolL.seqString(), "New > Sequence"); > > > Best, > Florian > > Am 24.04.2008 um 04:29 schrieb Mark Schreiber: > > > Hi Florian - > > > > > > > > > > There are at least two approaches. You are on the right track with > > making a union of all gene locations. The compound location that > > results from the Union will contain all the nucleotides that are > > coding. You can then iterate through each nucleotide in the genome and > > find out if the union contains the nucleotide. If it doesn't then it > > is non coding. This is surprisingly rapid as the comparisons are > > simple. The pseudo code would be something like... > > > > RichLocation coding; //initialize this by making a union of all > > locations of CDS or Gene Features. > > > > RichSequence genome; // read from file or database > > > > for(int i = 1; i <= genome.lenght(); i++){ //you might need to be a > > bit more sophisticated for a circular genome > > if( ! genome.contains(i){ > > //you have a non-coding nucleotide. > > } > > } > > > > The other approach is to use the blockIterator() method of the > > compound location that results from the union of coding sequences. > > This will output each contiguous chunk of coding sequence. If you know > > the length of the sequence then you can rapidly figure out the > > intervening pieces. > > > > For example, if the block iterator tells you that [10..50], [90..100], > > [350..380] are coding and you know the genome is of length 400 then > > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are > > non-coding. Again it is more complicated for circular sequences and > > more complex if you consider the opposite strand of a gene (the gene > > shadow) to be non-coding. Unfortunately there is no convenience method > > to do this but if you code something up it would be great to put it in > > the cookbook so others can re-use it. > > > > - Mark > > > > You could actually make point locations of all the non-coding > > nucleotides and then merge the whole lot at the end into a compound > > location of non-coding > > > > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz > wrote: > > > > > Hello, > > > > > > I am new to biojava and worked a lot with in the last few weeks. I hope > > > this is the right place for questions, if not please tell me. > > > > > > I want to get the nucleotid sequence outside the genes of a genebank > file. > > > So everything that is not marked by a 'gene' feature. Unfortunately, > there > > > is no sustract or exclude function for the Location class. Any hints? > > > > > > Btw: union() of location worked fine for extracting nucleotids of the > genes > > > only. > > > > > > Best, > > > Florian > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From hlapp at gmx.net Thu Apr 24 22:25:29 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 24 Apr 2008 18:25:29 -0400 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate In-Reply-To: <48103627.90505@uni-tuebingen.de> References: <48103627.90505@uni-tuebingen.de> Message-ID: <59C8D987-F6F8-4E83-B70D-127D38DCC0C9@gmx.net> Hi Andreas, On Apr 24, 2008, at 3:26 AM, Andreas Dr?ger wrote: > Or more generally, can I somehow read in an ontology (for instance > the GO), persist it in BioSQL and make use of the terms contained > therein? in principle, you absolutely can. As for whether Biojava lets you do this I do not know but I would suppose yes. (BioPerl has a script load_ontology.pl in its Bioperl-db package that does this.) Your other questions all seem Biojava-specific, so I'll leave them to the list and Mark & Richard. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From andreas.draeger at uni-tuebingen.de Fri Apr 18 07:13:52 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Fri, 18 Apr 2008 07:13:52 -0000 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate Message-ID: <480844DB.6070808@uni-tuebingen.de> Dear all, Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes. Therefore, I would like to ask the following questions: 1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers? I am grateful for any answers. Cheers Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: andreas.draeger.vcf Type: text/x-vcard Size: 509 bytes Desc: not available URL: From andreas.draeger at uni-tuebingen.de Tue Apr 22 05:45:02 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Tue, 22 Apr 2008 05:45:02 -0000 Subject: [Biojava-l] loading multiple records for same organism and peristance in BioSQL In-Reply-To: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> References: <4805E9B8.6000709@unity.ncsu.edu> <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com> Message-ID: <480D7B55.3070708@uni-tuebingen.de> Hi Doug, We also had the same problem. The solution is simple. You always construct a new SimpleNamespace, each time with the same name. Your code will work if you do one of the following: 1. You can load the namespace with the name from the database and set this namespace to the parser. 2. You can use the default namespace from the RichObjectFactory or 3. Just use the parser method, which does not require any namespaces - this method actually uses the default namespace (so three is actually equal to two). This should help. Cheers Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: andreas.draeger.vcf Type: text/x-vcard Size: 509 bytes Desc: not available URL: From andreas.draeger at uni-tuebingen.de Tue Apr 22 06:43:29 2008 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Tue, 22 Apr 2008 06:43:29 -0000 Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting them using Hibernate Message-ID: <480D859F.6080003@uni-tuebingen.de> Dear all, Recently I downloaded some GenBank-like files from the Ensembl web site (http://www.ensembl.org/index.html) and recognized that the format used on this site slightly diverges from what one gets from NCBI. Especially the ACCESSION number is not valid according to the pattern matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can thus not be parsed using the RichSequence.IOTools. This issue has already been discussed in this list before, but the solution was not to use files from Ensemble, but those from NCBI instead. However, the reason why the files from Ensembl are so important, is that they contain additional annotation, not provided by NCBI. For instance the feature "exon". The old parsers from the biojava.seq.io package are able to read in the files from this site. The Sequence objects can be enriched afterwards and be written to another genbank file. However, this again results in a file, which cannot be stored in a BioSQL database using Hibernate caused by the invalid accession number. The next problem is that even the old parsers do not treat this "rich" information from the Ensembl files properly. The feature "exon" becomes "any" when the sequence is enriched and written to a new GenBank file. Hence the benefit from the Ensembl annotation gets lost during paring and conversion. By the way, Ensembl also offers to write Embl-like files or other formats with the same problems as mentioned above. On the other hand, no matter which parser in BioJavaX I look up within the API documentation, I can always find a corresponding "Term" class, which states that this class "Implements some ...-specific terms", where the dots stand for the considered format like UniProt, GenBank, Embl and so forth. None of these Term classes provides any setters or add-methods, which would allow to define a new term like "exon". The structure of the parsers seems to me to be very sophisticated and it is not very easy to extend the parsers or term classes for own purposes. Therefore, I would like to ask the following questions: 1. Is there a way to read in files downloaded from Ensembl using only the designated BioJavaX classes? 2. How can I extend the terms so that not only "SOME X-specific terms" are included, but some more? And how do I tell the parser to use and apply these terms? Or more generally, can I somehow read in an ontology (for instance the GO), persist it in BioSQL and make use of the terms contained therein? 3. How can I persist a sequence from Ensembl within a BioSQL database using Hibernate even though they use different accession numbers? I am grateful for any answers. Cheers Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: andreas.draeger.vcf Type: text/x-vcard Size: 509 bytes Desc: not available URL: From budhaditya21 at yahoo.co.in Tue Apr 29 05:29:23 2008 From: budhaditya21 at yahoo.co.in (arunabha banerjee) Date: Tue, 29 Apr 2008 05:29:23 -0000 Subject: [Biojava-l] problems installing biojava on Windows XP professional Message-ID: <406135.22320.qm@web94608.mail.in2.yahoo.com> An HTML attachment was scrubbed... URL: