From ap3 at sanger.ac.uk  Sun Apr 13 14:02:41 2008
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Sun, 13 Apr 2008 19:02:41 +0100
Subject: [Biojava-l] biojava 1.6 released
Message-ID: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk>


Biojava 1.6 has been released and is available from http:// 
biojava.org/wiki/BioJava:Download

Biojava 1.6 offers more functionality and stability over the previous  
official releases. BioJava now depends on Java 1.5+. We highly  
recommend you to upgrade as soon as possible.

In detail, the phylo package org.biojavax.bio.phylo was improved and  
expanded by our GSOC'07 student Boh-Yun Lee. It now contains fully- 
functional Nexus and Phylip parsers, and tools for calculating UPGMA  
and Neighbour Joining, Jukes-Kantor and Kimura Two Parameter, and MP.  
It uses JGraphT to represent parsed trees.

The PDB file parser was improved by Jules Jacobsen for better dealing  
with PDB header records. Andreas Draeger provided several patches for  
improving the Genetic Algorithm modules. Additionally this release  
contains numerous bug fixes and documentation improvements.

Thanks to the entire biojava community for making this possible!

Happy Biojava-ing,

Andreas

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
                               +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From debrown at unity.ncsu.edu  Wed Apr 16 07:57:44 2008
From: debrown at unity.ncsu.edu (Doug Brown)
Date: Wed, 16 Apr 2008 07:57:44 -0400
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
Message-ID: <4805E9B8.6000709@unity.ncsu.edu>

Greetings,
I am happily climbing the learning curve for Biojava-live, Biojavax,
and BioSQL. I believe that I am using the latest releases, Biojava 1.6
and BioSQL 1.0, in that I have performed the installation within the
past week.

I am attempting to load, via Biojavax, multiple genbank records for the
same organism (a whole genome's worth of annotations) and to save those
into a BioSQL database via Biojavax's Hibernate persistence mechanism.
Loading a second genbank file (same organism, different sequence) croaks
with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2
...... could not insert: [Namespace]. FYI two sample genbank records
are  CH476760.gb and CH476761.gb and were obtained directly from genbank.

Never having used Hibernate before nor its type of database abstraction,
I think that I am properly handling the transaction semantics. Either I
am violating unspoken presumptions of the persistence paradigm or the
behavior of RichSequence.IOTools.readGenbankDNA is not what I expected.
I had presumed that the above routine would use the established
RichObjectFactory to obtain new or extant objects and then populate
those objects with values from the sequence file. This only seems to
happen when I load multiple sequences from a single file. Multi file
operations fail dismally.

What is the proper way of using Biojava to load up a database with records?

In advance, thank you all for the traffic on this list, it has been
quite helpful in bringing me up to speed.

Regards,
Doug Brown

Here is the relevant [hacked] subroutine:
  /**
   * This works for genbank files containing multiple sequences.
   * Originaly concept from:
http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html
   * It fails on inserting existant record(s) - does not replace...
   * This causes grief when loading multiple files...
   */
  public void loadNSave( Session session, File fileName)
    {
    boolean localSession = (session == null);
    Transaction tx = null;

    try
      {
      System.out.println( "*********** Loading "+fileName+"...");
      BufferedReader br = new BufferedReader( new FileReader(  fileName) );

      if ( session == null)  // create a local session
        {
        session = sessionFactory.openSession();
        RichObjectFactory.connectToBioSQL(session);
        }

      // load the objects. I expect this to use the established factory.
      RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA(
br, new
          SimpleNamespace( "genbankBiosqlRich") );

      while ( rsi.hasNext() )
        tx = session.beginTransaction(); // Hibernate requires transactions.

        System.out.println( "*********** Loading next sequence...");
        // ??should automatically fetch existing objects from the
database...
        RichSequence sequence = rsi.nextRichSequence();
        System.out.println( "loaded sequence
"+sequence.getAccession()+", identifier: "+ sequence.getIdentifier());

        try
          {
          System.out.println( "*********** saving...");

          // synchronize in-memory representation w/ the database
          // HUGE amounts of time spent doing selects on keys - really
slows things down!!
          session.saveOrUpdate( "Sequence", sequence );
          tx.commit();    // save to database - does an automatic flush
          // batch operations overwhelm the cache - clear it out!
          session.flush();  // force in-memory to disk.
          session.clear();  // clean out cache.
          }
        catch (HibernateException ex)
          {
          tx.rollback();   // discard the sequence and all its annotations
          ex.printStackTrace();
          }
        }
      }
    catch (FileNotFoundException ex)
      {
      ex.printStackTrace();
      }
    catch ( BioException bex)
      {
      bex.printStackTrace();
      }
    finally
      {
      if ( localSession)
        {
        session.flush();  // force in-memory to disk.
        session.close();  // only for local sessions
        }
      }
    }

and the following following is a sample stack dump:

org.hibernate.exception.ConstraintViolationException: could not insert:
[Namespace]
       at
org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) 


       at
org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) 


       at
org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) 


       at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279)
       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218)
       at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268)
       at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216)
       at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169)
       at org.hibernate.engine.Cascade.cascade(Cascade.java:130)
       at
org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108)
       at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72)
Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich'
for key 2
       at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975)
       at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600)
       at
com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) 


       at
com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) 


       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259)
       at
org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) 


-- 
Doug Brown - Bioinformatics
Fungal Genomics Laboratory
Center for Integrated Fungal Research
North Carolina State University
Campus Box 7251, Raleigh, NC 27695-7251
https://www.fungalgenomics.ncsu.edu/~debrown/
Tel: (919) 513-0394, Fax (919) 513-0024
e-mail: doug_brown at ncsu.edu

From debrown at unity.ncsu.edu  Wed Apr 16 08:01:52 2008
From: debrown at unity.ncsu.edu (Doug Brown)
Date: Wed, 16 Apr 2008 08:01:52 -0400
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
Message-ID: <4805EAB0.6010803@unity.ncsu.edu>

Greetings,
I am happily climbing the learning curve for Biojava-live, Biojavax,
and BioSQL. I believe that I am using the latest releases, Biojava 1.6
and BioSQL 1.0, in that I have performed the installation within the
past week.

I am attempting to load, via Biojavax, multiple genbank records for the
same organism (a whole genome's worth of annotations) and to save those
into a BioSQL database via Biojavax's Hibernate persistence mechanism.
Loading a second genbank file (same organism, different sequence) croaks
with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2
...... could not insert: [Namespace]. FYI two sample genbank records
are  CH476760.gb and CH476761.gb and were obtained directly from genbank.

Never having used Hibernate before nor its type of database abstraction,
I think that I am properly handling the transaction semantics. Either I
am violating unspoken presumptions of the persistence paradigm or the
behavior of RichSequence.IOTools.readGenbankDNA is not what I expected.
I had presumed that the above routine would use the established
RichObjectFactory to obtain new or extant objects and then populate
those objects with values from the sequence file. This only seems to
happen when I load multiple sequences from a single file. Multi file
operations fail dismally.

What is the proper way of using Biojava to load up a database with records?

In advance, thank you all for the traffic on this list, it has been
quite helpful in bringing me up to speed.

Regards,
Doug Brown

Here is the relevant [hacked] subroutine:
  /**
   * This works for genbank files containing multiple sequences.
   * Originaly concept from:
http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html
   * It fails on inserting existant record(s) - does not replace...
   * This causes grief when loading multiple files...
   */
  public void loadNSave( Session session, File fileName)
    {
    boolean localSession = (session == null);
    Transaction tx = null;

    try
      {
      System.out.println( "*********** Loading "+fileName+"...");
      BufferedReader br = new BufferedReader( new FileReader(  fileName) );

      if ( session == null)  // create a local session
        {
        session = sessionFactory.openSession();
        RichObjectFactory.connectToBioSQL(session);
        }

      // load the objects. I expect this to use the established factory.
      RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA(
br, new
          SimpleNamespace( "genbankBiosqlRich") );

      while ( rsi.hasNext() )
        tx = session.beginTransaction(); // Hibernate requires transactions.

        System.out.println( "*********** Loading next sequence...");
        // ??should automatically fetch existing objects from the
database...
        RichSequence sequence = rsi.nextRichSequence();
        System.out.println( "loaded sequence
"+sequence.getAccession()+", identifier: "+ sequence.getIdentifier());

        try
          {
          System.out.println( "*********** saving...");

          // synchronize in-memory representation w/ the database
          // HUGE amounts of time spent doing selects on keys - really
slows things down!!
          session.saveOrUpdate( "Sequence", sequence );
          tx.commit();    // save to database - does an automatic flush
          // batch operations overwhelm the cache - clear it out!
          session.flush();  // force in-memory to disk.
          session.clear();  // clean out cache.
          }
        catch (HibernateException ex)
          {
          tx.rollback();   // discard the sequence and all its annotations
          ex.printStackTrace();
          }
        }
      }
    catch (FileNotFoundException ex)
      {
      ex.printStackTrace();
      }
    catch ( BioException bex)
      {
      bex.printStackTrace();
      }
    finally
      {
      if ( localSession)
        {
        session.flush();  // force in-memory to disk.
        session.close();  // only for local sessions
        }
      }
    }

and the following following is a sample stack dump:

org.hibernate.exception.ConstraintViolationException: could not insert:
[Namespace]
       at
org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) 


       at
org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) 


       at
org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) 


       at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279)
       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218)
       at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268)
       at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216)
       at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169)
       at org.hibernate.engine.Cascade.cascade(Cascade.java:130)
       at
org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108)
       at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72)
Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich'
for key 2
       at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975)
       at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600)
       at
com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) 


       at
com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) 


       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259)
       at
org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) 


-- 
Doug Brown - Bioinformatics
Fungal Genomics Laboratory
Center for Integrated Fungal Research
North Carolina State University
Campus Box 7251, Raleigh, NC 27695-7251
https://www.fungalgenomics.ncsu.edu/~debrown/
Tel: (919) 513-0394, Fax (919) 513-0024
e-mail: doug_brown at ncsu.edu

From dreher at molgen.mpg.de  Mon Apr 21 08:51:52 2008
From: dreher at molgen.mpg.de (Felix Dreher)
Date: Mon, 21 Apr 2008 14:51:52 +0200
Subject: [Biojava-l] mailing list archives
Message-ID: <480C8DE8.5040805@molgen.mpg.de>

Hello all,

is there a possibility to query the biojava mailing-list archives?
(the link provided on the biojava-homepage doesn't work:
http://search.open-bio.org/cgi-bin/mail-search.cgi)

Best regards,
Felix


From markjschreiber at gmail.com  Mon Apr 21 21:27:51 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 22 Apr 2008 09:27:51 +0800
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
In-Reply-To: <4805E9B8.6000709@unity.ncsu.edu>
References: <4805E9B8.6000709@unity.ncsu.edu>
Message-ID: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>

Hi Doug -

Has anyone provided a solution for this yet?

I haven't used the Hibernate bindings to BioSQL (I'm actually working
on a JPA binding with EntityBeans) but when I did they worked well.
However, I have seen this type of error before. Clearly the entry
'genbankBiosqlRich' is being duplicated somewhere where it should be
unique.  This looks unusual because you call saveOrUpdate which should
be able to figure this out unless the way Hibernate determines
equality is not the same as the way BioJava does.

What happens if you try to save them one at a time (in sequential runs
of your program)? From the look of the stack trace you might see the
same error.

Also, it might pay to look at the biojavax documentation on
http://biojava.org/wiki/BioJava:BioJavaXDocs#BioSQL_and_Hibernate.

Any Hibernate experts able to offer an opinion here??
- Mark

On Wed, Apr 16, 2008 at 7:57 PM, Doug Brown <debrown at unity.ncsu.edu> wrote:
> Greetings,
> I am happily climbing the learning curve for Biojava-live, Biojavax,
> and BioSQL. I believe that I am using the latest releases, Biojava 1.6
> and BioSQL 1.0, in that I have performed the installation within the
> past week.
>
> I am attempting to load, via Biojavax, multiple genbank records for the
> same organism (a whole genome's worth of annotations) and to save those
> into a BioSQL database via Biojavax's Hibernate persistence mechanism.
> Loading a second genbank file (same organism, different sequence) croaks
> with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2
> ...... could not insert: [Namespace]. FYI two sample genbank records
> are  CH476760.gb and CH476761.gb and were obtained directly from genbank.
>
> Never having used Hibernate before nor its type of database abstraction,
> I think that I am properly handling the transaction semantics. Either I
> am violating unspoken presumptions of the persistence paradigm or the
> behavior of RichSequence.IOTools.readGenbankDNA is not what I expected.
> I had presumed that the above routine would use the established
> RichObjectFactory to obtain new or extant objects and then populate
> those objects with values from the sequence file. This only seems to
> happen when I load multiple sequences from a single file. Multi file
> operations fail dismally.
>
> What is the proper way of using Biojava to load up a database with records?
>
> In advance, thank you all for the traffic on this list, it has been
> quite helpful in bringing me up to speed.
>
> Regards,
> Doug Brown
>
> Here is the relevant [hacked] subroutine:
>  /**
>  * This works for genbank files containing multiple sequences.
>  * Originaly concept from:
> http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html
>  * It fails on inserting existant record(s) - does not replace...
>  * This causes grief when loading multiple files...
>  */
>  public void loadNSave( Session session, File fileName)
>   {
>   boolean localSession = (session == null);
>   Transaction tx = null;
>
>   try
>     {
>     System.out.println( "*********** Loading "+fileName+"...");
>     BufferedReader br = new BufferedReader( new FileReader(  fileName) );
>
>     if ( session == null)  // create a local session
>       {
>       session = sessionFactory.openSession();
>       RichObjectFactory.connectToBioSQL(session);
>       }
>
>     // load the objects. I expect this to use the established factory.
>     RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA(
> br, new
>         SimpleNamespace( "genbankBiosqlRich") );
>
>     while ( rsi.hasNext() )
>       tx = session.beginTransaction(); // Hibernate requires transactions.
>
>       System.out.println( "*********** Loading next sequence...");
>       // ??should automatically fetch existing objects from the
> database...
>       RichSequence sequence = rsi.nextRichSequence();
>       System.out.println( "loaded sequence
> "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier());
>
>       try
>         {
>         System.out.println( "*********** saving...");
>
>         // synchronize in-memory representation w/ the database
>         // HUGE amounts of time spent doing selects on keys - really
> slows things down!!
>         session.saveOrUpdate( "Sequence", sequence );
>         tx.commit();    // save to database - does an automatic flush
>         // batch operations overwhelm the cache - clear it out!
>         session.flush();  // force in-memory to disk.
>         session.clear();  // clean out cache.
>         }
>       catch (HibernateException ex)
>         {
>         tx.rollback();   // discard the sequence and all its annotations
>         ex.printStackTrace();
>         }
>       }
>     }
>   catch (FileNotFoundException ex)
>     {
>     ex.printStackTrace();
>     }
>   catch ( BioException bex)
>     {
>     bex.printStackTrace();
>     }
>   finally
>     {
>     if ( localSession)
>       {
>       session.flush();  // force in-memory to disk.
>       session.close();  // only for local sessions
>       }
>     }
>   }
>
> and the following following is a sample stack dump:
>
> org.hibernate.exception.ConstraintViolationException: could not insert:
> [Namespace]
>      at
> org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71)
>
>      at
> org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
>
>      at
> org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40)
>
>      at
> org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163)
>
>      at
> org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643)
>
>      at
> org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51)
>
>      at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279)
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107)
>
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi
>
> stener.java:187)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1
>
> 72)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java
>
> :94)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70)
>
>      at
> org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
>      at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
>      at
> org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218)
>      at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268)
>      at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216)
>      at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169)
>      at org.hibernate.engine.Cascade.cascade(Cascade.java:130)
>      at
> org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107)
>
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi
>
> stener.java:187)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1
>
> 72)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java
>
> :94)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70)
>
>      at
> org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
>      at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
>      at
> bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108)
>      at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72)
> Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich'
> for key 2
>      at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975)
>      at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600)
>      at
> com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125)
>
>      at
> com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677)
>
>      at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357)
>      at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274)
>      at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259)
>      at
> org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73)
>
>      at
> org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33)
>
>
>
> --
> Doug Brown - Bioinformatics
> Fungal Genomics Laboratory
> Center for Integrated Fungal Research
> North Carolina State University
> Campus Box 7251, Raleigh, NC 27695-7251
> https://www.fungalgenomics.ncsu.edu/~debrown/
> Tel: (919) 513-0394, Fax (919) 513-0024
> e-mail: doug_brown at ncsu.edu
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From markjschreiber at gmail.com  Mon Apr 21 21:39:39 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 22 Apr 2008 09:39:39 +0800
Subject: [Biojava-l] Serching mailing list archives
Message-ID: <93b45ca50804211839lddf0a7je47c337c5250c85d@mail.gmail.com>

Dear Felix -

The URL has been updated. Please see below.

- Mark

---------- Forwarded message ----------
From: Mauricio Herrera Cuadra via RT <support at helpdesk.open-bio.org>
Date: Tue, Apr 22, 2008 at 9:20 AM
Subject: [O|B|F Helpdesk #507] Fwd: [Biojava-l] mailing list archives
To: markjschreiber at gmail.com
Cc: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org


OBF Search engine URL has been moved to: http://search.open-bio.org
I've updated the link in the BioJava wiki
(http://biojava.org/wiki/BioJava:MailingLists) with the new URL.

Please let the requestor know about the update.

Regards,
Mauricio.

From markjschreiber at gmail.com  Tue Apr 22 02:40:35 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 22 Apr 2008 14:40:35 +0800
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
In-Reply-To: <480D7B55.3070708@uni-tuebingen.de>
References: <4805E9B8.6000709@unity.ncsu.edu>
	<93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>
	<480D7B55.3070708@uni-tuebingen.de>
Message-ID: <93b45ca50804212340s708df36ek73cdf3337b957316@mail.gmail.com>

Hi -

Can someone update the docs on biojava.org to reflect this requirement?

Thanks,

- Mark

On Tue, Apr 22, 2008 at 1:44 PM, Andreas Dr?ger
<andreas.draeger at uni-tuebingen.de> wrote:
> Hi Doug,
>
> We also had the same problem. The solution is simple. You always construct a
> new SimpleNamespace, each time with the same name. Your code will work if
> you do one of the following:
> 1. You can load the namespace with the name from the database and set this
> namespace to the parser.
> 2. You can use the default namespace from the RichObjectFactory or
> 3. Just use the parser method, which does not require any namespaces - this
> method actually uses the default namespace (so three is actually equal to
> two).
> This should help.
>
> Cheers
> Andreas
>


From debrown at unity.ncsu.edu  Tue Apr 22 08:22:16 2008
From: debrown at unity.ncsu.edu (Doug Brown)
Date: Tue, 22 Apr 2008 08:22:16 -0400
Subject: [Biojava-l] mailing list archives
In-Reply-To: <480C8DE8.5040805@molgen.mpg.de>
References: <480C8DE8.5040805@molgen.mpg.de>
Message-ID: <480DD878.2000100@unity.ncsu.edu>

Hi Felix,

In addition to the http://search.open-bio.org link mentioned by
Mauricio Herrera Cuadra, you could use Google directly with search 
expressions similar to:

    "[biojava-l]" site:portal.open-bio.org
    "[biojava-dev]" site:portal.open-bio.org

Of course, you need to add on any additional search terms to limit the 
results.
In general see: http://www.google.com/advanced_search

Regards,
Doug


Felix Dreher wrote:
> Hello all,
>
> is there a possibility to query the biojava mailing-list archives?
> (the link provided on the biojava-homepage doesn't work:
> http://search.open-bio.org/cgi-bin/mail-search.cgi)
>
> Best regards,
> Felix
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Doug Brown - Bioinformatics
Fungal Genomics Laboratory
Center for Integrated Fungal Research
North Carolina State University
Campus Box 7251, Raleigh, NC 27695-7251
https://www.fungalgenomics.ncsu.edu/~debrown/
Tel: (919) 513-0394, Fax (919) 513-0024
e-mail: doug_brown at ncsu.edu


From markjschreiber at gmail.com  Wed Apr 23 03:22:14 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 23 Apr 2008 15:22:14 +0800
Subject: [Biojava-l] Suspicious Headers
Message-ID: <93b45ca50804230022j223ed4eetf219e21d35cd51dc@mail.gmail.com>

Hi -

If you have ever tried posting to the list and had your email bounced
back with a message complaining about a suspicious header chances are
your email has been bounced by our spam filter.  There are generally 2
reasons why this might happen

1. Your email is HTML, mail to the list must be text only.

2. You email has an attachment, possibly a .vcf file or an image in
your email signature (company logo or similar).

If you keep it plain text it should get through.

- Mark

From mail at florianschatz.de  Wed Apr 23 09:49:05 2008
From: mail at florianschatz.de (Florian Schatz)
Date: Wed, 23 Apr 2008 15:49:05 +0200
Subject: [Biojava-l] Extract non-gene regions
Message-ID: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>

Hello,

I am new to biojava and worked a lot with in the last few weeks. I  
hope this is the right place for questions, if not please tell me.

I want to get the nucleotid sequence outside the genes of a genebank  
file. So everything that is not marked by a 'gene' feature.   
Unfortunately, there is no sustract or exclude function for the  
Location class. Any hints?

Btw: union() of location worked fine for extracting nucleotids of the  
genes only.

Best,
Florian

From markjschreiber at gmail.com  Wed Apr 23 22:29:12 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Thu, 24 Apr 2008 10:29:12 +0800
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
References: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
Message-ID: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>

Hi Florian -

There are at least two approaches. You are on the right track with
making a union of all gene locations.  The compound location that
results from the Union will contain all the nucleotides that are
coding. You can then iterate through each nucleotide in the genome and
find out if the union contains the nucleotide. If it doesn't then it
is non coding.  This is surprisingly rapid as the comparisons are
simple.  The pseudo code would be something like...

RichLocation coding; //initialize this by making a union of all
locations of CDS or Gene Features.

RichSequence genome; // read from file or database

for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
bit more sophisticated for a circular genome
    if( ! genome.contains(i){
         //you have a non-coding nucleotide.
    }
}

The other approach is to use the blockIterator() method of the
compound location that results from the union of coding sequences.
This will output each contiguous chunk of coding sequence. If you know
the length of the sequence then you can rapidly figure out the
intervening pieces.

For example, if the block iterator tells you that [10..50], [90..100],
[350..380] are coding and you know the genome is of length 400 then
you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
non-coding.  Again it is more complicated for circular sequences and
more complex if you consider the opposite strand of a gene (the gene
shadow) to be non-coding. Unfortunately there is no convenience method
to do this but if you code something up it would be great to put it in
the cookbook so others can re-use it.

- Mark

You could actually make point locations of all the non-coding
nucleotides and then merge the whole lot at the end into a compound
location of non-coding

On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de> wrote:
> Hello,
>
>  I am new to biojava and worked a lot with in the last few weeks. I hope
> this is the right place for questions, if not please tell me.
>
>  I want to get the nucleotid sequence outside the genes of a genebank file.
> So everything that is not marked by a 'gene' feature.  Unfortunately, there
> is no sustract or exclude function for the Location class. Any hints?
>
>  Btw: union() of location worked fine for extracting nucleotids of the genes
> only.
>
>  Best,
>  Florian
>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From heuermh at acm.org  Thu Apr 24 00:09:46 2008
From: heuermh at acm.org (Michael Heuer)
Date: Thu, 24 Apr 2008 00:09:46 -0400 (EDT)
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0804240007470.25626-100000@shell3.shore.net>

On Thu, 24 Apr 2008, Mark Schreiber wrote:

> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations.  The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding.  This is surprisingly rapid as the comparisons are
> simple.  The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> bit more sophisticated for a circular genome
>     if( ! genome.contains(i){
>          //you have a non-coding nucleotide.
>     }
> }

typo?

  if (!coding.contains(i)) {
    // you have a non-coding nucleotide.
  }


> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding.  Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de> wrote:
> > Hello,
> >
> >  I am new to biojava and worked a lot with in the last few weeks. I hope
> > this is the right place for questions, if not please tell me.
> >
> >  I want to get the nucleotid sequence outside the genes of a genebank file.
> > So everything that is not marked by a 'gene' feature.  Unfortunately, there
> > is no sustract or exclude function for the Location class. Any hints?
> >
> >  Btw: union() of location worked fine for extracting nucleotids of the genes
> > only.
> >
> >  Best,
> >  Florian
> >  _______________________________________________
> >  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >  http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From andreas.draeger at uni-tuebingen.de  Thu Apr 24 03:26:31 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Thu, 24 Apr 2008 09:26:31 +0200
Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting
 them using Hibernate
Message-ID: <48103627.90505@uni-tuebingen.de>

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the
solution was not to use files from Ensemble, but those from NCBI instead.
However, the reason why the files from Ensembl are so important, is that
they contain additional annotation, not provided by NCBI. For instance 
the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas

From mail at florianschatz.de  Thu Apr 24 08:09:24 2008
From: mail at florianschatz.de (Florian Schatz)
Date: Thu, 24 Apr 2008 14:09:24 +0200
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
References: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
	<93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
Message-ID: <BD0D4981-C9D8-42DD-9C76-77C0C8113FE5@florianschatz.de>

Hello,

I tried that, but is as slow as a version operating on Strings..  
however, I created a Cookbook entry:
http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions

Is there a better way to get a Sequence from a SybolList than:

Sequence newsequence = DNATools.createDNASequence(symbolL.seqString 
(), "New Sequence");

Best,
Florian

Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations.  The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding.  This is surprisingly rapid as the comparisons are
> simple.  The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> bit more sophisticated for a circular genome
>     if( ! genome.contains(i){
>          //you have a non-coding nucleotide.
>     }
> }
>
> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding.  Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz  
> <mail at florianschatz.de> wrote:
>> Hello,
>>
>>  I am new to biojava and worked a lot with in the last few weeks.  
>> I hope
>> this is the right place for questions, if not please tell me.
>>
>>  I want to get the nucleotid sequence outside the genes of a  
>> genebank file.
>> So everything that is not marked by a 'gene' feature.   
>> Unfortunately, there
>> is no sustract or exclude function for the Location class. Any hints?
>>
>>  Btw: union() of location worked fine for extracting nucleotids of  
>> the genes
>> only.
>>
>>  Best,
>>  Florian
>>  _______________________________________________
>>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>>


From markjschreiber at gmail.com  Thu Apr 24 08:47:59 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Thu, 24 Apr 2008 20:47:59 +0800
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <BD0D4981-C9D8-42DD-9C76-77C0C8113FE5@florianschatz.de>
References: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
	<93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
	<BD0D4981-C9D8-42DD-9C76-77C0C8113FE5@florianschatz.de>
Message-ID: <93b45ca50804240547gefb750fh493a01a8d0edbe22@mail.gmail.com>

Hi -

While Sequences and SymbolLists offer many advantages over Strings or
character arrays speed is not one of them.

You can create a Sequence using the SequenceFactory implementations
which are much more efficient than converting to Strings and back to
symbols again. This is a very expensive operation.  From memory
SimpleRichSequence may even have a constructor that takes a SymbolList
and a name. There should be no need to convert to a String and back.

Also, do you need a Sequence when a SymbolList may contain all the
information you need?

Finally the Edit operations you use in your wiki example will cause
quite a big performance hit, your comment seems to allude to this. It
would be better to collect all the non-coding points (i) and compile
them into a compound location and then extract the SymbolList for that
location all in one go.

- Mark

On Thu, Apr 24, 2008 at 8:09 PM, Florian Schatz <mail at florianschatz.de> wrote:
> Hello,
>
>  I tried that, but is as slow as a version operating on Strings.. however, I
> created a Cookbook entry:
>  http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions
>
>  Is there a better way to get a Sequence from a SybolList than:
>
>  Sequence newsequence = DNATools.createDNASequence(symbolL.seqString(), "New
> Sequence");
>
>
>  Best,
>  Florian
>
>  Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
>
> > Hi Florian -
> >
> >
> >
> >
> > There are at least two approaches. You are on the right track with
> > making a union of all gene locations.  The compound location that
> > results from the Union will contain all the nucleotides that are
> > coding. You can then iterate through each nucleotide in the genome and
> > find out if the union contains the nucleotide. If it doesn't then it
> > is non coding.  This is surprisingly rapid as the comparisons are
> > simple.  The pseudo code would be something like...
> >
> > RichLocation coding; //initialize this by making a union of all
> > locations of CDS or Gene Features.
> >
> > RichSequence genome; // read from file or database
> >
> > for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> > bit more sophisticated for a circular genome
> >    if( ! genome.contains(i){
> >         //you have a non-coding nucleotide.
> >    }
> > }
> >
> > The other approach is to use the blockIterator() method of the
> > compound location that results from the union of coding sequences.
> > This will output each contiguous chunk of coding sequence. If you know
> > the length of the sequence then you can rapidly figure out the
> > intervening pieces.
> >
> > For example, if the block iterator tells you that [10..50], [90..100],
> > [350..380] are coding and you know the genome is of length 400 then
> > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> > non-coding.  Again it is more complicated for circular sequences and
> > more complex if you consider the opposite strand of a gene (the gene
> > shadow) to be non-coding. Unfortunately there is no convenience method
> > to do this but if you code something up it would be great to put it in
> > the cookbook so others can re-use it.
> >
> > - Mark
> >
> > You could actually make point locations of all the non-coding
> > nucleotides and then merge the whole lot at the end into a compound
> > location of non-coding
> >
> > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de>
> wrote:
> >
> > > Hello,
> > >
> > >  I am new to biojava and worked a lot with in the last few weeks. I hope
> > > this is the right place for questions, if not please tell me.
> > >
> > >  I want to get the nucleotid sequence outside the genes of a genebank
> file.
> > > So everything that is not marked by a 'gene' feature.  Unfortunately,
> there
> > > is no sustract or exclude function for the Location class. Any hints?
> > >
> > >  Btw: union() of location worked fine for extracting nucleotids of the
> genes
> > > only.
> > >
> > >  Best,
> > >  Florian
> > >  _______________________________________________
> > >  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >  http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> > >
> >
>
>
>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From hlapp at gmx.net  Thu Apr 24 18:25:29 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 24 Apr 2008 18:25:29 -0400
Subject: [Biojava-l] Problem while parsing GenBank-like files and
	persiting them using Hibernate
In-Reply-To: <48103627.90505@uni-tuebingen.de>
References: <48103627.90505@uni-tuebingen.de>
Message-ID: <59C8D987-F6F8-4E83-B70D-127D38DCC0C9@gmx.net>

Hi Andreas,

On Apr 24, 2008, at 3:26 AM, Andreas Dr?ger wrote:
> Or more generally, can I somehow read in an ontology (for instance  
> the GO), persist it in BioSQL and make use of the terms contained  
> therein?


in principle, you absolutely can. As for whether Biojava lets you do  
this I do not know but I would suppose yes. (BioPerl has a script  
load_ontology.pl in its Bioperl-db package that does this.)

Your other questions all seem Biojava-specific, so I'll leave them to  
the list and Mark & Richard.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From andreas.draeger at uni-tuebingen.de  Fri Apr 18 03:13:52 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Fri, 18 Apr 2008 07:13:52 -0000
Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting
 them using Hibernate
Message-ID: <480844DB.6070808@uni-tuebingen.de>

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site 
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern 
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can 
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the 
solution was not to use files from Ensemble, but those from NCBI 
instead. However, the reason why the files from Ensembl are so 
important, is that they contain additional annotation, not provided by 
NCBI. For instance the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the 
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database 
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080418/d9e58e3f/attachment-0001.vcf>

From andreas.draeger at uni-tuebingen.de  Tue Apr 22 01:45:02 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Tue, 22 Apr 2008 05:45:02 -0000
Subject: [Biojava-l] loading multiple records for same organism
 and	peristance in BioSQL
In-Reply-To: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>
References: <4805E9B8.6000709@unity.ncsu.edu>
	<93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>
Message-ID: <480D7B55.3070708@uni-tuebingen.de>

Hi Doug,

We also had the same problem. The solution is simple. You always 
construct a new SimpleNamespace, each time with the same name. Your code 
will work if you do one of the following:
1. You can load the namespace with the name from the database and set 
this namespace to the parser.
2. You can use the default namespace from the RichObjectFactory or
3. Just use the parser method, which does not require any namespaces - 
this method actually uses the default namespace (so three is actually 
equal to two).
This should help.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080422/0d8e8e41/attachment-0001.vcf>

From andreas.draeger at uni-tuebingen.de  Tue Apr 22 02:43:29 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Tue, 22 Apr 2008 06:43:29 -0000
Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting
 them using Hibernate
Message-ID: <480D859F.6080003@uni-tuebingen.de>

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site 
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern 
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can 
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the 
solution was not to use files from Ensemble, but those from NCBI 
instead. However, the reason why the files from Ensembl are so 
important, is that they contain additional annotation, not provided by 
NCBI. For instance the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the 
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database 
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080422/906d4e9d/attachment-0001.vcf>

From budhaditya21 at yahoo.co.in  Tue Apr 29 01:29:23 2008
From: budhaditya21 at yahoo.co.in (arunabha banerjee)
Date: Tue, 29 Apr 2008 05:29:23 -0000
Subject: [Biojava-l] problems installing biojava on Windows XP professional
Message-ID: <406135.22320.qm@web94608.mail.in2.yahoo.com>

An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080429/32d4f319/attachment-0001.html>

From ap3 at sanger.ac.uk  Sun Apr 13 18:02:41 2008
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Sun, 13 Apr 2008 19:02:41 +0100
Subject: [Biojava-l] biojava 1.6 released
Message-ID: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk>


Biojava 1.6 has been released and is available from http:// 
biojava.org/wiki/BioJava:Download

Biojava 1.6 offers more functionality and stability over the previous  
official releases. BioJava now depends on Java 1.5+. We highly  
recommend you to upgrade as soon as possible.

In detail, the phylo package org.biojavax.bio.phylo was improved and  
expanded by our GSOC'07 student Boh-Yun Lee. It now contains fully- 
functional Nexus and Phylip parsers, and tools for calculating UPGMA  
and Neighbour Joining, Jukes-Kantor and Kimura Two Parameter, and MP.  
It uses JGraphT to represent parsed trees.

The PDB file parser was improved by Jules Jacobsen for better dealing  
with PDB header records. Andreas Draeger provided several patches for  
improving the Genetic Algorithm modules. Additionally this release  
contains numerous bug fixes and documentation improvements.

Thanks to the entire biojava community for making this possible!

Happy Biojava-ing,

Andreas

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
                               +44 (0) 1223 49 6891

-----------------------------------------------------------------------


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From debrown at unity.ncsu.edu  Wed Apr 16 11:57:44 2008
From: debrown at unity.ncsu.edu (Doug Brown)
Date: Wed, 16 Apr 2008 07:57:44 -0400
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
Message-ID: <4805E9B8.6000709@unity.ncsu.edu>

Greetings,
I am happily climbing the learning curve for Biojava-live, Biojavax,
and BioSQL. I believe that I am using the latest releases, Biojava 1.6
and BioSQL 1.0, in that I have performed the installation within the
past week.

I am attempting to load, via Biojavax, multiple genbank records for the
same organism (a whole genome's worth of annotations) and to save those
into a BioSQL database via Biojavax's Hibernate persistence mechanism.
Loading a second genbank file (same organism, different sequence) croaks
with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2
...... could not insert: [Namespace]. FYI two sample genbank records
are  CH476760.gb and CH476761.gb and were obtained directly from genbank.

Never having used Hibernate before nor its type of database abstraction,
I think that I am properly handling the transaction semantics. Either I
am violating unspoken presumptions of the persistence paradigm or the
behavior of RichSequence.IOTools.readGenbankDNA is not what I expected.
I had presumed that the above routine would use the established
RichObjectFactory to obtain new or extant objects and then populate
those objects with values from the sequence file. This only seems to
happen when I load multiple sequences from a single file. Multi file
operations fail dismally.

What is the proper way of using Biojava to load up a database with records?

In advance, thank you all for the traffic on this list, it has been
quite helpful in bringing me up to speed.

Regards,
Doug Brown

Here is the relevant [hacked] subroutine:
  /**
   * This works for genbank files containing multiple sequences.
   * Originaly concept from:
http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html
   * It fails on inserting existant record(s) - does not replace...
   * This causes grief when loading multiple files...
   */
  public void loadNSave( Session session, File fileName)
    {
    boolean localSession = (session == null);
    Transaction tx = null;

    try
      {
      System.out.println( "*********** Loading "+fileName+"...");
      BufferedReader br = new BufferedReader( new FileReader(  fileName) );

      if ( session == null)  // create a local session
        {
        session = sessionFactory.openSession();
        RichObjectFactory.connectToBioSQL(session);
        }

      // load the objects. I expect this to use the established factory.
      RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA(
br, new
          SimpleNamespace( "genbankBiosqlRich") );

      while ( rsi.hasNext() )
        tx = session.beginTransaction(); // Hibernate requires transactions.

        System.out.println( "*********** Loading next sequence...");
        // ??should automatically fetch existing objects from the
database...
        RichSequence sequence = rsi.nextRichSequence();
        System.out.println( "loaded sequence
"+sequence.getAccession()+", identifier: "+ sequence.getIdentifier());

        try
          {
          System.out.println( "*********** saving...");

          // synchronize in-memory representation w/ the database
          // HUGE amounts of time spent doing selects on keys - really
slows things down!!
          session.saveOrUpdate( "Sequence", sequence );
          tx.commit();    // save to database - does an automatic flush
          // batch operations overwhelm the cache - clear it out!
          session.flush();  // force in-memory to disk.
          session.clear();  // clean out cache.
          }
        catch (HibernateException ex)
          {
          tx.rollback();   // discard the sequence and all its annotations
          ex.printStackTrace();
          }
        }
      }
    catch (FileNotFoundException ex)
      {
      ex.printStackTrace();
      }
    catch ( BioException bex)
      {
      bex.printStackTrace();
      }
    finally
      {
      if ( localSession)
        {
        session.flush();  // force in-memory to disk.
        session.close();  // only for local sessions
        }
      }
    }

and the following following is a sample stack dump:

org.hibernate.exception.ConstraintViolationException: could not insert:
[Namespace]
       at
org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) 


       at
org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) 


       at
org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) 


       at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279)
       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218)
       at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268)
       at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216)
       at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169)
       at org.hibernate.engine.Cascade.cascade(Cascade.java:130)
       at
org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108)
       at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72)
Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich'
for key 2
       at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975)
       at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600)
       at
com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) 


       at
com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) 


       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259)
       at
org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) 


-- 
Doug Brown - Bioinformatics
Fungal Genomics Laboratory
Center for Integrated Fungal Research
North Carolina State University
Campus Box 7251, Raleigh, NC 27695-7251
https://www.fungalgenomics.ncsu.edu/~debrown/
Tel: (919) 513-0394, Fax (919) 513-0024
e-mail: doug_brown at ncsu.edu


From debrown at unity.ncsu.edu  Wed Apr 16 12:01:52 2008
From: debrown at unity.ncsu.edu (Doug Brown)
Date: Wed, 16 Apr 2008 08:01:52 -0400
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
Message-ID: <4805EAB0.6010803@unity.ncsu.edu>

Greetings,
I am happily climbing the learning curve for Biojava-live, Biojavax,
and BioSQL. I believe that I am using the latest releases, Biojava 1.6
and BioSQL 1.0, in that I have performed the installation within the
past week.

I am attempting to load, via Biojavax, multiple genbank records for the
same organism (a whole genome's worth of annotations) and to save those
into a BioSQL database via Biojavax's Hibernate persistence mechanism.
Loading a second genbank file (same organism, different sequence) croaks
with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2
...... could not insert: [Namespace]. FYI two sample genbank records
are  CH476760.gb and CH476761.gb and were obtained directly from genbank.

Never having used Hibernate before nor its type of database abstraction,
I think that I am properly handling the transaction semantics. Either I
am violating unspoken presumptions of the persistence paradigm or the
behavior of RichSequence.IOTools.readGenbankDNA is not what I expected.
I had presumed that the above routine would use the established
RichObjectFactory to obtain new or extant objects and then populate
those objects with values from the sequence file. This only seems to
happen when I load multiple sequences from a single file. Multi file
operations fail dismally.

What is the proper way of using Biojava to load up a database with records?

In advance, thank you all for the traffic on this list, it has been
quite helpful in bringing me up to speed.

Regards,
Doug Brown

Here is the relevant [hacked] subroutine:
  /**
   * This works for genbank files containing multiple sequences.
   * Originaly concept from:
http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html
   * It fails on inserting existant record(s) - does not replace...
   * This causes grief when loading multiple files...
   */
  public void loadNSave( Session session, File fileName)
    {
    boolean localSession = (session == null);
    Transaction tx = null;

    try
      {
      System.out.println( "*********** Loading "+fileName+"...");
      BufferedReader br = new BufferedReader( new FileReader(  fileName) );

      if ( session == null)  // create a local session
        {
        session = sessionFactory.openSession();
        RichObjectFactory.connectToBioSQL(session);
        }

      // load the objects. I expect this to use the established factory.
      RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA(
br, new
          SimpleNamespace( "genbankBiosqlRich") );

      while ( rsi.hasNext() )
        tx = session.beginTransaction(); // Hibernate requires transactions.

        System.out.println( "*********** Loading next sequence...");
        // ??should automatically fetch existing objects from the
database...
        RichSequence sequence = rsi.nextRichSequence();
        System.out.println( "loaded sequence
"+sequence.getAccession()+", identifier: "+ sequence.getIdentifier());

        try
          {
          System.out.println( "*********** saving...");

          // synchronize in-memory representation w/ the database
          // HUGE amounts of time spent doing selects on keys - really
slows things down!!
          session.saveOrUpdate( "Sequence", sequence );
          tx.commit();    // save to database - does an automatic flush
          // batch operations overwhelm the cache - clear it out!
          session.flush();  // force in-memory to disk.
          session.clear();  // clean out cache.
          }
        catch (HibernateException ex)
          {
          tx.rollback();   // discard the sequence and all its annotations
          ex.printStackTrace();
          }
        }
      }
    catch (FileNotFoundException ex)
      {
      ex.printStackTrace();
      }
    catch ( BioException bex)
      {
      bex.printStackTrace();
      }
    finally
      {
      if ( localSession)
        {
        session.flush();  // force in-memory to disk.
        session.close();  // only for local sessions
        }
      }
    }

and the following following is a sample stack dump:

org.hibernate.exception.ConstraintViolationException: could not insert:
[Namespace]
       at
org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71) 


       at
org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163) 


       at
org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643) 


       at
org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51) 


       at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279)
       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218)
       at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268)
       at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216)
       at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169)
       at org.hibernate.engine.Cascade.cascade(Cascade.java:130)
       at
org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265) 


       at
org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) 


       at
org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) 


       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi 


stener.java:187)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1 


72)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java 


:94)
       at
org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) 


       at
org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
       at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
       at
bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108)
       at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72)
Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich'
for key 2
       at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975)
       at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600)
       at
com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125) 


       at
com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677) 


       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274)
       at
com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259)
       at
org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) 


       at
org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) 


-- 
Doug Brown - Bioinformatics
Fungal Genomics Laboratory
Center for Integrated Fungal Research
North Carolina State University
Campus Box 7251, Raleigh, NC 27695-7251
https://www.fungalgenomics.ncsu.edu/~debrown/
Tel: (919) 513-0394, Fax (919) 513-0024
e-mail: doug_brown at ncsu.edu


From dreher at molgen.mpg.de  Mon Apr 21 12:51:52 2008
From: dreher at molgen.mpg.de (Felix Dreher)
Date: Mon, 21 Apr 2008 14:51:52 +0200
Subject: [Biojava-l] mailing list archives
Message-ID: <480C8DE8.5040805@molgen.mpg.de>

Hello all,

is there a possibility to query the biojava mailing-list archives?
(the link provided on the biojava-homepage doesn't work:
http://search.open-bio.org/cgi-bin/mail-search.cgi)

Best regards,
Felix


From markjschreiber at gmail.com  Tue Apr 22 01:27:51 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 22 Apr 2008 09:27:51 +0800
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
In-Reply-To: <4805E9B8.6000709@unity.ncsu.edu>
References: <4805E9B8.6000709@unity.ncsu.edu>
Message-ID: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>

Hi Doug -

Has anyone provided a solution for this yet?

I haven't used the Hibernate bindings to BioSQL (I'm actually working
on a JPA binding with EntityBeans) but when I did they worked well.
However, I have seen this type of error before. Clearly the entry
'genbankBiosqlRich' is being duplicated somewhere where it should be
unique.  This looks unusual because you call saveOrUpdate which should
be able to figure this out unless the way Hibernate determines
equality is not the same as the way BioJava does.

What happens if you try to save them one at a time (in sequential runs
of your program)? From the look of the stack trace you might see the
same error.

Also, it might pay to look at the biojavax documentation on
http://biojava.org/wiki/BioJava:BioJavaXDocs#BioSQL_and_Hibernate.

Any Hibernate experts able to offer an opinion here??
- Mark

On Wed, Apr 16, 2008 at 7:57 PM, Doug Brown <debrown at unity.ncsu.edu> wrote:
> Greetings,
> I am happily climbing the learning curve for Biojava-live, Biojavax,
> and BioSQL. I believe that I am using the latest releases, Biojava 1.6
> and BioSQL 1.0, in that I have performed the installation within the
> past week.
>
> I am attempting to load, via Biojavax, multiple genbank records for the
> same organism (a whole genome's worth of annotations) and to save those
> into a BioSQL database via Biojavax's Hibernate persistence mechanism.
> Loading a second genbank file (same organism, different sequence) croaks
> with the error: SEVERE: Duplicate entry 'genbankBiosqlRich' for key 2
> ...... could not insert: [Namespace]. FYI two sample genbank records
> are  CH476760.gb and CH476761.gb and were obtained directly from genbank.
>
> Never having used Hibernate before nor its type of database abstraction,
> I think that I am properly handling the transaction semantics. Either I
> am violating unspoken presumptions of the persistence paradigm or the
> behavior of RichSequence.IOTools.readGenbankDNA is not what I expected.
> I had presumed that the above routine would use the established
> RichObjectFactory to obtain new or extant objects and then populate
> those objects with values from the sequence file. This only seems to
> happen when I load multiple sequences from a single file. Multi file
> operations fail dismally.
>
> What is the proper way of using Biojava to load up a database with records?
>
> In advance, thank you all for the traffic on this list, it has been
> quite helpful in bringing me up to speed.
>
> Regards,
> Doug Brown
>
> Here is the relevant [hacked] subroutine:
>  /**
>  * This works for genbank files containing multiple sequences.
>  * Originaly concept from:
> http://portal.open-bio.org/pipermail/biojava-l/2007-April/005824.html
>  * It fails on inserting existant record(s) - does not replace...
>  * This causes grief when loading multiple files...
>  */
>  public void loadNSave( Session session, File fileName)
>   {
>   boolean localSession = (session == null);
>   Transaction tx = null;
>
>   try
>     {
>     System.out.println( "*********** Loading "+fileName+"...");
>     BufferedReader br = new BufferedReader( new FileReader(  fileName) );
>
>     if ( session == null)  // create a local session
>       {
>       session = sessionFactory.openSession();
>       RichObjectFactory.connectToBioSQL(session);
>       }
>
>     // load the objects. I expect this to use the established factory.
>     RichSequenceIterator rsi = RichSequence.IOTools.readGenbankDNA(
> br, new
>         SimpleNamespace( "genbankBiosqlRich") );
>
>     while ( rsi.hasNext() )
>       tx = session.beginTransaction(); // Hibernate requires transactions.
>
>       System.out.println( "*********** Loading next sequence...");
>       // ??should automatically fetch existing objects from the
> database...
>       RichSequence sequence = rsi.nextRichSequence();
>       System.out.println( "loaded sequence
> "+sequence.getAccession()+", identifier: "+ sequence.getIdentifier());
>
>       try
>         {
>         System.out.println( "*********** saving...");
>
>         // synchronize in-memory representation w/ the database
>         // HUGE amounts of time spent doing selects on keys - really
> slows things down!!
>         session.saveOrUpdate( "Sequence", sequence );
>         tx.commit();    // save to database - does an automatic flush
>         // batch operations overwhelm the cache - clear it out!
>         session.flush();  // force in-memory to disk.
>         session.clear();  // clean out cache.
>         }
>       catch (HibernateException ex)
>         {
>         tx.rollback();   // discard the sequence and all its annotations
>         ex.printStackTrace();
>         }
>       }
>     }
>   catch (FileNotFoundException ex)
>     {
>     ex.printStackTrace();
>     }
>   catch ( BioException bex)
>     {
>     bex.printStackTrace();
>     }
>   finally
>     {
>     if ( localSession)
>       {
>       session.flush();  // force in-memory to disk.
>       session.close();  // only for local sessions
>       }
>     }
>   }
>
> and the following following is a sample stack dump:
>
> org.hibernate.exception.ConstraintViolationException: could not insert:
> [Namespace]
>      at
> org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:71)
>
>      at
> org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
>
>      at
> org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:40)
>
>      at
> org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2163)
>
>      at
> org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2643)
>
>      at
> org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:51)
>
>      at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:279)
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107)
>
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi
>
> stener.java:187)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1
>
> 72)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java
>
> :94)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70)
>
>      at
> org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
>      at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
>      at
> org.hibernate.engine.CascadingAction$5.cascade(CascadingAction.java:218)
>      at org.hibernate.engine.Cascade.cascadeToOne(Cascade.java:268)
>      at org.hibernate.engine.Cascade.cascadeAssociation(Cascade.java:216)
>      at org.hibernate.engine.Cascade.cascadeProperty(Cascade.java:169)
>      at org.hibernate.engine.Cascade.cascade(Cascade.java:130)
>      at
> org.hibernate.event.def.AbstractSaveEventListener.cascadeBeforeSave(AbstractSaveEventListener.java:431)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:265)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181)
>
>      at
> org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107)
>
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventLi
>
> stener.java:187)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:1
>
> 72)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.performSaveOrUpdate(DefaultSaveOrUpdateEventListener.java
>
> :94)
>      at
> org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70)
>
>      at
> org.hibernate.impl.SessionImpl.fireSaveOrUpdate(SessionImpl.java:507)
>      at org.hibernate.impl.SessionImpl.saveOrUpdate(SessionImpl.java:499)
>      at
> bioinformatics.biojava.BriefLoader.loadNSave(BriefLoader.java:108)
>      at bioinformatics.biojava.BriefLoader.main(BriefLoader.java:72)
> Caused by: java.sql.SQLException: Duplicate entry 'genbankBiosqlRich'
> for key 2
>      at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2975)
>      at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1600)
>      at
> com.mysql.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1125)
>
>      at
> com.mysql.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:677)
>
>      at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1357)
>      at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1274)
>      at
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1259)
>      at
> org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73)
>
>      at
> org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33)
>
>
>
> --
> Doug Brown - Bioinformatics
> Fungal Genomics Laboratory
> Center for Integrated Fungal Research
> North Carolina State University
> Campus Box 7251, Raleigh, NC 27695-7251
> https://www.fungalgenomics.ncsu.edu/~debrown/
> Tel: (919) 513-0394, Fax (919) 513-0024
> e-mail: doug_brown at ncsu.edu
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From markjschreiber at gmail.com  Tue Apr 22 01:39:39 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 22 Apr 2008 09:39:39 +0800
Subject: [Biojava-l] Serching mailing list archives
Message-ID: <93b45ca50804211839lddf0a7je47c337c5250c85d@mail.gmail.com>

Dear Felix -

The URL has been updated. Please see below.

- Mark

---------- Forwarded message ----------
From: Mauricio Herrera Cuadra via RT <support at helpdesk.open-bio.org>
Date: Tue, Apr 22, 2008 at 9:20 AM
Subject: [O|B|F Helpdesk #507] Fwd: [Biojava-l] mailing list archives
To: markjschreiber at gmail.com
Cc: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org


OBF Search engine URL has been moved to: http://search.open-bio.org
I've updated the link in the BioJava wiki
(http://biojava.org/wiki/BioJava:MailingLists) with the new URL.

Please let the requestor know about the update.

Regards,
Mauricio.


From markjschreiber at gmail.com  Tue Apr 22 06:40:35 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Tue, 22 Apr 2008 14:40:35 +0800
Subject: [Biojava-l] loading multiple records for same organism and
	peristance in BioSQL
In-Reply-To: <480D7B55.3070708@uni-tuebingen.de>
References: <4805E9B8.6000709@unity.ncsu.edu>
	<93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>
	<480D7B55.3070708@uni-tuebingen.de>
Message-ID: <93b45ca50804212340s708df36ek73cdf3337b957316@mail.gmail.com>

Hi -

Can someone update the docs on biojava.org to reflect this requirement?

Thanks,

- Mark

On Tue, Apr 22, 2008 at 1:44 PM, Andreas Dr?ger
<andreas.draeger at uni-tuebingen.de> wrote:
> Hi Doug,
>
> We also had the same problem. The solution is simple. You always construct a
> new SimpleNamespace, each time with the same name. Your code will work if
> you do one of the following:
> 1. You can load the namespace with the name from the database and set this
> namespace to the parser.
> 2. You can use the default namespace from the RichObjectFactory or
> 3. Just use the parser method, which does not require any namespaces - this
> method actually uses the default namespace (so three is actually equal to
> two).
> This should help.
>
> Cheers
> Andreas
>


From debrown at unity.ncsu.edu  Tue Apr 22 12:22:16 2008
From: debrown at unity.ncsu.edu (Doug Brown)
Date: Tue, 22 Apr 2008 08:22:16 -0400
Subject: [Biojava-l] mailing list archives
In-Reply-To: <480C8DE8.5040805@molgen.mpg.de>
References: <480C8DE8.5040805@molgen.mpg.de>
Message-ID: <480DD878.2000100@unity.ncsu.edu>

Hi Felix,

In addition to the http://search.open-bio.org link mentioned by
Mauricio Herrera Cuadra, you could use Google directly with search 
expressions similar to:

    "[biojava-l]" site:portal.open-bio.org
    "[biojava-dev]" site:portal.open-bio.org

Of course, you need to add on any additional search terms to limit the 
results.
In general see: http://www.google.com/advanced_search

Regards,
Doug


Felix Dreher wrote:
> Hello all,
>
> is there a possibility to query the biojava mailing-list archives?
> (the link provided on the biojava-homepage doesn't work:
> http://search.open-bio.org/cgi-bin/mail-search.cgi)
>
> Best regards,
> Felix
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Doug Brown - Bioinformatics
Fungal Genomics Laboratory
Center for Integrated Fungal Research
North Carolina State University
Campus Box 7251, Raleigh, NC 27695-7251
https://www.fungalgenomics.ncsu.edu/~debrown/
Tel: (919) 513-0394, Fax (919) 513-0024
e-mail: doug_brown at ncsu.edu


From markjschreiber at gmail.com  Wed Apr 23 07:22:14 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 23 Apr 2008 15:22:14 +0800
Subject: [Biojava-l] Suspicious Headers
Message-ID: <93b45ca50804230022j223ed4eetf219e21d35cd51dc@mail.gmail.com>

Hi -

If you have ever tried posting to the list and had your email bounced
back with a message complaining about a suspicious header chances are
your email has been bounced by our spam filter.  There are generally 2
reasons why this might happen

1. Your email is HTML, mail to the list must be text only.

2. You email has an attachment, possibly a .vcf file or an image in
your email signature (company logo or similar).

If you keep it plain text it should get through.

- Mark


From mail at florianschatz.de  Wed Apr 23 13:49:05 2008
From: mail at florianschatz.de (Florian Schatz)
Date: Wed, 23 Apr 2008 15:49:05 +0200
Subject: [Biojava-l] Extract non-gene regions
Message-ID: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>

Hello,

I am new to biojava and worked a lot with in the last few weeks. I  
hope this is the right place for questions, if not please tell me.

I want to get the nucleotid sequence outside the genes of a genebank  
file. So everything that is not marked by a 'gene' feature.   
Unfortunately, there is no sustract or exclude function for the  
Location class. Any hints?

Btw: union() of location worked fine for extracting nucleotids of the  
genes only.

Best,
Florian


From markjschreiber at gmail.com  Thu Apr 24 02:29:12 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Thu, 24 Apr 2008 10:29:12 +0800
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
References: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
Message-ID: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>

Hi Florian -

There are at least two approaches. You are on the right track with
making a union of all gene locations.  The compound location that
results from the Union will contain all the nucleotides that are
coding. You can then iterate through each nucleotide in the genome and
find out if the union contains the nucleotide. If it doesn't then it
is non coding.  This is surprisingly rapid as the comparisons are
simple.  The pseudo code would be something like...

RichLocation coding; //initialize this by making a union of all
locations of CDS or Gene Features.

RichSequence genome; // read from file or database

for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
bit more sophisticated for a circular genome
    if( ! genome.contains(i){
         //you have a non-coding nucleotide.
    }
}

The other approach is to use the blockIterator() method of the
compound location that results from the union of coding sequences.
This will output each contiguous chunk of coding sequence. If you know
the length of the sequence then you can rapidly figure out the
intervening pieces.

For example, if the block iterator tells you that [10..50], [90..100],
[350..380] are coding and you know the genome is of length 400 then
you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
non-coding.  Again it is more complicated for circular sequences and
more complex if you consider the opposite strand of a gene (the gene
shadow) to be non-coding. Unfortunately there is no convenience method
to do this but if you code something up it would be great to put it in
the cookbook so others can re-use it.

- Mark

You could actually make point locations of all the non-coding
nucleotides and then merge the whole lot at the end into a compound
location of non-coding

On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de> wrote:
> Hello,
>
>  I am new to biojava and worked a lot with in the last few weeks. I hope
> this is the right place for questions, if not please tell me.
>
>  I want to get the nucleotid sequence outside the genes of a genebank file.
> So everything that is not marked by a 'gene' feature.  Unfortunately, there
> is no sustract or exclude function for the Location class. Any hints?
>
>  Btw: union() of location worked fine for extracting nucleotids of the genes
> only.
>
>  Best,
>  Florian
>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From heuermh at acm.org  Thu Apr 24 04:09:46 2008
From: heuermh at acm.org (Michael Heuer)
Date: Thu, 24 Apr 2008 00:09:46 -0400 (EDT)
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0804240007470.25626-100000@shell3.shore.net>

On Thu, 24 Apr 2008, Mark Schreiber wrote:

> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations.  The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding.  This is surprisingly rapid as the comparisons are
> simple.  The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> bit more sophisticated for a circular genome
>     if( ! genome.contains(i){
>          //you have a non-coding nucleotide.
>     }
> }

typo?

  if (!coding.contains(i)) {
    // you have a non-coding nucleotide.
  }


> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding.  Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de> wrote:
> > Hello,
> >
> >  I am new to biojava and worked a lot with in the last few weeks. I hope
> > this is the right place for questions, if not please tell me.
> >
> >  I want to get the nucleotid sequence outside the genes of a genebank file.
> > So everything that is not marked by a 'gene' feature.  Unfortunately, there
> > is no sustract or exclude function for the Location class. Any hints?
> >
> >  Btw: union() of location worked fine for extracting nucleotids of the genes
> > only.
> >
> >  Best,
> >  Florian
> >  _______________________________________________
> >  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >  http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From andreas.draeger at uni-tuebingen.de  Thu Apr 24 07:26:31 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Thu, 24 Apr 2008 09:26:31 +0200
Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting
 them using Hibernate
Message-ID: <48103627.90505@uni-tuebingen.de>

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the
solution was not to use files from Ensemble, but those from NCBI instead.
However, the reason why the files from Ensembl are so important, is that
they contain additional annotation, not provided by NCBI. For instance 
the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas


From mail at florianschatz.de  Thu Apr 24 12:09:24 2008
From: mail at florianschatz.de (Florian Schatz)
Date: Thu, 24 Apr 2008 14:09:24 +0200
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
References: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
	<93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
Message-ID: <BD0D4981-C9D8-42DD-9C76-77C0C8113FE5@florianschatz.de>

Hello,

I tried that, but is as slow as a version operating on Strings..  
however, I created a Cookbook entry:
http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions

Is there a better way to get a Sequence from a SybolList than:

Sequence newsequence = DNATools.createDNASequence(symbolL.seqString 
(), "New Sequence");

Best,
Florian

Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations.  The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding.  This is surprisingly rapid as the comparisons are
> simple.  The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> bit more sophisticated for a circular genome
>     if( ! genome.contains(i){
>          //you have a non-coding nucleotide.
>     }
> }
>
> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding.  Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz  
> <mail at florianschatz.de> wrote:
>> Hello,
>>
>>  I am new to biojava and worked a lot with in the last few weeks.  
>> I hope
>> this is the right place for questions, if not please tell me.
>>
>>  I want to get the nucleotid sequence outside the genes of a  
>> genebank file.
>> So everything that is not marked by a 'gene' feature.   
>> Unfortunately, there
>> is no sustract or exclude function for the Location class. Any hints?
>>
>>  Btw: union() of location worked fine for extracting nucleotids of  
>> the genes
>> only.
>>
>>  Best,
>>  Florian
>>  _______________________________________________
>>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>>


From markjschreiber at gmail.com  Thu Apr 24 12:47:59 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Thu, 24 Apr 2008 20:47:59 +0800
Subject: [Biojava-l] Extract non-gene regions
In-Reply-To: <BD0D4981-C9D8-42DD-9C76-77C0C8113FE5@florianschatz.de>
References: <B8FE6090-06D9-4082-9529-3AB4680D05A8@florianschatz.de>
	<93b45ca50804231929h7a7c31aaudccbc5325d6be5f3@mail.gmail.com>
	<BD0D4981-C9D8-42DD-9C76-77C0C8113FE5@florianschatz.de>
Message-ID: <93b45ca50804240547gefb750fh493a01a8d0edbe22@mail.gmail.com>

Hi -

While Sequences and SymbolLists offer many advantages over Strings or
character arrays speed is not one of them.

You can create a Sequence using the SequenceFactory implementations
which are much more efficient than converting to Strings and back to
symbols again. This is a very expensive operation.  From memory
SimpleRichSequence may even have a constructor that takes a SymbolList
and a name. There should be no need to convert to a String and back.

Also, do you need a Sequence when a SymbolList may contain all the
information you need?

Finally the Edit operations you use in your wiki example will cause
quite a big performance hit, your comment seems to allude to this. It
would be better to collect all the non-coding points (i) and compile
them into a compound location and then extract the SymbolList for that
location all in one go.

- Mark

On Thu, Apr 24, 2008 at 8:09 PM, Florian Schatz <mail at florianschatz.de> wrote:
> Hello,
>
>  I tried that, but is as slow as a version operating on Strings.. however, I
> created a Cookbook entry:
>  http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions
>
>  Is there a better way to get a Sequence from a SybolList than:
>
>  Sequence newsequence = DNATools.createDNASequence(symbolL.seqString(), "New
> Sequence");
>
>
>  Best,
>  Florian
>
>  Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
>
> > Hi Florian -
> >
> >
> >
> >
> > There are at least two approaches. You are on the right track with
> > making a union of all gene locations.  The compound location that
> > results from the Union will contain all the nucleotides that are
> > coding. You can then iterate through each nucleotide in the genome and
> > find out if the union contains the nucleotide. If it doesn't then it
> > is non coding.  This is surprisingly rapid as the comparisons are
> > simple.  The pseudo code would be something like...
> >
> > RichLocation coding; //initialize this by making a union of all
> > locations of CDS or Gene Features.
> >
> > RichSequence genome; // read from file or database
> >
> > for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> > bit more sophisticated for a circular genome
> >    if( ! genome.contains(i){
> >         //you have a non-coding nucleotide.
> >    }
> > }
> >
> > The other approach is to use the blockIterator() method of the
> > compound location that results from the union of coding sequences.
> > This will output each contiguous chunk of coding sequence. If you know
> > the length of the sequence then you can rapidly figure out the
> > intervening pieces.
> >
> > For example, if the block iterator tells you that [10..50], [90..100],
> > [350..380] are coding and you know the genome is of length 400 then
> > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> > non-coding.  Again it is more complicated for circular sequences and
> > more complex if you consider the opposite strand of a gene (the gene
> > shadow) to be non-coding. Unfortunately there is no convenience method
> > to do this but if you code something up it would be great to put it in
> > the cookbook so others can re-use it.
> >
> > - Mark
> >
> > You could actually make point locations of all the non-coding
> > nucleotides and then merge the whole lot at the end into a compound
> > location of non-coding
> >
> > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de>
> wrote:
> >
> > > Hello,
> > >
> > >  I am new to biojava and worked a lot with in the last few weeks. I hope
> > > this is the right place for questions, if not please tell me.
> > >
> > >  I want to get the nucleotid sequence outside the genes of a genebank
> file.
> > > So everything that is not marked by a 'gene' feature.  Unfortunately,
> there
> > > is no sustract or exclude function for the Location class. Any hints?
> > >
> > >  Btw: union() of location worked fine for extracting nucleotids of the
> genes
> > > only.
> > >
> > >  Best,
> > >  Florian
> > >  _______________________________________________
> > >  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >  http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> > >
> >
>
>
>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From hlapp at gmx.net  Thu Apr 24 22:25:29 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 24 Apr 2008 18:25:29 -0400
Subject: [Biojava-l] Problem while parsing GenBank-like files and
	persiting them using Hibernate
In-Reply-To: <48103627.90505@uni-tuebingen.de>
References: <48103627.90505@uni-tuebingen.de>
Message-ID: <59C8D987-F6F8-4E83-B70D-127D38DCC0C9@gmx.net>

Hi Andreas,

On Apr 24, 2008, at 3:26 AM, Andreas Dr?ger wrote:
> Or more generally, can I somehow read in an ontology (for instance  
> the GO), persist it in BioSQL and make use of the terms contained  
> therein?


in principle, you absolutely can. As for whether Biojava lets you do  
this I do not know but I would suppose yes. (BioPerl has a script  
load_ontology.pl in its Bioperl-db package that does this.)

Your other questions all seem Biojava-specific, so I'll leave them to  
the list and Mark & Richard.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From andreas.draeger at uni-tuebingen.de  Fri Apr 18 07:13:52 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Fri, 18 Apr 2008 07:13:52 -0000
Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting
 them using Hibernate
Message-ID: <480844DB.6070808@uni-tuebingen.de>

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site 
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern 
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can 
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the 
solution was not to use files from Ensemble, but those from NCBI 
instead. However, the reason why the files from Ensembl are so 
important, is that they contain additional annotation, not provided by 
NCBI. For instance the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the 
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database 
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080418/d9e58e3f/attachment-0002.vcf>

From andreas.draeger at uni-tuebingen.de  Tue Apr 22 05:45:02 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Tue, 22 Apr 2008 05:45:02 -0000
Subject: [Biojava-l] loading multiple records for same organism
 and	peristance in BioSQL
In-Reply-To: <93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>
References: <4805E9B8.6000709@unity.ncsu.edu>
	<93b45ca50804211827i42924c9rb1d5d2f85c25a1ca@mail.gmail.com>
Message-ID: <480D7B55.3070708@uni-tuebingen.de>

Hi Doug,

We also had the same problem. The solution is simple. You always 
construct a new SimpleNamespace, each time with the same name. Your code 
will work if you do one of the following:
1. You can load the namespace with the name from the database and set 
this namespace to the parser.
2. You can use the default namespace from the RichObjectFactory or
3. Just use the parser method, which does not require any namespaces - 
this method actually uses the default namespace (so three is actually 
equal to two).
This should help.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080422/0d8e8e41/attachment-0002.vcf>

From andreas.draeger at uni-tuebingen.de  Tue Apr 22 06:43:29 2008
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Tue, 22 Apr 2008 06:43:29 -0000
Subject: [Biojava-l] Problem while parsing GenBank-like files and persiting
 them using Hibernate
Message-ID: <480D859F.6080003@uni-tuebingen.de>

Dear all,

Recently I downloaded some GenBank-like files from the Ensembl web site 
(http://www.ensembl.org/index.html) and recognized that the format used 
on this site slightly diverges from what one gets from NCBI.
Especially the ACCESSION number is not valid according to the pattern 
matcher in class org.biojavax.bio.seq.io.GenbankFormat and the files can 
thus not be parsed using the RichSequence.IOTools.
This issue has already been discussed in this list before, but the 
solution was not to use files from Ensemble, but those from NCBI 
instead. However, the reason why the files from Ensembl are so 
important, is that they contain additional annotation, not provided by 
NCBI. For instance the feature "exon".
The old parsers from the biojava.seq.io package are able to read in the 
files from this site. The Sequence objects can be enriched afterwards 
and be written to another genbank file. However, this again results in a 
file, which cannot be stored in a BioSQL database using Hibernate caused 
by the invalid accession number. The next problem is that even the old 
parsers do not treat this "rich" information from the Ensembl files 
properly. The feature "exon" becomes "any" when the sequence is enriched 
and written to a new GenBank file. Hence the benefit from the Ensembl 
annotation gets lost during paring and conversion. By the way, Ensembl 
also offers to write Embl-like files or other formats with the same 
problems as mentioned above.
On the other hand, no matter which parser in BioJavaX I look up within 
the API documentation, I can always find a corresponding "Term" class, 
which states that this class "Implements some ...-specific terms", where 
the dots stand for the considered format like UniProt, GenBank, Embl and 
so forth. None of these Term classes provides any setters or 
add-methods, which would allow to define a new term like "exon". The 
structure of the parsers seems to me to be very sophisticated and it is 
not very easy to extend the parsers or term classes for own purposes.
Therefore, I would like to ask the following questions:
1. Is there a way to read in files downloaded from Ensembl using only 
the designated BioJavaX classes?
2. How can I extend the terms so that not only "SOME X-specific terms" 
are included, but some more? And how do I tell the parser to use and 
apply these terms? Or more generally, can I somehow read in an ontology 
(for instance the GO), persist it in BioSQL and make use of the terms 
contained therein?
3. How can I persist a sequence from Ensembl within a BioSQL database 
using Hibernate even though they use different accession numbers?
I am grateful for any answers.

Cheers
Andreas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: andreas.draeger.vcf
Type: text/x-vcard
Size: 509 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080422/906d4e9d/attachment-0002.vcf>

From budhaditya21 at yahoo.co.in  Tue Apr 29 05:29:23 2008
From: budhaditya21 at yahoo.co.in (arunabha banerjee)
Date: Tue, 29 Apr 2008 05:29:23 -0000
Subject: [Biojava-l] problems installing biojava on Windows XP professional
Message-ID: <406135.22320.qm@web94608.mail.in2.yahoo.com>

An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20080429/32d4f319/attachment-0002.html>