From gongwuming at gmail.com  Mon Jan  3 06:45:39 2005
From: gongwuming at gmail.com (Wuming Gong)
Date: Mon Jan  3 06:42:46 2005
Subject: [Biojava-l] Is there a BioJava wrapper for Consensus
Message-ID: <24d6fd0505010303451496711f@mail.gmail.com>

Hi List, 

I wonder whether there is already such a BioJava class for standalone
version of Consensus (Hertz, G. Z. and Stormo G. D., 1999,
Bioinformatics, 15, 563-577). If no, I want to write one. Could you
please recommed some tutorials or documents for writing such wrapper
in BioJava?

Thanks.

Wuming
From mark.schreiber at group.novartis.com  Thu Jan  6 21:43:13 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Jan  6 21:39:53 2005
Subject: [Biojava-l] Exception not being caught.
Message-ID: <OF9F1D6B6A.62AEBDD9-ON48256F82.000E62B8-48256F82.000EF1D1@EU.novartis.net>

I tend to agree. This probably shouldn't be an error. It should be 
possible to recover from it. Having said that changing it might be a bit 
of a headache. You don't need to catch Errors but if we make it an 
Exception then old code will be invalid because you do need to catch 
exceptions.

A possible way around would be to make a RuntimeException which you don't 
need to catch but then you are pretty much back to the situation of 
catching a Throwable so there is no real advantage. Also, 
RuntimeExceptions are pretty much reserved for things that happen due to 
bad programming such as NullPointerExceptions.

As a general rule BioJava shouldn't use Errors unless something truely bad 
happens, such as not being able to locate a critical resource like 
AlphabetManager.xml. That sort of thing would be very hard to recover from 
and should be an error. Someone passing some crap sequence to a parser 
shouldn't be an error.

- Mark


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
Sent by: biojava-l-bounces@portal.open-bio.org
12/29/2004 01:14 PM

 
        To:     "Michael Heuer" <heuermh@acm.org>
        cc:     biojava-l@biojava.org, (bcc: Mark Schreiber/GP/Novartis)
        Subject:        RE: [Biojava-l] Exception not being caught.


Hi,

I resolved the exception by adding a trim() call to truncate whitespace
from the ends of the sequence before passing it to DNATools. There were
some weird trailing blank symbols, \0, \r, \n and the like. Fair enough
that it threw a wobbly when it encountered these. However, I am sure
there are situations where you would want to safely know whether a
sequence contained invalid characters (eg. if accepting free-text
sequence information via a web interface). In this case, you would want
to catch the exception in the usual manner.

Should this particular BioError not be a plain normal BioException that
people could catch easily?

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199 
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: Michael Heuer [mailto:heuermh@shell3.shore.net] On 
> Behalf Of Michael Heuer
> Sent: Wednesday, December 29, 2004 12:25 PM
> To: Richard HOLLAND
> Cc: biojava-l@biojava.org
> Subject: Re: [Biojava-l] Exception not being caught.
> 
> 
> 
> On Wed, 29 Dec 2004, Richard HOLLAND wrote:
> 
> > I am getting an exception thrown by my code that never seems to get
> > caught. I am not sure if this is because of BioJava or 
> because of a lack
> > of understanding of Exceptions on my part? The exception causes the
> > program to grind to an immediate halt. My method throws the general
> > Exception class, but the exception thrown by BioJava seems to escape
> > that detail and treats it as though my method were not handling
> > exceptions at all. I would expect the calling method which wraps the
> > call in a try{}catch{Exception e} statement to catch it? 
> But apparently
> > not? Why not?!!
> >
> > The method in BioJava I am using is DNATools.createDNASequence.
> >
> > Here is the exception:
> >
> > Exception in thread "main" org.biojava.bio.BioError: 
> Something has gone
> > badly wrong with DNA
> >         at org.biojava.bio.seq.DNATools.createDNA(DNATools.java:158)
> 
> Unfortunately BioError is not an exception, it is an Error.
> 
> I believe you can catch them with
> 
> try
> {
>   // ...
> }
> catch (Throwable t)
> {
>   // ...
> }
> 
> but you probably shouldn't be.  From the BioError javadoc:
> 
> For developers:
> Throw this when something has gone wrong and in general people should
> not be handling it.
> 
> 
> 
> > org.biojava.bio.seq.DNATools.createDNASequence(DNATools.java:176)
> >         at
> > 
> gis.aads.pipeline.LibraryFastaBuilder.run(LibraryFastaBuilder.
> java:234)
> >         at gis.pipeline.Main.main(Main.java:125)
> > Caused by: org.biojava.bio.symbol.IllegalSymbolException: This
> > tokenization doesn't contain character: ''
> >         at
> > 
> org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar(Ch
> aracterTok
> > enization.java:175)
> 
> This is the real problem:  the parser doesn't know what to do with the
> character ''.  I don't know exactly what that means, but does 
> the string
> you pull out of the database clob look reasonable?
> 
>    michael
> 
> 
> > And here is the method that calls it (or bits of it anyhow, and an
> > example calling method):
> >
> >     public void doTheThing() {
> >               MyClass otherClass = new MyClass();
> >        try {
> >                  int rc = otherClass.run();
> >                  System.out.println("rc was "+rc);
> >        } catch (Exception e) {
> >           System.out.println("oops!");
> >        }
> >     }
> >
> >     public int run() throws Exception {
> > ....
> >         // For each library, get all trimmed seqs.
> >         for (String lib : libs) {
> >             log.info("Processing library "+lib);
> > ....
> >             // Get the sequences.
> >             seqq.execute(lib);
> >             rs = seqq.results();
> >
> >             // Log info.
> >             log.info("Processing fasta.");
> >             while (rs.next()) {
> >                 // Get details.
> >                 String seqID = rs.getString(1);
> >                 char direction = UserSampleID.getDirection(seqID);
> >                 Clob seqclob = rs.getClob(2);
> >                 String seqstr =
> > seqclob.getSubString((long)1,(int)seqclob.length());
> >                 if (seqstr.length()<minLength) continue;
> >
> >                 // Create the sequence and format it into fasta.
> >                 Sequence seq = DNATools.createDNASequence(seqstr,
> > seqID);
> >                 ByteArrayOutputStream baos = new
> > ByteArrayOutputStream();
> >                 SeqIOTools.writeFasta(baos,seq);
> >                 baos.flush();
> >
> >                 // For each seq, if reverse, add to reverse 
> temp file.
> >                 // Else, add to forward temp file.
> >                 switch (direction) {
> >                     case 'R':
> >                         reverseWriter.write(baos.toString());
> >                         break;
> >                     case 'F':
> >                         forwardWriter.write(baos.toString());
> >                         break;
> >                     default:
> >                         log.warning("Unknown direction "+direction+"
> > received for sequence "+seqID);
> >                         rc = PipelineApp.FAILURE;
> >                         continue;
> >                 }
> > ....
> >             }
> > ....
> >         }
> > ....
> >     }
> >
> >
> > I understand that the exception is thrown because of an invalid
> > sequence, but I don't understand why it isn't being caught.
> >
> >
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you are not the
> > intended recipient, please delete it and notify us 
> immediately. Please
> > do not copy or use it for any purpose, or disclose its 
> content to any
> > other person. Thank you.
> > ---------------------------------------------
> >
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
> >
> 
> 

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From mark.schreiber at group.novartis.com  Thu Jan  6 21:50:17 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Jan  6 21:46:52 2005
Subject: [Biojava-l] CVS
Message-ID: <OF96810C93.92CC8C9F-ON48256F82.000F9286-48256F82.000F9787@EU.novartis.net>

Take a look at:

http://cvs.biojava.org/


Felipe Albrecht <felipe.albrecht@gmail.com>
Sent by: biojava-l-bounces@portal.open-bio.org
12/29/2004 02:06 PM
Please respond to Felipe Albrecht

 
        To:     biojava-l@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] CVS


Hello, 

I'm a computer science student and I intent study biomolecular with
bioinformatic.
I have a good skills with java platform and I love to code and study
biomolecular.

I want work with biojava, and for while, for understand better the
biojava, I thing to help biojava's upgrade, from java 1.4 to 1.5.
For this, I need know what are the cvs project directory, the
repository is " 
:pserver:cvs@cvs.open-bio.org:/home/repository/biojava" ?

So, what are the project that contains the most actual (and instable) 
code?

Thanks,

Felipe Albrecht
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From mark.schreiber at group.novartis.com  Thu Jan  6 21:54:11 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Jan  6 21:50:43 2005
Subject: [Biojava-l] Blast SAX parser output
Message-ID: <OF5591523D.BB6F058B-ON48256F82.000FCCD3-48256F82.000FF2F1@EU.novartis.net>

Use a different SearchContentHandler (or Override BlastHitSummaryWriter) 
to write to a file instead of STDOUT. Or you could redirect STDOUT to a 
file (are you allowed to do that with a cron job??)


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
Sent by: biojava-l-bounces@portal.open-bio.org
12/30/2004 11:54 AM

 
        To:     <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Blast SAX parser output


Is there any way to stop the blast parser code from outputting progress? I 
get lots of the following and its clogging up my unix mailbox as the job 
is run through cron:

obj=score               317
obj=expectValue         7e-86
obj=numberOfIdentities          158
obj=alignmentSize               160
obj=percentageIdentity          98
obj=numberOfPositives           159
obj=numberOfPositives           159
obj=queryFrame          plus2
obj=querySequenceStart          29
obj=querySequenceEnd            508
obj=querySequence DKHWMPVTKLGRLVKDMKIKSLEEIYLFSLPIKESEIIDFFLGASLKD
EVLKIMPVQKQTRAGQRTRFKAFVAIGDYNGHVGLGVKCSKEVATAIRGAIILAKLSIVPVRRGYWGNKIGKPHTVPCKV
TGRCGSVLVRLIPAPRGTGIVSAPVPKKLLMM
obj=subjectSequenceStart                31
obj=subjectSequenceEnd          190
obj=subjectSequence DKEWIPVTKLGRLVKDMKIKSLEEIYLFSLPIKESEIIDFFLGASLKD
EVLKIMPVQKQTRAGQRTRFKAFVAIGDYNGHVGLGVKCSKEVATAIRGAIILAKLSIVPVRRGYWGNKIGKPHTVPCKV
TGRCGSVLVRLIPAPRGTGIVSAPVPKKLLMM
....

The code producing this is:

            File parsedBlast = safe.tempfile();
            SearchContentHandler handler = new BlastHitSummaryWriter(new 
BufferedWriter(new FileWriter(parsedBlast))); 
            SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
            adapter.setSearchContentHandler(handler);
            BlastLikeSAXParser breader = new BlastLikeSAXParser();
            breader.setModeLazy();
            InputSource is = new InputSource(new FileReader(blast));
            breader.setContentHandler(adapter);
            breader.parse(is);

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199 
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the 
intended recipient, please delete it and notify us immediately. Please do 
not copy or use it for any purpose, or disclose its content to any other 
person. Thank you.
---------------------------------------------


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From wim.glassee at ua.ac.be  Fri Jan  7 10:01:34 2005
From: wim.glassee at ua.ac.be (Wim Glassee)
Date: Fri Jan  7 09:58:14 2005
Subject: [Biojava-l] BioJava 1.5
Message-ID: <41DEA44E.7080706@ua.ac.be>

Hi all,

anybody have any idea if and when a biojava 1.5 is coming? A 
java.sun.com article stated late 2004, early 2005.
I would personally be interested in a build on top of the 1.5 codebase. 
Partly because of the internal xml library.

Thanks,

Wim


From mark.schreiber at group.novartis.com  Sun Jan  9 20:03:05 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Sun Jan  9 19:59:36 2005
Subject: [Biojava-l] BioJava 1.5
Message-ID: <OFB232BF3A.A3804D85-ON48256F85.00056719-48256F85.0005C6C5@EU.novartis.net>

I'd like to see biojava 1.4 come out first!!

Seriously though, I'm assuming you mean a version that uses java 1.5 (java 
5). I think Matthew Pocock is working on something, I think it's called 
BioJava2 and represents a major redesign (in which case it probably 
shouldn't be called biojava but thats just semantics). Not too sure what 
the website is.

Having said all that, I think there is no reason why we cannot start using 
java 1.5 in the standard old biojava (probably after biojava 1.4 is 
finalised). As long as most of the community is happy with that proposal.

- Mark


Wim Glassee <wim.glassee@ua.ac.be>
Sent by: biojava-l-bounces@portal.open-bio.org
01/07/2005 11:01 PM

 
        To:     biojava-l@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] BioJava 1.5


Hi all,

anybody have any idea if and when a biojava 1.5 is coming? A 
java.sun.com article stated late 2004, early 2005.
I would personally be interested in a build on top of the 1.5 codebase. 
Partly because of the internal xml library.

Thanks,

Wim


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From ghhu at info.biology.mcmaster.ca  Mon Jan 10 10:58:59 2005
From: ghhu at info.biology.mcmaster.ca (Guanhong Hu)
Date: Mon Jan 10 10:55:25 2005
Subject: [Biojava-l] can use biojava in JApplet?
Message-ID: <Pine.LNX.4.44.0501101052150.19365-100000@info0>

HI, 
I am new to biojava. I develped a small java application by using some 
biojava modules. It runs O.K. But when I convert this application to a 
JApplet , it compiles O.K. , but when I use appletviewer to test it, it 
gave out the following error  message:
java.lang.NoClassDefFoundError: org/biojava/bio/symbol/IllegalSymbolException
        at java.lang.Class.getDeclaredConstructors0(Native Method)
        at java.lang.Class.privateGetDeclaredConstructors(Class.java:1610)
        at java.lang.Class.getConstructor0(Class.java:1922)
        at java.lang.Class.newInstance0(Class.java:278)
        at java.lang.Class.newInstance(Class.java:261)
        at sun.applet.AppletPanel.createApplet(AppletPanel.java:617)
        at sun.applet.AppletPanel.runLoader(AppletPanel.java:546)
        at sun.applet.AppletPanel.run(AppletPanel.java:298)
        at java.lang.Thread.run(Thread.java:534)
and the applet not initialized.

I don't know what's wrong. Can biojava be used in JApplet???

Any reply is great appreciated.

Thanks

Guanhong 

From hollandr at gis.a-star.edu.sg  Mon Jan 10 20:07:29 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Mon Jan 10 20:05:10 2005
Subject: [Biojava-l] can use biojava in JApplet?
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D560140EC63@BIONIC.biopolis.one-north.com>

Sounds like a classpath problem. You need to include the biojava.jar
file either in the same folder as the applet, or in the global java
resources (something like /usr/java/lib/ext) on the server the applet is
hosted on. Just using individual class files will not work as there are
many interdependencies.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org 
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of 
> Guanhong Hu
> Sent: Monday, January 10, 2005 11:59 PM
> To: biojava-l@biojava.org
> Subject: [Biojava-l] can use biojava in JApplet?
> 
> 
> HI, 
> I am new to biojava. I develped a small java application by 
> using some 
> biojava modules. It runs O.K. But when I convert this 
> application to a 
> JApplet , it compiles O.K. , but when I use appletviewer to 
> test it, it 
> gave out the following error  message:
> java.lang.NoClassDefFoundError: 
> org/biojava/bio/symbol/IllegalSymbolException
>         at java.lang.Class.getDeclaredConstructors0(Native Method)
>         at 
> java.lang.Class.privateGetDeclaredConstructors(Class.java:1610)
>         at java.lang.Class.getConstructor0(Class.java:1922)
>         at java.lang.Class.newInstance0(Class.java:278)
>         at java.lang.Class.newInstance(Class.java:261)
>         at sun.applet.AppletPanel.createApplet(AppletPanel.java:617)
>         at sun.applet.AppletPanel.runLoader(AppletPanel.java:546)
>         at sun.applet.AppletPanel.run(AppletPanel.java:298)
>         at java.lang.Thread.run(Thread.java:534)
> and the applet not initialized.
> 
> I don't know what's wrong. Can biojava be used in JApplet???
> 
> Any reply is great appreciated.
> 
> Thanks
> 
> Guanhong 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

From e.willighagen at science.ru.nl  Tue Jan 11 01:52:31 2005
From: e.willighagen at science.ru.nl (Egon Willighagen)
Date: Tue Jan 11 01:49:10 2005
Subject: [Biojava-l] 3D viewing?
Message-ID: <200501110752.31210.e.willighagen@science.ru.nl>


Hi all,

I've been monitoring this list for a few weeks now, but are a bit puzzled by 
the projects nature. There does not some to be that much activity... how 
actively is it used and developed?

Anyway, what I wanted to ask is this. The website news mentions 
org.biojava.bio.structure for holding 3D coordinates (though it's missing 
from the JavaDoc on the website), so I was wondering what you use for 3D 
display, and would like to discuss the option of using Jmol [1] for 3D 
rendering of protein.

Egon

1. http://www.jmol.org/

-- 
e.willighagen@science.ru.nl
PhD student on Molecular Representation in Chemometrics
Radboud University Nijmegen
http://www.cac.science.ru.nl/people/egonw/
GPG: 1024D/D6336BA6
From ap3 at sanger.ac.uk  Tue Jan 11 04:46:29 2005
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue Jan 11 04:42:25 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <200501110752.31210.e.willighagen@science.ru.nl>
References: <200501110752.31210.e.willighagen@science.ru.nl>
Message-ID: <AB592808-63B5-11D9-84AA-001124313E58@sanger.ac.uk>

Hi Egon!

I am the person who contributed the biojava - structure classes. As it 
turns out I am a very happy Jmol user already for quite a while now!  
:-)

Using both Biojava and Jmol I am working on a 3D - DAS (distributed 
annotation system) client to visualize annotations of proteins in both 
sequence and structure.

http://www.efamily.org.uk/software/dasclients/spice/

To interact between Biojava and Jmol I use the Biojava code to create a 
PDB file and use this file as an input to Jmol.

Some docu how to incorporate Jmol into another application can be found 
at
http://wiki.jmol.org/ApplicationsEmbeddingJmol

> The website news mentions
> org.biojava.bio.structure for holding 3D coordinates (though it's 
> missing
> from the JavaDoc on the website),

As far as I know the JavaDoc relates to the last public release, which 
did not contain the structure classes.  Guess it is really time to make 
a new biojava release!

Greetings,
Andreas

-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK

From tmo at ebi.ac.uk  Tue Jan 11 05:49:52 2005
From: tmo at ebi.ac.uk (Tom Oinn)
Date: Tue Jan 11 05:46:36 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <AB592808-63B5-11D9-84AA-001124313E58@sanger.ac.uk>
References: <200501110752.31210.e.willighagen@science.ru.nl>
	<AB592808-63B5-11D9-84AA-001124313E58@sanger.ac.uk>
Message-ID: <41E3AF50.2020803@ebi.ac.uk>

Andreas Prlic wrote:
> Hi Egon!
> 
> I am the person who contributed the biojava - structure classes. As it 
> turns out I am a very happy Jmol user already for quite a while now!  :-)

Seconded - Jmol is very cool and surprisingly easy to integrate if you 
have PDB format data lying around somewhere in your code. I believe it 
supports other structure formats as well although I've never used them.

Cheers,

Tom


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.6.10 - Release Date: 1/10/2005

From e.willighagen at science.ru.nl  Tue Jan 11 06:24:40 2005
From: e.willighagen at science.ru.nl (Egon Willighagen)
Date: Tue Jan 11 06:30:08 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <41E3AF50.2020803@ebi.ac.uk>
References: <200501110752.31210.e.willighagen@science.ru.nl>
	<AB592808-63B5-11D9-84AA-001124313E58@sanger.ac.uk>
	<41E3AF50.2020803@ebi.ac.uk>
Message-ID: <200501111224.40229.e.willighagen@science.ru.nl>

On Tuesday 11 January 2005 11:49 am, Tom Oinn wrote:
> Andreas Prlic wrote:
> > I am the person who contributed the biojava - structure classes. As it
> > turns out I am a very happy Jmol user already for quite a while now!  :-)
>
> Seconded - Jmol is very cool and surprisingly easy to integrate if you
> have PDB format data lying around somewhere in your code. I believe it
> supports other structure formats as well although I've never used them.

Hi Tom,

I'm actually one of the Jmol developers (though we should thank Miguel for the 
excellent 3D rendering of proteins), but was actually refering to a tighter 
integration of BioJava and Jmol...

On http://wiki.jmol.org/JmolCdkIntegration there is this code fragment:

===================================
public void viewCDKModel(ChemFile aCDKChemFileObject) {
  JmolPanel jmolPanel = new JmolPanel();
  contentPane.add(jmolPanel);
        
  JmolViewer viewer = jmolPanel.getViewer();
  viewer.openClientFile("", "", aCDKChemFileObject);
}

class JmolPanel extends JPanel {
  JmolViewer viewer;
  JmolAdapter adapter;
  JmolPanel() {
    // use CDK IO
    adapter = new CdkJmolAdapter(null);
    viewer = new JmolViewer(this, adapter);
  }
}
===================================

And I was wondering about a JmolAdapter implementation for BioJava...
So that it would be possible to do:

===================================
public void viewBioJavaModel(Sequence sequence) {
  JmolPanel jmolPanel = new JmolPanel();
  contentPane.add(jmolPanel);
        
  JmolViewer viewer = jmolPanel.getViewer();
  viewer.openClientFile("", "", sequence);
}

class JmolPanel extends JPanel {
  JmolViewer viewer;
  JmolAdapter adapter;
  JmolPanel() {
    // use CDK IO
    adapter = new BioJavaJmolAdapter(null);
    viewer = new JmolViewer(this, adapter);
  }
}
===================================

Or some BioJava class instead of Sequence...

This would remove the serialization to PDB, and allow new possibilities... for 
which we might need to extend the model adapter, but that's no issue...

Egon
From ola.spjuth at farmbio.uu.se  Tue Jan 11 06:52:38 2005
From: ola.spjuth at farmbio.uu.se (Ola Spjuth)
Date: Tue Jan 11 06:45:59 2005
Subject: [Biojava-l] Biojava DB
Message-ID: <1105444357.3095.37.camel@zidane>

Hello,

I am thinking of using BioJava for managing sequences and annotations of
sequences.

BioJava seems to have database support for storing sequences in
relational databases. How developed is the support for storing annotated
sequences? Is there any documentation (other than JavaDoc) on working
with sequences and annotations, and database persistence of these?

Best regards

   .../Ola Spjuth


From tmo at ebi.ac.uk  Tue Jan 11 07:17:43 2005
From: tmo at ebi.ac.uk (Tom Oinn)
Date: Tue Jan 11 07:15:09 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <200501111224.40229.e.willighagen@science.ru.nl>
References: <200501110752.31210.e.willighagen@science.ru.nl>	<AB592808-63B5-11D9-84AA-001124313E58@sanger.ac.uk>	<41E3AF50.2020803@ebi.ac.uk>
	<200501111224.40229.e.willighagen@science.ru.nl>
Message-ID: <41E3C3E7.9070305@ebi.ac.uk>

Egon Willighagen wrote:
> On Tuesday 11 January 2005 11:49 am, Tom Oinn wrote:
> 
>>Andreas Prlic wrote:
>>
>>>I am the person who contributed the biojava - structure classes. As it
>>>turns out I am a very happy Jmol user already for quite a while now!  :-)
>>
>>Seconded - Jmol is very cool and surprisingly easy to integrate if you
>>have PDB format data lying around somewhere in your code. I believe it
>>supports other structure formats as well although I've never used them.
> 
> 
> Hi Tom,
> 
> I'm actually one of the Jmol developers (though we should thank Miguel for the 
> excellent 3D rendering of proteins), but was actually refering to a tighter 
> integration of BioJava and Jmol...

Hi Egon,

Although I agree that in principle this is a reasonable thing to do I'm 
not convinced that it's worth it in this case - is there any information 
loss in converting to PDB format then reading back into JMol? If not 
then I'd leave it at that, there would be relatively few gains from 
being able to do so directly.

One thing that I believe systems like Spice take advantage of is the 
ability to include scripting instructions in the input to JMol; I 
haven't looked at the biojava 3d classes in detail but I would have 
thought they'd lack this functionality - it's nothing to do with 
representing the structure fundamentally. If you were to implement a 
direct Jmol view over the biojava classes you'd have to have some way of 
duplicating this functionality.

Having a relatively standard intermediate representation such as PDB 
format flatfiles is generally a good thing and makes it easier to link 
components in a loosely coupled fashion at the expense potentially of 
some efficiency. My thoughts are that the efficiency is no big deal in 
this case and that the convenience of the intermediate representation 
(not to mention that it already exists and works!) makes it the 
preferred option.

Cheers,

Tom
From e.willighagen at science.ru.nl  Tue Jan 11 07:27:37 2005
From: e.willighagen at science.ru.nl (Egon Willighagen)
Date: Tue Jan 11 07:29:26 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <41E3C3E7.9070305@ebi.ac.uk>
References: <200501110752.31210.e.willighagen@science.ru.nl>
	<200501111224.40229.e.willighagen@science.ru.nl>
	<41E3C3E7.9070305@ebi.ac.uk>
Message-ID: <200501111327.37971.e.willighagen@science.ru.nl>

On Tuesday 11 January 2005 01:17 pm, Tom Oinn wrote:
> Having a relatively standard intermediate representation such as PDB
> format flatfiles is generally a good thing and makes it easier to link
> components in a loosely coupled fashion at the expense potentially of
> some efficiency. My thoughts are that the efficiency is no big deal in
> this case and that the convenience of the intermediate representation
> (not to mention that it already exists and works!) makes it the
> preferred option.

True :) 

Egon
From ap3 at sanger.ac.uk  Tue Jan 11 08:03:15 2005
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Tue Jan 11 07:59:08 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <200501111144.18889.e.willighagen@science.ru.nl>
References: <200501110752.31210.e.willighagen@science.ru.nl>
	<F9342CB4-63B3-11D9-84AA-001124313E58@sanger.ac.uk>
	<200501111144.18889.e.willighagen@science.ru.nl>
Message-ID: <28444FBA-63D1-11D9-84AA-001124313E58@sanger.ac.uk>

Hi Egon!

What would be the advantages of using a BioJavaModel Adapter ?

* faster loading of structure into Jmol ?
* fewer memory consumption of application ?
* if you rotate the structure in Jmol, the  rotated coordinates are 
available through  the biojava structure object ?

how about performance of the Jmol display, would it be as fast?


Cheers,
Andreas


-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK

From e.willighagen at science.ru.nl  Tue Jan 11 08:43:43 2005
From: e.willighagen at science.ru.nl (Egon Willighagen)
Date: Tue Jan 11 08:40:35 2005
Subject: [Biojava-l] 3D viewing?
In-Reply-To: <28444FBA-63D1-11D9-84AA-001124313E58@sanger.ac.uk>
References: <200501110752.31210.e.willighagen@science.ru.nl>
	<200501111144.18889.e.willighagen@science.ru.nl>
	<28444FBA-63D1-11D9-84AA-001124313E58@sanger.ac.uk>
Message-ID: <200501111443.43297.e.willighagen@science.ru.nl>

On Tuesday 11 January 2005 02:03 pm, Andreas Prlic wrote:
> Hi Egon!
>
> What would be the advantages of using a BioJavaModel Adapter ?
>
> * faster loading of structure into Jmol ?
> * fewer memory consumption of application ?

Yes, I think so, because one does not need to make a large String first (or a 
File if that is used as intermediate) for the PDB serialization...

> * if you rotate the structure in Jmol, the rotated coordinates are
> available through the biojava structure object ?
>
> how about performance of the Jmol display, would it be as fast?

The Jmol viewer still copies things into private data classes, so the 
rendering speed is identical.

Because of this copying, I'm now realizing that it might not get updated when 
the original data object is changed... so, it might not be that useful 
actually...

Egon
From msouthern at exsar.com  Tue Jan 11 09:48:36 2005
From: msouthern at exsar.com (Mark Southern)
Date: Tue Jan 11 09:44:57 2005
Subject: [Biojava-l] SequencePanel changes between 1.3 and 1.4pre1
In-Reply-To: <200501111341.j0BDfEKt007588@portal.open-bio.org>
Message-ID: <2C879FB52902524C85E8B616CF276F1A1CC53F@cartasrv.carta.local>

I've just started looking at 1.4pre1 (sorry to take so long!).

I am using SequencePanel's to view protein sequences. I add multiple
SequencePanels (each having the same sequence) to a Container with a
BoxLayout. Each displays a different RangeLocation with the overall effect
that the sequence appears to wrap over many lines. The credit for this idea
goes to Mathew Pocock and it was working very nicely :-)


It breaks in 1.4pre1 however and the change that causes this is that the
SequencePanel's position within it's Container ( getX() and getY() ) is
added to the Graphics2D translation in the SequencePanel's paintComponent()
method. The SequencePanel is then painted in an unseen part of it's
Container.

i.e. from;

g2.translate(leadingBorder.getSize() - minAcross + insets.left, insets.top);

to;

g2.translate(leadingBorder.getSize() - minAcross + insets.left + getX(),
insets.top + getY());


I can imagine using a SequencePanel being used in a Container with a
BorderLayout or BoxLayout but I can't imagine it being arranged at a
coordinate other than (0,0). Can someone please explain the rationale for
the 1.4pre1 way of doing things or can we switch back to what 1.3 did?

Cheers,

Mark.


From heuermh at acm.org  Tue Jan 11 17:39:42 2005
From: heuermh at acm.org (Michael Heuer)
Date: Tue Jan 11 17:35:43 2005
Subject: [Biojava-l] biosql support in bioperl?
In-Reply-To: <200501111443.43297.e.willighagen@science.ru.nl>
Message-ID: <Pine.GSO.4.44.0501111732420.29567-100000@shell3.shore.net>


I'm afraid to ask this on the bioperl list for fear of getting flamed, but
I was always under the assumption that bioperl could read and write from a
biosql schema, like I have been doing with biojava for some time.

The bioperl HOWTO docs read

Important

Support for the biosql protocol is disabled as of Bioperl version 1.4.
We hope to remedy this in a subsequent release.

and the module referred to in Bio::DB::Registry

'biosql' => 'Bio::DB::BioSQL::BioDatabaseAdaptor'

doesn't seem to exist at all.

Does any one know, do I need to use an earlier version, or am I missing
something obvious?

(please don't hurt me  :)

   michael

From mark.schreiber at group.novartis.com  Tue Jan 11 19:48:01 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Tue Jan 11 19:44:33 2005
Subject: [Biojava-l] Biojava DB
Message-ID: <OF017FABE9.D5FBC3C0-ON48256F87.00043C85-48256F87.0004682A@EU.novartis.net>

Hi -

You would probably want to use something like BioSQL. There is some 
documentation under the OBDA section of this page 
http://www.biojava.org/docs/bj_in_anger/index.htm

There are some people on the list who are actively working with this. You 
would probably aslo want to look at the BioSQL webpages and mailing list.

- Mark


Ola Spjuth <ola.spjuth@farmbio.uu.se>
Sent by: biojava-l-bounces@portal.open-bio.org
01/11/2005 07:52 PM

 
        To:     biojava-l@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Biojava DB


Hello,

I am thinking of using BioJava for managing sequences and annotations of
sequences.

BioJava seems to have database support for storing sequences in
relational databases. How developed is the support for storing annotated
sequences? Is there any documentation (other than JavaDoc) on working
with sequences and annotations, and database persistence of these?

Best regards

   .../Ola Spjuth


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From Anna.Henricson at cgb.ki.se  Thu Jan 13 03:40:05 2005
From: Anna.Henricson at cgb.ki.se (Anna Henricson)
Date: Thu Jan 13 03:36:40 2005
Subject: [Biojava-l] IllegalArgumentException when parsing an embl file
Message-ID: <EOEJIPKFOLMCOPIEHGGNAEEECGAA.Anna.Henricson@cgb.ki.se>

Hi,
I'm parsing the feature table in an embl file to retrieve the information
under feature key CDS. For instance, I am calculating the number of exons,
the length of the exons, retrieving the protein sequence and id etc.
Sometimes an IllegalArgumentException is thrown by the code
sequence = seqIterator.nextSequence();	//(see below in this email)

I guess there is some problem in the embl file with the Location, so that
the Sequence cannot be instantiated, and as a result these sequences will
not be present in my resulting output file. Why is this exception thrown and
is there any way to avoid or handle this problem? Please bear in mind that I
am new to BioJava and therefore would greatly appreciate a more detailed
explanation.
Thanks!
/Anna

The code and the exceptions that are thrown are as follows:

....
	private Sequence sequence;

....

	SequenceIterator seqIterator = SeqIOTools.readEmbl (bufferedReader);

	SequenceIterator seqIterator = SeqIOTools.readEmbl (bufferedReader);
	while (seqIterator.hasNext()){
		try{
			sequence = seqIterator.nextSequence();
		}

		catch (BioException e){
			e.printStackTrace();
		}
		catch (NoSuchElementException e){
			e.printStackTrace();
		}
....

java.lang.IllegalArgumentException: Location [1045891,1046196] is outside
1..1000000
        at
org.biojava.bio.seq.impl.SimpleFeature.<init>(SimpleFeature.java:306)
        at
org.biojava.bio.seq.impl.SimpleStrandedFeature.<init>(SimpleStrandedFeature.
java:74)
        at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown
Source)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
        at
org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature
Realizer.java:138)
rethrown as org.biojava.bio.BioException: Couldn't realize feature
        at
org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature
Realizer.java:144)
        at
org.biojava.bio.seq.SimpleFeatureRealizer.realizeFeature(SimpleFeatureRealiz
er.java:94)
        at
org.biojava.bio.seq.impl.SimpleSequence.realizeFeature(SimpleSequence.java:1
98)
        at
org.biojava.bio.seq.impl.SimpleSequence.createFeature(SimpleSequence.java:20
4)
        at
org.biojava.bio.seq.io.SequenceBuilderBase.makeSequence(SequenceBuilderBase.
java:168)
        at
org.biojava.bio.seq.io.SmartSequenceBuilder.makeSequence(SmartSequenceBuilde
r.java:87)
        at
org.biojava.bio.seq.io.SequenceBuilderFilter.makeSequence(SequenceBuilderFil
ter.java:98)
        at
org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:101)
        at EmblFileParser.<init>(EmblFileParser.java:34)
        at EmblToExintFormat.main(EmblToExintFormat.java:57)

--------------------------------------------
Anna Henricson, MSc, PhD student
Center for Genomics and Bioinformatics (CGB)
Karolinska Institutet
S-171 77 Stockholm
Sweden
Phone: +46 (0)8 524 87296
Fax: +46 (0)8 337983


From mark.schreiber at group.novartis.com  Thu Jan 13 04:00:54 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Jan 13 03:57:20 2005
Subject: [Biojava-l] IllegalArgumentException when parsing an embl file
Message-ID: <OFC406E26E.90AE677D-ON48256F88.00303A3D-48256F88.00318800@EU.novartis.net>

Hi Anna -

It seems the problem may be with the EMBL file.

java.lang.IllegalArgumentException: Location [1045891,1046196] is outside
1..1000000

This part indicates BioJava is trying to make a Feature with the Location [1045891,1046196] which is outside the bounds of the available sequence. It is not allowed 
to create features outside of the available Sequence. Apparently BioJava 
is able to find 1000000 bases. Is there actually more sequence than this in the EMBL file?

There are two possible solutions (unless you can get a full version of the 
EMBL file)...

1) If you are only interested in comparing Feature Locations you could use 
a DummySequence of the appropriate length.
2) You could write a custom org.biojava.bio.seq.io.SeqIOFilter which 
overloads the startFeature and method and only passes the Feature.Template 
to the delegate if the Location in that Feature.Template is inside the 
valid range.

If you want to use option 2 and need more help then post to the list.

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


"Anna Henricson" <Anna.Henricson@cgb.ki.se>
Sent by: biojava-l-bounces@portal.open-bio.org
01/13/2005 04:40 PM

 
        To:     "Biojava" <biojava-l@open-bio.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] IllegalArgumentException when parsing an embl file


Hi,
I'm parsing the feature table in an embl file to retrieve the information
under feature key CDS. For instance, I am calculating the number of exons,
the length of the exons, retrieving the protein sequence and id etc.
Sometimes an IllegalArgumentException is thrown by the code
sequence = seqIterator.nextSequence();           //(see below in this 
email)

I guess there is some problem in the embl file with the Location, so that
the Sequence cannot be instantiated, and as a result these sequences will
not be present in my resulting output file. Why is this exception thrown 
and
is there any way to avoid or handle this problem? Please bear in mind that 
I
am new to BioJava and therefore would greatly appreciate a more detailed
explanation.
Thanks!
/Anna

The code and the exceptions that are thrown are as follows:

....
                 private Sequence sequence;

....

                 SequenceIterator seqIterator = SeqIOTools.readEmbl 
(bufferedReader);

                 SequenceIterator seqIterator = SeqIOTools.readEmbl 
(bufferedReader);
                 while (seqIterator.hasNext()){
                                 try{
                                                 sequence = 
seqIterator.nextSequence();
                                 }

                                 catch (BioException e){
                                                 e.printStackTrace();
                                 }
                                 catch (NoSuchElementException e){
                                                 e.printStackTrace();
                                 }
....

java.lang.IllegalArgumentException: Location [1045891,1046196] is outside
1..1000000
        at
org.biojava.bio.seq.impl.SimpleFeature.<init>(SimpleFeature.java:306)
        at
org.biojava.bio.seq.impl.SimpleStrandedFeature.<init>(SimpleStrandedFeature.
java:74)
        at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown
Source)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
        at
org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature
Realizer.java:138)
rethrown as org.biojava.bio.BioException: Couldn't realize feature
        at
org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature
Realizer.java:144)
        at
org.biojava.bio.seq.SimpleFeatureRealizer.realizeFeature(SimpleFeatureRealiz
er.java:94)
        at
org.biojava.bio.seq.impl.SimpleSequence.realizeFeature(SimpleSequence.java:1
98)
        at
org.biojava.bio.seq.impl.SimpleSequence.createFeature(SimpleSequence.java:20
4)
        at
org.biojava.bio.seq.io.SequenceBuilderBase.makeSequence(SequenceBuilderBase.
java:168)
        at
org.biojava.bio.seq.io.SmartSequenceBuilder.makeSequence(SmartSequenceBuilde
r.java:87)
        at
org.biojava.bio.seq.io.SequenceBuilderFilter.makeSequence(SequenceBuilderFil
ter.java:98)
        at
org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:101)
        at EmblFileParser.<init>(EmblFileParser.java:34)
        at EmblToExintFormat.main(EmblToExintFormat.java:57)

--------------------------------------------
Anna Henricson, MSc, PhD student
Center for Genomics and Bioinformatics (CGB)
Karolinska Institutet
S-171 77 Stockholm
Sweden
Phone: +46 (0)8 524 87296
Fax: +46 (0)8 337983


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From kalle.naslund at genpat.uu.se  Thu Jan 13 05:06:17 2005
From: kalle.naslund at genpat.uu.se (=?ISO-8859-1?Q?Kalle_N=E4slund?=)
Date: Thu Jan 13 05:04:00 2005
Subject: [Biojava-l] IllegalArgumentException when parsing an embl file
In-Reply-To: <OFC406E26E.90AE677D-ON48256F88.00303A3D-48256F88.00318800@EU.novartis.net>
References: <OFC406E26E.90AE677D-ON48256F88.00303A3D-48256F88.00318800@EU.novartis.net>
Message-ID: <41E64819.7020100@genpat.uu.se>

mark.schreiber@group.novartis.com wrote:

>Hi Anna -
>
>It seems the problem may be with the EMBL file.
>
>java.lang.IllegalArgumentException: Location [1045891,1046196] is outside
>1..1000000
>
>This part indicates BioJava is trying to make a Feature with the Location [1045891,1046196] which is outside the bounds of the available sequence. It is not allowed 
>to create features outside of the available Sequence. Apparently BioJava 
>is able to find 1000000 bases. Is there actually more sequence than this in the EMBL file?
>
>  
>
Just a quick observation that might be relevant, it seems that the only 
place in SimpleFeature that throws an IllegalArgumentException
due to the range being outside the sequence is commented out since some 
time, with the cvs comment "Various modifications to make
life easier" ( SimpleFeature rev 1.22 ). So this problem seems to have 
occured for more people, and a quick dirty solution was found.

So, i guess, if Anna is aware of the consequences, ( things might go bad 
if you want to use these Features ) a solution might be to try
an up to date biojava cvs build. 

mvh Kalle
From piroska.devay at pharma.novartis.com  Mon Jan 17 10:53:46 2005
From: piroska.devay at pharma.novartis.com (piroska.devay@pharma.novartis.com)
Date: Mon Jan 17 10:49:58 2005
Subject: [Biojava-l] (no subject)
Message-ID: <OFB32D5BDA.659E98AE-ONC1256F8C.005172D0-C1256F8C.0057524A@EU.novartis.net>

Dear All,

I am new to biojava and unfortunately new to Java also.
Parsing a Fasta output I could modify the FastaSearchSAX Parser to return 
the parsed data on the standard output.
In the Fasta output the query-hit alignments are not returned, instead the 
query sequence and the subject sequence are returned separately.
If the sequences were shifted by Fasta for matching, '-' symbols are 
inserted 
(-----------------------QVQLQQSGNELAKPGASMKMSCRASGYSFTSYWIHWLKQRPDQGLEWIGYIDPATAYTESNQKFKDKAILTADRS)
I would like to align these sequence-strings. I simply would have 2 
strings as an input or converted into  SymbolLists.
I don't seem to find the right class to do this. 
Could someone offer advice or refer me to some sample programs that I can 
browse through or a more detailed tutorial?

Thanks very much,

Piroska
From mark.schreiber at group.novartis.com  Mon Jan 17 21:34:56 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 17 21:31:12 2005
Subject: [Biojava-l] (no subject)
Message-ID: <OFC5FE17A0.E2B41B95-ON48256F8D.000D3AAE-48256F8D.000E2F87@EU.novartis.net>

Hi -

I'm not too clear on what you are trying to do but you may find the Blast 
and Fasta tutorials on this page helpful 
http://www.biojava.org/docs/bj_in_anger/index.htm Much of the stuff that 
is from the blast tutorials is relevant to the Fasta parsers.

>From my understanding of your email you are not getting some of the 
information you want from the standard classes. You should know that the 
SearchContentAdapters provided with BioJava do not capture every detail of 
a search, only the bits we thought were most interesting. If you need to 
get more (or less) information you will need to make a custom 
SearchContentAdapter (usually you just extend the SearchContentAdapter and 
override some or all of the methods).  Particularly you may want to look 
at http://www.biojava.org/docs/bj_in_anger/blastecho.htm which shows how 
to echo events from a BlastLikeSAXParser to STDOUT. It should be very easy 
to change this to echo for a FastaSearchSAXParser. Running this program 
will help you determine where the things you are looking for may end up 
and help you decide if and how you need to make a custom 
SearchContentAdapter to get the information you want.

Hope this helps,

Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


Piroska Devay/PH/Novartis@PH
Sent by: biojava-l-bounces@portal.open-bio.org
01/17/2005 11:53 PM

 
        To:     biojava-l@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] (no subject)


Dear All,

I am new to biojava and unfortunately new to Java also.
Parsing a Fasta output I could modify the FastaSearchSAX Parser to return 
the parsed data on the standard output.
In the Fasta output the query-hit alignments are not returned, instead the 

query sequence and the subject sequence are returned separately.
If the sequences were shifted by Fasta for matching, '-' symbols are 
inserted 
(-----------------------QVQLQQSGNELAKPGASMKMSCRASGYSFTSYWIHWLKQRPDQGLEWIGYIDPATAYTESNQKFKDKAILTADRS)
I would like to align these sequence-strings. I simply would have 2 
strings as an input or converted into  SymbolLists.
I don't seem to find the right class to do this. 
Could someone offer advice or refer me to some sample programs that I can 
browse through or a more detailed tutorial?

Thanks very much,

Piroska
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From d.lapointe at comcast.net  Thu Jan 20 08:46:15 2005
From: d.lapointe at comcast.net (David Lapointe)
Date: Thu Jan 20 08:42:40 2005
Subject: [Biojava-l] BioSQL 
Message-ID: <200501200846.15274.d.lapointe@comcast.net>

In BioJavaInAnger David Huen has the disclaimer at the end of his useful 
section

NOTE: If you are using the 1.3 version of Biojava with the Singapore schema, 
do not install biosqldb-assembly-pg.sql or biosql-accelerators-pg.sql as 
described above. All you will need is the the new biosqldb-pg.sql. There 
appear to be performance issues in some cases when the other stuff is 
installed also. This note will be updated eventually to reflect this advice.

It is not clear to me which way to go.  How does one know whether one has the 
SIngapore schema ? 
-- 
 .david
 David Lapointe
"Love goes out the door when money  comes innuendo." - G.Marx
From smh1008 at cus.cam.ac.uk  Thu Jan 20 12:50:35 2005
From: smh1008 at cus.cam.ac.uk (David Huen)
Date: Thu Jan 20 12:46:46 2005
Subject: [Biojava-l] BioSQL
In-Reply-To: <200501200846.15274.d.lapointe@comcast.net>
References: <200501200846.15274.d.lapointe@comcast.net>
Message-ID: <200501201750.35335.smh1008@cus.cam.ac.uk>

On Thursday 20 Jan 2005 13:46, David Lapointe wrote:
> In BioJavaInAnger David Huen has the disclaimer at the end of his useful
> section
>
> NOTE: If you are using the 1.3 version of Biojava with the Singapore
> schema, do not install biosqldb-assembly-pg.sql or
> biosql-accelerators-pg.sql as described above. All you will need is the
> the new biosqldb-pg.sql. There appear to be performance issues in some
> cases when the other stuff is installed also. This note will be updated
> eventually to reflect this advice.
>
> It is not clear to me which way to go.  How does one know whether one has
> the SIngapore schema ?

I believe that the current schemas are at:-
http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/biosql-schema/sql/?cvsroot=biosql

The last time I tried with postgres,  I believe the biosqldb-pg.sql alone 
was sufficient (i am relying on hazy memories here, anyone else know 
better?) and the drop-tables.sql is useful for cleaning up.  Performance 
was quite slow but Thomas Down found that introducing a further index 
improved things:-

create index seqfeatureloc_seqfeature on location (seqfeature_id);

I don't know if this has been committed since.  If you have performance 
problems with postgres, do investigate adding indices to the postgres case.

Regards,
David H.
From dan.baggott.work at gmail.com  Thu Jan 20 18:02:44 2005
From: dan.baggott.work at gmail.com (Dan Baggott)
Date: Thu Jan 20 17:58:47 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <922876fd050120150262960c68@mail.gmail.com>

Does anyone hava any java code for reading from .nib nucleotide
sequence files (ie what's used by the UCSC folks)?   I know Jim Kent
et al. have some utilities (I think in C) for reading from nib files
but am wondering about java...

Thanks,

Dan
From dlondon at ebi.ac.uk  Thu Jan 20 12:58:59 2005
From: dlondon at ebi.ac.uk (Darin London)
Date: Thu Jan 20 19:32:35 2005
Subject: [Biojava-l] BOSC 2005
Message-ID: <20050120175859.GA7254@parrot.ebi.ac.uk>

 {Please pass the word!}
 
 MEETING ANNOUNCEMENT & CALL FOR SPEAKERS

 The 6th annual Bioinformatics Open Source Conference (BOSC'2005) is organized by the
 not-for-profit Open Bioinformatics Foundation. The meeting will take place
 June 23-24, 2005 in Detroit, Michigan, USA, and is one of several Special Interest
 Group (SIG) meetings occurring in conjunction with the 13th International Conference
 on Intelligent Systems for Molecular Biology.

 see http://www.iscb.org/ismb2005 for more information.

 Because of the power of many Open Source bioinformatics packages in
 use by the Research Community today, it is not too presumptuous to say 
 that the work of the Open Source Bioinformatics Community represents 
 the cutting edge of Bioinformatics in general. This has been repeatedly 
 demonstrated by the quality of presentations at previous BOSC conferences.
 This year, at BOSC 2006, we want to continue this tradition of excellence, 
 while presenting this message to a wider part of the Research Community.  
 Please, pass this message on to anyone you know that is interested in
 Bioinformatics software. 


 BOSC PROGRAM & CONTACT INFO
 
 * Web: http://www.open-bio.org/bosc2005/
 * Email: bosc@open-bio.org
 
 FEES

  TO BE ANNOUNCED. Watch the bosc website for more information.
 
 
 SPEAKERS & ABSTRACTS WANTED
 
 The program committee is currently seeking abstracts for talks at BOSC 
 2005. BOSC is a great opportunity for you to tell the community about 
 your use, development, or philosophy of open source software development 
 in bioinformatics. The committee will select several submitted abstracts 
 for 25-minute talks and others for shorter "lightning" talks. Accepted 
 abstracts will be published on the BOSC web site.
 
 If you are interested in speaking at BOSC 2005, 
 please send us before April 26, 2005:
 
 * an abstract (no more than a few paragraphs)
 * a URL for the project page, if applicable
 * information about the open source license used for your software or 
   your release plans.

 Abstracts will be accepted for submission until April 26, 2005.
 Abstracts chosen for presentation will be announced May 12, 2005 
 (before the ISMB Early Registration Deadline).

 LIGHTNING-TALK SPEAKERS WANTED!
 
 The program committee is currently seeking speakers for the lightning 
 talks at BOSC 2005. Lightning talks are quick - only five minutes 
 long - and a great opportunity for you to give people a quick 
 summary of your open source project, code, idea, or vision of the future.

 If you are interested in giving a lightning talk at BOSC 2005, 
 please send us:

 * a brief title and summary (one or two lines)
 * a URL for the project page, if applicable
 * information about the open source license used for your software or 
   your release plans.

 We will accept entries on-line until BOSC starts, but
 space for demos and lightning talks is limited.<br/>
    
 SOFTWARE DEMONSTRATIONS WANTED!
 If you are involved in the development of Open Source Bioinformatics Software, 
 you are invited to provide a short demonstration to attendees of BOSC 2005.

 If you are interested in giving a software demonstration at BOSC 2005,
 please send us:

 * a brief title and summary (one or two lines)
 * a URL for the project page, if applicable
 * Internet connectivity requirements (e.g. website Application served on the 
   world wide web, or web based client application).

   We will accept entries on-line until the BOSC starts, but
   space for demos and lightning talks is limited. 

** Because the mission of the OBF is to promote Open Source software, we will favor submissions for
   projects that apply a recognized Open Source License, or adhere to the general Open Source Philosophy.
   See the following websites for further details:
   href="http://www.opensource.org/licenses/
   href="http://www.opensource.org/docs/definition.php


  SESSION CHAIRS WANTED
  If you would like to be involved BOSC 2005, we invite you to chair a session.  This will 
  not require much of your time.  You will be given a schedule of presenters during your session. 
  You simply introduce each speaker, and manage the time of their presentation (25 minutes for full 
  presentations, 5-10 minutes for lightning talks/demos, depending on the number of entries).

  If you are interested in chairing a session, please send us your name and affiliation (if applicable).

-- 
cheers,

Darin London dlondon@ebi.ac.uk    European Bioinformatics Institute, 
+44 (0)1223 49 2566               Wellcome Trust Genome Campus, Hinxton 
+44 (0)1223 49 4468 (fax)         Cambridgeshire CB10 1SD, UK
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biojava-l/attachments/20050120/9bb1da79/attachment.bin
From mark.schreiber at group.novartis.com  Sun Jan 23 21:10:35 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Sun Jan 23 21:06:48 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <OF4ED37E07.9E461415-ON48256F93.000BE47B-48256F93.000BF4F7@EU.novartis.net>

In short, no.

Do you have a description of the format? It may not be too hard to adapt 
an existing parser.

- Mark


Dan Baggott <dan.baggott.work@gmail.com>
Sent by: biojava-l-bounces@portal.open-bio.org
01/21/2005 07:02 AM
Please respond to baggott2

 
        To:     biojava-l@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] reading nib sequence files


Does anyone hava any java code for reading from .nib nucleotide
sequence files (ie what's used by the UCSC folks)?   I know Jim Kent
et al. have some utilities (I think in C) for reading from nib files
but am wondering about java...

Thanks,

Dan
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From hollandr at gis.a-star.edu.sg  Sun Jan 23 21:48:44 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Sun Jan 23 21:45:59 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2C8B@BIONIC.biopolis.one-north.com>

It's a compressed binary format. I doubt BioJava would be able to read
it without a lot of effort as the current parser framework is set up for
text input only.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org 
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of 
> mark.schreiber@group.novartis.com
> Sent: Monday, January 24, 2005 10:11 AM
> To: baggott2@llnl.gov
> Cc: biojava-l-bounces@portal.open-bio.org; biojava-l@biojava.org
> Subject: Re: [Biojava-l] reading nib sequence files
> 
> 
> In short, no.
> 
> Do you have a description of the format? It may not be too 
> hard to adapt 
> an existing parser.
> 
> - Mark
> 
> 
> 
> 
> 
> Dan Baggott <dan.baggott.work@gmail.com>
> Sent by: biojava-l-bounces@portal.open-bio.org
> 01/21/2005 07:02 AM
> Please respond to baggott2
> 
>  
>         To:     biojava-l@biojava.org
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [Biojava-l] reading nib sequence files
> 
> 
> Does anyone hava any java code for reading from .nib nucleotide
> sequence files (ie what's used by the UCSC folks)?   I know Jim Kent
> et al. have some utilities (I think in C) for reading from nib files
> but am wondering about java...
> 
> Thanks,
> 
> Dan
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

From td2 at sanger.ac.uk  Mon Jan 24 03:34:04 2005
From: td2 at sanger.ac.uk (Thomas Down)
Date: Mon Jan 24 03:30:07 2005
Subject: [Biojava-l] reading nib sequence files
In-Reply-To: <6D9E9B9DF347EF4385F6271C64FB8D56015B2C8B@BIONIC.biopolis.one-north.com>
References: <6D9E9B9DF347EF4385F6271C64FB8D56015B2C8B@BIONIC.biopolis.one-north.com>
Message-ID: <B4BF6812-6DE2-11D9-A9E5-000A95C8B056@sanger.ac.uk>


On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:

> It's a compressed binary format. I doubt BioJava would be able to read
> it without a lot of effort as the current parser framework is set up 
> for
> text input only.

Nib support probably wouldn't fit into the text-oriented parsing 
framework, but I'm sure it could be supported somehow if there was 
demand.  A quick google doesn't turn up any format documentation, but 
Jim Kent's IO code is at:

           http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c

One interesting way to handle this might be to open the nib file as a 
MappedByteBuffer, and back a SymbolList directly using that -- 
potentially giving us an efficient way of working with huge sequences.. 
  Any interest in that?

           Thomas.

From mark.schreiber at group.novartis.com  Mon Jan 24 03:37:16 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 24 03:33:26 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <OF56F10C0B.C32E819F-ON48256F93.002F4E82-48256F93.002F5BAC@EU.novartis.net>

I'd need to brush up on my nio, and my c !


Thomas Down <td2@sanger.ac.uk>
01/24/2005 04:34 PM

 
        To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
        cc:     "<baggott2@llnl.gov>", biojava-list List <biojava-l@biojava.org>, Mark 
Schreiber/GP/Novartis@PH
        Subject:        Re: [Biojava-l] reading nib sequence files


On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:

> It's a compressed binary format. I doubt BioJava would be able to read
> it without a lot of effort as the current parser framework is set up 
> for
> text input only.

Nib support probably wouldn't fit into the text-oriented parsing 
framework, but I'm sure it could be supported somehow if there was 
demand.  A quick google doesn't turn up any format documentation, but 
Jim Kent's IO code is at:

           http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c

One interesting way to handle this might be to open the nib file as a 
MappedByteBuffer, and back a SymbolList directly using that -- 
potentially giving us an efficient way of working with huge sequences.. 
  Any interest in that?

           Thomas.


From hollandr at gis.a-star.edu.sg  Mon Jan 24 03:47:12 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Mon Jan 24 03:44:07 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CF1@BIONIC.biopolis.one-north.com>

I think the idea of storing sequences internally as compressed binary
sequence would be a good idea regardless, for any symbol list. Currently
each Symbol in a SymbolList requires one word of memory (the size of a
memory pointer to the singleton Symbol instances). Therefore any
SymbolList of length X containing symbols from an n-ary alphabet would
require X words of memory to store it, plus the overhead of the
SymbolList and n Symbol singleton instances (admittedly shared between
all SymbolLists currently in memory).

If you used a compressed binary format internally, doing away with
explicit Symbol references and representing each symbol in a ByteBuffer
as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
would require much less space than even the singleton model above. This
way you could fit four DNA symbols into a single byte of memory, as
opposed to four words of memory. The number of bits required for a
symbol in any given alphabet is merely log base 2 of the size of the
alphabet, rounded up to the nearest whole number. eg. for the English
alphabet of 26 letters only, you would need 5 bits, or in terms of whole
bytes, you would be able to fit 8 symbols into 5 bytes. 

To do this you would need to define a 'bits' parameter on the alphabet
which is calculated from the number of symbols in the alphabet, a
'bitMap' parameter on the alphabet which maps symbols to bit values (and
vice versa with 'inverseBitMap'), and keep a separate 'length' parameter
in the SymbolList which would be used to tell the binary decoder when to
stop parsing the sequence (as you can only store whole bytes, there will
often be trailing zeroes in the buffer which could be misleading without
this extra parameter).

You could always return singleton Symbol objects if requested, by
decoding the binary sequence on the fly, but you would no longer need to
store the sequence using them.

Is this worth considering for the big BioJava rewrite?

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: mark.schreiber@group.novartis.com 
> [mailto:mark.schreiber@group.novartis.com] 
> Sent: Monday, January 24, 2005 4:37 PM
> To: Thomas Down
> Cc: biojava-list List; Richard HOLLAND; 
> "<baggott2@llnl.gov"@novartis.com
> Subject: Re: [Biojava-l] reading nib sequence files
> 
> 
> I'd need to brush up on my nio, and my c !
> 
> 
> 
> 
> 
> Thomas Down <td2@sanger.ac.uk>
> 01/24/2005 04:34 PM
> 
>  
>         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
>         cc:     "<baggott2@llnl.gov>", biojava-list List 
> <biojava-l@biojava.org>, Mark 
> Schreiber/GP/Novartis@PH
>         Subject:        Re: [Biojava-l] reading nib sequence files
> 
> 
> 
> On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> 
> > It's a compressed binary format. I doubt BioJava would be 
> able to read
> > it without a lot of effort as the current parser framework 
> is set up 
> > for
> > text input only.
> 
> Nib support probably wouldn't fit into the text-oriented parsing 
> framework, but I'm sure it could be supported somehow if there was 
> demand.  A quick google doesn't turn up any format documentation, but 
> Jim Kent's IO code is at:
> 
>            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> 
> One interesting way to handle this might be to open the nib file as a 
> MappedByteBuffer, and back a SymbolList directly using that -- 
> potentially giving us an efficient way of working with huge 
> sequences.. 
>   Any interest in that?
> 
>            Thomas.
> 
> 
> 
> 
> 

From mark.schreiber at group.novartis.com  Mon Jan 24 03:52:46 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 24 03:48:51 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <OF41FA0D2E.F69C186B-ON48256F93.00308230-48256F93.0030C73F@EU.novartis.net>

BioJava does already do some compression on large sequences (or at least 
it used to). Like you say you can bit pack a lot. Ambiguity causes 
problems as you can have more than four symbols for DNA (including n, y, r 
etc).

Does Jim Kent's schema offer better compression? Even if it doens't the 
use of a ByteBuffer will probably increase the speed of the current 
implementations.

- Mark


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
01/24/2005 04:47 PM

 
        To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down" <td2@sanger.ac.uk>
        cc:     "biojava-list List" <biojava-l@biojava.org>, <baggott2@llnl.gov>
        Subject:        RE: [Biojava-l] reading nib sequence files


I think the idea of storing sequences internally as compressed binary
sequence would be a good idea regardless, for any symbol list. Currently
each Symbol in a SymbolList requires one word of memory (the size of a
memory pointer to the singleton Symbol instances). Therefore any
SymbolList of length X containing symbols from an n-ary alphabet would
require X words of memory to store it, plus the overhead of the
SymbolList and n Symbol singleton instances (admittedly shared between
all SymbolLists currently in memory).

If you used a compressed binary format internally, doing away with
explicit Symbol references and representing each symbol in a ByteBuffer
as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
would require much less space than even the singleton model above. This
way you could fit four DNA symbols into a single byte of memory, as
opposed to four words of memory. The number of bits required for a
symbol in any given alphabet is merely log base 2 of the size of the
alphabet, rounded up to the nearest whole number. eg. for the English
alphabet of 26 letters only, you would need 5 bits, or in terms of whole
bytes, you would be able to fit 8 symbols into 5 bytes. 

To do this you would need to define a 'bits' parameter on the alphabet
which is calculated from the number of symbols in the alphabet, a
'bitMap' parameter on the alphabet which maps symbols to bit values (and
vice versa with 'inverseBitMap'), and keep a separate 'length' parameter
in the SymbolList which would be used to tell the binary decoder when to
stop parsing the sequence (as you can only store whole bytes, there will
often be trailing zeroes in the buffer which could be misleading without
this extra parameter).

You could always return singleton Symbol objects if requested, by
decoding the binary sequence on the fly, but you would no longer need to
store the sequence using them.

Is this worth considering for the big BioJava rewrite?

Richard Holland
Bioinformatics Specialist
GIS extension 8199 
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: mark.schreiber@group.novartis.com 
> [mailto:mark.schreiber@group.novartis.com] 
> Sent: Monday, January 24, 2005 4:37 PM
> To: Thomas Down
> Cc: biojava-list List; Richard HOLLAND; 
> "<baggott2@llnl.gov"@novartis.com
> Subject: Re: [Biojava-l] reading nib sequence files
> 
> 
> I'd need to brush up on my nio, and my c !
> 
> 
> 
> 
> 
> Thomas Down <td2@sanger.ac.uk>
> 01/24/2005 04:34 PM
> 
> 
>         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
>         cc:     "<baggott2@llnl.gov>", biojava-list List 
> <biojava-l@biojava.org>, Mark 
> Schreiber/GP/Novartis@PH
>         Subject:        Re: [Biojava-l] reading nib sequence files
> 
> 
> 
> On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> 
> > It's a compressed binary format. I doubt BioJava would be 
> able to read
> > it without a lot of effort as the current parser framework 
> is set up 
> > for
> > text input only.
> 
> Nib support probably wouldn't fit into the text-oriented parsing 
> framework, but I'm sure it could be supported somehow if there was 
> demand.  A quick google doesn't turn up any format documentation, but 
> Jim Kent's IO code is at:
> 
>            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> 
> One interesting way to handle this might be to open the nib file as a 
> MappedByteBuffer, and back a SymbolList directly using that -- 
> potentially giving us an efficient way of working with huge 
> sequences.. 
>   Any interest in that?
> 
>            Thomas.
> 
> 
> 
> 
> 


From hollandr at gis.a-star.edu.sg  Mon Jan 24 03:59:27 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Mon Jan 24 03:56:23 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CF5@BIONIC.biopolis.one-north.com>

NIB files store one base per 4 bits, non-variable, giving a 50%
compression rate and a maximum arity of 16 different base values per
position.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: mark.schreiber@group.novartis.com 
> [mailto:mark.schreiber@group.novartis.com] 
> Sent: Monday, January 24, 2005 4:53 PM
> To: Richard HOLLAND
> Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> BioJava does already do some compression on large sequences 
> (or at least 
> it used to). Like you say you can bit pack a lot. Ambiguity causes 
> problems as you can have more than four symbols for DNA 
> (including n, y, r 
> etc).
> 
> Does Jim Kent's schema offer better compression? Even if it 
> doens't the 
> use of a ByteBuffer will probably increase the speed of the current 
> implementations.
> 
> - Mark
> 
> 
> 
> 
> 
> "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> 01/24/2005 04:47 PM
> 
>  
>         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down" 
> <td2@sanger.ac.uk>
>         cc:     "biojava-list List" <biojava-l@biojava.org>, 
> <baggott2@llnl.gov>
>         Subject:        RE: [Biojava-l] reading nib sequence files
> 
> 
> I think the idea of storing sequences internally as compressed binary
> sequence would be a good idea regardless, for any symbol 
> list. Currently
> each Symbol in a SymbolList requires one word of memory (the size of a
> memory pointer to the singleton Symbol instances). Therefore any
> SymbolList of length X containing symbols from an n-ary alphabet would
> require X words of memory to store it, plus the overhead of the
> SymbolList and n Symbol singleton instances (admittedly shared between
> all SymbolLists currently in memory).
> 
> If you used a compressed binary format internally, doing away with
> explicit Symbol references and representing each symbol in a 
> ByteBuffer
> as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
> would require much less space than even the singleton model 
> above. This
> way you could fit four DNA symbols into a single byte of memory, as
> opposed to four words of memory. The number of bits required for a
> symbol in any given alphabet is merely log base 2 of the size of the
> alphabet, rounded up to the nearest whole number. eg. for the English
> alphabet of 26 letters only, you would need 5 bits, or in 
> terms of whole
> bytes, you would be able to fit 8 symbols into 5 bytes. 
> 
> To do this you would need to define a 'bits' parameter on the alphabet
> which is calculated from the number of symbols in the alphabet, a
> 'bitMap' parameter on the alphabet which maps symbols to bit 
> values (and
> vice versa with 'inverseBitMap'), and keep a separate 
> 'length' parameter
> in the SymbolList which would be used to tell the binary 
> decoder when to
> stop parsing the sequence (as you can only store whole bytes, 
> there will
> often be trailing zeroes in the buffer which could be 
> misleading without
> this extra parameter).
> 
> You could always return singleton Symbol objects if requested, by
> decoding the binary sequence on the fly, but you would no 
> longer need to
> store the sequence using them.
> 
> Is this worth considering for the big BioJava rewrite?
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199 
>  
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber@group.novartis.com 
> > [mailto:mark.schreiber@group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:37 PM
> > To: Thomas Down
> > Cc: biojava-list List; Richard HOLLAND; 
> > "<baggott2@llnl.gov"@novartis.com
> > Subject: Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > I'd need to brush up on my nio, and my c !
> > 
> > 
> > 
> > 
> > 
> > Thomas Down <td2@sanger.ac.uk>
> > 01/24/2005 04:34 PM
> > 
> > 
> >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> >         cc:     "<baggott2@llnl.gov>", biojava-list List 
> > <biojava-l@biojava.org>, Mark 
> > Schreiber/GP/Novartis@PH
> >         Subject:        Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > 
> > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > 
> > > It's a compressed binary format. I doubt BioJava would be 
> > able to read
> > > it without a lot of effort as the current parser framework 
> > is set up 
> > > for
> > > text input only.
> > 
> > Nib support probably wouldn't fit into the text-oriented parsing 
> > framework, but I'm sure it could be supported somehow if there was 
> > demand.  A quick google doesn't turn up any format 
> documentation, but 
> > Jim Kent's IO code is at:
> > 
> >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > 
> > One interesting way to handle this might be to open the nib 
> file as a 
> > MappedByteBuffer, and back a SymbolList directly using that -- 
> > potentially giving us an efficient way of working with huge 
> > sequences.. 
> >   Any interest in that?
> > 
> >            Thomas.
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 

From mark.schreiber at group.novartis.com  Mon Jan 24 04:17:16 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 24 04:14:09 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <OF8EB16D91.42D81E1E-ON48256F93.00322DFB-48256F93.00330526@EU.novartis.net>

BioJava uses (or at least can use) the PackedSymbolList for large 
sequences. It uses an array of longs to represent the packed bits.

There may be some advantage to using a ByteBuffer, hard to know.

- Mark


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
01/24/2005 04:59 PM

 
        To:     Mark Schreiber/GP/Novartis@PH
        cc:     <baggott2@llnl.gov>, "biojava-list List" <biojava-l@biojava.org>, "Thomas 
Down" <td2@sanger.ac.uk>
        Subject:        RE: [Biojava-l] reading nib sequence files


NIB files store one base per 4 bits, non-variable, giving a 50%
compression rate and a maximum arity of 16 different base values per
position.

Richard Holland
Bioinformatics Specialist
GIS extension 8199 
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: mark.schreiber@group.novartis.com 
> [mailto:mark.schreiber@group.novartis.com] 
> Sent: Monday, January 24, 2005 4:53 PM
> To: Richard HOLLAND
> Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> BioJava does already do some compression on large sequences 
> (or at least 
> it used to). Like you say you can bit pack a lot. Ambiguity causes 
> problems as you can have more than four symbols for DNA 
> (including n, y, r 
> etc).
> 
> Does Jim Kent's schema offer better compression? Even if it 
> doens't the 
> use of a ByteBuffer will probably increase the speed of the current 
> implementations.
> 
> - Mark
> 
> 
> 
> 
> 
> "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> 01/24/2005 04:47 PM
> 
> 
>         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down" 
> <td2@sanger.ac.uk>
>         cc:     "biojava-list List" <biojava-l@biojava.org>, 
> <baggott2@llnl.gov>
>         Subject:        RE: [Biojava-l] reading nib sequence files
> 
> 
> I think the idea of storing sequences internally as compressed binary
> sequence would be a good idea regardless, for any symbol 
> list. Currently
> each Symbol in a SymbolList requires one word of memory (the size of a
> memory pointer to the singleton Symbol instances). Therefore any
> SymbolList of length X containing symbols from an n-ary alphabet would
> require X words of memory to store it, plus the overhead of the
> SymbolList and n Symbol singleton instances (admittedly shared between
> all SymbolLists currently in memory).
> 
> If you used a compressed binary format internally, doing away with
> explicit Symbol references and representing each symbol in a 
> ByteBuffer
> as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
> would require much less space than even the singleton model 
> above. This
> way you could fit four DNA symbols into a single byte of memory, as
> opposed to four words of memory. The number of bits required for a
> symbol in any given alphabet is merely log base 2 of the size of the
> alphabet, rounded up to the nearest whole number. eg. for the English
> alphabet of 26 letters only, you would need 5 bits, or in 
> terms of whole
> bytes, you would be able to fit 8 symbols into 5 bytes. 
> 
> To do this you would need to define a 'bits' parameter on the alphabet
> which is calculated from the number of symbols in the alphabet, a
> 'bitMap' parameter on the alphabet which maps symbols to bit 
> values (and
> vice versa with 'inverseBitMap'), and keep a separate 
> 'length' parameter
> in the SymbolList which would be used to tell the binary 
> decoder when to
> stop parsing the sequence (as you can only store whole bytes, 
> there will
> often be trailing zeroes in the buffer which could be 
> misleading without
> this extra parameter).
> 
> You could always return singleton Symbol objects if requested, by
> decoding the binary sequence on the fly, but you would no 
> longer need to
> store the sequence using them.
> 
> Is this worth considering for the big BioJava rewrite?
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199 
> 
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber@group.novartis.com 
> > [mailto:mark.schreiber@group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:37 PM
> > To: Thomas Down
> > Cc: biojava-list List; Richard HOLLAND; 
> > "<baggott2@llnl.gov"@novartis.com
> > Subject: Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > I'd need to brush up on my nio, and my c !
> > 
> > 
> > 
> > 
> > 
> > Thomas Down <td2@sanger.ac.uk>
> > 01/24/2005 04:34 PM
> > 
> > 
> >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> >         cc:     "<baggott2@llnl.gov>", biojava-list List 
> > <biojava-l@biojava.org>, Mark 
> > Schreiber/GP/Novartis@PH
> >         Subject:        Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > 
> > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > 
> > > It's a compressed binary format. I doubt BioJava would be 
> > able to read
> > > it without a lot of effort as the current parser framework 
> > is set up 
> > > for
> > > text input only.
> > 
> > Nib support probably wouldn't fit into the text-oriented parsing 
> > framework, but I'm sure it could be supported somehow if there was 
> > demand.  A quick google doesn't turn up any format 
> documentation, but 
> > Jim Kent's IO code is at:
> > 
> >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > 
> > One interesting way to handle this might be to open the nib 
> file as a 
> > MappedByteBuffer, and back a SymbolList directly using that -- 
> > potentially giving us an efficient way of working with huge 
> > sequences.. 
> >   Any interest in that?
> > 
> >            Thomas.
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 


From verhoeff2 at gis.a-star.edu.sg  Mon Jan 24 04:16:29 2005
From: verhoeff2 at gis.a-star.edu.sg (VERHOEF Frans)
Date: Mon Jan 24 04:14:14 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56D73461@BIONIC.biopolis.one-north.com>

You could always ZIPStream it out for even more compression.

Frans

-----Original Message-----
From: biojava-l-bounces@portal.open-bio.org
[mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard
HOLLAND
Sent: Monday, January 24, 2005 04:59 PM
To: mark.schreiber@group.novartis.com
Cc: Thomas Down; biojava-list List
Subject: RE: [Biojava-l] reading nib sequence files

NIB files store one base per 4 bits, non-variable, giving a 50%
compression rate and a maximum arity of 16 different base values per
position.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: mark.schreiber@group.novartis.com 
> [mailto:mark.schreiber@group.novartis.com] 
> Sent: Monday, January 24, 2005 4:53 PM
> To: Richard HOLLAND
> Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> BioJava does already do some compression on large sequences 
> (or at least 
> it used to). Like you say you can bit pack a lot. Ambiguity causes 
> problems as you can have more than four symbols for DNA 
> (including n, y, r 
> etc).
> 
> Does Jim Kent's schema offer better compression? Even if it 
> doens't the 
> use of a ByteBuffer will probably increase the speed of the current 
> implementations.
> 
> - Mark
> 
> 
> 
> 
> 
> "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> 01/24/2005 04:47 PM
> 
>  
>         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down" 
> <td2@sanger.ac.uk>
>         cc:     "biojava-list List" <biojava-l@biojava.org>, 
> <baggott2@llnl.gov>
>         Subject:        RE: [Biojava-l] reading nib sequence files
> 
> 
> I think the idea of storing sequences internally as compressed binary
> sequence would be a good idea regardless, for any symbol 
> list. Currently
> each Symbol in a SymbolList requires one word of memory (the size of a
> memory pointer to the singleton Symbol instances). Therefore any
> SymbolList of length X containing symbols from an n-ary alphabet would
> require X words of memory to store it, plus the overhead of the
> SymbolList and n Symbol singleton instances (admittedly shared between
> all SymbolLists currently in memory).
> 
> If you used a compressed binary format internally, doing away with
> explicit Symbol references and representing each symbol in a 
> ByteBuffer
> as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you
> would require much less space than even the singleton model 
> above. This
> way you could fit four DNA symbols into a single byte of memory, as
> opposed to four words of memory. The number of bits required for a
> symbol in any given alphabet is merely log base 2 of the size of the
> alphabet, rounded up to the nearest whole number. eg. for the English
> alphabet of 26 letters only, you would need 5 bits, or in 
> terms of whole
> bytes, you would be able to fit 8 symbols into 5 bytes. 
> 
> To do this you would need to define a 'bits' parameter on the alphabet
> which is calculated from the number of symbols in the alphabet, a
> 'bitMap' parameter on the alphabet which maps symbols to bit 
> values (and
> vice versa with 'inverseBitMap'), and keep a separate 
> 'length' parameter
> in the SymbolList which would be used to tell the binary 
> decoder when to
> stop parsing the sequence (as you can only store whole bytes, 
> there will
> often be trailing zeroes in the buffer which could be 
> misleading without
> this extra parameter).
> 
> You could always return singleton Symbol objects if requested, by
> decoding the binary sequence on the fly, but you would no 
> longer need to
> store the sequence using them.
> 
> Is this worth considering for the big BioJava rewrite?
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199 
>  
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber@group.novartis.com 
> > [mailto:mark.schreiber@group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:37 PM
> > To: Thomas Down
> > Cc: biojava-list List; Richard HOLLAND; 
> > "<baggott2@llnl.gov"@novartis.com
> > Subject: Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > I'd need to brush up on my nio, and my c !
> > 
> > 
> > 
> > 
> > 
> > Thomas Down <td2@sanger.ac.uk>
> > 01/24/2005 04:34 PM
> > 
> > 
> >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> >         cc:     "<baggott2@llnl.gov>", biojava-list List 
> > <biojava-l@biojava.org>, Mark 
> > Schreiber/GP/Novartis@PH
> >         Subject:        Re: [Biojava-l] reading nib sequence files
> > 
> > 
> > 
> > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > 
> > > It's a compressed binary format. I doubt BioJava would be 
> > able to read
> > > it without a lot of effort as the current parser framework 
> > is set up 
> > > for
> > > text input only.
> > 
> > Nib support probably wouldn't fit into the text-oriented parsing 
> > framework, but I'm sure it could be supported somehow if there was 
> > demand.  A quick google doesn't turn up any format 
> documentation, but 
> > Jim Kent's IO code is at:
> > 
> >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > 
> > One interesting way to handle this might be to open the nib 
> file as a 
> > MappedByteBuffer, and back a SymbolList directly using that -- 
> > potentially giving us an efficient way of working with huge 
> > sequences.. 
> >   Any interest in that?
> > 
> >            Thomas.
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l

From hollandr at gis.a-star.edu.sg  Mon Jan 24 04:19:07 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Mon Jan 24 04:16:34 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CF8@BIONIC.biopolis.one-north.com>

The trouble with ZIP is that to do random-access reads of the sequence
(eg. give me all bases from X to Y) you have to unzip the whole sequence
each time. That makes it quite a bit slower. The solution needs to be a
compression algorithm of some kind which allows instant random access
without slowing down the create/update process too much either. Hence a
custom fixed-width binary solution would be the first thing that comes
to mind, but it may not be the only one.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: VERHOEF Frans 
> Sent: Monday, January 24, 2005 5:16 PM
> To: Richard HOLLAND; mark.schreiber@group.novartis.com
> Cc: Thomas Down; biojava-list List
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> You could always ZIPStream it out for even more compression.
> 
> Frans
> 
> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org 
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of 
> Richard HOLLAND
> Sent: Monday, January 24, 2005 04:59 PM
> To: mark.schreiber@group.novartis.com
> Cc: Thomas Down; biojava-list List
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> NIB files store one base per 4 bits, non-variable, giving a 
> 50% compression rate and a maximum arity of 16 different base 
> values per position.
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199   
>  
> ---------------------------------------------
> This email is confidential and may be privileged. If you are 
> not the intended recipient, please delete it and notify us 
> immediately. Please do not copy or use it for any purpose, or 
> disclose its content to any other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber@group.novartis.com
> > [mailto:mark.schreiber@group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:53 PM
> > To: Richard HOLLAND
> > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> > Subject: RE: [Biojava-l] reading nib sequence files
> > 
> > 
> > BioJava does already do some compression on large sequences
> > (or at least 
> > it used to). Like you say you can bit pack a lot. Ambiguity causes 
> > problems as you can have more than four symbols for DNA 
> > (including n, y, r 
> > etc).
> > 
> > Does Jim Kent's schema offer better compression? Even if it
> > doens't the 
> > use of a ByteBuffer will probably increase the speed of the current 
> > implementations.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > 01/24/2005 04:47 PM
> > 
> >  
> >         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down" 
> > <td2@sanger.ac.uk>
> >         cc:     "biojava-list List" <biojava-l@biojava.org>, 
> > <baggott2@llnl.gov>
> >         Subject:        RE: [Biojava-l] reading nib sequence files
> > 
> > 
> > I think the idea of storing sequences internally as 
> compressed binary 
> > sequence would be a good idea regardless, for any symbol list. 
> > Currently each Symbol in a SymbolList requires one word of 
> memory (the 
> > size of a memory pointer to the singleton Symbol 
> instances). Therefore 
> > any SymbolList of length X containing symbols from an n-ary 
> alphabet 
> > would require X words of memory to store it, plus the 
> overhead of the
> > SymbolList and n Symbol singleton instances (admittedly 
> shared between
> > all SymbolLists currently in memory).
> > 
> > If you used a compressed binary format internally, doing away with 
> > explicit Symbol references and representing each symbol in a 
> > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G 
> > etc.), you would require much less space than even the 
> singleton model
> > above. This
> > way you could fit four DNA symbols into a single byte of memory, as
> > opposed to four words of memory. The number of bits required for a
> > symbol in any given alphabet is merely log base 2 of the size of the
> > alphabet, rounded up to the nearest whole number. eg. for 
> the English
> > alphabet of 26 letters only, you would need 5 bits, or in 
> > terms of whole
> > bytes, you would be able to fit 8 symbols into 5 bytes. 
> > 
> > To do this you would need to define a 'bits' parameter on 
> the alphabet 
> > which is calculated from the number of symbols in the alphabet, a 
> > 'bitMap' parameter on the alphabet which maps symbols to bit values 
> > (and vice versa with 'inverseBitMap'), and keep a separate
> > 'length' parameter
> > in the SymbolList which would be used to tell the binary 
> > decoder when to
> > stop parsing the sequence (as you can only store whole bytes, 
> > there will
> > often be trailing zeroes in the buffer which could be 
> > misleading without
> > this extra parameter).
> > 
> > You could always return singleton Symbol objects if requested, by 
> > decoding the binary sequence on the fly, but you would no 
> longer need 
> > to store the sequence using them.
> > 
> > Is this worth considering for the big BioJava rewrite?
> > 
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >  
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you 
> are not the 
> > intended recipient, please delete it and notify us 
> immediately. Please 
> > do not copy or use it for any purpose, or disclose its 
> content to any 
> > other person. Thank you.
> > ---------------------------------------------
> > 
> > 
> > > -----Original Message-----
> > > From: mark.schreiber@group.novartis.com
> > > [mailto:mark.schreiber@group.novartis.com] 
> > > Sent: Monday, January 24, 2005 4:37 PM
> > > To: Thomas Down
> > > Cc: biojava-list List; Richard HOLLAND; 
> > > "<baggott2@llnl.gov"@novartis.com
> > > Subject: Re: [Biojava-l] reading nib sequence files
> > > 
> > > 
> > > I'd need to brush up on my nio, and my c !
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Thomas Down <td2@sanger.ac.uk>
> > > 01/24/2005 04:34 PM
> > > 
> > > 
> > >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > >         cc:     "<baggott2@llnl.gov>", biojava-list List 
> > > <biojava-l@biojava.org>, Mark
> > > Schreiber/GP/Novartis@PH
> > >         Subject:        Re: [Biojava-l] reading nib sequence files
> > > 
> > > 
> > > 
> > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > > 
> > > > It's a compressed binary format. I doubt BioJava would be
> > > able to read
> > > > it without a lot of effort as the current parser framework
> > > is set up
> > > > for
> > > > text input only.
> > > 
> > > Nib support probably wouldn't fit into the text-oriented parsing
> > > framework, but I'm sure it could be supported somehow if 
> there was 
> > > demand.  A quick google doesn't turn up any format 
> > documentation, but
> > > Jim Kent's IO code is at:
> > > 
> > >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > > 
> > > One interesting way to handle this might be to open the nib
> > file as a
> > > MappedByteBuffer, and back a SymbolList directly using that --
> > > potentially giving us an efficient way of working with huge 
> > > sequences.. 
> > >   Any interest in that?
> > > 
> > >            Thomas.
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org 
> http://biojava.org/mailman/listinfo/biojava-l
> 

From td2 at sanger.ac.uk  Mon Jan 24 04:22:27 2005
From: td2 at sanger.ac.uk (Thomas Down)
Date: Mon Jan 24 04:18:31 2005
Subject: [Biojava-l] reading nib sequence files
In-Reply-To: <OF8EB16D91.42D81E1E-ON48256F93.00322DFB-48256F93.00330526@EU.novartis.net>
References: <OF8EB16D91.42D81E1E-ON48256F93.00322DFB-48256F93.00330526@EU.novartis.net>
Message-ID: <76C0E31C-6DE9-11D9-A9E5-000A95C8B056@sanger.ac.uk>


On 24 Jan 2005, at 09:17, mark.schreiber@group.novartis.com wrote:

> BioJava uses (or at least can use) the PackedSymbolList for large
> sequences. It uses an array of longs to represent the packed bits.
>
> There may be some advantage to using a ByteBuffer, hard to know.

The main reason I was thinking for using MappedByteBuffer is that if 
you're accessing a large amount of sequence it won't necessarily all 
get loaded into memory at once.  This could, for example, make random 
access to a multi-gigabase sequence database bearable on a basic 
desktop computer.  Just a thought, not sure how much demand there is 
for this.

            Thomas.

From mark.schreiber at group.novartis.com  Mon Jan 24 04:26:12 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 24 04:29:39 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <OF3DB242C5.929C5B35-ON48256F93.0033BA8B-48256F93.0033D6B4@EU.novartis.net>

Would make an interesting comp sci project as you would probably need a 
way to split the DB over multiple files when you hit the limits of your 
OS.

- Mark


Thomas Down <td2@sanger.ac.uk>
01/24/2005 05:22 PM

 
        To:     Mark Schreiber/GP/Novartis@PH
        cc:     biojava-list List <biojava-l@biojava.org>
        Subject:        Re: [Biojava-l] reading nib sequence files


On 24 Jan 2005, at 09:17, mark.schreiber@group.novartis.com wrote:

> BioJava uses (or at least can use) the PackedSymbolList for large
> sequences. It uses an array of longs to represent the packed bits.
>
> There may be some advantage to using a ByteBuffer, hard to know.

The main reason I was thinking for using MappedByteBuffer is that if 
you're accessing a large amount of sequence it won't necessarily all 
get loaded into memory at once.  This could, for example, make random 
access to a multi-gigabase sequence database bearable on a basic 
desktop computer.  Just a thought, not sure how much demand there is 
for this.

            Thomas.


From hollandr at gis.a-star.edu.sg  Mon Jan 24 04:37:50 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Mon Jan 24 04:36:48 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CFC@BIONIC.biopolis.one-north.com>

I would have thought anything that makes life easier when you need it
but doesn't make life harder when you don't would be a good thing, as
long as it doesn't take too much programming effort. Seeing as you are
intending to rewrite BioJava from scratch anyway, I think this would be
a good thing to work into the mechanism.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org 
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of 
> Thomas Down
> Sent: Monday, January 24, 2005 5:22 PM
> To: mark.schreiber@group.novartis.com
> Cc: biojava-list List
> Subject: Re: [Biojava-l] reading nib sequence files
> 
> 
> 
> On 24 Jan 2005, at 09:17, mark.schreiber@group.novartis.com wrote:
> 
> > BioJava uses (or at least can use) the PackedSymbolList for large
> > sequences. It uses an array of longs to represent the packed bits.
> >
> > There may be some advantage to using a ByteBuffer, hard to know.
> 
> The main reason I was thinking for using MappedByteBuffer is that if 
> you're accessing a large amount of sequence it won't necessarily all 
> get loaded into memory at once.  This could, for example, make random 
> access to a multi-gigabase sequence database bearable on a basic 
> desktop computer.  Just a thought, not sure how much demand there is 
> for this.
> 
>             Thomas.
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

From Russell.Smithies at agresearch.co.nz  Mon Jan 24 14:29:37 2005
From: Russell.Smithies at agresearch.co.nz (Smithies, Russell)
Date: Mon Jan 24 14:25:56 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <D5DBA313349A4B458528BE63B387F36C936529@imail.agresearch.co.nz>

You don't need to extract the whole file with ZipInputStream first.
I managed to get the part I wanted by setting the offset to the start of
the sequence (was using zipped chromosomes in fasta format) and the
buffer to the length I wanted.
It was a year or 2 ago and I probably don't have the code anymore but it
is possible  ;-)

Russell Smithies

Bioinformatics Software Developer
AgResearch Invermay
Private Bag 50034
Puddle Alley
Mosgiel
New Zealand 

-----Original Message-----
From: biojava-l-bounces@portal.open-bio.org
[mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard
HOLLAND
Sent: Monday, 24 January 2005 10:19 p.m.
To: VERHOEF Frans; mark.schreiber@group.novartis.com
Cc: biojava-list List; Thomas Down
Subject: RE: [Biojava-l] reading nib sequence files

The trouble with ZIP is that to do random-access reads of the sequence
(eg. give me all bases from X to Y) you have to unzip the whole sequence
each time. That makes it quite a bit slower. The solution needs to be a
compression algorithm of some kind which allows instant random access
without slowing down the create/update process too much either. Hence a
custom fixed-width binary solution would be the first thing that comes
to mind, but it may not be the only one.

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: VERHOEF Frans 
> Sent: Monday, January 24, 2005 5:16 PM
> To: Richard HOLLAND; mark.schreiber@group.novartis.com
> Cc: Thomas Down; biojava-list List
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> 
> You could always ZIPStream it out for even more compression.
> 
> Frans
> 
> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org 
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of 
> Richard HOLLAND
> Sent: Monday, January 24, 2005 04:59 PM
> To: mark.schreiber@group.novartis.com
> Cc: Thomas Down; biojava-list List
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> NIB files store one base per 4 bits, non-variable, giving a 
> 50% compression rate and a maximum arity of 16 different base 
> values per position.
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199   
>  
> ---------------------------------------------
> This email is confidential and may be privileged. If you are 
> not the intended recipient, please delete it and notify us 
> immediately. Please do not copy or use it for any purpose, or 
> disclose its content to any other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: mark.schreiber@group.novartis.com
> > [mailto:mark.schreiber@group.novartis.com] 
> > Sent: Monday, January 24, 2005 4:53 PM
> > To: Richard HOLLAND
> > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> > Subject: RE: [Biojava-l] reading nib sequence files
> > 
> > 
> > BioJava does already do some compression on large sequences
> > (or at least 
> > it used to). Like you say you can bit pack a lot. Ambiguity causes 
> > problems as you can have more than four symbols for DNA 
> > (including n, y, r 
> > etc).
> > 
> > Does Jim Kent's schema offer better compression? Even if it
> > doens't the 
> > use of a ByteBuffer will probably increase the speed of the current 
> > implementations.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > 01/24/2005 04:47 PM
> > 
> >  
> >         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down" 
> > <td2@sanger.ac.uk>
> >         cc:     "biojava-list List" <biojava-l@biojava.org>, 
> > <baggott2@llnl.gov>
> >         Subject:        RE: [Biojava-l] reading nib sequence files
> > 
> > 
> > I think the idea of storing sequences internally as 
> compressed binary 
> > sequence would be a good idea regardless, for any symbol list. 
> > Currently each Symbol in a SymbolList requires one word of 
> memory (the 
> > size of a memory pointer to the singleton Symbol 
> instances). Therefore 
> > any SymbolList of length X containing symbols from an n-ary 
> alphabet 
> > would require X words of memory to store it, plus the 
> overhead of the
> > SymbolList and n Symbol singleton instances (admittedly 
> shared between
> > all SymbolLists currently in memory).
> > 
> > If you used a compressed binary format internally, doing away with 
> > explicit Symbol references and representing each symbol in a 
> > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G 
> > etc.), you would require much less space than even the 
> singleton model
> > above. This
> > way you could fit four DNA symbols into a single byte of memory, as
> > opposed to four words of memory. The number of bits required for a
> > symbol in any given alphabet is merely log base 2 of the size of the
> > alphabet, rounded up to the nearest whole number. eg. for 
> the English
> > alphabet of 26 letters only, you would need 5 bits, or in 
> > terms of whole
> > bytes, you would be able to fit 8 symbols into 5 bytes. 
> > 
> > To do this you would need to define a 'bits' parameter on 
> the alphabet 
> > which is calculated from the number of symbols in the alphabet, a 
> > 'bitMap' parameter on the alphabet which maps symbols to bit values 
> > (and vice versa with 'inverseBitMap'), and keep a separate
> > 'length' parameter
> > in the SymbolList which would be used to tell the binary 
> > decoder when to
> > stop parsing the sequence (as you can only store whole bytes, 
> > there will
> > often be trailing zeroes in the buffer which could be 
> > misleading without
> > this extra parameter).
> > 
> > You could always return singleton Symbol objects if requested, by 
> > decoding the binary sequence on the fly, but you would no 
> longer need 
> > to store the sequence using them.
> > 
> > Is this worth considering for the big BioJava rewrite?
> > 
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >  
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you 
> are not the 
> > intended recipient, please delete it and notify us 
> immediately. Please 
> > do not copy or use it for any purpose, or disclose its 
> content to any 
> > other person. Thank you.
> > ---------------------------------------------
> > 
> > 
> > > -----Original Message-----
> > > From: mark.schreiber@group.novartis.com
> > > [mailto:mark.schreiber@group.novartis.com] 
> > > Sent: Monday, January 24, 2005 4:37 PM
> > > To: Thomas Down
> > > Cc: biojava-list List; Richard HOLLAND; 
> > > "<baggott2@llnl.gov"@novartis.com
> > > Subject: Re: [Biojava-l] reading nib sequence files
> > > 
> > > 
> > > I'd need to brush up on my nio, and my c !
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Thomas Down <td2@sanger.ac.uk>
> > > 01/24/2005 04:34 PM
> > > 
> > > 
> > >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > >         cc:     "<baggott2@llnl.gov>", biojava-list List 
> > > <biojava-l@biojava.org>, Mark
> > > Schreiber/GP/Novartis@PH
> > >         Subject:        Re: [Biojava-l] reading nib sequence files
> > > 
> > > 
> > > 
> > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > > 
> > > > It's a compressed binary format. I doubt BioJava would be
> > > able to read
> > > > it without a lot of effort as the current parser framework
> > > is set up
> > > > for
> > > > text input only.
> > > 
> > > Nib support probably wouldn't fit into the text-oriented parsing
> > > framework, but I'm sure it could be supported somehow if 
> there was 
> > > demand.  A quick google doesn't turn up any format 
> > documentation, but
> > > Jim Kent's IO code is at:
> > > 
> > >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > > 
> > > One interesting way to handle this might be to open the nib
> > file as a
> > > MappedByteBuffer, and back a SymbolList directly using that --
> > > potentially giving us an efficient way of working with huge 
> > > sequences.. 
> > >   Any interest in that?
> > > 
> > >            Thomas.
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org 
> http://biojava.org/mailman/listinfo/biojava-l
> 

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================

From mark.schreiber at group.novartis.com  Mon Jan 24 19:52:09 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 24 19:48:22 2005
Subject: [Biojava-l] Re: [Biojava-dev] BLAST-like
Message-ID: <OF8FD4885A.FE8C3401-ON48256F94.0004A0F6-48256F94.0004C708@EU.novartis.net>

You should be able to find the answer in the archives somewhere. You need 
to call setModeLazy() or something similar on the BlastLikeSAXParser


"badr al-daihani" <aldaihani@hotmail.co.uk>
Sent by: biojava-dev-bounces@portal.open-bio.org
01/24/2005 11:43 PM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] BLAST-like


Hi guys
I'm trying to pasre a BLAST file using the BlastPareser Class in biojava 
in 
anger but I got this error


org.xml.sax.SAXException: Program ncbi-blastp Version 2.2.5 is not 
supported 
by the biojava blast-like parsing framework


Any hepl will be appreciated

Best regards

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! 
http://www.msn.co.uk/messenger

_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From hollandr at gis.a-star.edu.sg  Wed Jan 26 02:54:34 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Wed Jan 26 02:51:45 2005
Subject: [Biojava-l] PatternHunter parsing
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2E28@BIONIC.biopolis.one-north.com>

Are there any BioJava or BioPerl modules for parsing PatternHunter output? It's very similar to Blast output, so if there isn't one already, would other people be interested in using one if I wrote one?

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199   
 
---------------------------------------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you.
---------------------------------------------


From heuermh at acm.org  Wed Jan 26 14:55:46 2005
From: heuermh at acm.org (Michael Heuer)
Date: Wed Jan 26 15:20:42 2005
Subject: [Biojava-l] grouped features using dazzle GFFAnnotationSource
In-Reply-To: <B4BF6812-6DE2-11D9-A9E5-000A95C8B056@sanger.ac.uk>
Message-ID: <Pine.GSO.4.44.0501261447570.13703-100000@shell3.shore.net>

Hello,

I would like to use the dazzle GFFAnnotationSource to display predicted
transcripts on the Ensembl web interface, but it's a little bit difficult
to follow tags through from the GFF file to what ends up on the Ensembl
web interface.

I imagine I would want a row in the GFF file for each exon and then group
each exon together into a transcript using something in the tag/value
field.  Does anyone have an example?

   michael

From td2 at sanger.ac.uk  Thu Jan 27 05:00:05 2005
From: td2 at sanger.ac.uk (Thomas Down)
Date: Thu Jan 27 04:55:56 2005
Subject: [Biojava-l] grouped features using dazzle GFFAnnotationSource
In-Reply-To: <Pine.GSO.4.44.0501261447570.13703-100000@shell3.shore.net>
References: <Pine.GSO.4.44.0501261447570.13703-100000@shell3.shore.net>
Message-ID: <381BA466-704A-11D9-A8BD-000A95C8B056@sanger.ac.uk>


On 26 Jan 2005, at 19:55, Michael Heuer wrote:

> Hello,
>
> I would like to use the dazzle GFFAnnotationSource to display predicted
> transcripts on the Ensembl web interface, but it's a little bit 
> difficult
> to follow tags through from the GFF file to what ends up on the Ensembl
> web interface.
>
> I imagine I would want a row in the GFF file for each exon and then 
> group
> each exon together into a transcript using something in the tag/value
> field.  Does anyone have an example?

For Ensembl display, it should be sufficient to include a property 
named "id" on each GFF record -- all features with matching IDs will be 
grouped by the ensembl web-code.

Something like:

2L      annotation      exon       1       7528    0.0     .       0	id 
"Foo"; some_attribute "Bar"
2L      annotation      exon       9492    9835    0.0     .       0	id 
"Foo"

Should behave the way you want.


              Thomas


From dan.baggott.work at gmail.com  Thu Jan 27 18:01:45 2005
From: dan.baggott.work at gmail.com (Dan Baggott)
Date: Thu Jan 27 17:57:40 2005
Subject: [Biojava-l] reading nib sequence files
In-Reply-To: <D5DBA313349A4B458528BE63B387F36C936529@imail.agresearch.co.nz>
References: <D5DBA313349A4B458528BE63B387F36C936529@imail.agresearch.co.nz>
Message-ID: <922876fd05012715013fc9fc8d@mail.gmail.com>

That question started off a flurry...  Thanks for the input!  So, from
my narrow and selfish perspective, the short of this thread is that
there isn't any "ready to go" nib i/o code and that the existing
BioJava parsing framework is not designed to deal with binary files so
it would be less than trivial to adapt it.

I don't have much experience with reading from large files (binary or
otherwise).  Is there a general consensus on the path of least
resistance for implementing fast random access to large-ish nucleotide
sequences (ie on the order of human chromosome sized)?  I'm not so
concerned about the size of the sequence files, just speed of access. 
I mentioned the nib format in the first place becuase I was impressed
with the speed at which Jim Kent's nibFrag utility extracts sequence
-- pretty much immediately from the human perspective.

Dan

On Tue, 25 Jan 2005 08:29:37 +1300, Smithies, Russell
<Russell.Smithies@agresearch.co.nz> wrote:
> You don't need to extract the whole file with ZipInputStream first.
> I managed to get the part I wanted by setting the offset to the start of
> the sequence (was using zipped chromosomes in fasta format) and the
> buffer to the length I wanted.
> It was a year or 2 ago and I probably don't have the code anymore but it
> is possible  ;-)
> 
> Russell Smithies
> 
> Bioinformatics Software Developer
> AgResearch Invermay
> Private Bag 50034
> Puddle Alley
> Mosgiel
> New Zealand
> 
> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard
> HOLLAND
> 
> Sent: Monday, 24 January 2005 10:19 p.m.
> To: VERHOEF Frans; mark.schreiber@group.novartis.com
> Cc: biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> The trouble with ZIP is that to do random-access reads of the sequence
> (eg. give me all bases from X to Y) you have to unzip the whole sequence
> each time. That makes it quite a bit slower. The solution needs to be a
> compression algorithm of some kind which allows instant random access
> without slowing down the create/update process too much either. Hence a
> custom fixed-width binary solution would be the first thing that comes
> to mind, but it may not be the only one.
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199
> 
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
> 
> > -----Original Message-----
> > From: VERHOEF Frans
> > Sent: Monday, January 24, 2005 5:16 PM
> > To: Richard HOLLAND; mark.schreiber@group.novartis.com
> > Cc: Thomas Down; biojava-list List
> > Subject: RE: [Biojava-l] reading nib sequence files
> >
> >
> > You could always ZIPStream it out for even more compression.
> >
> > Frans
> >
> > -----Original Message-----
> > From: biojava-l-bounces@portal.open-bio.org
> > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of
> > Richard HOLLAND
> > Sent: Monday, January 24, 2005 04:59 PM
> > To: mark.schreiber@group.novartis.com
> > Cc: Thomas Down; biojava-list List
> > Subject: RE: [Biojava-l] reading nib sequence files
> >
> > NIB files store one base per 4 bits, non-variable, giving a
> > 50% compression rate and a maximum arity of 16 different base
> > values per position.
> >
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you are
> > not the intended recipient, please delete it and notify us
> > immediately. Please do not copy or use it for any purpose, or
> > disclose its content to any other person. Thank you.
> > ---------------------------------------------
> >
> >
> > > -----Original Message-----
> > > From: mark.schreiber@group.novartis.com
> > > [mailto:mark.schreiber@group.novartis.com]
> > > Sent: Monday, January 24, 2005 4:53 PM
> > > To: Richard HOLLAND
> > > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> > > Subject: RE: [Biojava-l] reading nib sequence files
> > >
> > >
> > > BioJava does already do some compression on large sequences
> > > (or at least
> > > it used to). Like you say you can bit pack a lot. Ambiguity causes
> > > problems as you can have more than four symbols for DNA
> > > (including n, y, r
> > > etc).
> > >
> > > Does Jim Kent's schema offer better compression? Even if it
> > > doens't the
> > > use of a ByteBuffer will probably increase the speed of the current
> > > implementations.
> > >
> > > - Mark
> > >
> > >
> > >
> > >
> > >
> > > "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > > 01/24/2005 04:47 PM
> > >
> > >
> > >         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down"
> > > <td2@sanger.ac.uk>
> > >         cc:     "biojava-list List" <biojava-l@biojava.org>,
> > > <baggott2@llnl.gov>
> > >         Subject:        RE: [Biojava-l] reading nib sequence files
> > >
> > >
> > > I think the idea of storing sequences internally as
> > compressed binary
> > > sequence would be a good idea regardless, for any symbol list.
> > > Currently each Symbol in a SymbolList requires one word of
> > memory (the
> > > size of a memory pointer to the singleton Symbol
> > instances). Therefore
> > > any SymbolList of length X containing symbols from an n-ary
> > alphabet
> > > would require X words of memory to store it, plus the
> > overhead of the
> > > SymbolList and n Symbol singleton instances (admittedly
> > shared between
> > > all SymbolLists currently in memory).
> > >
> > > If you used a compressed binary format internally, doing away with
> > > explicit Symbol references and representing each symbol in a
> > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G
> > > etc.), you would require much less space than even the
> > singleton model
> > > above. This
> > > way you could fit four DNA symbols into a single byte of memory, as
> > > opposed to four words of memory. The number of bits required for a
> > > symbol in any given alphabet is merely log base 2 of the size of the
> > > alphabet, rounded up to the nearest whole number. eg. for
> > the English
> > > alphabet of 26 letters only, you would need 5 bits, or in
> > > terms of whole
> > > bytes, you would be able to fit 8 symbols into 5 bytes.
> > >
> > > To do this you would need to define a 'bits' parameter on
> > the alphabet
> > > which is calculated from the number of symbols in the alphabet, a
> > > 'bitMap' parameter on the alphabet which maps symbols to bit values
> > > (and vice versa with 'inverseBitMap'), and keep a separate
> > > 'length' parameter
> > > in the SymbolList which would be used to tell the binary
> > > decoder when to
> > > stop parsing the sequence (as you can only store whole bytes,
> > > there will
> > > often be trailing zeroes in the buffer which could be
> > > misleading without
> > > this extra parameter).
> > >
> > > You could always return singleton Symbol objects if requested, by
> > > decoding the binary sequence on the fly, but you would no
> > longer need
> > > to store the sequence using them.
> > >
> > > Is this worth considering for the big BioJava rewrite?
> > >
> > > Richard Holland
> > > Bioinformatics Specialist
> > > GIS extension 8199
> > >
> > > ---------------------------------------------
> > > This email is confidential and may be privileged. If you
> > are not the
> > > intended recipient, please delete it and notify us
> > immediately. Please
> > > do not copy or use it for any purpose, or disclose its
> > content to any
> > > other person. Thank you.
> > > ---------------------------------------------
> > >
> > >
> > > > -----Original Message-----
> > > > From: mark.schreiber@group.novartis.com
> > > > [mailto:mark.schreiber@group.novartis.com]
> > > > Sent: Monday, January 24, 2005 4:37 PM
> > > > To: Thomas Down
> > > > Cc: biojava-list List; Richard HOLLAND;
> > > > "<baggott2@llnl.gov"@novartis.com
> > > > Subject: Re: [Biojava-l] reading nib sequence files
> > > >
> > > >
> > > > I'd need to brush up on my nio, and my c !
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thomas Down <td2@sanger.ac.uk>
> > > > 01/24/2005 04:34 PM
> > > >
> > > >
> > > >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > > >         cc:     "<baggott2@llnl.gov>", biojava-list List
> > > > <biojava-l@biojava.org>, Mark
> > > > Schreiber/GP/Novartis@PH
> > > >         Subject:        Re: [Biojava-l] reading nib sequence files
> > > >
> > > >
> > > >
> > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > > >
> > > > > It's a compressed binary format. I doubt BioJava would be
> > > > able to read
> > > > > it without a lot of effort as the current parser framework
> > > > is set up
> > > > > for
> > > > > text input only.
> > > >
> > > > Nib support probably wouldn't fit into the text-oriented parsing
> > > > framework, but I'm sure it could be supported somehow if
> > there was
> > > > demand.  A quick google doesn't turn up any format
> > > documentation, but
> > > > Jim Kent's IO code is at:
> > > >
> > > >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > > >
> > > > One interesting way to handle this might be to open the nib
> > > file as a
> > > > MappedByteBuffer, and back a SymbolList directly using that --
> > > > potentially giving us an efficient way of working with huge
> > > > sequences..
> > > >   Any interest in that?
> > > >
> > > >            Thomas.
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
> >
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>
From mark.schreiber at group.novartis.com  Thu Jan 27 22:24:38 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Jan 27 22:21:12 2005
Subject: [Biojava-l] reading nib sequence files
Message-ID: <OF6218D651.A0AB9A23-ON48256F97.0012AC97-48256F97.0012BCC7@EU.novartis.net>

I think if you want to use Java the nio packages are the way to go.

Just my $0.02


Dan Baggott <dan.baggott.work@gmail.com>
Sent by: biojava-l-bounces@portal.open-bio.org
01/28/2005 07:01 AM
Please respond to baggott2

 
        To:     biojava-list List <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-l] reading nib sequence files


That question started off a flurry...  Thanks for the input!  So, from
my narrow and selfish perspective, the short of this thread is that
there isn't any "ready to go" nib i/o code and that the existing
BioJava parsing framework is not designed to deal with binary files so
it would be less than trivial to adapt it.

I don't have much experience with reading from large files (binary or
otherwise).  Is there a general consensus on the path of least
resistance for implementing fast random access to large-ish nucleotide
sequences (ie on the order of human chromosome sized)?  I'm not so
concerned about the size of the sequence files, just speed of access. 
I mentioned the nib format in the first place becuase I was impressed
with the speed at which Jim Kent's nibFrag utility extracts sequence
-- pretty much immediately from the human perspective.

Dan

On Tue, 25 Jan 2005 08:29:37 +1300, Smithies, Russell
<Russell.Smithies@agresearch.co.nz> wrote:
> You don't need to extract the whole file with ZipInputStream first.
> I managed to get the part I wanted by setting the offset to the start of
> the sequence (was using zipped chromosomes in fasta format) and the
> buffer to the length I wanted.
> It was a year or 2 ago and I probably don't have the code anymore but it
> is possible  ;-)
> 
> Russell Smithies
> 
> Bioinformatics Software Developer
> AgResearch Invermay
> Private Bag 50034
> Puddle Alley
> Mosgiel
> New Zealand
> 
> -----Original Message-----
> From: biojava-l-bounces@portal.open-bio.org
> [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard
> HOLLAND
> 
> Sent: Monday, 24 January 2005 10:19 p.m.
> To: VERHOEF Frans; mark.schreiber@group.novartis.com
> Cc: biojava-list List; Thomas Down
> Subject: RE: [Biojava-l] reading nib sequence files
> 
> The trouble with ZIP is that to do random-access reads of the sequence
> (eg. give me all bases from X to Y) you have to unzip the whole sequence
> each time. That makes it quite a bit slower. The solution needs to be a
> compression algorithm of some kind which allows instant random access
> without slowing down the create/update process too much either. Hence a
> custom fixed-width binary solution would be the first thing that comes
> to mind, but it may not be the only one.
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199
> 
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the
> intended recipient, please delete it and notify us immediately. Please
> do not copy or use it for any purpose, or disclose its content to any
> other person. Thank you.
> ---------------------------------------------
> 
> > -----Original Message-----
> > From: VERHOEF Frans
> > Sent: Monday, January 24, 2005 5:16 PM
> > To: Richard HOLLAND; mark.schreiber@group.novartis.com
> > Cc: Thomas Down; biojava-list List
> > Subject: RE: [Biojava-l] reading nib sequence files
> >
> >
> > You could always ZIPStream it out for even more compression.
> >
> > Frans
> >
> > -----Original Message-----
> > From: biojava-l-bounces@portal.open-bio.org
> > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of
> > Richard HOLLAND
> > Sent: Monday, January 24, 2005 04:59 PM
> > To: mark.schreiber@group.novartis.com
> > Cc: Thomas Down; biojava-list List
> > Subject: RE: [Biojava-l] reading nib sequence files
> >
> > NIB files store one base per 4 bits, non-variable, giving a
> > 50% compression rate and a maximum arity of 16 different base
> > values per position.
> >
> > Richard Holland
> > Bioinformatics Specialist
> > GIS extension 8199
> >
> > ---------------------------------------------
> > This email is confidential and may be privileged. If you are
> > not the intended recipient, please delete it and notify us
> > immediately. Please do not copy or use it for any purpose, or
> > disclose its content to any other person. Thank you.
> > ---------------------------------------------
> >
> >
> > > -----Original Message-----
> > > From: mark.schreiber@group.novartis.com
> > > [mailto:mark.schreiber@group.novartis.com]
> > > Sent: Monday, January 24, 2005 4:53 PM
> > > To: Richard HOLLAND
> > > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down
> > > Subject: RE: [Biojava-l] reading nib sequence files
> > >
> > >
> > > BioJava does already do some compression on large sequences
> > > (or at least
> > > it used to). Like you say you can bit pack a lot. Ambiguity causes
> > > problems as you can have more than four symbols for DNA
> > > (including n, y, r
> > > etc).
> > >
> > > Does Jim Kent's schema offer better compression? Even if it
> > > doens't the
> > > use of a ByteBuffer will probably increase the speed of the current
> > > implementations.
> > >
> > > - Mark
> > >
> > >
> > >
> > >
> > >
> > > "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > > 01/24/2005 04:47 PM
> > >
> > >
> > >         To:     Mark Schreiber/GP/Novartis@PH, "Thomas Down"
> > > <td2@sanger.ac.uk>
> > >         cc:     "biojava-list List" <biojava-l@biojava.org>,
> > > <baggott2@llnl.gov>
> > >         Subject:        RE: [Biojava-l] reading nib sequence files
> > >
> > >
> > > I think the idea of storing sequences internally as
> > compressed binary
> > > sequence would be a good idea regardless, for any symbol list.
> > > Currently each Symbol in a SymbolList requires one word of
> > memory (the
> > > size of a memory pointer to the singleton Symbol
> > instances). Therefore
> > > any SymbolList of length X containing symbols from an n-ary
> > alphabet
> > > would require X words of memory to store it, plus the
> > overhead of the
> > > SymbolList and n Symbol singleton instances (admittedly
> > shared between
> > > all SymbolLists currently in memory).
> > >
> > > If you used a compressed binary format internally, doing away with
> > > explicit Symbol references and representing each symbol in a
> > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G
> > > etc.), you would require much less space than even the
> > singleton model
> > > above. This
> > > way you could fit four DNA symbols into a single byte of memory, as
> > > opposed to four words of memory. The number of bits required for a
> > > symbol in any given alphabet is merely log base 2 of the size of the
> > > alphabet, rounded up to the nearest whole number. eg. for
> > the English
> > > alphabet of 26 letters only, you would need 5 bits, or in
> > > terms of whole
> > > bytes, you would be able to fit 8 symbols into 5 bytes.
> > >
> > > To do this you would need to define a 'bits' parameter on
> > the alphabet
> > > which is calculated from the number of symbols in the alphabet, a
> > > 'bitMap' parameter on the alphabet which maps symbols to bit values
> > > (and vice versa with 'inverseBitMap'), and keep a separate
> > > 'length' parameter
> > > in the SymbolList which would be used to tell the binary
> > > decoder when to
> > > stop parsing the sequence (as you can only store whole bytes,
> > > there will
> > > often be trailing zeroes in the buffer which could be
> > > misleading without
> > > this extra parameter).
> > >
> > > You could always return singleton Symbol objects if requested, by
> > > decoding the binary sequence on the fly, but you would no
> > longer need
> > > to store the sequence using them.
> > >
> > > Is this worth considering for the big BioJava rewrite?
> > >
> > > Richard Holland
> > > Bioinformatics Specialist
> > > GIS extension 8199
> > >
> > > ---------------------------------------------
> > > This email is confidential and may be privileged. If you
> > are not the
> > > intended recipient, please delete it and notify us
> > immediately. Please
> > > do not copy or use it for any purpose, or disclose its
> > content to any
> > > other person. Thank you.
> > > ---------------------------------------------
> > >
> > >
> > > > -----Original Message-----
> > > > From: mark.schreiber@group.novartis.com
> > > > [mailto:mark.schreiber@group.novartis.com]
> > > > Sent: Monday, January 24, 2005 4:37 PM
> > > > To: Thomas Down
> > > > Cc: biojava-list List; Richard HOLLAND;
> > > > "<baggott2@llnl.gov"@novartis.com
> > > > Subject: Re: [Biojava-l] reading nib sequence files
> > > >
> > > >
> > > > I'd need to brush up on my nio, and my c !
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thomas Down <td2@sanger.ac.uk>
> > > > 01/24/2005 04:34 PM
> > > >
> > > >
> > > >         To:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> > > >         cc:     "<baggott2@llnl.gov>", biojava-list List
> > > > <biojava-l@biojava.org>, Mark
> > > > Schreiber/GP/Novartis@PH
> > > >         Subject:        Re: [Biojava-l] reading nib sequence files
> > > >
> > > >
> > > >
> > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote:
> > > >
> > > > > It's a compressed binary format. I doubt BioJava would be
> > > > able to read
> > > > > it without a lot of effort as the current parser framework
> > > > is set up
> > > > > for
> > > > > text input only.
> > > >
> > > > Nib support probably wouldn't fit into the text-oriented parsing
> > > > framework, but I'm sure it could be supported somehow if
> > there was
> > > > demand.  A quick google doesn't turn up any format
> > > documentation, but
> > > > Jim Kent's IO code is at:
> > > >
> > > >            http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c
> > > >
> > > > One interesting way to handle this might be to open the nib
> > > file as a
> > > > MappedByteBuffer, and back a SymbolList directly using that --
> > > > potentially giving us an efficient way of working with huge
> > > > sequences..
> > > >   Any interest in that?
> > > >
> > > >            Thomas.
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
> >
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From mark.schreiber at group.novartis.com  Mon Jan 31 01:57:38 2005
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Jan 31 01:53:48 2005
Subject: [Biojava-l] Validate Annotation vs Ontology
Message-ID: <OF964B9C68.7AF4369D-ON48256F9A.002614A5-48256F9A.00263D26@EU.novartis.net>

Hello -

I have an Ontology in a BioSQL DB and I would like to  validate an 
Annotation against the terms in that DB. Is there a way to create an 
AnnotationType from an Ontology?

- Mark