From gongwuming at gmail.com Mon Jan 3 06:45:39 2005 From: gongwuming at gmail.com (Wuming Gong) Date: Mon Jan 3 06:42:46 2005 Subject: [Biojava-l] Is there a BioJava wrapper for Consensus Message-ID: <24d6fd0505010303451496711f@mail.gmail.com> Hi List, I wonder whether there is already such a BioJava class for standalone version of Consensus (Hertz, G. Z. and Stormo G. D., 1999, Bioinformatics, 15, 563-577). If no, I want to write one. Could you please recommed some tutorials or documents for writing such wrapper in BioJava? Thanks. Wuming From mark.schreiber at group.novartis.com Thu Jan 6 21:43:13 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Jan 6 21:39:53 2005 Subject: [Biojava-l] Exception not being caught. Message-ID: I tend to agree. This probably shouldn't be an error. It should be possible to recover from it. Having said that changing it might be a bit of a headache. You don't need to catch Errors but if we make it an Exception then old code will be invalid because you do need to catch exceptions. A possible way around would be to make a RuntimeException which you don't need to catch but then you are pretty much back to the situation of catching a Throwable so there is no real advantage. Also, RuntimeExceptions are pretty much reserved for things that happen due to bad programming such as NullPointerExceptions. As a general rule BioJava shouldn't use Errors unless something truely bad happens, such as not being able to locate a critical resource like AlphabetManager.xml. That sort of thing would be very hard to recover from and should be an error. Someone passing some crap sequence to a parser shouldn't be an error. - Mark "Richard HOLLAND" Sent by: biojava-l-bounces@portal.open-bio.org 12/29/2004 01:14 PM To: "Michael Heuer" cc: biojava-l@biojava.org, (bcc: Mark Schreiber/GP/Novartis) Subject: RE: [Biojava-l] Exception not being caught. Hi, I resolved the exception by adding a trim() call to truncate whitespace from the ends of the sequence before passing it to DNATools. There were some weird trailing blank symbols, \0, \r, \n and the like. Fair enough that it threw a wobbly when it encountered these. However, I am sure there are situations where you would want to safely know whether a sequence contained invalid characters (eg. if accepting free-text sequence information via a web interface). In this case, you would want to catch the exception in the usual manner. Should this particular BioError not be a plain normal BioException that people could catch easily? cheers, Richard Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: Michael Heuer [mailto:heuermh@shell3.shore.net] On > Behalf Of Michael Heuer > Sent: Wednesday, December 29, 2004 12:25 PM > To: Richard HOLLAND > Cc: biojava-l@biojava.org > Subject: Re: [Biojava-l] Exception not being caught. > > > > On Wed, 29 Dec 2004, Richard HOLLAND wrote: > > > I am getting an exception thrown by my code that never seems to get > > caught. I am not sure if this is because of BioJava or > because of a lack > > of understanding of Exceptions on my part? The exception causes the > > program to grind to an immediate halt. My method throws the general > > Exception class, but the exception thrown by BioJava seems to escape > > that detail and treats it as though my method were not handling > > exceptions at all. I would expect the calling method which wraps the > > call in a try{}catch{Exception e} statement to catch it? > But apparently > > not? Why not?!! > > > > The method in BioJava I am using is DNATools.createDNASequence. > > > > Here is the exception: > > > > Exception in thread "main" org.biojava.bio.BioError: > Something has gone > > badly wrong with DNA > > at org.biojava.bio.seq.DNATools.createDNA(DNATools.java:158) > > Unfortunately BioError is not an exception, it is an Error. > > I believe you can catch them with > > try > { > // ... > } > catch (Throwable t) > { > // ... > } > > but you probably shouldn't be. From the BioError javadoc: > > For developers: > Throw this when something has gone wrong and in general people should > not be handling it. > > > > > org.biojava.bio.seq.DNATools.createDNASequence(DNATools.java:176) > > at > > > gis.aads.pipeline.LibraryFastaBuilder.run(LibraryFastaBuilder. > java:234) > > at gis.pipeline.Main.main(Main.java:125) > > Caused by: org.biojava.bio.symbol.IllegalSymbolException: This > > tokenization doesn't contain character: '' > > at > > > org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar(Ch > aracterTok > > enization.java:175) > > This is the real problem: the parser doesn't know what to do with the > character ''. I don't know exactly what that means, but does > the string > you pull out of the database clob look reasonable? > > michael > > > > And here is the method that calls it (or bits of it anyhow, and an > > example calling method): > > > > public void doTheThing() { > > MyClass otherClass = new MyClass(); > > try { > > int rc = otherClass.run(); > > System.out.println("rc was "+rc); > > } catch (Exception e) { > > System.out.println("oops!"); > > } > > } > > > > public int run() throws Exception { > > .... > > // For each library, get all trimmed seqs. > > for (String lib : libs) { > > log.info("Processing library "+lib); > > .... > > // Get the sequences. > > seqq.execute(lib); > > rs = seqq.results(); > > > > // Log info. > > log.info("Processing fasta."); > > while (rs.next()) { > > // Get details. > > String seqID = rs.getString(1); > > char direction = UserSampleID.getDirection(seqID); > > Clob seqclob = rs.getClob(2); > > String seqstr = > > seqclob.getSubString((long)1,(int)seqclob.length()); > > if (seqstr.length() > > > // Create the sequence and format it into fasta. > > Sequence seq = DNATools.createDNASequence(seqstr, > > seqID); > > ByteArrayOutputStream baos = new > > ByteArrayOutputStream(); > > SeqIOTools.writeFasta(baos,seq); > > baos.flush(); > > > > // For each seq, if reverse, add to reverse > temp file. > > // Else, add to forward temp file. > > switch (direction) { > > case 'R': > > reverseWriter.write(baos.toString()); > > break; > > case 'F': > > forwardWriter.write(baos.toString()); > > break; > > default: > > log.warning("Unknown direction "+direction+" > > received for sequence "+seqID); > > rc = PipelineApp.FAILURE; > > continue; > > } > > .... > > } > > .... > > } > > .... > > } > > > > > > I understand that the exception is thrown because of an invalid > > sequence, but I don't understand why it isn't being caught. > > > > > > Richard Holland > > Bioinformatics Specialist > > GIS extension 8199 > > > > --------------------------------------------- > > This email is confidential and may be privileged. If you are not the > > intended recipient, please delete it and notify us > immediately. Please > > do not copy or use it for any purpose, or disclose its > content to any > > other person. Thank you. > > --------------------------------------------- > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l@biojava.org > > http://biojava.org/mailman/listinfo/biojava-l > > > > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From mark.schreiber at group.novartis.com Thu Jan 6 21:50:17 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Jan 6 21:46:52 2005 Subject: [Biojava-l] CVS Message-ID: Take a look at: http://cvs.biojava.org/ Felipe Albrecht Sent by: biojava-l-bounces@portal.open-bio.org 12/29/2004 02:06 PM Please respond to Felipe Albrecht To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] CVS Hello, I'm a computer science student and I intent study biomolecular with bioinformatic. I have a good skills with java platform and I love to code and study biomolecular. I want work with biojava, and for while, for understand better the biojava, I thing to help biojava's upgrade, from java 1.4 to 1.5. For this, I need know what are the cvs project directory, the repository is " :pserver:cvs@cvs.open-bio.org:/home/repository/biojava" ? So, what are the project that contains the most actual (and instable) code? Thanks, Felipe Albrecht _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From mark.schreiber at group.novartis.com Thu Jan 6 21:54:11 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Jan 6 21:50:43 2005 Subject: [Biojava-l] Blast SAX parser output Message-ID: Use a different SearchContentHandler (or Override BlastHitSummaryWriter) to write to a file instead of STDOUT. Or you could redirect STDOUT to a file (are you allowed to do that with a cron job??) "Richard HOLLAND" Sent by: biojava-l-bounces@portal.open-bio.org 12/30/2004 11:54 AM To: cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Blast SAX parser output Is there any way to stop the blast parser code from outputting progress? I get lots of the following and its clogging up my unix mailbox as the job is run through cron: obj=score 317 obj=expectValue 7e-86 obj=numberOfIdentities 158 obj=alignmentSize 160 obj=percentageIdentity 98 obj=numberOfPositives 159 obj=numberOfPositives 159 obj=queryFrame plus2 obj=querySequenceStart 29 obj=querySequenceEnd 508 obj=querySequence DKHWMPVTKLGRLVKDMKIKSLEEIYLFSLPIKESEIIDFFLGASLKD EVLKIMPVQKQTRAGQRTRFKAFVAIGDYNGHVGLGVKCSKEVATAIRGAIILAKLSIVPVRRGYWGNKIGKPHTVPCKV TGRCGSVLVRLIPAPRGTGIVSAPVPKKLLMM obj=subjectSequenceStart 31 obj=subjectSequenceEnd 190 obj=subjectSequence DKEWIPVTKLGRLVKDMKIKSLEEIYLFSLPIKESEIIDFFLGASLKD EVLKIMPVQKQTRAGQRTRFKAFVAIGDYNGHVGLGVKCSKEVATAIRGAIILAKLSIVPVRRGYWGNKIGKPHTVPCKV TGRCGSVLVRLIPAPRGTGIVSAPVPKKLLMM .... The code producing this is: File parsedBlast = safe.tempfile(); SearchContentHandler handler = new BlastHitSummaryWriter(new BufferedWriter(new FileWriter(parsedBlast))); SeqSimilarityAdapter adapter = new SeqSimilarityAdapter(); adapter.setSearchContentHandler(handler); BlastLikeSAXParser breader = new BlastLikeSAXParser(); breader.setModeLazy(); InputSource is = new InputSource(new FileReader(blast)); breader.setContentHandler(adapter); breader.parse(is); cheers, Richard Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From wim.glassee at ua.ac.be Fri Jan 7 10:01:34 2005 From: wim.glassee at ua.ac.be (Wim Glassee) Date: Fri Jan 7 09:58:14 2005 Subject: [Biojava-l] BioJava 1.5 Message-ID: <41DEA44E.7080706@ua.ac.be> Hi all, anybody have any idea if and when a biojava 1.5 is coming? A java.sun.com article stated late 2004, early 2005. I would personally be interested in a build on top of the 1.5 codebase. Partly because of the internal xml library. Thanks, Wim From mark.schreiber at group.novartis.com Sun Jan 9 20:03:05 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Sun Jan 9 19:59:36 2005 Subject: [Biojava-l] BioJava 1.5 Message-ID: I'd like to see biojava 1.4 come out first!! Seriously though, I'm assuming you mean a version that uses java 1.5 (java 5). I think Matthew Pocock is working on something, I think it's called BioJava2 and represents a major redesign (in which case it probably shouldn't be called biojava but thats just semantics). Not too sure what the website is. Having said all that, I think there is no reason why we cannot start using java 1.5 in the standard old biojava (probably after biojava 1.4 is finalised). As long as most of the community is happy with that proposal. - Mark Wim Glassee Sent by: biojava-l-bounces@portal.open-bio.org 01/07/2005 11:01 PM To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] BioJava 1.5 Hi all, anybody have any idea if and when a biojava 1.5 is coming? A java.sun.com article stated late 2004, early 2005. I would personally be interested in a build on top of the 1.5 codebase. Partly because of the internal xml library. Thanks, Wim _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From ghhu at info.biology.mcmaster.ca Mon Jan 10 10:58:59 2005 From: ghhu at info.biology.mcmaster.ca (Guanhong Hu) Date: Mon Jan 10 10:55:25 2005 Subject: [Biojava-l] can use biojava in JApplet? Message-ID: HI, I am new to biojava. I develped a small java application by using some biojava modules. It runs O.K. But when I convert this application to a JApplet , it compiles O.K. , but when I use appletviewer to test it, it gave out the following error message: java.lang.NoClassDefFoundError: org/biojava/bio/symbol/IllegalSymbolException at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:1610) at java.lang.Class.getConstructor0(Class.java:1922) at java.lang.Class.newInstance0(Class.java:278) at java.lang.Class.newInstance(Class.java:261) at sun.applet.AppletPanel.createApplet(AppletPanel.java:617) at sun.applet.AppletPanel.runLoader(AppletPanel.java:546) at sun.applet.AppletPanel.run(AppletPanel.java:298) at java.lang.Thread.run(Thread.java:534) and the applet not initialized. I don't know what's wrong. Can biojava be used in JApplet??? Any reply is great appreciated. Thanks Guanhong From hollandr at gis.a-star.edu.sg Mon Jan 10 20:07:29 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Mon Jan 10 20:05:10 2005 Subject: [Biojava-l] can use biojava in JApplet? Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D560140EC63@BIONIC.biopolis.one-north.com> Sounds like a classpath problem. You need to include the biojava.jar file either in the same folder as the applet, or in the global java resources (something like /usr/java/lib/ext) on the server the applet is hosted on. Just using individual class files will not work as there are many interdependencies. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > Guanhong Hu > Sent: Monday, January 10, 2005 11:59 PM > To: biojava-l@biojava.org > Subject: [Biojava-l] can use biojava in JApplet? > > > HI, > I am new to biojava. I develped a small java application by > using some > biojava modules. It runs O.K. But when I convert this > application to a > JApplet , it compiles O.K. , but when I use appletviewer to > test it, it > gave out the following error message: > java.lang.NoClassDefFoundError: > org/biojava/bio/symbol/IllegalSymbolException > at java.lang.Class.getDeclaredConstructors0(Native Method) > at > java.lang.Class.privateGetDeclaredConstructors(Class.java:1610) > at java.lang.Class.getConstructor0(Class.java:1922) > at java.lang.Class.newInstance0(Class.java:278) > at java.lang.Class.newInstance(Class.java:261) > at sun.applet.AppletPanel.createApplet(AppletPanel.java:617) > at sun.applet.AppletPanel.runLoader(AppletPanel.java:546) > at sun.applet.AppletPanel.run(AppletPanel.java:298) > at java.lang.Thread.run(Thread.java:534) > and the applet not initialized. > > I don't know what's wrong. Can biojava be used in JApplet??? > > Any reply is great appreciated. > > Thanks > > Guanhong > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From e.willighagen at science.ru.nl Tue Jan 11 01:52:31 2005 From: e.willighagen at science.ru.nl (Egon Willighagen) Date: Tue Jan 11 01:49:10 2005 Subject: [Biojava-l] 3D viewing? Message-ID: <200501110752.31210.e.willighagen@science.ru.nl> Hi all, I've been monitoring this list for a few weeks now, but are a bit puzzled by the projects nature. There does not some to be that much activity... how actively is it used and developed? Anyway, what I wanted to ask is this. The website news mentions org.biojava.bio.structure for holding 3D coordinates (though it's missing from the JavaDoc on the website), so I was wondering what you use for 3D display, and would like to discuss the option of using Jmol [1] for 3D rendering of protein. Egon 1. http://www.jmol.org/ -- e.willighagen@science.ru.nl PhD student on Molecular Representation in Chemometrics Radboud University Nijmegen http://www.cac.science.ru.nl/people/egonw/ GPG: 1024D/D6336BA6 From ap3 at sanger.ac.uk Tue Jan 11 04:46:29 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue Jan 11 04:42:25 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: <200501110752.31210.e.willighagen@science.ru.nl> References: <200501110752.31210.e.willighagen@science.ru.nl> Message-ID: Hi Egon! I am the person who contributed the biojava - structure classes. As it turns out I am a very happy Jmol user already for quite a while now! :-) Using both Biojava and Jmol I am working on a 3D - DAS (distributed annotation system) client to visualize annotations of proteins in both sequence and structure. http://www.efamily.org.uk/software/dasclients/spice/ To interact between Biojava and Jmol I use the Biojava code to create a PDB file and use this file as an input to Jmol. Some docu how to incorporate Jmol into another application can be found at http://wiki.jmol.org/ApplicationsEmbeddingJmol > The website news mentions > org.biojava.bio.structure for holding 3D coordinates (though it's > missing > from the JavaDoc on the website), As far as I know the JavaDoc relates to the last public release, which did not contain the structure classes. Guess it is really time to make a new biojava release! Greetings, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK From tmo at ebi.ac.uk Tue Jan 11 05:49:52 2005 From: tmo at ebi.ac.uk (Tom Oinn) Date: Tue Jan 11 05:46:36 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: References: <200501110752.31210.e.willighagen@science.ru.nl> Message-ID: <41E3AF50.2020803@ebi.ac.uk> Andreas Prlic wrote: > Hi Egon! > > I am the person who contributed the biojava - structure classes. As it > turns out I am a very happy Jmol user already for quite a while now! :-) Seconded - Jmol is very cool and surprisingly easy to integrate if you have PDB format data lying around somewhere in your code. I believe it supports other structure formats as well although I've never used them. Cheers, Tom -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.6.10 - Release Date: 1/10/2005 From e.willighagen at science.ru.nl Tue Jan 11 06:24:40 2005 From: e.willighagen at science.ru.nl (Egon Willighagen) Date: Tue Jan 11 06:30:08 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: <41E3AF50.2020803@ebi.ac.uk> References: <200501110752.31210.e.willighagen@science.ru.nl> <41E3AF50.2020803@ebi.ac.uk> Message-ID: <200501111224.40229.e.willighagen@science.ru.nl> On Tuesday 11 January 2005 11:49 am, Tom Oinn wrote: > Andreas Prlic wrote: > > I am the person who contributed the biojava - structure classes. As it > > turns out I am a very happy Jmol user already for quite a while now! :-) > > Seconded - Jmol is very cool and surprisingly easy to integrate if you > have PDB format data lying around somewhere in your code. I believe it > supports other structure formats as well although I've never used them. Hi Tom, I'm actually one of the Jmol developers (though we should thank Miguel for the excellent 3D rendering of proteins), but was actually refering to a tighter integration of BioJava and Jmol... On http://wiki.jmol.org/JmolCdkIntegration there is this code fragment: =================================== public void viewCDKModel(ChemFile aCDKChemFileObject) { JmolPanel jmolPanel = new JmolPanel(); contentPane.add(jmolPanel); JmolViewer viewer = jmolPanel.getViewer(); viewer.openClientFile("", "", aCDKChemFileObject); } class JmolPanel extends JPanel { JmolViewer viewer; JmolAdapter adapter; JmolPanel() { // use CDK IO adapter = new CdkJmolAdapter(null); viewer = new JmolViewer(this, adapter); } } =================================== And I was wondering about a JmolAdapter implementation for BioJava... So that it would be possible to do: =================================== public void viewBioJavaModel(Sequence sequence) { JmolPanel jmolPanel = new JmolPanel(); contentPane.add(jmolPanel); JmolViewer viewer = jmolPanel.getViewer(); viewer.openClientFile("", "", sequence); } class JmolPanel extends JPanel { JmolViewer viewer; JmolAdapter adapter; JmolPanel() { // use CDK IO adapter = new BioJavaJmolAdapter(null); viewer = new JmolViewer(this, adapter); } } =================================== Or some BioJava class instead of Sequence... This would remove the serialization to PDB, and allow new possibilities... for which we might need to extend the model adapter, but that's no issue... Egon From ola.spjuth at farmbio.uu.se Tue Jan 11 06:52:38 2005 From: ola.spjuth at farmbio.uu.se (Ola Spjuth) Date: Tue Jan 11 06:45:59 2005 Subject: [Biojava-l] Biojava DB Message-ID: <1105444357.3095.37.camel@zidane> Hello, I am thinking of using BioJava for managing sequences and annotations of sequences. BioJava seems to have database support for storing sequences in relational databases. How developed is the support for storing annotated sequences? Is there any documentation (other than JavaDoc) on working with sequences and annotations, and database persistence of these? Best regards .../Ola Spjuth From tmo at ebi.ac.uk Tue Jan 11 07:17:43 2005 From: tmo at ebi.ac.uk (Tom Oinn) Date: Tue Jan 11 07:15:09 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: <200501111224.40229.e.willighagen@science.ru.nl> References: <200501110752.31210.e.willighagen@science.ru.nl> <41E3AF50.2020803@ebi.ac.uk> <200501111224.40229.e.willighagen@science.ru.nl> Message-ID: <41E3C3E7.9070305@ebi.ac.uk> Egon Willighagen wrote: > On Tuesday 11 January 2005 11:49 am, Tom Oinn wrote: > >>Andreas Prlic wrote: >> >>>I am the person who contributed the biojava - structure classes. As it >>>turns out I am a very happy Jmol user already for quite a while now! :-) >> >>Seconded - Jmol is very cool and surprisingly easy to integrate if you >>have PDB format data lying around somewhere in your code. I believe it >>supports other structure formats as well although I've never used them. > > > Hi Tom, > > I'm actually one of the Jmol developers (though we should thank Miguel for the > excellent 3D rendering of proteins), but was actually refering to a tighter > integration of BioJava and Jmol... Hi Egon, Although I agree that in principle this is a reasonable thing to do I'm not convinced that it's worth it in this case - is there any information loss in converting to PDB format then reading back into JMol? If not then I'd leave it at that, there would be relatively few gains from being able to do so directly. One thing that I believe systems like Spice take advantage of is the ability to include scripting instructions in the input to JMol; I haven't looked at the biojava 3d classes in detail but I would have thought they'd lack this functionality - it's nothing to do with representing the structure fundamentally. If you were to implement a direct Jmol view over the biojava classes you'd have to have some way of duplicating this functionality. Having a relatively standard intermediate representation such as PDB format flatfiles is generally a good thing and makes it easier to link components in a loosely coupled fashion at the expense potentially of some efficiency. My thoughts are that the efficiency is no big deal in this case and that the convenience of the intermediate representation (not to mention that it already exists and works!) makes it the preferred option. Cheers, Tom From e.willighagen at science.ru.nl Tue Jan 11 07:27:37 2005 From: e.willighagen at science.ru.nl (Egon Willighagen) Date: Tue Jan 11 07:29:26 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: <41E3C3E7.9070305@ebi.ac.uk> References: <200501110752.31210.e.willighagen@science.ru.nl> <200501111224.40229.e.willighagen@science.ru.nl> <41E3C3E7.9070305@ebi.ac.uk> Message-ID: <200501111327.37971.e.willighagen@science.ru.nl> On Tuesday 11 January 2005 01:17 pm, Tom Oinn wrote: > Having a relatively standard intermediate representation such as PDB > format flatfiles is generally a good thing and makes it easier to link > components in a loosely coupled fashion at the expense potentially of > some efficiency. My thoughts are that the efficiency is no big deal in > this case and that the convenience of the intermediate representation > (not to mention that it already exists and works!) makes it the > preferred option. True :) Egon From ap3 at sanger.ac.uk Tue Jan 11 08:03:15 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue Jan 11 07:59:08 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: <200501111144.18889.e.willighagen@science.ru.nl> References: <200501110752.31210.e.willighagen@science.ru.nl> <200501111144.18889.e.willighagen@science.ru.nl> Message-ID: <28444FBA-63D1-11D9-84AA-001124313E58@sanger.ac.uk> Hi Egon! What would be the advantages of using a BioJavaModel Adapter ? * faster loading of structure into Jmol ? * fewer memory consumption of application ? * if you rotate the structure in Jmol, the rotated coordinates are available through the biojava structure object ? how about performance of the Jmol display, would it be as fast? Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK From e.willighagen at science.ru.nl Tue Jan 11 08:43:43 2005 From: e.willighagen at science.ru.nl (Egon Willighagen) Date: Tue Jan 11 08:40:35 2005 Subject: [Biojava-l] 3D viewing? In-Reply-To: <28444FBA-63D1-11D9-84AA-001124313E58@sanger.ac.uk> References: <200501110752.31210.e.willighagen@science.ru.nl> <200501111144.18889.e.willighagen@science.ru.nl> <28444FBA-63D1-11D9-84AA-001124313E58@sanger.ac.uk> Message-ID: <200501111443.43297.e.willighagen@science.ru.nl> On Tuesday 11 January 2005 02:03 pm, Andreas Prlic wrote: > Hi Egon! > > What would be the advantages of using a BioJavaModel Adapter ? > > * faster loading of structure into Jmol ? > * fewer memory consumption of application ? Yes, I think so, because one does not need to make a large String first (or a File if that is used as intermediate) for the PDB serialization... > * if you rotate the structure in Jmol, the rotated coordinates are > available through the biojava structure object ? > > how about performance of the Jmol display, would it be as fast? The Jmol viewer still copies things into private data classes, so the rendering speed is identical. Because of this copying, I'm now realizing that it might not get updated when the original data object is changed... so, it might not be that useful actually... Egon From msouthern at exsar.com Tue Jan 11 09:48:36 2005 From: msouthern at exsar.com (Mark Southern) Date: Tue Jan 11 09:44:57 2005 Subject: [Biojava-l] SequencePanel changes between 1.3 and 1.4pre1 In-Reply-To: <200501111341.j0BDfEKt007588@portal.open-bio.org> Message-ID: <2C879FB52902524C85E8B616CF276F1A1CC53F@cartasrv.carta.local> I've just started looking at 1.4pre1 (sorry to take so long!). I am using SequencePanel's to view protein sequences. I add multiple SequencePanels (each having the same sequence) to a Container with a BoxLayout. Each displays a different RangeLocation with the overall effect that the sequence appears to wrap over many lines. The credit for this idea goes to Mathew Pocock and it was working very nicely :-) It breaks in 1.4pre1 however and the change that causes this is that the SequencePanel's position within it's Container ( getX() and getY() ) is added to the Graphics2D translation in the SequencePanel's paintComponent() method. The SequencePanel is then painted in an unseen part of it's Container. i.e. from; g2.translate(leadingBorder.getSize() - minAcross + insets.left, insets.top); to; g2.translate(leadingBorder.getSize() - minAcross + insets.left + getX(), insets.top + getY()); I can imagine using a SequencePanel being used in a Container with a BorderLayout or BoxLayout but I can't imagine it being arranged at a coordinate other than (0,0). Can someone please explain the rationale for the 1.4pre1 way of doing things or can we switch back to what 1.3 did? Cheers, Mark. From heuermh at acm.org Tue Jan 11 17:39:42 2005 From: heuermh at acm.org (Michael Heuer) Date: Tue Jan 11 17:35:43 2005 Subject: [Biojava-l] biosql support in bioperl? In-Reply-To: <200501111443.43297.e.willighagen@science.ru.nl> Message-ID: I'm afraid to ask this on the bioperl list for fear of getting flamed, but I was always under the assumption that bioperl could read and write from a biosql schema, like I have been doing with biojava for some time. The bioperl HOWTO docs read Important Support for the biosql protocol is disabled as of Bioperl version 1.4. We hope to remedy this in a subsequent release. and the module referred to in Bio::DB::Registry 'biosql' => 'Bio::DB::BioSQL::BioDatabaseAdaptor' doesn't seem to exist at all. Does any one know, do I need to use an earlier version, or am I missing something obvious? (please don't hurt me :) michael From mark.schreiber at group.novartis.com Tue Jan 11 19:48:01 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Tue Jan 11 19:44:33 2005 Subject: [Biojava-l] Biojava DB Message-ID: Hi - You would probably want to use something like BioSQL. There is some documentation under the OBDA section of this page http://www.biojava.org/docs/bj_in_anger/index.htm There are some people on the list who are actively working with this. You would probably aslo want to look at the BioSQL webpages and mailing list. - Mark Ola Spjuth Sent by: biojava-l-bounces@portal.open-bio.org 01/11/2005 07:52 PM To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Biojava DB Hello, I am thinking of using BioJava for managing sequences and annotations of sequences. BioJava seems to have database support for storing sequences in relational databases. How developed is the support for storing annotated sequences? Is there any documentation (other than JavaDoc) on working with sequences and annotations, and database persistence of these? Best regards .../Ola Spjuth _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From Anna.Henricson at cgb.ki.se Thu Jan 13 03:40:05 2005 From: Anna.Henricson at cgb.ki.se (Anna Henricson) Date: Thu Jan 13 03:36:40 2005 Subject: [Biojava-l] IllegalArgumentException when parsing an embl file Message-ID: Hi, I'm parsing the feature table in an embl file to retrieve the information under feature key CDS. For instance, I am calculating the number of exons, the length of the exons, retrieving the protein sequence and id etc. Sometimes an IllegalArgumentException is thrown by the code sequence = seqIterator.nextSequence(); //(see below in this email) I guess there is some problem in the embl file with the Location, so that the Sequence cannot be instantiated, and as a result these sequences will not be present in my resulting output file. Why is this exception thrown and is there any way to avoid or handle this problem? Please bear in mind that I am new to BioJava and therefore would greatly appreciate a more detailed explanation. Thanks! /Anna The code and the exceptions that are thrown are as follows: .... private Sequence sequence; .... SequenceIterator seqIterator = SeqIOTools.readEmbl (bufferedReader); SequenceIterator seqIterator = SeqIOTools.readEmbl (bufferedReader); while (seqIterator.hasNext()){ try{ sequence = seqIterator.nextSequence(); } catch (BioException e){ e.printStackTrace(); } catch (NoSuchElementException e){ e.printStackTrace(); } .... java.lang.IllegalArgumentException: Location [1045891,1046196] is outside 1..1000000 at org.biojava.bio.seq.impl.SimpleFeature.(SimpleFeature.java:306) at org.biojava.bio.seq.impl.SimpleStrandedFeature.(SimpleStrandedFeature. java:74) at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc torAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:274) at org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature Realizer.java:138) rethrown as org.biojava.bio.BioException: Couldn't realize feature at org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature Realizer.java:144) at org.biojava.bio.seq.SimpleFeatureRealizer.realizeFeature(SimpleFeatureRealiz er.java:94) at org.biojava.bio.seq.impl.SimpleSequence.realizeFeature(SimpleSequence.java:1 98) at org.biojava.bio.seq.impl.SimpleSequence.createFeature(SimpleSequence.java:20 4) at org.biojava.bio.seq.io.SequenceBuilderBase.makeSequence(SequenceBuilderBase. java:168) at org.biojava.bio.seq.io.SmartSequenceBuilder.makeSequence(SmartSequenceBuilde r.java:87) at org.biojava.bio.seq.io.SequenceBuilderFilter.makeSequence(SequenceBuilderFil ter.java:98) at org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:101) at EmblFileParser.(EmblFileParser.java:34) at EmblToExintFormat.main(EmblToExintFormat.java:57) -------------------------------------------- Anna Henricson, MSc, PhD student Center for Genomics and Bioinformatics (CGB) Karolinska Institutet S-171 77 Stockholm Sweden Phone: +46 (0)8 524 87296 Fax: +46 (0)8 337983 From mark.schreiber at group.novartis.com Thu Jan 13 04:00:54 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Jan 13 03:57:20 2005 Subject: [Biojava-l] IllegalArgumentException when parsing an embl file Message-ID: Hi Anna - It seems the problem may be with the EMBL file. java.lang.IllegalArgumentException: Location [1045891,1046196] is outside 1..1000000 This part indicates BioJava is trying to make a Feature with the Location [1045891,1046196] which is outside the bounds of the available sequence. It is not allowed to create features outside of the available Sequence. Apparently BioJava is able to find 1000000 bases. Is there actually more sequence than this in the EMBL file? There are two possible solutions (unless you can get a full version of the EMBL file)... 1) If you are only interested in comparing Feature Locations you could use a DummySequence of the appropriate length. 2) You could write a custom org.biojava.bio.seq.io.SeqIOFilter which overloads the startFeature and method and only passes the Feature.Template to the delegate if the Location in that Feature.Template is inside the valid range. If you want to use option 2 and need more help then post to the list. - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 "Anna Henricson" Sent by: biojava-l-bounces@portal.open-bio.org 01/13/2005 04:40 PM To: "Biojava" cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] IllegalArgumentException when parsing an embl file Hi, I'm parsing the feature table in an embl file to retrieve the information under feature key CDS. For instance, I am calculating the number of exons, the length of the exons, retrieving the protein sequence and id etc. Sometimes an IllegalArgumentException is thrown by the code sequence = seqIterator.nextSequence(); //(see below in this email) I guess there is some problem in the embl file with the Location, so that the Sequence cannot be instantiated, and as a result these sequences will not be present in my resulting output file. Why is this exception thrown and is there any way to avoid or handle this problem? Please bear in mind that I am new to BioJava and therefore would greatly appreciate a more detailed explanation. Thanks! /Anna The code and the exceptions that are thrown are as follows: .... private Sequence sequence; .... SequenceIterator seqIterator = SeqIOTools.readEmbl (bufferedReader); SequenceIterator seqIterator = SeqIOTools.readEmbl (bufferedReader); while (seqIterator.hasNext()){ try{ sequence = seqIterator.nextSequence(); } catch (BioException e){ e.printStackTrace(); } catch (NoSuchElementException e){ e.printStackTrace(); } .... java.lang.IllegalArgumentException: Location [1045891,1046196] is outside 1..1000000 at org.biojava.bio.seq.impl.SimpleFeature.(SimpleFeature.java:306) at org.biojava.bio.seq.impl.SimpleStrandedFeature.(SimpleStrandedFeature. java:74) at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc torAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:274) at org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature Realizer.java:138) rethrown as org.biojava.bio.BioException: Couldn't realize feature at org.biojava.bio.seq.SimpleFeatureRealizer$TemplateImpl.realize(SimpleFeature Realizer.java:144) at org.biojava.bio.seq.SimpleFeatureRealizer.realizeFeature(SimpleFeatureRealiz er.java:94) at org.biojava.bio.seq.impl.SimpleSequence.realizeFeature(SimpleSequence.java:1 98) at org.biojava.bio.seq.impl.SimpleSequence.createFeature(SimpleSequence.java:20 4) at org.biojava.bio.seq.io.SequenceBuilderBase.makeSequence(SequenceBuilderBase. java:168) at org.biojava.bio.seq.io.SmartSequenceBuilder.makeSequence(SmartSequenceBuilde r.java:87) at org.biojava.bio.seq.io.SequenceBuilderFilter.makeSequence(SequenceBuilderFil ter.java:98) at org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:101) at EmblFileParser.(EmblFileParser.java:34) at EmblToExintFormat.main(EmblToExintFormat.java:57) -------------------------------------------- Anna Henricson, MSc, PhD student Center for Genomics and Bioinformatics (CGB) Karolinska Institutet S-171 77 Stockholm Sweden Phone: +46 (0)8 524 87296 Fax: +46 (0)8 337983 _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From kalle.naslund at genpat.uu.se Thu Jan 13 05:06:17 2005 From: kalle.naslund at genpat.uu.se (=?ISO-8859-1?Q?Kalle_N=E4slund?=) Date: Thu Jan 13 05:04:00 2005 Subject: [Biojava-l] IllegalArgumentException when parsing an embl file In-Reply-To: References: Message-ID: <41E64819.7020100@genpat.uu.se> mark.schreiber@group.novartis.com wrote: >Hi Anna - > >It seems the problem may be with the EMBL file. > >java.lang.IllegalArgumentException: Location [1045891,1046196] is outside >1..1000000 > >This part indicates BioJava is trying to make a Feature with the Location [1045891,1046196] which is outside the bounds of the available sequence. It is not allowed >to create features outside of the available Sequence. Apparently BioJava >is able to find 1000000 bases. Is there actually more sequence than this in the EMBL file? > > > Just a quick observation that might be relevant, it seems that the only place in SimpleFeature that throws an IllegalArgumentException due to the range being outside the sequence is commented out since some time, with the cvs comment "Various modifications to make life easier" ( SimpleFeature rev 1.22 ). So this problem seems to have occured for more people, and a quick dirty solution was found. So, i guess, if Anna is aware of the consequences, ( things might go bad if you want to use these Features ) a solution might be to try an up to date biojava cvs build. mvh Kalle From piroska.devay at pharma.novartis.com Mon Jan 17 10:53:46 2005 From: piroska.devay at pharma.novartis.com (piroska.devay@pharma.novartis.com) Date: Mon Jan 17 10:49:58 2005 Subject: [Biojava-l] (no subject) Message-ID: Dear All, I am new to biojava and unfortunately new to Java also. Parsing a Fasta output I could modify the FastaSearchSAX Parser to return the parsed data on the standard output. In the Fasta output the query-hit alignments are not returned, instead the query sequence and the subject sequence are returned separately. If the sequences were shifted by Fasta for matching, '-' symbols are inserted (-----------------------QVQLQQSGNELAKPGASMKMSCRASGYSFTSYWIHWLKQRPDQGLEWIGYIDPATAYTESNQKFKDKAILTADRS) I would like to align these sequence-strings. I simply would have 2 strings as an input or converted into SymbolLists. I don't seem to find the right class to do this. Could someone offer advice or refer me to some sample programs that I can browse through or a more detailed tutorial? Thanks very much, Piroska From mark.schreiber at group.novartis.com Mon Jan 17 21:34:56 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 17 21:31:12 2005 Subject: [Biojava-l] (no subject) Message-ID: Hi - I'm not too clear on what you are trying to do but you may find the Blast and Fasta tutorials on this page helpful http://www.biojava.org/docs/bj_in_anger/index.htm Much of the stuff that is from the blast tutorials is relevant to the Fasta parsers. >From my understanding of your email you are not getting some of the information you want from the standard classes. You should know that the SearchContentAdapters provided with BioJava do not capture every detail of a search, only the bits we thought were most interesting. If you need to get more (or less) information you will need to make a custom SearchContentAdapter (usually you just extend the SearchContentAdapter and override some or all of the methods). Particularly you may want to look at http://www.biojava.org/docs/bj_in_anger/blastecho.htm which shows how to echo events from a BlastLikeSAXParser to STDOUT. It should be very easy to change this to echo for a FastaSearchSAXParser. Running this program will help you determine where the things you are looking for may end up and help you decide if and how you need to make a custom SearchContentAdapter to get the information you want. Hope this helps, Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 Piroska Devay/PH/Novartis@PH Sent by: biojava-l-bounces@portal.open-bio.org 01/17/2005 11:53 PM To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] (no subject) Dear All, I am new to biojava and unfortunately new to Java also. Parsing a Fasta output I could modify the FastaSearchSAX Parser to return the parsed data on the standard output. In the Fasta output the query-hit alignments are not returned, instead the query sequence and the subject sequence are returned separately. If the sequences were shifted by Fasta for matching, '-' symbols are inserted (-----------------------QVQLQQSGNELAKPGASMKMSCRASGYSFTSYWIHWLKQRPDQGLEWIGYIDPATAYTESNQKFKDKAILTADRS) I would like to align these sequence-strings. I simply would have 2 strings as an input or converted into SymbolLists. I don't seem to find the right class to do this. Could someone offer advice or refer me to some sample programs that I can browse through or a more detailed tutorial? Thanks very much, Piroska _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From d.lapointe at comcast.net Thu Jan 20 08:46:15 2005 From: d.lapointe at comcast.net (David Lapointe) Date: Thu Jan 20 08:42:40 2005 Subject: [Biojava-l] BioSQL Message-ID: <200501200846.15274.d.lapointe@comcast.net> In BioJavaInAnger David Huen has the disclaimer at the end of his useful section NOTE: If you are using the 1.3 version of Biojava with the Singapore schema, do not install biosqldb-assembly-pg.sql or biosql-accelerators-pg.sql as described above. All you will need is the the new biosqldb-pg.sql. There appear to be performance issues in some cases when the other stuff is installed also. This note will be updated eventually to reflect this advice. It is not clear to me which way to go. How does one know whether one has the SIngapore schema ? -- .david David Lapointe "Love goes out the door when money comes innuendo." - G.Marx From smh1008 at cus.cam.ac.uk Thu Jan 20 12:50:35 2005 From: smh1008 at cus.cam.ac.uk (David Huen) Date: Thu Jan 20 12:46:46 2005 Subject: [Biojava-l] BioSQL In-Reply-To: <200501200846.15274.d.lapointe@comcast.net> References: <200501200846.15274.d.lapointe@comcast.net> Message-ID: <200501201750.35335.smh1008@cus.cam.ac.uk> On Thursday 20 Jan 2005 13:46, David Lapointe wrote: > In BioJavaInAnger David Huen has the disclaimer at the end of his useful > section > > NOTE: If you are using the 1.3 version of Biojava with the Singapore > schema, do not install biosqldb-assembly-pg.sql or > biosql-accelerators-pg.sql as described above. All you will need is the > the new biosqldb-pg.sql. There appear to be performance issues in some > cases when the other stuff is installed also. This note will be updated > eventually to reflect this advice. > > It is not clear to me which way to go. How does one know whether one has > the SIngapore schema ? I believe that the current schemas are at:- http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/biosql-schema/sql/?cvsroot=biosql The last time I tried with postgres, I believe the biosqldb-pg.sql alone was sufficient (i am relying on hazy memories here, anyone else know better?) and the drop-tables.sql is useful for cleaning up. Performance was quite slow but Thomas Down found that introducing a further index improved things:- create index seqfeatureloc_seqfeature on location (seqfeature_id); I don't know if this has been committed since. If you have performance problems with postgres, do investigate adding indices to the postgres case. Regards, David H. From dan.baggott.work at gmail.com Thu Jan 20 18:02:44 2005 From: dan.baggott.work at gmail.com (Dan Baggott) Date: Thu Jan 20 17:58:47 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <922876fd050120150262960c68@mail.gmail.com> Does anyone hava any java code for reading from .nib nucleotide sequence files (ie what's used by the UCSC folks)? I know Jim Kent et al. have some utilities (I think in C) for reading from nib files but am wondering about java... Thanks, Dan From dlondon at ebi.ac.uk Thu Jan 20 12:58:59 2005 From: dlondon at ebi.ac.uk (Darin London) Date: Thu Jan 20 19:32:35 2005 Subject: [Biojava-l] BOSC 2005 Message-ID: <20050120175859.GA7254@parrot.ebi.ac.uk> {Please pass the word!} MEETING ANNOUNCEMENT & CALL FOR SPEAKERS The 6th annual Bioinformatics Open Source Conference (BOSC'2005) is organized by the not-for-profit Open Bioinformatics Foundation. The meeting will take place June 23-24, 2005 in Detroit, Michigan, USA, and is one of several Special Interest Group (SIG) meetings occurring in conjunction with the 13th International Conference on Intelligent Systems for Molecular Biology. see http://www.iscb.org/ismb2005 for more information. Because of the power of many Open Source bioinformatics packages in use by the Research Community today, it is not too presumptuous to say that the work of the Open Source Bioinformatics Community represents the cutting edge of Bioinformatics in general. This has been repeatedly demonstrated by the quality of presentations at previous BOSC conferences. This year, at BOSC 2006, we want to continue this tradition of excellence, while presenting this message to a wider part of the Research Community. Please, pass this message on to anyone you know that is interested in Bioinformatics software. BOSC PROGRAM & CONTACT INFO * Web: http://www.open-bio.org/bosc2005/ * Email: bosc@open-bio.org FEES TO BE ANNOUNCED. Watch the bosc website for more information. SPEAKERS & ABSTRACTS WANTED The program committee is currently seeking abstracts for talks at BOSC 2005. BOSC is a great opportunity for you to tell the community about your use, development, or philosophy of open source software development in bioinformatics. The committee will select several submitted abstracts for 25-minute talks and others for shorter "lightning" talks. Accepted abstracts will be published on the BOSC web site. If you are interested in speaking at BOSC 2005, please send us before April 26, 2005: * an abstract (no more than a few paragraphs) * a URL for the project page, if applicable * information about the open source license used for your software or your release plans. Abstracts will be accepted for submission until April 26, 2005. Abstracts chosen for presentation will be announced May 12, 2005 (before the ISMB Early Registration Deadline). LIGHTNING-TALK SPEAKERS WANTED! The program committee is currently seeking speakers for the lightning talks at BOSC 2005. Lightning talks are quick - only five minutes long - and a great opportunity for you to give people a quick summary of your open source project, code, idea, or vision of the future. If you are interested in giving a lightning talk at BOSC 2005, please send us: * a brief title and summary (one or two lines) * a URL for the project page, if applicable * information about the open source license used for your software or your release plans. We will accept entries on-line until BOSC starts, but space for demos and lightning talks is limited.
SOFTWARE DEMONSTRATIONS WANTED! If you are involved in the development of Open Source Bioinformatics Software, you are invited to provide a short demonstration to attendees of BOSC 2005. If you are interested in giving a software demonstration at BOSC 2005, please send us: * a brief title and summary (one or two lines) * a URL for the project page, if applicable * Internet connectivity requirements (e.g. website Application served on the world wide web, or web based client application). We will accept entries on-line until the BOSC starts, but space for demos and lightning talks is limited. ** Because the mission of the OBF is to promote Open Source software, we will favor submissions for projects that apply a recognized Open Source License, or adhere to the general Open Source Philosophy. See the following websites for further details: href="http://www.opensource.org/licenses/ href="http://www.opensource.org/docs/definition.php SESSION CHAIRS WANTED If you would like to be involved BOSC 2005, we invite you to chair a session. This will not require much of your time. You will be given a schedule of presenters during your session. You simply introduce each speaker, and manage the time of their presentation (25 minutes for full presentations, 5-10 minutes for lightning talks/demos, depending on the number of entries). If you are interested in chairing a session, please send us your name and affiliation (if applicable). -- cheers, Darin London dlondon@ebi.ac.uk European Bioinformatics Institute, +44 (0)1223 49 2566 Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) Cambridgeshire CB10 1SD, UK -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biojava-l/attachments/20050120/9bb1da79/attachment.bin From mark.schreiber at group.novartis.com Sun Jan 23 21:10:35 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Sun Jan 23 21:06:48 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: In short, no. Do you have a description of the format? It may not be too hard to adapt an existing parser. - Mark Dan Baggott Sent by: biojava-l-bounces@portal.open-bio.org 01/21/2005 07:02 AM Please respond to baggott2 To: biojava-l@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] reading nib sequence files Does anyone hava any java code for reading from .nib nucleotide sequence files (ie what's used by the UCSC folks)? I know Jim Kent et al. have some utilities (I think in C) for reading from nib files but am wondering about java... Thanks, Dan _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From hollandr at gis.a-star.edu.sg Sun Jan 23 21:48:44 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Sun Jan 23 21:45:59 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2C8B@BIONIC.biopolis.one-north.com> It's a compressed binary format. I doubt BioJava would be able to read it without a lot of effort as the current parser framework is set up for text input only. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > mark.schreiber@group.novartis.com > Sent: Monday, January 24, 2005 10:11 AM > To: baggott2@llnl.gov > Cc: biojava-l-bounces@portal.open-bio.org; biojava-l@biojava.org > Subject: Re: [Biojava-l] reading nib sequence files > > > In short, no. > > Do you have a description of the format? It may not be too > hard to adapt > an existing parser. > > - Mark > > > > > > Dan Baggott > Sent by: biojava-l-bounces@portal.open-bio.org > 01/21/2005 07:02 AM > Please respond to baggott2 > > > To: biojava-l@biojava.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] reading nib sequence files > > > Does anyone hava any java code for reading from .nib nucleotide > sequence files (ie what's used by the UCSC folks)? I know Jim Kent > et al. have some utilities (I think in C) for reading from nib files > but am wondering about java... > > Thanks, > > Dan > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From td2 at sanger.ac.uk Mon Jan 24 03:34:04 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Mon Jan 24 03:30:07 2005 Subject: [Biojava-l] reading nib sequence files In-Reply-To: <6D9E9B9DF347EF4385F6271C64FB8D56015B2C8B@BIONIC.biopolis.one-north.com> References: <6D9E9B9DF347EF4385F6271C64FB8D56015B2C8B@BIONIC.biopolis.one-north.com> Message-ID: On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > It's a compressed binary format. I doubt BioJava would be able to read > it without a lot of effort as the current parser framework is set up > for > text input only. Nib support probably wouldn't fit into the text-oriented parsing framework, but I'm sure it could be supported somehow if there was demand. A quick google doesn't turn up any format documentation, but Jim Kent's IO code is at: http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c One interesting way to handle this might be to open the nib file as a MappedByteBuffer, and back a SymbolList directly using that -- potentially giving us an efficient way of working with huge sequences.. Any interest in that? Thomas. From mark.schreiber at group.novartis.com Mon Jan 24 03:37:16 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 24 03:33:26 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: I'd need to brush up on my nio, and my c ! Thomas Down 01/24/2005 04:34 PM To: "Richard HOLLAND" cc: "", biojava-list List , Mark Schreiber/GP/Novartis@PH Subject: Re: [Biojava-l] reading nib sequence files On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > It's a compressed binary format. I doubt BioJava would be able to read > it without a lot of effort as the current parser framework is set up > for > text input only. Nib support probably wouldn't fit into the text-oriented parsing framework, but I'm sure it could be supported somehow if there was demand. A quick google doesn't turn up any format documentation, but Jim Kent's IO code is at: http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c One interesting way to handle this might be to open the nib file as a MappedByteBuffer, and back a SymbolList directly using that -- potentially giving us an efficient way of working with huge sequences.. Any interest in that? Thomas. From hollandr at gis.a-star.edu.sg Mon Jan 24 03:47:12 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Mon Jan 24 03:44:07 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CF1@BIONIC.biopolis.one-north.com> I think the idea of storing sequences internally as compressed binary sequence would be a good idea regardless, for any symbol list. Currently each Symbol in a SymbolList requires one word of memory (the size of a memory pointer to the singleton Symbol instances). Therefore any SymbolList of length X containing symbols from an n-ary alphabet would require X words of memory to store it, plus the overhead of the SymbolList and n Symbol singleton instances (admittedly shared between all SymbolLists currently in memory). If you used a compressed binary format internally, doing away with explicit Symbol references and representing each symbol in a ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you would require much less space than even the singleton model above. This way you could fit four DNA symbols into a single byte of memory, as opposed to four words of memory. The number of bits required for a symbol in any given alphabet is merely log base 2 of the size of the alphabet, rounded up to the nearest whole number. eg. for the English alphabet of 26 letters only, you would need 5 bits, or in terms of whole bytes, you would be able to fit 8 symbols into 5 bytes. To do this you would need to define a 'bits' parameter on the alphabet which is calculated from the number of symbols in the alphabet, a 'bitMap' parameter on the alphabet which maps symbols to bit values (and vice versa with 'inverseBitMap'), and keep a separate 'length' parameter in the SymbolList which would be used to tell the binary decoder when to stop parsing the sequence (as you can only store whole bytes, there will often be trailing zeroes in the buffer which could be misleading without this extra parameter). You could always return singleton Symbol objects if requested, by decoding the binary sequence on the fly, but you would no longer need to store the sequence using them. Is this worth considering for the big BioJava rewrite? Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: mark.schreiber@group.novartis.com > [mailto:mark.schreiber@group.novartis.com] > Sent: Monday, January 24, 2005 4:37 PM > To: Thomas Down > Cc: biojava-list List; Richard HOLLAND; > " Subject: Re: [Biojava-l] reading nib sequence files > > > I'd need to brush up on my nio, and my c ! > > > > > > Thomas Down > 01/24/2005 04:34 PM > > > To: "Richard HOLLAND" > cc: "", biojava-list List > , Mark > Schreiber/GP/Novartis@PH > Subject: Re: [Biojava-l] reading nib sequence files > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > It's a compressed binary format. I doubt BioJava would be > able to read > > it without a lot of effort as the current parser framework > is set up > > for > > text input only. > > Nib support probably wouldn't fit into the text-oriented parsing > framework, but I'm sure it could be supported somehow if there was > demand. A quick google doesn't turn up any format documentation, but > Jim Kent's IO code is at: > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > One interesting way to handle this might be to open the nib file as a > MappedByteBuffer, and back a SymbolList directly using that -- > potentially giving us an efficient way of working with huge > sequences.. > Any interest in that? > > Thomas. > > > > > From mark.schreiber at group.novartis.com Mon Jan 24 03:52:46 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 24 03:48:51 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: BioJava does already do some compression on large sequences (or at least it used to). Like you say you can bit pack a lot. Ambiguity causes problems as you can have more than four symbols for DNA (including n, y, r etc). Does Jim Kent's schema offer better compression? Even if it doens't the use of a ByteBuffer will probably increase the speed of the current implementations. - Mark "Richard HOLLAND" 01/24/2005 04:47 PM To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" cc: "biojava-list List" , Subject: RE: [Biojava-l] reading nib sequence files I think the idea of storing sequences internally as compressed binary sequence would be a good idea regardless, for any symbol list. Currently each Symbol in a SymbolList requires one word of memory (the size of a memory pointer to the singleton Symbol instances). Therefore any SymbolList of length X containing symbols from an n-ary alphabet would require X words of memory to store it, plus the overhead of the SymbolList and n Symbol singleton instances (admittedly shared between all SymbolLists currently in memory). If you used a compressed binary format internally, doing away with explicit Symbol references and representing each symbol in a ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you would require much less space than even the singleton model above. This way you could fit four DNA symbols into a single byte of memory, as opposed to four words of memory. The number of bits required for a symbol in any given alphabet is merely log base 2 of the size of the alphabet, rounded up to the nearest whole number. eg. for the English alphabet of 26 letters only, you would need 5 bits, or in terms of whole bytes, you would be able to fit 8 symbols into 5 bytes. To do this you would need to define a 'bits' parameter on the alphabet which is calculated from the number of symbols in the alphabet, a 'bitMap' parameter on the alphabet which maps symbols to bit values (and vice versa with 'inverseBitMap'), and keep a separate 'length' parameter in the SymbolList which would be used to tell the binary decoder when to stop parsing the sequence (as you can only store whole bytes, there will often be trailing zeroes in the buffer which could be misleading without this extra parameter). You could always return singleton Symbol objects if requested, by decoding the binary sequence on the fly, but you would no longer need to store the sequence using them. Is this worth considering for the big BioJava rewrite? Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: mark.schreiber@group.novartis.com > [mailto:mark.schreiber@group.novartis.com] > Sent: Monday, January 24, 2005 4:37 PM > To: Thomas Down > Cc: biojava-list List; Richard HOLLAND; > " Subject: Re: [Biojava-l] reading nib sequence files > > > I'd need to brush up on my nio, and my c ! > > > > > > Thomas Down > 01/24/2005 04:34 PM > > > To: "Richard HOLLAND" > cc: "", biojava-list List > , Mark > Schreiber/GP/Novartis@PH > Subject: Re: [Biojava-l] reading nib sequence files > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > It's a compressed binary format. I doubt BioJava would be > able to read > > it without a lot of effort as the current parser framework > is set up > > for > > text input only. > > Nib support probably wouldn't fit into the text-oriented parsing > framework, but I'm sure it could be supported somehow if there was > demand. A quick google doesn't turn up any format documentation, but > Jim Kent's IO code is at: > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > One interesting way to handle this might be to open the nib file as a > MappedByteBuffer, and back a SymbolList directly using that -- > potentially giving us an efficient way of working with huge > sequences.. > Any interest in that? > > Thomas. > > > > > From hollandr at gis.a-star.edu.sg Mon Jan 24 03:59:27 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Mon Jan 24 03:56:23 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CF5@BIONIC.biopolis.one-north.com> NIB files store one base per 4 bits, non-variable, giving a 50% compression rate and a maximum arity of 16 different base values per position. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: mark.schreiber@group.novartis.com > [mailto:mark.schreiber@group.novartis.com] > Sent: Monday, January 24, 2005 4:53 PM > To: Richard HOLLAND > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > Subject: RE: [Biojava-l] reading nib sequence files > > > BioJava does already do some compression on large sequences > (or at least > it used to). Like you say you can bit pack a lot. Ambiguity causes > problems as you can have more than four symbols for DNA > (including n, y, r > etc). > > Does Jim Kent's schema offer better compression? Even if it > doens't the > use of a ByteBuffer will probably increase the speed of the current > implementations. > > - Mark > > > > > > "Richard HOLLAND" > 01/24/2005 04:47 PM > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > cc: "biojava-list List" , > > Subject: RE: [Biojava-l] reading nib sequence files > > > I think the idea of storing sequences internally as compressed binary > sequence would be a good idea regardless, for any symbol > list. Currently > each Symbol in a SymbolList requires one word of memory (the size of a > memory pointer to the singleton Symbol instances). Therefore any > SymbolList of length X containing symbols from an n-ary alphabet would > require X words of memory to store it, plus the overhead of the > SymbolList and n Symbol singleton instances (admittedly shared between > all SymbolLists currently in memory). > > If you used a compressed binary format internally, doing away with > explicit Symbol references and representing each symbol in a > ByteBuffer > as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you > would require much less space than even the singleton model > above. This > way you could fit four DNA symbols into a single byte of memory, as > opposed to four words of memory. The number of bits required for a > symbol in any given alphabet is merely log base 2 of the size of the > alphabet, rounded up to the nearest whole number. eg. for the English > alphabet of 26 letters only, you would need 5 bits, or in > terms of whole > bytes, you would be able to fit 8 symbols into 5 bytes. > > To do this you would need to define a 'bits' parameter on the alphabet > which is calculated from the number of symbols in the alphabet, a > 'bitMap' parameter on the alphabet which maps symbols to bit > values (and > vice versa with 'inverseBitMap'), and keep a separate > 'length' parameter > in the SymbolList which would be used to tell the binary > decoder when to > stop parsing the sequence (as you can only store whole bytes, > there will > often be trailing zeroes in the buffer which could be > misleading without > this extra parameter). > > You could always return singleton Symbol objects if requested, by > decoding the binary sequence on the fly, but you would no > longer need to > store the sequence using them. > > Is this worth considering for the big BioJava rewrite? > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are not the > intended recipient, please delete it and notify us immediately. Please > do not copy or use it for any purpose, or disclose its content to any > other person. Thank you. > --------------------------------------------- > > > > -----Original Message----- > > From: mark.schreiber@group.novartis.com > > [mailto:mark.schreiber@group.novartis.com] > > Sent: Monday, January 24, 2005 4:37 PM > > To: Thomas Down > > Cc: biojava-list List; Richard HOLLAND; > > " > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > Thomas Down > > 01/24/2005 04:34 PM > > > > > > To: "Richard HOLLAND" > > cc: "", biojava-list List > > , Mark > > Schreiber/GP/Novartis@PH > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > It's a compressed binary format. I doubt BioJava would be > > able to read > > > it without a lot of effort as the current parser framework > > is set up > > > for > > > text input only. > > > > Nib support probably wouldn't fit into the text-oriented parsing > > framework, but I'm sure it could be supported somehow if there was > > demand. A quick google doesn't turn up any format > documentation, but > > Jim Kent's IO code is at: > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > One interesting way to handle this might be to open the nib > file as a > > MappedByteBuffer, and back a SymbolList directly using that -- > > potentially giving us an efficient way of working with huge > > sequences.. > > Any interest in that? > > > > Thomas. > > > > > > > > > > > > > > From mark.schreiber at group.novartis.com Mon Jan 24 04:17:16 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 24 04:14:09 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: BioJava uses (or at least can use) the PackedSymbolList for large sequences. It uses an array of longs to represent the packed bits. There may be some advantage to using a ByteBuffer, hard to know. - Mark "Richard HOLLAND" 01/24/2005 04:59 PM To: Mark Schreiber/GP/Novartis@PH cc: , "biojava-list List" , "Thomas Down" Subject: RE: [Biojava-l] reading nib sequence files NIB files store one base per 4 bits, non-variable, giving a 50% compression rate and a maximum arity of 16 different base values per position. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: mark.schreiber@group.novartis.com > [mailto:mark.schreiber@group.novartis.com] > Sent: Monday, January 24, 2005 4:53 PM > To: Richard HOLLAND > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > Subject: RE: [Biojava-l] reading nib sequence files > > > BioJava does already do some compression on large sequences > (or at least > it used to). Like you say you can bit pack a lot. Ambiguity causes > problems as you can have more than four symbols for DNA > (including n, y, r > etc). > > Does Jim Kent's schema offer better compression? Even if it > doens't the > use of a ByteBuffer will probably increase the speed of the current > implementations. > > - Mark > > > > > > "Richard HOLLAND" > 01/24/2005 04:47 PM > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > cc: "biojava-list List" , > > Subject: RE: [Biojava-l] reading nib sequence files > > > I think the idea of storing sequences internally as compressed binary > sequence would be a good idea regardless, for any symbol > list. Currently > each Symbol in a SymbolList requires one word of memory (the size of a > memory pointer to the singleton Symbol instances). Therefore any > SymbolList of length X containing symbols from an n-ary alphabet would > require X words of memory to store it, plus the overhead of the > SymbolList and n Symbol singleton instances (admittedly shared between > all SymbolLists currently in memory). > > If you used a compressed binary format internally, doing away with > explicit Symbol references and representing each symbol in a > ByteBuffer > as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you > would require much less space than even the singleton model > above. This > way you could fit four DNA symbols into a single byte of memory, as > opposed to four words of memory. The number of bits required for a > symbol in any given alphabet is merely log base 2 of the size of the > alphabet, rounded up to the nearest whole number. eg. for the English > alphabet of 26 letters only, you would need 5 bits, or in > terms of whole > bytes, you would be able to fit 8 symbols into 5 bytes. > > To do this you would need to define a 'bits' parameter on the alphabet > which is calculated from the number of symbols in the alphabet, a > 'bitMap' parameter on the alphabet which maps symbols to bit > values (and > vice versa with 'inverseBitMap'), and keep a separate > 'length' parameter > in the SymbolList which would be used to tell the binary > decoder when to > stop parsing the sequence (as you can only store whole bytes, > there will > often be trailing zeroes in the buffer which could be > misleading without > this extra parameter). > > You could always return singleton Symbol objects if requested, by > decoding the binary sequence on the fly, but you would no > longer need to > store the sequence using them. > > Is this worth considering for the big BioJava rewrite? > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are not the > intended recipient, please delete it and notify us immediately. Please > do not copy or use it for any purpose, or disclose its content to any > other person. Thank you. > --------------------------------------------- > > > > -----Original Message----- > > From: mark.schreiber@group.novartis.com > > [mailto:mark.schreiber@group.novartis.com] > > Sent: Monday, January 24, 2005 4:37 PM > > To: Thomas Down > > Cc: biojava-list List; Richard HOLLAND; > > " > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > Thomas Down > > 01/24/2005 04:34 PM > > > > > > To: "Richard HOLLAND" > > cc: "", biojava-list List > > , Mark > > Schreiber/GP/Novartis@PH > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > It's a compressed binary format. I doubt BioJava would be > > able to read > > > it without a lot of effort as the current parser framework > > is set up > > > for > > > text input only. > > > > Nib support probably wouldn't fit into the text-oriented parsing > > framework, but I'm sure it could be supported somehow if there was > > demand. A quick google doesn't turn up any format > documentation, but > > Jim Kent's IO code is at: > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > One interesting way to handle this might be to open the nib > file as a > > MappedByteBuffer, and back a SymbolList directly using that -- > > potentially giving us an efficient way of working with huge > > sequences.. > > Any interest in that? > > > > Thomas. > > > > > > > > > > > > > > From verhoeff2 at gis.a-star.edu.sg Mon Jan 24 04:16:29 2005 From: verhoeff2 at gis.a-star.edu.sg (VERHOEF Frans) Date: Mon Jan 24 04:14:14 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56D73461@BIONIC.biopolis.one-north.com> You could always ZIPStream it out for even more compression. Frans -----Original Message----- From: biojava-l-bounces@portal.open-bio.org [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard HOLLAND Sent: Monday, January 24, 2005 04:59 PM To: mark.schreiber@group.novartis.com Cc: Thomas Down; biojava-list List Subject: RE: [Biojava-l] reading nib sequence files NIB files store one base per 4 bits, non-variable, giving a 50% compression rate and a maximum arity of 16 different base values per position. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: mark.schreiber@group.novartis.com > [mailto:mark.schreiber@group.novartis.com] > Sent: Monday, January 24, 2005 4:53 PM > To: Richard HOLLAND > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > Subject: RE: [Biojava-l] reading nib sequence files > > > BioJava does already do some compression on large sequences > (or at least > it used to). Like you say you can bit pack a lot. Ambiguity causes > problems as you can have more than four symbols for DNA > (including n, y, r > etc). > > Does Jim Kent's schema offer better compression? Even if it > doens't the > use of a ByteBuffer will probably increase the speed of the current > implementations. > > - Mark > > > > > > "Richard HOLLAND" > 01/24/2005 04:47 PM > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > cc: "biojava-list List" , > > Subject: RE: [Biojava-l] reading nib sequence files > > > I think the idea of storing sequences internally as compressed binary > sequence would be a good idea regardless, for any symbol > list. Currently > each Symbol in a SymbolList requires one word of memory (the size of a > memory pointer to the singleton Symbol instances). Therefore any > SymbolList of length X containing symbols from an n-ary alphabet would > require X words of memory to store it, plus the overhead of the > SymbolList and n Symbol singleton instances (admittedly shared between > all SymbolLists currently in memory). > > If you used a compressed binary format internally, doing away with > explicit Symbol references and representing each symbol in a > ByteBuffer > as binary values (00 for A, 01 for T, 10 for C, 11 for G etc.), you > would require much less space than even the singleton model > above. This > way you could fit four DNA symbols into a single byte of memory, as > opposed to four words of memory. The number of bits required for a > symbol in any given alphabet is merely log base 2 of the size of the > alphabet, rounded up to the nearest whole number. eg. for the English > alphabet of 26 letters only, you would need 5 bits, or in > terms of whole > bytes, you would be able to fit 8 symbols into 5 bytes. > > To do this you would need to define a 'bits' parameter on the alphabet > which is calculated from the number of symbols in the alphabet, a > 'bitMap' parameter on the alphabet which maps symbols to bit > values (and > vice versa with 'inverseBitMap'), and keep a separate > 'length' parameter > in the SymbolList which would be used to tell the binary > decoder when to > stop parsing the sequence (as you can only store whole bytes, > there will > often be trailing zeroes in the buffer which could be > misleading without > this extra parameter). > > You could always return singleton Symbol objects if requested, by > decoding the binary sequence on the fly, but you would no > longer need to > store the sequence using them. > > Is this worth considering for the big BioJava rewrite? > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are not the > intended recipient, please delete it and notify us immediately. Please > do not copy or use it for any purpose, or disclose its content to any > other person. Thank you. > --------------------------------------------- > > > > -----Original Message----- > > From: mark.schreiber@group.novartis.com > > [mailto:mark.schreiber@group.novartis.com] > > Sent: Monday, January 24, 2005 4:37 PM > > To: Thomas Down > > Cc: biojava-list List; Richard HOLLAND; > > " > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > Thomas Down > > 01/24/2005 04:34 PM > > > > > > To: "Richard HOLLAND" > > cc: "", biojava-list List > > , Mark > > Schreiber/GP/Novartis@PH > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > It's a compressed binary format. I doubt BioJava would be > > able to read > > > it without a lot of effort as the current parser framework > > is set up > > > for > > > text input only. > > > > Nib support probably wouldn't fit into the text-oriented parsing > > framework, but I'm sure it could be supported somehow if there was > > demand. A quick google doesn't turn up any format > documentation, but > > Jim Kent's IO code is at: > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > One interesting way to handle this might be to open the nib > file as a > > MappedByteBuffer, and back a SymbolList directly using that -- > > potentially giving us an efficient way of working with huge > > sequences.. > > Any interest in that? > > > > Thomas. > > > > > > > > > > > > > > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From hollandr at gis.a-star.edu.sg Mon Jan 24 04:19:07 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Mon Jan 24 04:16:34 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CF8@BIONIC.biopolis.one-north.com> The trouble with ZIP is that to do random-access reads of the sequence (eg. give me all bases from X to Y) you have to unzip the whole sequence each time. That makes it quite a bit slower. The solution needs to be a compression algorithm of some kind which allows instant random access without slowing down the create/update process too much either. Hence a custom fixed-width binary solution would be the first thing that comes to mind, but it may not be the only one. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: VERHOEF Frans > Sent: Monday, January 24, 2005 5:16 PM > To: Richard HOLLAND; mark.schreiber@group.novartis.com > Cc: Thomas Down; biojava-list List > Subject: RE: [Biojava-l] reading nib sequence files > > > You could always ZIPStream it out for even more compression. > > Frans > > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > Richard HOLLAND > Sent: Monday, January 24, 2005 04:59 PM > To: mark.schreiber@group.novartis.com > Cc: Thomas Down; biojava-list List > Subject: RE: [Biojava-l] reading nib sequence files > > NIB files store one base per 4 bits, non-variable, giving a > 50% compression rate and a maximum arity of 16 different base > values per position. > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us > immediately. Please do not copy or use it for any purpose, or > disclose its content to any other person. Thank you. > --------------------------------------------- > > > > -----Original Message----- > > From: mark.schreiber@group.novartis.com > > [mailto:mark.schreiber@group.novartis.com] > > Sent: Monday, January 24, 2005 4:53 PM > > To: Richard HOLLAND > > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > BioJava does already do some compression on large sequences > > (or at least > > it used to). Like you say you can bit pack a lot. Ambiguity causes > > problems as you can have more than four symbols for DNA > > (including n, y, r > > etc). > > > > Does Jim Kent's schema offer better compression? Even if it > > doens't the > > use of a ByteBuffer will probably increase the speed of the current > > implementations. > > > > - Mark > > > > > > > > > > > > "Richard HOLLAND" > > 01/24/2005 04:47 PM > > > > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > > > cc: "biojava-list List" , > > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > I think the idea of storing sequences internally as > compressed binary > > sequence would be a good idea regardless, for any symbol list. > > Currently each Symbol in a SymbolList requires one word of > memory (the > > size of a memory pointer to the singleton Symbol > instances). Therefore > > any SymbolList of length X containing symbols from an n-ary > alphabet > > would require X words of memory to store it, plus the > overhead of the > > SymbolList and n Symbol singleton instances (admittedly > shared between > > all SymbolLists currently in memory). > > > > If you used a compressed binary format internally, doing away with > > explicit Symbol references and representing each symbol in a > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G > > etc.), you would require much less space than even the > singleton model > > above. This > > way you could fit four DNA symbols into a single byte of memory, as > > opposed to four words of memory. The number of bits required for a > > symbol in any given alphabet is merely log base 2 of the size of the > > alphabet, rounded up to the nearest whole number. eg. for > the English > > alphabet of 26 letters only, you would need 5 bits, or in > > terms of whole > > bytes, you would be able to fit 8 symbols into 5 bytes. > > > > To do this you would need to define a 'bits' parameter on > the alphabet > > which is calculated from the number of symbols in the alphabet, a > > 'bitMap' parameter on the alphabet which maps symbols to bit values > > (and vice versa with 'inverseBitMap'), and keep a separate > > 'length' parameter > > in the SymbolList which would be used to tell the binary > > decoder when to > > stop parsing the sequence (as you can only store whole bytes, > > there will > > often be trailing zeroes in the buffer which could be > > misleading without > > this extra parameter). > > > > You could always return singleton Symbol objects if requested, by > > decoding the binary sequence on the fly, but you would no > longer need > > to store the sequence using them. > > > > Is this worth considering for the big BioJava rewrite? > > > > Richard Holland > > Bioinformatics Specialist > > GIS extension 8199 > > > > --------------------------------------------- > > This email is confidential and may be privileged. If you > are not the > > intended recipient, please delete it and notify us > immediately. Please > > do not copy or use it for any purpose, or disclose its > content to any > > other person. Thank you. > > --------------------------------------------- > > > > > > > -----Original Message----- > > > From: mark.schreiber@group.novartis.com > > > [mailto:mark.schreiber@group.novartis.com] > > > Sent: Monday, January 24, 2005 4:37 PM > > > To: Thomas Down > > > Cc: biojava-list List; Richard HOLLAND; > > > " > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > > > > > > > Thomas Down > > > 01/24/2005 04:34 PM > > > > > > > > > To: "Richard HOLLAND" > > > cc: "", biojava-list List > > > , Mark > > > Schreiber/GP/Novartis@PH > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > > > It's a compressed binary format. I doubt BioJava would be > > > able to read > > > > it without a lot of effort as the current parser framework > > > is set up > > > > for > > > > text input only. > > > > > > Nib support probably wouldn't fit into the text-oriented parsing > > > framework, but I'm sure it could be supported somehow if > there was > > > demand. A quick google doesn't turn up any format > > documentation, but > > > Jim Kent's IO code is at: > > > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > > > One interesting way to handle this might be to open the nib > > file as a > > > MappedByteBuffer, and back a SymbolList directly using that -- > > > potentially giving us an efficient way of working with huge > > > sequences.. > > > Any interest in that? > > > > > > Thomas. > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From td2 at sanger.ac.uk Mon Jan 24 04:22:27 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Mon Jan 24 04:18:31 2005 Subject: [Biojava-l] reading nib sequence files In-Reply-To: References: Message-ID: <76C0E31C-6DE9-11D9-A9E5-000A95C8B056@sanger.ac.uk> On 24 Jan 2005, at 09:17, mark.schreiber@group.novartis.com wrote: > BioJava uses (or at least can use) the PackedSymbolList for large > sequences. It uses an array of longs to represent the packed bits. > > There may be some advantage to using a ByteBuffer, hard to know. The main reason I was thinking for using MappedByteBuffer is that if you're accessing a large amount of sequence it won't necessarily all get loaded into memory at once. This could, for example, make random access to a multi-gigabase sequence database bearable on a basic desktop computer. Just a thought, not sure how much demand there is for this. Thomas. From mark.schreiber at group.novartis.com Mon Jan 24 04:26:12 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 24 04:29:39 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: Would make an interesting comp sci project as you would probably need a way to split the DB over multiple files when you hit the limits of your OS. - Mark Thomas Down 01/24/2005 05:22 PM To: Mark Schreiber/GP/Novartis@PH cc: biojava-list List Subject: Re: [Biojava-l] reading nib sequence files On 24 Jan 2005, at 09:17, mark.schreiber@group.novartis.com wrote: > BioJava uses (or at least can use) the PackedSymbolList for large > sequences. It uses an array of longs to represent the packed bits. > > There may be some advantage to using a ByteBuffer, hard to know. The main reason I was thinking for using MappedByteBuffer is that if you're accessing a large amount of sequence it won't necessarily all get loaded into memory at once. This could, for example, make random access to a multi-gigabase sequence database bearable on a basic desktop computer. Just a thought, not sure how much demand there is for this. Thomas. From hollandr at gis.a-star.edu.sg Mon Jan 24 04:37:50 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Mon Jan 24 04:36:48 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2CFC@BIONIC.biopolis.one-north.com> I would have thought anything that makes life easier when you need it but doesn't make life harder when you don't would be a good thing, as long as it doesn't take too much programming effort. Seeing as you are intending to rewrite BioJava from scratch anyway, I think this would be a good thing to work into the mechanism. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > Thomas Down > Sent: Monday, January 24, 2005 5:22 PM > To: mark.schreiber@group.novartis.com > Cc: biojava-list List > Subject: Re: [Biojava-l] reading nib sequence files > > > > On 24 Jan 2005, at 09:17, mark.schreiber@group.novartis.com wrote: > > > BioJava uses (or at least can use) the PackedSymbolList for large > > sequences. It uses an array of longs to represent the packed bits. > > > > There may be some advantage to using a ByteBuffer, hard to know. > > The main reason I was thinking for using MappedByteBuffer is that if > you're accessing a large amount of sequence it won't necessarily all > get loaded into memory at once. This could, for example, make random > access to a multi-gigabase sequence database bearable on a basic > desktop computer. Just a thought, not sure how much demand there is > for this. > > Thomas. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From Russell.Smithies at agresearch.co.nz Mon Jan 24 14:29:37 2005 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon Jan 24 14:25:56 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: You don't need to extract the whole file with ZipInputStream first. I managed to get the part I wanted by setting the offset to the start of the sequence (was using zipped chromosomes in fasta format) and the buffer to the length I wanted. It was a year or 2 ago and I probably don't have the code anymore but it is possible ;-) Russell Smithies Bioinformatics Software Developer AgResearch Invermay Private Bag 50034 Puddle Alley Mosgiel New Zealand -----Original Message----- From: biojava-l-bounces@portal.open-bio.org [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard HOLLAND Sent: Monday, 24 January 2005 10:19 p.m. To: VERHOEF Frans; mark.schreiber@group.novartis.com Cc: biojava-list List; Thomas Down Subject: RE: [Biojava-l] reading nib sequence files The trouble with ZIP is that to do random-access reads of the sequence (eg. give me all bases from X to Y) you have to unzip the whole sequence each time. That makes it quite a bit slower. The solution needs to be a compression algorithm of some kind which allows instant random access without slowing down the create/update process too much either. Hence a custom fixed-width binary solution would be the first thing that comes to mind, but it may not be the only one. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: VERHOEF Frans > Sent: Monday, January 24, 2005 5:16 PM > To: Richard HOLLAND; mark.schreiber@group.novartis.com > Cc: Thomas Down; biojava-list List > Subject: RE: [Biojava-l] reading nib sequence files > > > You could always ZIPStream it out for even more compression. > > Frans > > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > Richard HOLLAND > Sent: Monday, January 24, 2005 04:59 PM > To: mark.schreiber@group.novartis.com > Cc: Thomas Down; biojava-list List > Subject: RE: [Biojava-l] reading nib sequence files > > NIB files store one base per 4 bits, non-variable, giving a > 50% compression rate and a maximum arity of 16 different base > values per position. > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us > immediately. Please do not copy or use it for any purpose, or > disclose its content to any other person. Thank you. > --------------------------------------------- > > > > -----Original Message----- > > From: mark.schreiber@group.novartis.com > > [mailto:mark.schreiber@group.novartis.com] > > Sent: Monday, January 24, 2005 4:53 PM > > To: Richard HOLLAND > > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > BioJava does already do some compression on large sequences > > (or at least > > it used to). Like you say you can bit pack a lot. Ambiguity causes > > problems as you can have more than four symbols for DNA > > (including n, y, r > > etc). > > > > Does Jim Kent's schema offer better compression? Even if it > > doens't the > > use of a ByteBuffer will probably increase the speed of the current > > implementations. > > > > - Mark > > > > > > > > > > > > "Richard HOLLAND" > > 01/24/2005 04:47 PM > > > > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > > > cc: "biojava-list List" , > > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > I think the idea of storing sequences internally as > compressed binary > > sequence would be a good idea regardless, for any symbol list. > > Currently each Symbol in a SymbolList requires one word of > memory (the > > size of a memory pointer to the singleton Symbol > instances). Therefore > > any SymbolList of length X containing symbols from an n-ary > alphabet > > would require X words of memory to store it, plus the > overhead of the > > SymbolList and n Symbol singleton instances (admittedly > shared between > > all SymbolLists currently in memory). > > > > If you used a compressed binary format internally, doing away with > > explicit Symbol references and representing each symbol in a > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G > > etc.), you would require much less space than even the > singleton model > > above. This > > way you could fit four DNA symbols into a single byte of memory, as > > opposed to four words of memory. The number of bits required for a > > symbol in any given alphabet is merely log base 2 of the size of the > > alphabet, rounded up to the nearest whole number. eg. for > the English > > alphabet of 26 letters only, you would need 5 bits, or in > > terms of whole > > bytes, you would be able to fit 8 symbols into 5 bytes. > > > > To do this you would need to define a 'bits' parameter on > the alphabet > > which is calculated from the number of symbols in the alphabet, a > > 'bitMap' parameter on the alphabet which maps symbols to bit values > > (and vice versa with 'inverseBitMap'), and keep a separate > > 'length' parameter > > in the SymbolList which would be used to tell the binary > > decoder when to > > stop parsing the sequence (as you can only store whole bytes, > > there will > > often be trailing zeroes in the buffer which could be > > misleading without > > this extra parameter). > > > > You could always return singleton Symbol objects if requested, by > > decoding the binary sequence on the fly, but you would no > longer need > > to store the sequence using them. > > > > Is this worth considering for the big BioJava rewrite? > > > > Richard Holland > > Bioinformatics Specialist > > GIS extension 8199 > > > > --------------------------------------------- > > This email is confidential and may be privileged. If you > are not the > > intended recipient, please delete it and notify us > immediately. Please > > do not copy or use it for any purpose, or disclose its > content to any > > other person. Thank you. > > --------------------------------------------- > > > > > > > -----Original Message----- > > > From: mark.schreiber@group.novartis.com > > > [mailto:mark.schreiber@group.novartis.com] > > > Sent: Monday, January 24, 2005 4:37 PM > > > To: Thomas Down > > > Cc: biojava-list List; Richard HOLLAND; > > > " > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > > > > > > > Thomas Down > > > 01/24/2005 04:34 PM > > > > > > > > > To: "Richard HOLLAND" > > > cc: "", biojava-list List > > > , Mark > > > Schreiber/GP/Novartis@PH > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > > > It's a compressed binary format. I doubt BioJava would be > > > able to read > > > > it without a lot of effort as the current parser framework > > > is set up > > > > for > > > > text input only. > > > > > > Nib support probably wouldn't fit into the text-oriented parsing > > > framework, but I'm sure it could be supported somehow if > there was > > > demand. A quick google doesn't turn up any format > > documentation, but > > > Jim Kent's IO code is at: > > > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > > > One interesting way to handle this might be to open the nib > > file as a > > > MappedByteBuffer, and back a SymbolList directly using that -- > > > potentially giving us an efficient way of working with huge > > > sequences.. > > > Any interest in that? > > > > > > Thomas. > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From mark.schreiber at group.novartis.com Mon Jan 24 19:52:09 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 24 19:48:22 2005 Subject: [Biojava-l] Re: [Biojava-dev] BLAST-like Message-ID: You should be able to find the answer in the archives somewhere. You need to call setModeLazy() or something similar on the BlastLikeSAXParser "badr al-daihani" Sent by: biojava-dev-bounces@portal.open-bio.org 01/24/2005 11:43 PM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] BLAST-like Hi guys I'm trying to pasre a BLAST file using the BlastPareser Class in biojava in anger but I got this error org.xml.sax.SAXException: Program ncbi-blastp Version 2.2.5 is not supported by the biojava blast-like parsing framework Any hepl will be appreciated Best regards _________________________________________________________________ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From hollandr at gis.a-star.edu.sg Wed Jan 26 02:54:34 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Wed Jan 26 02:51:45 2005 Subject: [Biojava-l] PatternHunter parsing Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56015B2E28@BIONIC.biopolis.one-north.com> Are there any BioJava or BioPerl modules for parsing PatternHunter output? It's very similar to Blast output, so if there isn't one already, would other people be interested in using one if I wrote one? cheers, Richard Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- From heuermh at acm.org Wed Jan 26 14:55:46 2005 From: heuermh at acm.org (Michael Heuer) Date: Wed Jan 26 15:20:42 2005 Subject: [Biojava-l] grouped features using dazzle GFFAnnotationSource In-Reply-To: Message-ID: Hello, I would like to use the dazzle GFFAnnotationSource to display predicted transcripts on the Ensembl web interface, but it's a little bit difficult to follow tags through from the GFF file to what ends up on the Ensembl web interface. I imagine I would want a row in the GFF file for each exon and then group each exon together into a transcript using something in the tag/value field. Does anyone have an example? michael From td2 at sanger.ac.uk Thu Jan 27 05:00:05 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Thu Jan 27 04:55:56 2005 Subject: [Biojava-l] grouped features using dazzle GFFAnnotationSource In-Reply-To: References: Message-ID: <381BA466-704A-11D9-A8BD-000A95C8B056@sanger.ac.uk> On 26 Jan 2005, at 19:55, Michael Heuer wrote: > Hello, > > I would like to use the dazzle GFFAnnotationSource to display predicted > transcripts on the Ensembl web interface, but it's a little bit > difficult > to follow tags through from the GFF file to what ends up on the Ensembl > web interface. > > I imagine I would want a row in the GFF file for each exon and then > group > each exon together into a transcript using something in the tag/value > field. Does anyone have an example? For Ensembl display, it should be sufficient to include a property named "id" on each GFF record -- all features with matching IDs will be grouped by the ensembl web-code. Something like: 2L annotation exon 1 7528 0.0 . 0 id "Foo"; some_attribute "Bar" 2L annotation exon 9492 9835 0.0 . 0 id "Foo" Should behave the way you want. Thomas From dan.baggott.work at gmail.com Thu Jan 27 18:01:45 2005 From: dan.baggott.work at gmail.com (Dan Baggott) Date: Thu Jan 27 17:57:40 2005 Subject: [Biojava-l] reading nib sequence files In-Reply-To: References: Message-ID: <922876fd05012715013fc9fc8d@mail.gmail.com> That question started off a flurry... Thanks for the input! So, from my narrow and selfish perspective, the short of this thread is that there isn't any "ready to go" nib i/o code and that the existing BioJava parsing framework is not designed to deal with binary files so it would be less than trivial to adapt it. I don't have much experience with reading from large files (binary or otherwise). Is there a general consensus on the path of least resistance for implementing fast random access to large-ish nucleotide sequences (ie on the order of human chromosome sized)? I'm not so concerned about the size of the sequence files, just speed of access. I mentioned the nib format in the first place becuase I was impressed with the speed at which Jim Kent's nibFrag utility extracts sequence -- pretty much immediately from the human perspective. Dan On Tue, 25 Jan 2005 08:29:37 +1300, Smithies, Russell wrote: > You don't need to extract the whole file with ZipInputStream first. > I managed to get the part I wanted by setting the offset to the start of > the sequence (was using zipped chromosomes in fasta format) and the > buffer to the length I wanted. > It was a year or 2 ago and I probably don't have the code anymore but it > is possible ;-) > > Russell Smithies > > Bioinformatics Software Developer > AgResearch Invermay > Private Bag 50034 > Puddle Alley > Mosgiel > New Zealand > > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard > HOLLAND > > Sent: Monday, 24 January 2005 10:19 p.m. > To: VERHOEF Frans; mark.schreiber@group.novartis.com > Cc: biojava-list List; Thomas Down > Subject: RE: [Biojava-l] reading nib sequence files > > The trouble with ZIP is that to do random-access reads of the sequence > (eg. give me all bases from X to Y) you have to unzip the whole sequence > each time. That makes it quite a bit slower. The solution needs to be a > compression algorithm of some kind which allows instant random access > without slowing down the create/update process too much either. Hence a > custom fixed-width binary solution would be the first thing that comes > to mind, but it may not be the only one. > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are not the > intended recipient, please delete it and notify us immediately. Please > do not copy or use it for any purpose, or disclose its content to any > other person. Thank you. > --------------------------------------------- > > > -----Original Message----- > > From: VERHOEF Frans > > Sent: Monday, January 24, 2005 5:16 PM > > To: Richard HOLLAND; mark.schreiber@group.novartis.com > > Cc: Thomas Down; biojava-list List > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > You could always ZIPStream it out for even more compression. > > > > Frans > > > > -----Original Message----- > > From: biojava-l-bounces@portal.open-bio.org > > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > > Richard HOLLAND > > Sent: Monday, January 24, 2005 04:59 PM > > To: mark.schreiber@group.novartis.com > > Cc: Thomas Down; biojava-list List > > Subject: RE: [Biojava-l] reading nib sequence files > > > > NIB files store one base per 4 bits, non-variable, giving a > > 50% compression rate and a maximum arity of 16 different base > > values per position. > > > > Richard Holland > > Bioinformatics Specialist > > GIS extension 8199 > > > > --------------------------------------------- > > This email is confidential and may be privileged. If you are > > not the intended recipient, please delete it and notify us > > immediately. Please do not copy or use it for any purpose, or > > disclose its content to any other person. Thank you. > > --------------------------------------------- > > > > > > > -----Original Message----- > > > From: mark.schreiber@group.novartis.com > > > [mailto:mark.schreiber@group.novartis.com] > > > Sent: Monday, January 24, 2005 4:53 PM > > > To: Richard HOLLAND > > > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > > > > BioJava does already do some compression on large sequences > > > (or at least > > > it used to). Like you say you can bit pack a lot. Ambiguity causes > > > problems as you can have more than four symbols for DNA > > > (including n, y, r > > > etc). > > > > > > Does Jim Kent's schema offer better compression? Even if it > > > doens't the > > > use of a ByteBuffer will probably increase the speed of the current > > > implementations. > > > > > > - Mark > > > > > > > > > > > > > > > > > > "Richard HOLLAND" > > > 01/24/2005 04:47 PM > > > > > > > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > > > > > cc: "biojava-list List" , > > > > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > > > > I think the idea of storing sequences internally as > > compressed binary > > > sequence would be a good idea regardless, for any symbol list. > > > Currently each Symbol in a SymbolList requires one word of > > memory (the > > > size of a memory pointer to the singleton Symbol > > instances). Therefore > > > any SymbolList of length X containing symbols from an n-ary > > alphabet > > > would require X words of memory to store it, plus the > > overhead of the > > > SymbolList and n Symbol singleton instances (admittedly > > shared between > > > all SymbolLists currently in memory). > > > > > > If you used a compressed binary format internally, doing away with > > > explicit Symbol references and representing each symbol in a > > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G > > > etc.), you would require much less space than even the > > singleton model > > > above. This > > > way you could fit four DNA symbols into a single byte of memory, as > > > opposed to four words of memory. The number of bits required for a > > > symbol in any given alphabet is merely log base 2 of the size of the > > > alphabet, rounded up to the nearest whole number. eg. for > > the English > > > alphabet of 26 letters only, you would need 5 bits, or in > > > terms of whole > > > bytes, you would be able to fit 8 symbols into 5 bytes. > > > > > > To do this you would need to define a 'bits' parameter on > > the alphabet > > > which is calculated from the number of symbols in the alphabet, a > > > 'bitMap' parameter on the alphabet which maps symbols to bit values > > > (and vice versa with 'inverseBitMap'), and keep a separate > > > 'length' parameter > > > in the SymbolList which would be used to tell the binary > > > decoder when to > > > stop parsing the sequence (as you can only store whole bytes, > > > there will > > > often be trailing zeroes in the buffer which could be > > > misleading without > > > this extra parameter). > > > > > > You could always return singleton Symbol objects if requested, by > > > decoding the binary sequence on the fly, but you would no > > longer need > > > to store the sequence using them. > > > > > > Is this worth considering for the big BioJava rewrite? > > > > > > Richard Holland > > > Bioinformatics Specialist > > > GIS extension 8199 > > > > > > --------------------------------------------- > > > This email is confidential and may be privileged. If you > > are not the > > > intended recipient, please delete it and notify us > > immediately. Please > > > do not copy or use it for any purpose, or disclose its > > content to any > > > other person. Thank you. > > > --------------------------------------------- > > > > > > > > > > -----Original Message----- > > > > From: mark.schreiber@group.novartis.com > > > > [mailto:mark.schreiber@group.novartis.com] > > > > Sent: Monday, January 24, 2005 4:37 PM > > > > To: Thomas Down > > > > Cc: biojava-list List; Richard HOLLAND; > > > > " > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > > > > > > > > > > > > > Thomas Down > > > > 01/24/2005 04:34 PM > > > > > > > > > > > > To: "Richard HOLLAND" > > > > cc: "", biojava-list List > > > > , Mark > > > > Schreiber/GP/Novartis@PH > > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > > > > > It's a compressed binary format. I doubt BioJava would be > > > > able to read > > > > > it without a lot of effort as the current parser framework > > > > is set up > > > > > for > > > > > text input only. > > > > > > > > Nib support probably wouldn't fit into the text-oriented parsing > > > > framework, but I'm sure it could be supported somehow if > > there was > > > > demand. A quick google doesn't turn up any format > > > documentation, but > > > > Jim Kent's IO code is at: > > > > > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > > > > > One interesting way to handle this might be to open the nib > > > file as a > > > > MappedByteBuffer, and back a SymbolList directly using that -- > > > > potentially giving us an efficient way of working with huge > > > > sequences.. > > > > Any interest in that? > > > > > > > > Thomas. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l@biojava.org > > http://biojava.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From mark.schreiber at group.novartis.com Thu Jan 27 22:24:38 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Jan 27 22:21:12 2005 Subject: [Biojava-l] reading nib sequence files Message-ID: I think if you want to use Java the nio packages are the way to go. Just my $0.02 Dan Baggott Sent by: biojava-l-bounces@portal.open-bio.org 01/28/2005 07:01 AM Please respond to baggott2 To: biojava-list List cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] reading nib sequence files That question started off a flurry... Thanks for the input! So, from my narrow and selfish perspective, the short of this thread is that there isn't any "ready to go" nib i/o code and that the existing BioJava parsing framework is not designed to deal with binary files so it would be less than trivial to adapt it. I don't have much experience with reading from large files (binary or otherwise). Is there a general consensus on the path of least resistance for implementing fast random access to large-ish nucleotide sequences (ie on the order of human chromosome sized)? I'm not so concerned about the size of the sequence files, just speed of access. I mentioned the nib format in the first place becuase I was impressed with the speed at which Jim Kent's nibFrag utility extracts sequence -- pretty much immediately from the human perspective. Dan On Tue, 25 Jan 2005 08:29:37 +1300, Smithies, Russell wrote: > You don't need to extract the whole file with ZipInputStream first. > I managed to get the part I wanted by setting the offset to the start of > the sequence (was using zipped chromosomes in fasta format) and the > buffer to the length I wanted. > It was a year or 2 ago and I probably don't have the code anymore but it > is possible ;-) > > Russell Smithies > > Bioinformatics Software Developer > AgResearch Invermay > Private Bag 50034 > Puddle Alley > Mosgiel > New Zealand > > -----Original Message----- > From: biojava-l-bounces@portal.open-bio.org > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of Richard > HOLLAND > > Sent: Monday, 24 January 2005 10:19 p.m. > To: VERHOEF Frans; mark.schreiber@group.novartis.com > Cc: biojava-list List; Thomas Down > Subject: RE: [Biojava-l] reading nib sequence files > > The trouble with ZIP is that to do random-access reads of the sequence > (eg. give me all bases from X to Y) you have to unzip the whole sequence > each time. That makes it quite a bit slower. The solution needs to be a > compression algorithm of some kind which allows instant random access > without slowing down the create/update process too much either. Hence a > custom fixed-width binary solution would be the first thing that comes > to mind, but it may not be the only one. > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > > --------------------------------------------- > This email is confidential and may be privileged. If you are not the > intended recipient, please delete it and notify us immediately. Please > do not copy or use it for any purpose, or disclose its content to any > other person. Thank you. > --------------------------------------------- > > > -----Original Message----- > > From: VERHOEF Frans > > Sent: Monday, January 24, 2005 5:16 PM > > To: Richard HOLLAND; mark.schreiber@group.novartis.com > > Cc: Thomas Down; biojava-list List > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > You could always ZIPStream it out for even more compression. > > > > Frans > > > > -----Original Message----- > > From: biojava-l-bounces@portal.open-bio.org > > [mailto:biojava-l-bounces@portal.open-bio.org] On Behalf Of > > Richard HOLLAND > > Sent: Monday, January 24, 2005 04:59 PM > > To: mark.schreiber@group.novartis.com > > Cc: Thomas Down; biojava-list List > > Subject: RE: [Biojava-l] reading nib sequence files > > > > NIB files store one base per 4 bits, non-variable, giving a > > 50% compression rate and a maximum arity of 16 different base > > values per position. > > > > Richard Holland > > Bioinformatics Specialist > > GIS extension 8199 > > > > --------------------------------------------- > > This email is confidential and may be privileged. If you are > > not the intended recipient, please delete it and notify us > > immediately. Please do not copy or use it for any purpose, or > > disclose its content to any other person. Thank you. > > --------------------------------------------- > > > > > > > -----Original Message----- > > > From: mark.schreiber@group.novartis.com > > > [mailto:mark.schreiber@group.novartis.com] > > > Sent: Monday, January 24, 2005 4:53 PM > > > To: Richard HOLLAND > > > Cc: baggott2@llnl.gov; biojava-list List; Thomas Down > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > > > > BioJava does already do some compression on large sequences > > > (or at least > > > it used to). Like you say you can bit pack a lot. Ambiguity causes > > > problems as you can have more than four symbols for DNA > > > (including n, y, r > > > etc). > > > > > > Does Jim Kent's schema offer better compression? Even if it > > > doens't the > > > use of a ByteBuffer will probably increase the speed of the current > > > implementations. > > > > > > - Mark > > > > > > > > > > > > > > > > > > "Richard HOLLAND" > > > 01/24/2005 04:47 PM > > > > > > > > > To: Mark Schreiber/GP/Novartis@PH, "Thomas Down" > > > > > > cc: "biojava-list List" , > > > > > > Subject: RE: [Biojava-l] reading nib sequence files > > > > > > > > > I think the idea of storing sequences internally as > > compressed binary > > > sequence would be a good idea regardless, for any symbol list. > > > Currently each Symbol in a SymbolList requires one word of > > memory (the > > > size of a memory pointer to the singleton Symbol > > instances). Therefore > > > any SymbolList of length X containing symbols from an n-ary > > alphabet > > > would require X words of memory to store it, plus the > > overhead of the > > > SymbolList and n Symbol singleton instances (admittedly > > shared between > > > all SymbolLists currently in memory). > > > > > > If you used a compressed binary format internally, doing away with > > > explicit Symbol references and representing each symbol in a > > > ByteBuffer as binary values (00 for A, 01 for T, 10 for C, 11 for G > > > etc.), you would require much less space than even the > > singleton model > > > above. This > > > way you could fit four DNA symbols into a single byte of memory, as > > > opposed to four words of memory. The number of bits required for a > > > symbol in any given alphabet is merely log base 2 of the size of the > > > alphabet, rounded up to the nearest whole number. eg. for > > the English > > > alphabet of 26 letters only, you would need 5 bits, or in > > > terms of whole > > > bytes, you would be able to fit 8 symbols into 5 bytes. > > > > > > To do this you would need to define a 'bits' parameter on > > the alphabet > > > which is calculated from the number of symbols in the alphabet, a > > > 'bitMap' parameter on the alphabet which maps symbols to bit values > > > (and vice versa with 'inverseBitMap'), and keep a separate > > > 'length' parameter > > > in the SymbolList which would be used to tell the binary > > > decoder when to > > > stop parsing the sequence (as you can only store whole bytes, > > > there will > > > often be trailing zeroes in the buffer which could be > > > misleading without > > > this extra parameter). > > > > > > You could always return singleton Symbol objects if requested, by > > > decoding the binary sequence on the fly, but you would no > > longer need > > > to store the sequence using them. > > > > > > Is this worth considering for the big BioJava rewrite? > > > > > > Richard Holland > > > Bioinformatics Specialist > > > GIS extension 8199 > > > > > > --------------------------------------------- > > > This email is confidential and may be privileged. If you > > are not the > > > intended recipient, please delete it and notify us > > immediately. Please > > > do not copy or use it for any purpose, or disclose its > > content to any > > > other person. Thank you. > > > --------------------------------------------- > > > > > > > > > > -----Original Message----- > > > > From: mark.schreiber@group.novartis.com > > > > [mailto:mark.schreiber@group.novartis.com] > > > > Sent: Monday, January 24, 2005 4:37 PM > > > > To: Thomas Down > > > > Cc: biojava-list List; Richard HOLLAND; > > > > " > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > I'd need to brush up on my nio, and my c ! > > > > > > > > > > > > > > > > > > > > > > > > Thomas Down > > > > 01/24/2005 04:34 PM > > > > > > > > > > > > To: "Richard HOLLAND" > > > > cc: "", biojava-list List > > > > , Mark > > > > Schreiber/GP/Novartis@PH > > > > Subject: Re: [Biojava-l] reading nib sequence files > > > > > > > > > > > > > > > > On 24 Jan 2005, at 02:48, Richard HOLLAND wrote: > > > > > > > > > It's a compressed binary format. I doubt BioJava would be > > > > able to read > > > > > it without a lot of effort as the current parser framework > > > > is set up > > > > > for > > > > > text input only. > > > > > > > > Nib support probably wouldn't fit into the text-oriented parsing > > > > framework, but I'm sure it could be supported somehow if > > there was > > > > demand. A quick google doesn't turn up any format > > > documentation, but > > > > Jim Kent's IO code is at: > > > > > > > > http://www.soe.ucsc.edu/~kent/src/unzipped/lib/nib.c > > > > > > > > One interesting way to handle this might be to open the nib > > > file as a > > > > MappedByteBuffer, and back a SymbolList directly using that -- > > > > potentially giving us an efficient way of working with huge > > > > sequences.. > > > > Any interest in that? > > > > > > > > Thomas. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l@biojava.org > > http://biojava.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From mark.schreiber at group.novartis.com Mon Jan 31 01:57:38 2005 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Jan 31 01:53:48 2005 Subject: [Biojava-l] Validate Annotation vs Ontology Message-ID: Hello - I have an Ontology in a BioSQL DB and I would like to validate an Annotation against the terms in that DB. Is there a way to create an AnnotationType from an Ontology? - Mark