From mrp@sanger.ac.uk Wed Nov 1 11:19:02 2000 Date: Wed, 01 Nov 2000 11:19:02 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Anotatable Symbol
Hi Mike. There are several ways to do this without breaking anything we have at the moment. Firstly, you could add a method to ProteinTools double getResidueMass(Symbol s) throws IllegalSymbolException You could store the mass information in a format similar to resources/org/biojava/bio/seq/TranslationTables.xml (which is loaded by RNATools). The proplem with this is that you would have many getResidueMassByBla methods. Alternatively, you could write a new interface like this: public interface SymbolProperty { FiniteAlphabet getAlphabet(); double getValue(Symbol s) throws IllegalSymbolException; } You could then have ProteinTools provide several well-known versions - mass, charge, size etc. and load the data from a SymbolProperty.xml resource. It also leaves the door open to things like DNA physical properties. Another way to do this is to add the data to AlphabetManager.xml directly. You would have to modify the DTD so that the description element could have <key type="java.lang.String">mass<value type="java.lang.Double">90.3</value></key> style children, and then extend the symbolForXML code to handle this. The description elements should probably move to being <key type="java.lang.String"><value type="java.lang.String">The description goes in here</value></key> My money is on the interface option, as it lets you plug in new physical properties without having to have access to AlphabetManager.xml, including parameterising algorithms at run-time - TranslationTables ended up being great for this. The down-side for heavily computational algorithms is that you will have to perform some type of search within the implementations to find the value associated with a symbol. The issue of how to optimaly implement this search is nicely solved with the AlphabetIndex interface (just in), so it may not be that bad in practice. I have a feeling that the overhead of finding a particular key within an annotation bundle will be higher than the cost of looking up a double based upon the amino-acid, as hash-codes have to be calculated, and lots of functions and members are fetched to traverse the hash table. What do other people think? Mike Jones wrote: > I am starting to work on a package for biojava that can be used for MS > experimental data. Initially for proteins. So I need a way to annotate > amino acids with their atomic mass. I would appreciate the help of those > who have done such things. Can I just modify the AlphabetManager.xml. > Say add a new Alphabet > > I would rather not rewrite each symbol but if I were this is how it > would look. > <alphabet name="RESIDUE_MASS" parent="PROTEIN"> > <symbol name="s"> > <short>S</short> > <long>SER</long> > <mono-mass>87.03203</mono-mass> > <avg-mass>87.0782</avg-mass> > </symbol> > > ... > > To do this though I imagine I would have to modify > AlphabetManager.symbolFromXML. > > Please let me know if I am missing something or if any body has any > ideas. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom Raakesh.Syal@requisite.com Wed Nov 1 15:22:55 2000 Date: Wed, 1 Nov 2000 10:22:55 -0500 From: Raakesh Syal Raakesh.Syal@requisite.com Subject: [Biojava-l] biojava learning tools
Hi I am a science major with a background in programming. I would like some more information regarding learning biojava, either in the form of online tutorials or books. Thanks Raakesh SyalFrom mjones@mpi.com Wed Nov 1 19:54:57 2000 Date: Wed, 01 Nov 2000 14:54:57 -0500 From: Mike Jones mjones@mpi.com Subject: [Biojava-l] Anotatable Symbol
I think the interface idea sounds good but doesn't that seem like a lot of extra classes if you would make one for each property type. I would need at least 2 for residue masses (mon and iso topic masses). Maybe it could be more generic like: public interface SymbolProperty { FiniteAlphabet getAlphabet(); Object getValue(Symbol s, String type) throws IllegalSymbolException, throws UnknownTypeException; } Also why would I want to return a FiniteAlphabet for each SymbolProperty? I would like to get a better look at the AlphabetIndex source. Since I am behind a pretty serious fire wall here I can't use cvs to get the latest source. Do you have a zipped archive containing the code. Matthew Pocock wrote: > Hi Mike. > > There are several ways to do this without breaking anything we have at the > moment. Firstly, you could add a method to ProteinTools > > double getResidueMass(Symbol s) throws IllegalSymbolException > > You could store the mass information in a format similar to > resources/org/biojava/bio/seq/TranslationTables.xml (which is loaded by > RNATools). The proplem with this is that you would have many > getResidueMassByBla methods. Alternatively, you could write a new interface > like this: > > public interface SymbolProperty { > FiniteAlphabet getAlphabet(); > double getValue(Symbol s) throws IllegalSymbolException; > } > > You could then have ProteinTools provide several well-known versions - mass, > charge, size etc. and load the data from a SymbolProperty.xml resource. It > also leaves the door open to things like DNA physical properties. > > Another way to do this is to add the data to AlphabetManager.xml directly. > You would have to modify the DTD so that the description element could have > <key type="java.lang.String">mass<value > type="java.lang.Double">90.3</value></key> style children, and then extend > the symbolForXML code to handle this. The description elements should > probably move to being <key type="java.lang.String"><value > type="java.lang.String">The description goes in here</value></key> > > My money is on the interface option, as it lets you plug in new physical > properties without having to have access to AlphabetManager.xml, including > parameterising algorithms at run-time - TranslationTables ended up being > great for this. The down-side for heavily computational algorithms is that > you will have to perform some type of search within the implementations to > find the value associated with a symbol. The issue of how to optimaly > implement this search is nicely solved with the AlphabetIndex interface > (just in), so it may not be that bad in practice. I have a feeling that the > overhead of finding a particular key within an annotation bundle will be > higher than the cost of looking up a double based upon the amino-acid, as > hash-codes have to be calculated, and lots of functions and members are > fetched to traverse the hash table. > > What do other people think? > > Mike Jones wrote: > > > I am starting to work on a package for biojava that can be used for MS > > experimental data. Initially for proteins. So I need a way to annotate > > amino acids with their atomic mass. I would appreciate the help of those > > who have done such things. Can I just modify the AlphabetManager.xml. > > Say add a new Alphabet > > > > I would rather not rewrite each symbol but if I were this is how it > > would look. > > <alphabet name="RESIDUE_MASS" parent="PROTEIN"> > > <symbol name="s"> > > <short>S</short> > > <long>SER</long> > > <mono-mass>87.03203</mono-mass> > > <avg-mass>87.0782</avg-mass> > > </symbol> > > > > ... > > > > To do this though I imagine I would have to modify > > AlphabetManager.symbolFromXML. > > > > Please let me know if I am missing something or if any body has any > > ideas. > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l@biojava.org > > http://biojava.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom vij_ivai@hotmail.com Mon Nov 6 05:03:32 2000 Date: Mon, 06 Nov 2000 00:03:32 EST From: Vijay Narayanasamy vij_ivai@hotmail.com Subject: [Biojava-l] BLAST Networking Questions
Dear all, I'm playing with the following project thru this weekend. I would like to do the following. I guess some one would have done this or know this already. I would like to do the following: 1. Get the sequence data from the user with a GUI. 2. Send the sequence to the BLAST NCBI server 3. Get the output from the server 4. Present the output (may be in a different form) to the user. So the questions are , how to connect with the BLAST server and how to input the data in the appropriate database search? Is it possible to do with Java Servlets? If so how? Any other suggestions or comments? Sincerely, Vijay nvijay@psu.edu http://www.personal.psu.edu/vxn115 _________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. Share information about yourself, create your own public profile at http://profiles.msn.com.From garnhart@cisunix.unh.edu Mon Nov 6 12:03:30 2000 Date: Mon, 06 Nov 2000 07:03:30 -0500 From: Nancy J. Garnhart garnhart@cisunix.unh.edu Subject: [Biojava-l] BLAST Networking Questions
NCBI provides a stable URL that may be used to perform BLAST searches from another program (i.e., without interactive use of a Web browser). A demonstration client (ftp://ncbi.nlm.nih.gov/blast/blasturl/) and a README demonstrate how to access this URL. on 11/6/00 12:03 AM, Vijay Narayanasamy at vij_ivai@hotmail.com wrote: the above is copied right out of the BLAST overview page: http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html Nancy > Dear all, > > I'm playing with the following project thru this weekend. I would > like to do the following. I guess some one would have done this or know this > already. > > I would like to do the following: > > 1. Get the sequence data from the user with a GUI. > > 2. Send the sequence to the BLAST NCBI server > > 3. Get the output from the server > > 4. Present the output (may be in a different form) to the user. > > So the questions are , how to connect with the BLAST server and how to input > the data in the appropriate database search? > > Is it possible to do with Java Servlets? If so how? Any other suggestions or > comments? > > Sincerely, > > Vijay > > nvijay@psu.edu > > http://www.personal.psu.edu/vxn115 > > _________________________________________________________________________ > Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. > > Share information about yourself, create your own public profile at > http://profiles.msn.com. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l >From Robin.Emig@maxygen.com Mon Nov 6 21:00:08 2000 Date: Mon, 6 Nov 2000 13:00:08 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] Symbols are 1 Char?
I am trying to create a translation program that is based off of a codon bias table. I am having a little trouble actually creating the class though because I thought I'd create it as follows a Class with the following members SimpleDistribution (where the alphabet is DNA codons) Translation Table (where one alphabet is codons and the other is AA's) The problem is that the alphabets (built from symbols) are only 1 char elements, ie I can't represent ATG as a symbol. Am I missing something, is there a way to have a symbol be multiple chars? Even the interface defines it as a char. -Robin Robin Emig Bioinformatics Specialist 515 Galveston Dr Redwood City, CA 94063 Maxygen Inc 650-298-5493From td2@sanger.ac.uk Tue Nov 7 12:06:13 2000 Date: Tue, 7 Nov 2000 12:06:13 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Symbols are 1 Char?
On Mon, Nov 06, 2000 at 01:00:08PM -0800, Emig, Robin wrote: > > I am trying to create a translation program that is based off of a > codon bias table. I am having a little trouble actually creating the class > though because I thought I'd create it as follows > > a Class with the following members > SimpleDistribution (where the alphabet is DNA codons) > Translation Table (where one alphabet is codons and the other is AA's) > The problem is that the alphabets (built from symbols) are only 1 char > elements, ie I can't represent ATG as a symbol. Am I missing something, is > there a way to have a symbol be multiple chars? Even the interface defines > it as a char. Hi... BioJava Symbol objects certainly aren't tied to representing a single `char'. There is a convenience method, getToken(), which returns a char, but there isn't a requirement that this be anything meaningful (checks documentation -- yes, looks like to documentation of getToken() could do with some clarifications...) The easy way to represent codons is to use a cross-product alphabet. This is an ordered list of `child' alphabets, and contains symbols which are ordered lists of symbols from these child alphabets. So you can do something like: // Generate the alphabet DNA x DNA x DNA CrossProductAlphabet codonAlphabet = AlphabetManager. getCrossProductAlphabet(Collections.nCopies(3, DNATools.getDNA()); // Obtain a specific symbol from the codon alphabet List baseList = new ArrayList(); baseList.add(DNATools.a()); baseList.add(DNATools.t()); baseList.add(DNATools.g()); Symbol startCodon = codonAlphabet.getSymbol(baseList); You can do all the normal tricks with a cross-product alphabet, including constructing a distribution, and using it to store your codon bias table. If you call the `getToken' method on symbols in the codon alphabet, you'll get a unique (but not meaningful) char. On the other hand, getName() will return a sensible string representation of the ordered list. Hope this helps, Thomas. -- One of the advantages of being disorderly is that one is constantly making exciting discoveries. -- A. A. MilneFrom mrp@sanger.ac.uk Tue Nov 7 13:25:26 2000 Date: Tue, 07 Nov 2000 13:25:26 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] BLAST Networking Questions
Vijay Narayanasamy wrote: > Dear all, > > I'm playing with the following project thru this weekend. I would > like to do the following. I guess some one would have done this or know this > already. > > I would like to do the following: > > 1. Get the sequence data from the user with a GUI. There is a demo called seqviewer.EmblViewer that is a crude example of how to build a simple sequence GUI. You should be able to pull out the bits of this that you need - sequence loading, feature rendering, scaleing etc. > > 2. Send the sequence to the BLAST NCBI server Nancy covered this... > > > 3. Get the output from the server > and this. > > 4. Present the output (may be in a different form) to the user. You can use org.biojava.bio.program.sax.BlastLikeSAXParser to parse the resulting text into usefull information. You could then build new features on the query sequence (and update the viewer with them), or spit out the hits in some text format, or whatever. Good luck Matthew > > > So the questions are , how to connect with the BLAST server and how to input > the data in the appropriate database search? > > Is it possible to do with Java Servlets? If so how? Any other suggestions or > comments? > > Sincerely, > > Vijay > > nvijay@psu.edu > > http://www.personal.psu.edu/vxn115From td2@sanger.ac.uk Tue Nov 7 16:33:35 2000 Date: Tue, 7 Nov 2000 16:33:35 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [1.1] Sequence I/O rethink
Hi... I'd guess that the biological sequence I/O code is one of most widely useful parts of BioJava. The current system has served us quite well so far, but there are some issues that have cropped up, and I think the time might be ripe for a proper discussion of what we want from the package in the future. Issues which would be worth addressing (in no particular order): - It's not entirely clear how to handle alignments within the current I/O framework. - SequenceFormat classes tend to be tightly coupled to one particular mechanism for constructing SymbolLists. The mechanism used by all the current SequenceFormats is rather inefficient (both in time and space) when handling very long pieces of sequence. - There is not always an easy way to control the rules used to convert data from a sequence file into BioJava Annotation bundles and Feature objects. Some attempts /have/ been made in this direction (look at FastaDescriptionReader and FeatureBuilder). Unfortunately, this kind of functionality currently has to be implemeneted on a per-format basis, and has it's limitations. For instance, there is no simple way to agregate several feature-table entries in an EMBL file into a single BioJava feature. - The I/O framework only works on files which contain sequence data. It would be nice if at least some parts of it could be applied to the handling of, for example, GFF files (which currently have an entirely separate framework). What I'm potentially thinking an event-driven framework for parsing all kinds of sequence files (by which I include sequence-and-feature formats like EMBL, sequence-only like FASTA, feature-only like GFF, and alignments). We already have a simple event driven system in BioJava (org.biojava.bio.program.gff) and it works pretty well. There would then be a major refactoring of SequenceFactory so that it can act as a listener for the event stream. NOTE: I'm talking here primarily about changes to the guts of the I/O framework. I hope there won't be any significant increase in the number of lines of code needed in the simple case of reading a sequence from a common file format (EMBL, Genbank, FASTA). I know there are a number of people on the list who are interested in file parsing, so it would be good to hear everyone's thoughts and requirements before we finalize any API. Just to start the ball rolling, I've had an extension to the current I/O framework which decouples SymbolList creation from file parsing. I've been using this myself for a few weeks now, and it considerably improves performance (3-4 times) and peak memory usage (potentially a factor of almost two) when reading large sequences. This certainly doesn't address all the issues with the I/O framework, but it shows one area where some real improvements can be made. If you want to try this out, there is source code and class files in: http://www.biojava.org/proposals/newio.jar There's also javadoc at: http://www.biojava.org/proposals/newio-doc/index.html Any comments? Thomas. -- One of the advantages of being disorderly is that one is constantly making exciting discoveries. -- A. A. MilneFrom rik@cs.ucsd.edu Tue Nov 7 22:40:38 2000 Date: Tue, 07 Nov 2000 14:40:38 -0800 From: Richard K. Belew rik@cs.ucsd.edu Subject: [Biojava-l] biojava.dp doc/tutorials?
mr. matthew pocock and biojava, i'm contemplating using the dp example as the focus for a project in a CS data structures class will be teaching with a bioinformatics spin. might anyone else have done something similar, or developed other materials motivating the code? i find the author 'Samiul Hasan' in one source file, but don't know how to contact him either? thanks for any help, rik -- Richard K. Belew rik@cs.ucsd.edu http://www.cs.ucsd.edu/~rik Computer Science & Engr. Dept. Univ. California -- San Diego 858 / 534-2601 9500 Gilman Dr. (0114) 858 / 532-0702 (msgs) La Jolla CA 92093-0114 USA 858 / 534-7029 (fax)From mrp@sanger.ac.uk Wed Nov 8 15:24:24 2000 Date: Wed, 08 Nov 2000 15:24:24 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] biojava.dp doc/tutorials?
Hi Rik, I hope that UCSD is having better weather than we are. England seems to be totaly below water at the moment. "Richard K. Belew" wrote: > mr. matthew pocock and biojava, > > i'm contemplating using the dp example > as the focus for a project in a CS data structures > class will be teaching with a bioinformatics *blush* > > spin. might anyone else have done > something similar, or developed other materials > motivating the code? i find the author 'Samiul Hasan' > in one source file, but don't know how to contact > him either? thanks for any help, The DP objects have changed a bit since the 1.01 release - and for the better. We should put up a new snapshot of the project on the web-site as soon as I fix a show-stopper bug in the alphabet indexing code. The DP stuff was designed from the ground-up to be primarily a data structure. For pair-wise DP, there is now an interpreter object that performs alignments by 'interpreting' the HMM, and I have in development (but not CVS) a compiler that 'compiles' the HMMs to java byte-code which should be faster & produce bytecode that is *very* optimizable by hotspot. I have used the HMMs to model various biological sequences, and found them to be very flexible. They may be prohibatively slow on some older VMs for high-throughput, but for testing architectures & training models, this speed penalty is more than out-weighted by the ease with which you can build your particular model. There is sadly almost no tutorial documentation. I think that Samiul is interested in writing some. He has been using the package to model histone binding sites, and I think he is the nearest persone we have to a user (as oposed to developer) at this time. Please feel free to bother me at any time about how the code works, why it looks like that, or tell me of any difficulties/bugs. I hope that it ends up being useful as a teaching aid for your class. All the best, Matthew > > > rik > > -- > Richard K. Belew rik@cs.ucsd.edu > http://www.cs.ucsd.edu/~rik > Computer Science & Engr. Dept. > Univ. California -- San Diego 858 / 534-2601 > 9500 Gilman Dr. (0114) 858 / 532-0702 (msgs) > La Jolla CA 92093-0114 USA 858 / 534-7029 (fax) > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom mrp@sanger.ac.uk Wed Nov 8 16:58:15 2000 Date: Wed, 08 Nov 2000 16:58:15 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] New BioCorba IDL
Hello all. BioCorba is the bio* project that defines idl that should allow the projects to interoperate programmaticaly. I think it is a very good thing to have, particularly as it potentialy allows different parts of informatics problems to be tackled in different languages without re-writing all the code. Those of you subscribed to the bioxml mailing list will know that Alan Robinson has made a proposal for a new BioCorba idl (http://biocorba.org/pipermail/biocorba-l/2000-November/000044.html). The new IDL should be better behaved in situations where server memory is an issue. Regardless of how perfect it is, it is definitely an improvement over the current data model, and handles things like feature hierachies more cleanly. The BioCorba server and client has a seperate life-cycle to the BioJava core, so our code should be in a seperate CVS module (but in the current biojava repository) - how about a module called biocorba? Do any of you use the current BioCorba client/server? Is anybody interested in being the BioJava spokesperson for BioCorba-related things and/or our BioCorba developer? Anyway, thanks to Alan for putting together this revision, and getting a reference server & client together. MatthewFrom rik@cs.ucsd.edu Wed Nov 8 18:17:21 2000 Date: Wed, 08 Nov 2000 10:17:21 -0800 From: Richard K. Belew rik@cs.ucsd.edu Subject: [Biojava-l] biojava.dp doc/tutorials?
hi matthew and samiul, Matthew Pocock wrote: > > I hope that UCSD is having better weather than we are. England seems to > be totaly below water at the moment. and i thought you Brit's enjoyed being all wet:) we don't have weather in SoCal, just the sort of stasis that breeds politicians, like Ronald Reagan. > The DP objects have changed a bit since the 1.01 release - and for the > better. We should put up a new snapshot of the project on the web-site > as soon as I fix a show-stopper bug in the alphabet indexing code. ah! i am having problems running the SearchProfile demo: > Loading sequences > java.util.NoSuchElementException: There is no parser 'symbol' defined in > alphabet PROTEIN+X > at org.biojava.bio.symbol.AbstractAlphabet.getParser(AbstractAlphabet.java:58) > at SearchProfile.readSequenceDB(SearchProfile.java:110) > at SearchProfile.main(SearchProfile.java:21) maybe this is related? the new hacks you are developing sound very neat, so do let me know when you are ready to let others play. note that this is a relatively intro course in our curriculum, so my goal is to use things like DP to understand how good, efficient designs can make java work even on larger HMMs. the ideal progression will be to let them work on some small dataset (probably over DNA strings), then consider scaling issues as the length of strings increases and we move to proteins. > There is sadly almost no tutorial documentation. I think that Samiul is > interested in writing some. He has been using the package to model > histone binding sites, and I think he is the nearest persone we have to > a user (as oposed to developer) at this time. thanks Samiul for also replying! can you point me to any prelim writeups re: your use of these routines in your own work? that might help me slant what i develop (eg, towards data sets relevant to you)? do you think it is worth bugging Durbin et al with this same question? they'd be the sort of academics that i'd imagine also using this in a class somewhere? or are they all already listening in on biojava? > Please feel free to bother me at any time about how the code works, why > it looks like that, or tell me of any difficulties/bugs. thanks again, i'll probably take you up on that. best, rik -- Richard K. Belew rik@cs.ucsd.edu http://www.cs.ucsd.edu/~rik Computer Science & Engr. Dept. Univ. California -- San Diego 858 / 534-2601 9500 Gilman Dr. (0114) 858 / 532-0702 (msgs) La Jolla CA 92093-0114 USA 858 / 534-7029 (fax)From td2@sanger.ac.uk Thu Nov 9 13:47:56 2000 Date: Thu, 9 Nov 2000 13:47:56 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
--/9DWx/yDrRhgMJTb Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi... I've been making a little more progress with my plans for refactoring the sequence I/O framework for BioJava 1.1. I've attached two interfaces: SeqIOListener Generic listener for events produced by parsing biological sequence data SequenceBuilder SeqIOListener which builds a new BioJava sequence object. Rebuilding the I/O framework around these interfaces would meet the following objectives: - Decoupling all parts of the Sequence construction process from the file parsing. - An easy way to plug in filter and transducer objects between the parser and the Sequence construction step. - Potential to handle `feature-only' formats like GFF and GAME. Issues which are still open: - Exactly how should multiple sequence alignments be handled within the framework? One suggestion made internally at sanger would be to use a separate SequenceBuilder for each component of the alignments. I'd welcome comments from anyone who uses BioJava Alignments on this topic. Are there any commonly used formats for `annotated' alignments, with data which should be built into BioJava feature objects? - Are there any extra methods on SeqIOListener which I've missed? For instance, it's tempting to have a specific method for notifying the listener about a sequence's database ID, if this is present in the file. Any thoughts? Let me know what you think of these, Thomas -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry Pratchett --/9DWx/yDrRhgMJTb Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="SeqIOListener.java" package newio; /** * Notification interface for objects which listen to a sequence stream * parser. * * @author Thomas Down * @since 1.1 [newio proposal] */ public interface SeqIOListener { /** * Start the processing of a sequence. This method exists primarily * to enforce the life-cycles of SeqIOListener objects. */ public void startSequence(); /** * Notify the listener that processing of the sequence is complete. */ public void endSequence(); /** * Notify the listener of symbol data. * * <p> * NOTE: The SymbolReader is only guarenteed to be valid within * this call. If the listener does not fully read all the data, * the parser <em>may</em> assume that it is not required, and * skip it. * </p> */ public void addSymbols(SymbolReader sr) throws IOException, IllegalSymbolException; /** * Notify the listener of a sequence-wide property. This might * be stored as an entry in the sequence's annotation bundle. */ public void addSequenceProperty(String key, Object value); /** * Notify the listener that a new feature object is starting. * Every call to startFeature should have a corresponding call * to endFeature. If the listener is concerned with a hierarchy * of features, it should maintain a stack of `open' features. */ public void startFeature(Feature.Template templ); /** * Mark the end of data associated with one specific feature. */ public void endFeature(); /** * Notify the listener of a feature property. */ public void addFeatureProperty(String key, Object value); } --/9DWx/yDrRhgMJTb Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="SequenceBuilder.java" package newio; import org.biojava.bio.seq.*; /** * Interface for objects which accumulate state via SeqIOListener, * then construct a Sequence object. * * <p> * It is possible to build `transducer' objects which implement this * interface and pass on filtered notifications to a second, underlying * SequenceBuilder. In this case, they should provide a * <code>makeSequence</code> method which delegates to the underlying * SequenceBuilder. * </p> * * @author Thomas Down * @since 1.1 [newio proposal] */ public interface SequenceBuilder extends SeqIOListener { /** * Return the Sequence object which has been constructed * by this builder. This method is only expected to succeed * after the endSequence() notifier has been called. */ public Sequence makeSequence(); throws BioException; } --/9DWx/yDrRhgMJTb--From loraine@loraine.net Thu Nov 9 22:58:39 2000 Date: Thu, 9 Nov 2000 14:58:39 -0800 (PST) From: Ann Loraine loraine@loraine.net Subject: [Biojava-l] [newio] Proposed event-notification interfaces
On Thu, 9 Nov 2000, Thomas Down wrote: > Hi... > > I've been making a little more progress with my plans for > refactoring the sequence I/O framework for BioJava 1.1. I've > attached two interfaces: > > SeqIOListener Generic listener for events produced by > parsing biological sequence data > > SequenceBuilder SeqIOListener which builds a new BioJava > sequence object. > > Rebuilding the I/O framework around these interfaces would > meet the following objectives: > > - Decoupling all parts of the Sequence construction process > from the file parsing. Yes! I like this concept! > > - An easy way to plug in filter and transducer objects between > the parser and the Sequence construction step. Yes again! > > - Potential to handle `feature-only' formats like GFF and GAME. You could build a double-parser that extracts coordinates from a GFF/GAME file and then grabs the corresponding sequence out of a fasta db. > > Issues which are still open: > > - Exactly how should multiple sequence alignments be handled > within the framework? One suggestion made internally at > sanger would be to use a separate SequenceBuilder for each > component of the alignments. I'd welcome comments from anyone > who uses BioJava Alignments on this topic. Are there any > commonly used formats for `annotated' alignments, with > data which should be built into BioJava feature objects? Please allow in-between residues annotations as well as on-top-of residues annotations. For instance, in-between annotations are useful for mapping splice sites onto alignments. On-top-of anotations are useful for flagging individual residues. > > - Are there any extra methods on SeqIOListener which I've > missed? For instance, it's tempting to have a specific > method for notifying the listener about a sequence's > database ID, if this is present in the file. Any thoughts? > I would focus on designing the event class so that it can adequately capture the information being parsed, and then write your listeners based on the events. Also seems like you would want to have a general enough type of event that could handle structured information (name-value pairs, named lists, etc) in which you don't know anything about the semantics of what's coming. In cases where you do, you could have your parser broadcast more specialized events - subclasses of your very general base class event. The hard part in my mind is: where is the best place to put semantics? For instance, what objects need to know about database id, locus name, etc, and what objects just need to know about name-value/name-list pairs? I hope this is useful! -AnnFrom jtang@gene.com Fri Nov 10 01:46:00 2000 Date: Thu, 09 Nov 2000 17:46:00 -0800 From: Jerry (Zhijun) Tang jtang@gene.com Subject: [Biojava-l] problem with biojava-1.00.jar
Hi, I download the jar file. But I have problem to include it as a library in JBuilder4. When I used "jar xvf biojava-1.00.jar" to see the classes in it, I got the following message: java.util.zip.ZipException: invalid entry size (expected 156 but got 158 bytes) at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:355) ............... Please help, JerryFrom td2@sanger.ac.uk Fri Nov 10 11:21:12 2000 Date: Fri, 10 Nov 2000 11:21:12 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
On Thu, Nov 09, 2000 at 02:58:39PM -0800, Ann Loraine wrote: > > > - Potential to handle `feature-only' formats like GFF and GAME. > > You could build a double-parser that extracts coordinates from > a GFF/GAME file and then grabs the corresponding sequence out of > a fasta db. Yes indeed. This idea makes me lean even further towards the idea that there should be some special mechanism on the SeqIOListener interface for notifying a database ID (so that you can easily write a SequenceBuilder which listens for this, then goes and fetches the sequence data). Then a structure like this should work nicely: GFFParser ---> FetchSymbolsSequenceBuilder ---> DefaultSequenceBuilder ^ | FastaParser <------+ > Please allow in-between residues annotations as well as > on-top-of residues annotations. > > For instance, in-between annotations are useful for mapping splice > sites onto alignments. On-top-of anotations are useful for flagging > individual residues. This isn't really an issue for the I/O framework -- I'd assumed that the parsers would just generate standard BioJava Location objects. It's the current Location interface which forbids `between positions' locations -- in particular, the use of inclusive coordinates. I guess it should be possible to change the Location interface, although doing this without breaking too many of the current semantics might not be easy. Up to now, I've always seen the splicing problem in terms of exon and intron features, which can be modeled fine using our current interface, but I can see that if you want to deal with individual splice sites, matters become harder. > I would focus on designing the event class so that it can adequately > capture the information being parsed, and then write your listeners > based on the events. My current prototype for the SeqIOListener owes more to the SAX DocumentHandler interface and friends than to AWT event listeners. I'm not sure we actually need any/many specialized event objects -- instead, I've been trying to think about each type of record that a parser might find, and add suitable notification methods for each. The closest I've got to using an event object is for the startFeature notify, where the existing Feature.Template (and subclasses) objects are used to wrap up the Location and other basic information for the feature. As things stand at the moment, the database IO of the sequence would be passed to listeners using the addSequenceProperty notification. But probably it's important enough that we should have either a special notification method, or pass it as a parameter to the existing startSequence notification. Thanks, Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom td2@sanger.ac.uk Fri Nov 10 11:35:21 2000 Date: Fri, 10 Nov 2000 11:35:21 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] problem with biojava-1.00.jar
On Thu, Nov 09, 2000 at 05:46:00PM -0800, Jerry (Zhijun) Tang wrote: > I download the jar file. But I have problem to include it as a library > in JBuilder4. When I used "jar xvf biojava-1.00.jar" to see the classes > in it, I got the following message: > java.util.zip.ZipException: invalid entry size (expected 156 but got 158 > bytes) > at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:355) > Hi... There's actually a 1.01 release, which fixes a few minor bugs in 1.00. However, I don't think that's got anything to do with the problem you mention. The jar files are created using Sun's standard JAR tool, and shouldn't be causing any problems. My best guess is that your jar file got corrupted at some point during or after download. If you downloaded by HTTP (using the link from the front page of the web site) it's possible that your browser corrupted the file during transit. Right now, our web server appears to be returning the Content-Type for jar files as text/plain (ooops) which means that many browsers will do some newline processing on the data. This will be bad news for any binary file. I'll try to get this fixed, but in the meantime: - If using Netscape, try holding down the SHIFT key when you download a file (this might work in other browsers, too, but I'm not sure). - Download from our FTP site instead: ftp://ftp.biojava.org/pub/biojava - If this still fails, try avoiding your browser completely and using a command-line FTP client Hope this helps, Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom td2@sanger.ac.uk Fri Nov 10 12:00:17 2000 Date: Fri, 10 Nov 2000 12:00:17 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] problem with biojava-1.00.jar
On Fri, Nov 10, 2000 at 11:35:21AM +0000, Thomas Down wrote: > > Right now, > our web server appears to be returning the Content-Type for > jar files as text/plain (ooops) which means that many browsers > will do some newline processing on the data. This will be bad > news for any binary file. I'll try to get this fixed, but in > the meantime: Okay, we've fixed the server, and it now gives a more sensible MIME type, so you should be able to download again without any trouble. Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom mrp@sanger.ac.uk Fri Nov 10 12:22:19 2000 Date: Fri, 10 Nov 2000 12:22:19 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] problem with biojava-1.00.jar
*embarased* The MIME-type for jar files was absent. Jars were being converted into text/plain. They are now sent as something sensible and binary, application, jar-ish. Could you try to download 1.01 again, and see if you still get a corrupted jar? Thanks & sory. MatthewFrom mrp@sanger.ac.uk Fri Nov 10 12:32:11 2000 Date: Fri, 10 Nov 2000 12:32:11 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
Hi Ann. Ann Loraine wrote: > Please allow in-between residues annotations as well as > on-top-of residues annotations. > > For instance, in-between annotations are useful for mapping splice > sites onto alignments. On-top-of anotations are useful for flagging > individual residues. The current location frame-work is effectively built around the concept of sets of symbol indecies. Thus, there is no 'between'. This has caused problems for edit operations - GappedSymbolList for example is a bit tortuous in its definition of where to insert new gap characters. If you want to think of the current locations in terms of between-ness, then the min represents between it and the previous symbol, and max represents between it an the following symbol. Since min < max, there is no way to represent 'between'. The options are a) A completely new position object. Pros - it can look however you want it to. Cons - it will not play well with locations b) A location implementation that is empty, and still represents a gap emediately before min & emediately after max, and where max = min-1? Pros - this would fit the current math cleanly, and would let you add features to splice-sites. Cons - it kind-of breaks the Location concept. c) A location implementation where max = min + 1, but is empty and represents the position between the two indecies. Pros - nothing broken. Cons - We would have to adjust the Location docs to state that min & max are not contained in the location in this very special case - no biggie. My vote is c). What about you? Matthew > I hope this is useful! > > -Ann > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom td2@sanger.ac.uk Fri Nov 10 17:21:12 2000 Date: Fri, 10 Nov 2000 17:21:12 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
On Fri, Nov 10, 2000 at 12:32:11PM +0000, Matthew Pocock wrote: > > The current location frame-work is effectively built around the concept of > sets of symbol indecies. Thus, there is no 'between'. This has caused > problems for edit operations - GappedSymbolList for example is a bit tortuous > in its definition of where to insert new gap characters. If you want to think > of the current locations in terms of between-ness, then the min represents > between it and the previous symbol, and max represents between it an the > following symbol. Since min < max, there is no way to represent 'between'. > > The options are > > a) A completely new position object. Pros - it can look however you want it > to. Cons - it will not play well with locations This could be a bit awkward, especially when it comes to attaching Position objects to Features (I guess we'd want a common base interface for Position and Location, and I haven't a clue what that would look like). Could also lead to quite a bit of special case code :(. Probably worth exploring options which use the existing Location interface first, anyway. > b) A location implementation that is empty, and still represents a gap > emediately before min & emediately after max, and where max = min-1? Pros - > this would fit the current math cleanly, and would let you add features to > splice-sites. Cons - it kind-of breaks the Location concept. > > c) A location implementation where max = min + 1, but is empty and represents > the position between the two indecies. Pros - nothing broken. Cons - We would > have to adjust the Location docs to state that min & max are not contained in > the location in this very special case - no biggie. I think I prefer plan b -- to me this seems to be the smallest possible change of current Location semantics. One question concerns the semantics of the union operation for `cut' locations. If we have two cut locations, should the union method give: - The empty location (i.e. the union operator is only considering positions contained within the two locations). - A `compound cut' location -- I guess calling blockIterator on this will return the two individual cut-points. - Something else entirely? What about the case the union of a `cut' location and a normal `coverage' location? Any thoughts? Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom mrp@sanger.ac.uk Fri Nov 10 19:11:50 2000 Date: Fri, 10 Nov 2000 19:11:50 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Re: DP bug query
(cc'ed to the list) Hi Richard, This is because it should be looking for a parser under the name "token", not "symbol". I have changed my copy & checked it into CVS. I guess I need to check the demos more frequently. I don't actualy remember what this demo did. It is all a bit hazey back there. Best of luck, Matthew "Richard K. Belew" wrote: > hi matthew, > > i'm brand new to the biojava list so please excuse newbie tendancies. > > but the following query was buried in my (8 nov) thread around tutorials: > > > > Matthew Pocock wrote: > > > > > > The DP objects have changed a bit since the 1.01 release - and for the > > > better. We should put up a new snapshot of the project on the web-site > > > as soon as I fix a show-stopper bug in the alphabet indexing code. > > > > ah! i am having problems running the SearchProfile demo: > > > > > Loading sequences > > > java.util.NoSuchElementException: There is no parser 'symbol' defined in > > > alphabet PROTEIN+X > > > at org.biojava.bio.symbol.AbstractAlphabet.getParser(AbstractAlphabet.java:58) > > > at SearchProfile.readSequenceDB(SearchProfile.java:110) > > > at SearchProfile.main(SearchProfile.java:21) > > > > maybe this is related? > > if this is unrelated to your new fixes i'll continue to dig in on it. > > and since a few other topics have come and gone thru the list i > thought this might have slid by? > > thanks again, > rik > > -- > Richard K. Belew rik@cs.ucsd.edu > http://www.cs.ucsd.edu/~rik > Computer Science & Engr. Dept. > Univ. California -- San Diego 858 / 534-2601 > 9500 Gilman Dr. (0114) 858 / 532-0702 (msgs) > La Jolla CA 92093-0114 USA 858 / 534-7029 (fax)From anthonygoss@yahoo.com Sat Nov 11 00:56:30 2000 Date: Fri, 10 Nov 2000 16:56:30 -0800 From: Anthony Goss anthonygoss@yahoo.com Subject: [Biojava-l] I need some Java people
I am looking for some Java people, experience with J2EE. Plus would be Web methods, Web Logic, Netscape Application Server, and/or KIVA. I will give you a 15% raise from what you are making now. Depending on your location, there may be travel involved. But, you do not have to relocate. Please give me a call if you are interested. Thank you in advance, Anthony E. Goss Ph: 832-577-8890From tony_parsons@sandwich.pfizer.com Sun Nov 12 22:37:12 2000 Date: Sun, 12 Nov 2000 22:37:12 -0000 From: tony_parsons@sandwich.pfizer.com tony_parsons@sandwich.pfizer.com Subject: [Biojava-l] RE: Biojava-l digest, Vol 1 #173 - 2 msgs
Oh Please!, We are all looking for good people in this area. I don't exactly recall the constitution of this mailing list, but I thought it was for discourse about biojava rather than a free for all advertisement agency. If not can someone let me know if blatant job advertisements for free OK here? Best regards, Tony Parsons Dr.Tony Parsons, VOX : + 44 1304 646596 Information Management & Architecture, FAX : + 44 1304 656285 Pfizer Central Research, e-mail: Tony_Parsons@sandwich.pfizer.com Sandwich, CT13 9NJ UK -----Original Message----- From: biojava-l-request@biojava.org [mailto:biojava-l-request@biojava.org] Sent: 11 November 2000 17:01 To: biojava-l@biojava.org Subject: Biojava-l digest, Vol 1 #173 - 2 msgs PFIZER GLOBAL RESEARCH AND DEVELOPMENT ---------------------------------------------------------------- This message and any attachment has been virus checked by the PGRD Sandwich Data Centre. ---------------------------------------------------------------- Send Biojava-l mailing list submissions to biojava-l@biojava.org To subscribe or unsubscribe via the World Wide Web, visit http://biojava.org/mailman/listinfo/biojava-l or, via email, send a message with subject or body 'help' to biojava-l-request@biojava.org You can reach the person managing the list at biojava-l-admin@biojava.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Biojava-l digest..." Today's Topics: 1. Re: DP bug query (Matthew Pocock) 2. I need some Java people (Anthony Goss) --__--__-- Message: 1 Date: Fri, 10 Nov 2000 19:11:50 +0000 From: Matthew Pocock <mrp@sanger.ac.uk> Organization: The Sanger Center To: rik@cs.ucsd.edu CC: "biojava-l@biojava.org" <biojava-l@biojava.org> Subject: [Biojava-l] Re: DP bug query (cc'ed to the list) Hi Richard, This is because it should be looking for a parser under the name "token", not "symbol". I have changed my copy & checked it into CVS. I guess I need to check the demos more frequently. I don't actualy remember what this demo did. It is all a bit hazey back there. Best of luck, Matthew "Richard K. Belew" wrote: > hi matthew, > > i'm brand new to the biojava list so please excuse newbie tendancies. > > but the following query was buried in my (8 nov) thread around tutorials: > > > > Matthew Pocock wrote: > > > > > > The DP objects have changed a bit since the 1.01 release - and for the > > > better. We should put up a new snapshot of the project on the web-site > > > as soon as I fix a show-stopper bug in the alphabet indexing code. > > > > ah! i am having problems running the SearchProfile demo: > > > > > Loading sequences > > > java.util.NoSuchElementException: There is no parser 'symbol' defined in > > > alphabet PROTEIN+X > > > at org.biojava.bio.symbol.AbstractAlphabet.getParser(AbstractAlphabet.java:58) > > > at SearchProfile.readSequenceDB(SearchProfile.java:110) > > > at SearchProfile.main(SearchProfile.java:21) > > > > maybe this is related? > > if this is unrelated to your new fixes i'll continue to dig in on it. > > and since a few other topics have come and gone thru the list i > thought this might have slid by? > > thanks again, > rik > > -- > Richard K. Belew rik@cs.ucsd.edu > http://www.cs.ucsd.edu/~rik > Computer Science & Engr. Dept. > Univ. California -- San Diego 858 / 534-2601 > 9500 Gilman Dr. (0114) 858 / 532-0702 (msgs) > La Jolla CA 92093-0114 USA 858 / 534-7029 (fax) --__--__-- Message: 2 From: Anthony Goss<anthonygoss@yahoo.com> To: biojava-l@biojava.org Date: Fri, 10 Nov 2000 16:56:30 -0800 Subject: [Biojava-l] I need some Java people I am looking for some Java people, experience with J2EE. Plus would be Web methods, Web Logic, Netscape Application Server, and/or KIVA. I will give you a 15% raise from what you are making now. Depending on your location, there may be travel involved. But, you do not have to relocate. Please give me a call if you are interested. Thank you in advance, Anthony E. Goss Ph: 832-577-8890 --__--__-- _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l End of Biojava-l Digest PFIZER GLOBAL RESEARCH AND DEVELOPMENT ---------------------------------------------------------------- This message and any attachment has been virus checked by the PGRD Sandwich Data Centre. ----------------------------------------------------------------From Robin.Emig@maxygen.com Tue Nov 14 02:47:32 2000 Date: Mon, 13 Nov 2000 18:47:32 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] TokenParser and CrossProduct
I just tried to use the TokenParser on a crossproduct alphabet and it didn't work because the tokenParser class constructor sets up a map between a single character and symbol. Can a registered cvs person fix this? -Robin Robin Emig Bioinformatics Specialist 515 Galveston Dr Redwood City, CA 94063 Maxygen Inc 650-298-5493From td2@sanger.ac.uk Tue Nov 14 12:35:52 2000 Date: Tue, 14 Nov 2000 12:35:52 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] TokenParser and CrossProduct
On Mon, Nov 13, 2000 at 06:47:32PM -0800, Emig, Robin wrote: > I just tried to use the TokenParser on a crossproduct alphabet and > it didn't work because the tokenParser class constructor sets up a map > between a single character and symbol. Can a registered cvs person fix this? Just out of interest, are you actually explicitly constructing a TokenParser, or using the form: Alphabet alpha = ... Parser alphaTokens = alpha.getParser("token"); The "token" parser of a given alphabet is only defined if there exists a well-defined mapping between Symbols in that alphabet and printable characters in the unicode set. This is true of the simple DNA, RNA, and Protein alphabets, and I guess also for some other simple alphabets you might want to work with (dice rolls, coin tosses, whatever). Cross Product symbols are harder -- I guess we could define a standard single-char representation for some cases, like DNA x DNA, but it might be hard to get this accepted as a standard outside BioJava. And things get /really/ complicated once you get to alphabet like ((DNA x DNA x DNA) x Protein) (which is an entirely reasonable use of cross-products -- you might use that to represent an alignment of coding DNA against a protein sequence). On the other hand, CrossProductAlphabets do have a defined "name" parser. The symbols have names like (cytosine, adenine). This is a pretty verbose format for storing large amounts of alignment, but it is at least unambiguous. You are of course welcome to define your own token-mapping and parser implementation for your favourite cross-product alphabets, but unless you're working with a very common case, I'm not sure if this really belongs in the BioJava core. What definitely does need doing is some more documentation about the relationship between alphabets and parsers, and the cases where token-mappings do and don't exist. We may also want to change the SymbolParser interface a little bit as we switch to the new event-based I/O framework. I'm still very open to ideas about how CrossProductSymbols and Alignments ought to be handled for I/O. So we may be able to get something like the behaviour you want in future. Happy hacking, Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom mrp@sanger.ac.uk Tue Nov 14 14:50:10 2000 Date: Tue, 14 Nov 2000 14:50:10 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] RE: Biojava-l digest, Vol 1 #173 - 2 msgs
Anthony Goss, The Bio* mailing lists are for discussing Java in Bioinformatics, and in particular Biojava. Do not send recruitment adverts via this list. If you have any queries with regards to BioJava, or any of the other Bio* projects then please contact me personaly. Matthew Pocock tony_parsons@sandwich.pfizer.com wrote: > Oh Please!, > > We are all looking for good people in this area. I don't exactly recall the > constitution of this mailing list, but I thought it was for discourse about > biojava rather than a free for all advertisement agency. > > If not can someone let me know if blatant job advertisements for free OK > here? > > Best regards, > > Tony Parsons > > Dr.Tony Parsons, VOX : + 44 1304 646596 > Information Management & Architecture, FAX : + 44 1304 656285 > Pfizer Central Research, e-mail: > Tony_Parsons@sandwich.pfizer.com > Sandwich, > CT13 9NJ UKFrom Robin.Emig@maxygen.com Tue Nov 14 18:37:03 2000 Date: Tue, 14 Nov 2000 10:37:03 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] NameParser
Is the best way to deal situations where multiple tokens(or name) are really the same Symbol is to SubClass NameParser and add checks in it that symply map the redundant names to a proper unique one, and then parse. The reason I ask is that I am reading in CodonBiasTables which often have END TER or STP as the stop/terminal codon. I don't mind representing all of these as the same symbol, because they are in my case, but I wanted to know if there was a better way to do this, such as editing/creating and alphabet to do this. I was thinking of also creating possible a translation alphabet, essentially something that could set up all the mappings for a java.Map. -RobinFrom mrp@sanger.ac.uk Tue Nov 14 19:03:36 2000 Date: Tue, 14 Nov 2000 19:03:36 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] NameParser
Hi Robin, Just to make sure we are on the same page, you are sudgesting that END, TER and STP all be legal names for a single termination symbol in the protein-with-termination alphabet (retrievable from ProteinTools.getTAlphabet()). The codon tables map from DNA^3 to protein-with-termination, and the codon-bias tables give you a distribution over DNA^3 for a given protein-with-termination symbol. I sudgest a three-part solution. 1) Add a method to NameParser that lets you associate a name with a symbol. It will look something like: addSymbolForName(String name, Symbol sym) throws IllegalSymbolException; It will add a map from name to sym, assuming that sym is within the alphabet for the parser, and that name is not currently in use in that parser. You may wish to add the corresponding remove method for breaking associations. 2) Add a 'synonym'(sp?) element to the AlphabetManager.xml resource, and to the termination symbol add the synonyms. 3) Modify AlphabetManager.java so that it adds the synonyms to the name parser. Does this sound do-able, or is it a bit complex? All the best, Matthew "Emig, Robin" wrote: > Is the best way to deal situations where multiple tokens(or name) > are really the same Symbol is to SubClass NameParser and add checks in it > that symply map the redundant names to a proper unique one, and then parse. > The reason I ask is that I am reading in CodonBiasTables which often > have END TER or STP as the stop/terminal codon. I don't mind representing > all of these as the same symbol, because they are in my case, but I wanted > to know if there was a better way to do this, such as editing/creating and > alphabet to do this. I was thinking of also creating possible a translation > alphabet, essentially something that could set up all the mappings for a > java.Map. > -Robin > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom alan.mcculloch@agresearch.co.nz Thu Nov 16 04:43:45 2000 Date: Thu, 16 Nov 2000 17:43:45 +1300 From: McCulloch, Alan alan.mcculloch@agresearch.co.nz Subject: [Biojava-l] database for biojava
Does anybody have any tips on the right approach to setting up a database on top of which would sit biojava ? The platform will be Oracle 8 and I am very keen to NOT do my own data model (in the same way I'm keen to not do my own api/object design which is why I want to use something like biojava !) - I want to use a standard model if possible, if there is such a thing. Can a relational data model of some sort be derived from biojava ? Maybe I could use something from the bioxml project ? I'd be grateful for any tips on where to start. thanks Alan McCulloch Bioinformatics Software Engineer AgResearch NZ PS One thing I'm interested in is the possibly of using Oracle CLOBs and LOBs to perhaps store structured data or documents in single database fields (and so avoid a totally normalised design for storing document contents , which can be a pain) - however this is secondary to trying to use a "standard" sequence data model if possible, if there is such a thing.From td2@sanger.ac.uk Fri Nov 17 12:16:41 2000 Date: Fri, 17 Nov 2000 12:16:41 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Initial code landing
Hi... I've been working on a first implementation of the proposed new I/O interfaces. Things still aren't set in stone, but some practical tests should make further development much easier. So far, the interfaces seem to be very well. Unless there are any objections, I'd like to get these changes into the BioJava 1.1 main codebase as soon as possible. I'd like to check them in late this afternoon (probably around 18:00 UTC). Please let me know now if this is likely to cause any problems. For people who are currently relying on CVS BioJava, it's probably worth grabbing an up-to-date copy now before these changes land. The good news is that for simple applications, there should only be one change to make your code compatible with the new I/O: StreamReader is now constructed with a SequenceBuilderFactory (new interface) rather than the old SequenceFactory. I hope everyone can start using the new interfaces soon. Happy hacking! Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom mrp@sanger.ac.uk Fri Nov 17 12:41:32 2000 Date: Fri, 17 Nov 2000 12:41:32 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] byte-code generator
Hi. Those of you working form the cvs repository may have noticed that there is a new file called bytecode.jar in there. It is a fully functional library for generating java byte-code and then loading and running this code (credits: Thomas did most of the coding, he and I designed it & I have done some debugging). It can be effectively used as a macro-assembler for the Java VM. In the core biojava project it is currently used for generating the projected feature proxies (magic supplied by thomas), and in my experimental dynamic-programming compiler. We intend to release the bytecode generater under lgpl as a seperate package to biojava. It is currently in a cvs module called bytecode (a sister project to biojava-live). It has a sepreate development cycle to biojava, so it is apropreate to keep releases of each one seperate. As far as we are aware, it is the only open-source program of it's kind (we may be wrong - tell us). It also has the cleanest API of any of the bytecode generaters that I evaluated (of course, I would say that!). It is light-weight enough that generating classes is not noticably more expensive than loading byte-code direct from disk. If you are interested in this functionality, feel free to check out the bytecode module from anonymous CVS. It is currently under (un) documented. This is bleeding-edge stuff, so treat it carefully. If it doesn't do what you expect, then it may well be doing the wrong thing. As always, all questions, bugs, ideas, flames gratefully recieved. Matthew PocockFrom armhold@cs.rutgers.edu Fri Nov 17 15:29:16 2000 Date: Fri, 17 Nov 2000 10:29:16 -0500 From: George Armhold armhold@cs.rutgers.edu Subject: [Biojava-l] introduction, and some BLAST/Genscan code
Hello, I just subscribed to the biojava list and would like to say hello to everyone, as well as offer up some code. Browsing through the list archives I found a message from Vijay Narayanasamy who was looking for some code to talk to the BLAST server at NCBI. It so happens that I just completed such a class, and I'm happy to share it with anyone that may find it useful. Here's a (simplified) example: String mySequence = createSequence(); BlastConnection blast = new BlastConnection(); blast.setQuerySequence(mySequence); blast.setProgram(BlastConnection.BLASTN); blast.setDatabase(BlastConnection.DBEST); blast.setExpect(10f); String requestID = blast.submit(); // wait some amount of time for server to process String results = blast.getResults(requestID); if (results.equals(BlastConnection.IN_PROGRESS)) System.out.println("request ID " + requestID + " is still in progress. "); else System.out.println(results); I also have some code for talking to a Genscan server. The code has been in use at our site for a few weeks, but has not seen extensive testing yet. Source, binaries and documentation are available at http://bigbio.rutgers.edu/~armhold/bioinf. -- George Armhold Rutgers University Bioinformatics InitiativeFrom mrp@sanger.ac.uk Fri Nov 17 15:47:17 2000 Date: Fri, 17 Nov 2000 15:47:17 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] introduction, and some BLAST/Genscan code
Thanks George, This looks like realy cool code. It is the part we are missing from the process. CAT checked in blast parsing code (very cool) and Gerald Loeffler checked in a set of interfaces for requesting and representing blast hits. A standard object that actualy performs the blast search would be great. If you are interested in contributing and maintaining this code and/or becoming a BioJava developer, I can sort you out with a CVS account. In the mean time, I will read through the rest of your docs. You may like to look at the documentation for the packages: org.biojava.bio.search org.biojava.bio.program org.biojava.bio.program.sax org.biojava.bio.program.xml We are betwen releases at the moment (hopefuly we should be able to get a 1.1 out the door in the not-to-distant future...), so none of the APIs are above discussion. Thanks for the message. All the best, Matthew George Armhold wrote: > Hello, > > I just subscribed to the biojava list and would like to say hello to > everyone, as well as offer up some code. Browsing through the list > archives I found a message from Vijay Narayanasamy who was looking for > some code to talk to the BLAST server at NCBI. It so happens that I > just completed such a class, and I'm happy to share it with anyone > that may find it useful. Here's a (simplified) example: > > String mySequence = createSequence(); > BlastConnection blast = new BlastConnection(); > blast.setQuerySequence(mySequence); > blast.setProgram(BlastConnection.BLASTN); > blast.setDatabase(BlastConnection.DBEST); > blast.setExpect(10f); > String requestID = blast.submit(); > > // wait some amount of time for server to process > > String results = blast.getResults(requestID); > if (results.equals(BlastConnection.IN_PROGRESS)) > System.out.println("request ID " + requestID + " is still in > progress. > "); > else > System.out.println(results); > > I also have some code for talking to a Genscan server. The code has > been in use at our site for a few weeks, but has not seen extensive > testing yet. Source, binaries and documentation are available at > http://bigbio.rutgers.edu/~armhold/bioinf. > > -- > George Armhold > Rutgers University > Bioinformatics Initiative > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom armhold@cs.rutgers.edu Fri Nov 17 16:05:52 2000 Date: Fri, 17 Nov 2000 11:05:52 -0500 From: George Armhold armhold@cs.rutgers.edu Subject: [Biojava-l] introduction, and some BLAST/Genscan code
I should clarify something. My BlastConnection code interacts with the WWW server at NCBI, not the Blast server directly. It basically does an HTTP POST to submit the sequence. So it is subject to the whims of their webmaster, should they decide to change their CGI script. I hope my previous message was not misleading. I am planning on working on something that does talk to the server directly, which would be the Java equivalent of a "blast client". (If anyone is currently working on this I'd like to talk with them.) -- George Armhold Rutgers University Bioinformatics InitiativeFrom hlapp@gmx.net Fri Nov 17 17:49:05 2000 Date: Fri, 17 Nov 2000 09:49:05 -0800 From: Hilmar Lapp hlapp@gmx.net Subject: [Biojava-l] byte-code generator
Matthew Pocock wrote: > > We intend to release the bytecode generater under lgpl as a seperate > package to biojava. It is currently in a cvs module called bytecode (a > sister project to biojava-live). It has a sepreate development cycle to > biojava, so it is apropreate to keep releases of each one seperate. As > far as we are aware, it is the only open-source program of it's kind (we > may be wrong - tell us). It also has the cleanest API of any of the > bytecode generaters that I evaluated (of course, I would say that!). It > is light-weight enough that generating classes is not noticably more > expensive than loading byte-code direct from disk. > I'm sure you evaluated the JavaClass API (http://www.inf.fu-berlin.de/~dahm/JavaClass/), too (which BTW is also open-source). I can't judge the extent of overlap between the functionality of that package and your own, but there probably is some. Was there a particular reason not to collaborate with them, or is their interface not clean enough? Hilmar -- ----------------------------------------------------------------- Hilmar Lapp email: hlapp@gmx.net GNF, San Diego, Ca. 92122 phone: +1 858 812 1757 -----------------------------------------------------------------From td2@sanger.ac.uk Fri Nov 17 18:12:45 2000 Date: Fri, 17 Nov 2000 18:12:45 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Code landed
--sm4nu43k4a2Rpi4c Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I've just checked in the first revision of my new sequence I/O implementation. There's still more work left to be done, but hopefully most of the framework is now in place. Please everyone test this, read the code, shout at me if I've got something wrong, etc., etc. What's new: - Event-notification based sequence input, with full decoupling of the parsing from Sequence object creation. - A standard way to filter sequence and feature-table data as it is read into BioJava -- just implement the SequenceBuilder interface (see FastaDescriptionLineParser and EmblProcessor for examples) - Faster and more memory-efficient parsing of large sequences. - The irritating FASTA line-length bug dead and gone forever :). What's currently missing: - No GENBANK parser. If anyone else wants to take this on, feel free (look at the new EmblLikeFormat and EmblProcessor classes for ideas), otherwise I'll try to revive the old implementation. - IndexedSequenceDB was clobbered by one of the internal API changes -- it's not a hard fix, but I've temporarily disabled it until we've worked out the neatest way to fit this functionality onto the new framework. How to use it: >From the outside, I've tried to make the minimum possible API changes. If you just use the I/O framework via the StreamReader class, the only major change you'll see if that you now need to provide a SequenceBuilderFactory in place of the old SequenceFactory. The `standard' implementation is at SimpleSequenceBuilder.FACTORY. But in practice, you may want to wrap this up in one or more extra layers of sequence processing. As a quick example, I've attached a newio version of the GCContent demo program. I'm in the process of updating the other demo programs in the repository. For people who were previously using EmblParser, this has now been replaced by a lighter-weight EmblLikeParser (which should also work for formats like SwissProt, Transfac, UTRdb, and so on). Output from this is converted into something resembling the old parser using the EmblProcessor filter class. Happy hacking! Thomas. PS. For anyone who wants a copy of the last BioJava without newio, a checkout at 17:00 UTC today should be safe -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry Pratchett --sm4nu43k4a2Rpi4c Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="GCContent.java" package seq; import java.io.*; import org.biojava.bio.seq.io.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; public class GCContent { public static void main(String[] args) throws Exception { if (args.length != 1) throw new Exception("usage: java GCContent filename.fa"); String fileName = args[0]; // Set up stream reader Alphabet dna = DNATools.getDNA(); SymbolParser dnaParser = dna.getParser("token"); BufferedReader br = new BufferedReader( new FileReader(fileName)); SequenceBuilderFactory fact = new FastaDescriptionLineParser.Factory( SimpleSequenceBuilder.FACTORY); StreamReader stream = new StreamReader(br, new FastaFormat(), dnaParser, fact); // Iterate over all sequences in the stream while (stream.hasNext()) { Sequence seq = stream.nextSequence(); int gc = 0; for (int pos = 1; pos <= seq.length(); ++pos) { Symbol sym = seq.symbolAt(pos); if (sym == DNATools.g() || sym == DNATools.c()) ++gc; } System.out.println(seq.getName() + ": " + ((gc * 100.0) / seq.length()) + "%"); } } } --sm4nu43k4a2Rpi4c--From td2@sanger.ac.uk Mon Nov 20 11:46:42 2000 Date: Mon, 20 Nov 2000 11:46:42 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Xerces-J updated
I've just upgraded the xerces.jar file in the biojava-live CVS tree to match Xerces-J 1.2.1. This has better support for XML Schema validation, and also includes a fix which is relied on by some new code I'll be checking in this afternoon. Unless you are relying on some very specific bit of Xerces behaviour, you should just be able to do a normal CVS update and pick up the new jar file -- but your update may take a few minutes longer than normal. Let me know if this causes any problems, Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom td2@sanger.ac.uk Mon Nov 20 12:03:56 2000 Date: Mon, 20 Nov 2000 12:03:56 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] BioJava DAS client
Hi... After a couple of times when it's managed to get lost in my TODO list, I've now merged the phase 1 BioJava DAS client into the main CVS tree. If you are interested, take a look at the package org.biojava.bio.program.das (remember, if you are doing a CVS update, use "update -dP" to pick up newly created directories). With this package, you can create a BioJava-style SequenceDB which reflects the contents of a DAS reference datasource, then layer feature sets from one or more `annotation' servers on top of the reference sequence. The code is currently missing a query- optimizer module I'm working on at the moment, and it needs more testing against different server implementations. It should, however, be fully functional -- if you can't access your favourite DAS server, please report this as a bug. What I haven't done yet is build a graphical client application. If anyone is interested in working on this, it shouldn't be too hard to wire the client code up to the BioJava GUI packages to give at least a first-pass attempt at a viewer. Matthew demonstrated something like this at ISMB over the summer, so we know it's possible :). For people who are interested in DAS, there is now a new web site for the protocol specifications: http://biodas.org/ Happy hacking, Thomas -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom td2@sanger.ac.uk Mon Nov 20 15:15:59 2000 Date: Mon, 20 Nov 2000 15:15:59 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Dazzle server update
The latest development version of Dazzle, my DAS server toolkit, is now available via CVS from the BioJava repository. Details of how to access this can be found on http://cvs.biojava.org/. The module name is "dazzle". Dazzle is a Java servlet which uses to BioJava core APIs for handling sequence data and tables of features. The idea is to provide a simple framework for handling DAS requests and generating the basic documents, which can be parameterized for specific purposes by plugging in one or more DASDataSource objects. These allow the same servlet to work well as either a reference or an annotation server on the DAS network. I've got a couple of plans in mind for Dazzle: - Combine with the experimental biojava-ensembl bridge code to serve human genome data directly out of the Ensembl project's SQL database. This is a good testcase for Dazzle as a practical, scalable (!) server, as well as providing a sensible reference point for other people wanting to offer human annotation. The bridge code is very close to a position where I could serve sequence data (a little bit more work is still needed to get genes and features working, though). - Package dazzle with a standalone servlet container (Tomcat? Interalia picoServer? something else?) and a simple admin tool to give a `10 minute' DAS server installation. This should allow you to drop in some GFF/Game/whatever files and start serving annotations straight away). There are a few changes since the 0.04 tarball I put out a while back: - No longer needs to build from the same source tree as a DAS client -- I use the standard BioJava client code instead. - Some of the scalability bottlenecks fixed (but still more to go -- startup time for annotation servers is rather slower that I'd hope). - Tidied up the output -- should be able to generate 100% compliant DAS/0.98 documents. If anyone is in a hurry to try it, the instructions for 0.04 should still work. Otherwise, in the next few days I'm hoping to make the following changes: - Stabilise the DASDataSource interface - Migrate to servlets 2.2 (it currently builds against 2.1, but I don't know of any production quality 2.1 containers) - Improve lazy data source instantiation. - Write new installation documents. I'll make a `proper' release once these changes are made. Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom mrp@sanger.ac.uk Mon Nov 20 15:32:03 2000 Date: Mon, 20 Nov 2000 15:32:03 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] byte-code generator
Hi Hilmar, The honest answer is that when we were searching for available bytecode generators, we didn't find JavaClass. Of the ones we did find, none fitted our specs (both sensible & open-source). I think from looking at JavaClass, it is aimed more at dissasembling and editing bytecode, were as ours is optimized for generating it from scratch, either as the back-end for a compiler, or as a macro-assembler. Also, lots of the things that you have to handle explicitly in JavaClass (constant pool entries, jump points and the like) we take care of "under the hood", so that it is much easier to write class generators that are like c++ templates, and to organicaly re-use functionality (e.g. re-use a max or isNaN macro). Horses for courses. It is a shame that we didn't spot this one earlier. Matthew Hilmar Lapp wrote: > I'm sure you evaluated the JavaClass API > (http://www.inf.fu-berlin.de/~dahm/JavaClass/), too (which BTW is also > open-source). I can't judge the extent of overlap between the > functionality of that package and your own, but there probably is some. > Was there a particular reason not to collaborate with them, or is their > interface not clean enough? > > Hilmar > > -- > ----------------------------------------------------------------- > Hilmar Lapp email: hlapp@gmx.net > GNF, San Diego, Ca. 92122 phone: +1 858 812 1757 > ----------------------------------------------------------------- > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom td2@sanger.ac.uk Mon Nov 20 16:10:13 2000 Date: Mon, 20 Nov 2000 16:10:13 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Light refactoring of the SeqIOListener interface
(This is only really of interest to people who write SequenceFormats or SequenceBuilders -- the external API is unchanged since Friday) In the interest of simplicity, I've changed the SeqIOListener interface slightly so that we just notify the listener of blocks of Symbols, rather than passing around SymbolReader objects. This still allows us to optimize the SymbolList creation process -- performance and peak memory are essentially unchanged by this. But the API is slightly simpler, and it will make it much easier to write SequenceFormats which sit on top of some other parser system (I'm thinking especially about XML formats here sitting on top of SAX parsers). Let me know if there are any problems with this, Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom lstein@cshl.org Mon Nov 20 16:28:18 2000 Date: Mon, 20 Nov 2000 11:28:18 -0500 (EST) From: Lincoln Stein lstein@cshl.org Subject: [Biojava-l] Dazzle server update
Outstanding, on both the client and dazzle fronts! Lincoln Thomas Down writes: > The latest development version of Dazzle, my DAS server toolkit, is now > available via CVS from the BioJava repository. Details of > how to access this can be found on http://cvs.biojava.org/. > The module name is "dazzle". > > Dazzle is a Java servlet which uses to BioJava core APIs > for handling sequence data and tables of features. > The idea is to provide a simple framework for handling DAS > requests and generating the basic documents, which can be > parameterized for specific purposes by plugging in one or > more DASDataSource objects. These allow the same servlet > to work well as either a reference or an annotation server > on the DAS network. I've got a couple of plans in mind for > Dazzle: > > - Combine with the experimental biojava-ensembl bridge code > to serve human genome data directly out of the Ensembl > project's SQL database. This is a good testcase for Dazzle > as a practical, scalable (!) server, as well as providing > a sensible reference point for other people wanting to offer > human annotation. The bridge code is very close to a position > where I could serve sequence data (a little bit more work is > still needed to get genes and features working, though). > > - Package dazzle with a standalone servlet container (Tomcat? > Interalia picoServer? something else?) and a simple admin > tool to give a `10 minute' DAS server installation. This > should allow you to drop in some GFF/Game/whatever files and > start serving annotations straight away). > > There are a few changes since the 0.04 tarball I put out a > while back: > > - No longer needs to build from the same source tree as > a DAS client -- I use the standard BioJava client code > instead. > > - Some of the scalability bottlenecks fixed (but still more > to go -- startup time for annotation servers is rather slower > that I'd hope). > > - Tidied up the output -- should be able to generate 100% > compliant DAS/0.98 documents. > > If anyone is in a hurry to try it, the instructions for 0.04 should > still work. Otherwise, in the next few days I'm hoping to make the > following changes: > > - Stabilise the DASDataSource interface > > - Migrate to servlets 2.2 (it currently builds against 2.1, > but I don't know of any production quality 2.1 containers) > > - Improve lazy data source instantiation. > > - Write new installation documents. > > I'll make a `proper' release once these changes are made. > > Thomas. > -- > ``If I was going to carry a large axe on my back to a diplomatic > function I think I'd want it glittery too.'' > -- Terry Pratchett -- ======================================================================== Lincoln D. Stein Cold Spring Harbor Laboratory lstein@cshl.org Cold Spring Harbor, NY NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. PLEASE WRITE FOR DETAILS. ========================================================================From birney@ebi.ac.uk Mon Nov 20 17:14:12 2000 Date: Mon, 20 Nov 2000 17:14:12 +0000 (GMT) From: Ewan Birney birney@ebi.ac.uk Subject: [Biojava-l] Re: Dazzle server update
On Mon, 20 Nov 2000, Thomas Down wrote: > The latest development version of Dazzle, my DAS server toolkit, is now > available via CVS from the BioJava repository. Details of > how to access this can be found on http://cvs.biojava.org/. > The module name is "dazzle". Wow... I am very excited about this. If we can mount an ensembl reference server that will mean there is (another) heavy server serving out data. I am not sure if you have met the wonders of the static_golden_path table yet in ensembl or not, but if not, I suspect you will soon. drop me a note if you want a quick tour ;)From td2@sanger.ac.uk Mon Nov 20 18:45:56 2000 Date: Mon, 20 Nov 2000 18:45:56 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Oooops...
Just realized that I forgot to `CVS add' a vital file when I commited my last round of I/O changes -- I'll check it in tommorow am. Guess that if we are playing by EnsEMBL rules I owe large quantities of beer... (although, in my defence, everything /did. compile and run on my machine before I checke in...) If anyone is desparate, I'll try to e-mail the file this evening. Thomas@home (currenlty withough CVS access). -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom birney@ebi.ac.uk Mon Nov 20 18:58:39 2000 Date: Mon, 20 Nov 2000 18:58:39 +0000 (GMT) From: Ewan Birney birney@ebi.ac.uk Subject: [Biojava-l] Oooops...
On Mon, 20 Nov 2000, Thomas Down wrote: > Just realized that I forgot to `CVS add' a vital file when I commited > my last round of I/O changes -- I'll check it in tommorow am. > > Guess that if we are playing by EnsEMBL rules I owe large > quantities of beer... (although, in my defence, everything > /did. compile and run on my machine before I checke in...) <smile> beers rules rock </smile> On bioperl/ensembl, I keep two checkout'd directory structures when I am working. One for development, and for paranoid "clean room" tests. In fact, it is better to have this on a different amchine if possible.... (ewan, who has been here many times before) > > If anyone is desparate, I'll try to e-mail the file this evening. > > Thomas@home (currenlty withough CVS access). > -- > ``If I was going to carry a large axe on my back to a diplomatic > function I think I'd want it glittery too.'' > -- Terry Pratchett > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > ----------------------------------------------------------------- Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420 <birney@ebi.ac.uk>. -----------------------------------------------------------------From td2@sanger.ac.uk Tue Nov 21 10:47:50 2000 Date: Tue, 21 Nov 2000 10:47:50 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] ChunkedSymbolListBuilder.java
...is now safely checked in... *blush* Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom td2@sanger.ac.uk Wed Nov 22 14:08:57 2000 Date: Wed, 22 Nov 2000 14:08:57 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [Dazzle] Servlet API update, and a `quick test' release
I've now upgraded to Dazzle servlet to work with the Servlet API version 2.2 -- previous versions used 2.1, but this now seems to be dead. It should also work with 2.3 containers, once they start to appear. For people looking for a Servlet 2.2 container, the leading open source product seems to be Tomcat (http://jakarta.apache.org/). I've been testing with Tomcat 3.2beta7, and this seems to work well. One of the advantages of servlet 2.2 containers is that they support a standard servlet deployment mechanisms. I've prepared a `quick test' release of Dazzle, based on the configuration I've been using for some of my own testing. You can download this at: ftp://ftp.biojava.org/pub/biojava/dazzle/dazzle-test-0.05.war This contains the servlet itself, the libraries it required (BioJava and Xerces-J), some test data, and a deployment descriptor file, all wrapped up in the Servlet 2.2 `Web application' format. To try it out: - Install a servlet 2.2 container (tomcat) - Download the test distribution and rename it to `das.war' - Drop das.war into the deployment directory of the container (The standard Tomcat distribution has a `webapps' directory). - Restart the servlet container - Test by pointing your web browser to: <base_url_of_container>/das This should give an HTML `welcome' page generated by the servlet Let me know how this works out -- I'd be especially interested if anyone was testing using a container other than Tomcat. In the meantime, I'm still working on the bridge which will allow us to serve EnsEMBL -- watch this space. Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom td2@sanger.ac.uk Wed Nov 22 15:19:57 2000 Date: Wed, 22 Nov 2000 15:19:57 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] database for biojava
Just found this languishing at the end of my INBOX -- sorry... On Thu, Nov 16, 2000 at 05:43:45PM +1300, McCulloch, Alan wrote: > Does anybody have any tips on the right approach to setting up a database > on top of which would sit biojava ? > > The platform will be Oracle 8 and I am very keen to NOT do my > own data model (in the same way I'm keen to not do my own api/object > design which is why I want to use something like biojava !) - I want to > use a standard model if possible, if there is such a thing. > > Can a relational data model of some sort be derived from biojava ? It certainly should be possible to build a new relational model based on BioJava. Out basic model (simple sequence data, hierarchical features) is really pretty simple -- the only problems I can see might be: - Sparse locations -- it'll be a little bit of extra work to store these in the relational model. I guess I'd go for having a `span' table: create table location_span ( location_id int not null, min_pos int not null, max_pos int not null ) ; So each location is modeled by one or more location_span rows. Of course, the BioJava interfaces don't actually /require/ you to store sparse locations -- only implement this if you're actually going to need it. - Polymorphic features -- I guess the easiest way might be to have a separate table for each class of Feature object you want to store, but this means hardwiring the supported feature classes at a fairly low level. Another approach would be to have a table like: create table feature ( id sequence, sequence_id int not null, parent_id int, location_id int not null, type text, source text, biojava_feature blob ) ; so you're storing the `universal' properties of the feature, and then serializing the whole feature object and dumping it in the blob. But before you start implementing from scratch, you might like to take a look at what the EnsEMBL people have been doing (http://www.ensembl.org). They've got a fairly sophisticated model for storing genomic data in a relational model (currently using MySQL, but I've had the main tables running on PostgreSQL, and I know someone is working on an Oracle port). The EnsEMBL tables are more closely geared towards one specific application that the BioJava model is, but it might be worth looking to see if your data will fit into this model. I've been working on some Java interfaces for EnsEMBL -- all experimental code at the moment. Feel free to take a look at the following CVS modules if you're interested (in the main BioJava repository): ensembl Lightweight Java wrappers round the ensembl SQL tables (largely complete for reading, maybe 40-50% done for writing) biojava-ensembl Bridge which allows EnsEMBL databases to be viewed as BioJava SequenceDBs (currently pretty experimental) Hope this helps, Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry PratchettFrom loraine@loraine.net Mon Nov 27 06:47:59 2000 Date: Sun, 26 Nov 2000 22:47:59 -0800 (PST) From: Ann Loraine loraine@loraine.net Subject: [Biojava-l] BioJava DAS client
> > What I haven't done yet is build a graphical client application. > If anyone is interested in working on this, it shouldn't be too > hard to wire the client code up to the BioJava GUI packages to give > at least a first-pass attempt at a viewer. Matthew demonstrated > something like this at ISMB over the summer, so we know it's > possible :). > You also might want to check out Jazz - a Java open source toolkit for building graphical applications, such as genome and sequence map viewers! Jazz is the open source, Java heir to Pad++, groundbreaking zooming graphical interface project headed by Ben Bederson, now a prof at the Human Computer Interaction Lab at the University of Maryland, USA. The URL: http://www.cs.umd.edu/hcil/jazz/ -Ann > For people who are interested in DAS, there is now a new web > site for the protocol specifications: > > http://biodas.org/ > > Happy hacking, > > Thomas > -- > ``If I was going to carry a large axe on my back to a diplomatic > function I think I'd want it glittery too.'' > -- Terry Pratchett > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l >From kdj@sanger.ac.uk Mon Nov 27 14:13:18 2000 Date: 27 Nov 2000 14:13:18 +0000 From: Keith James kdj@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
Hi, I'm one of the Sanger Pathogen Sequencing Unit annotators and I've been writing/using OO Perl stuff for EMBL feature table editing, Blast/Fasta/HMMER/EMBOSS etc. sequence analysis. I'm a Java newbie looking to see if the 'grass is greener' on the Java side of the fence. I spent a weekend reading the Javadoc and trying things out. No problems. Now I have some questions: I want to implement a Fasta search output parser (for the nicer -m 10 form of output). I have a Perl implementation right now. Going through the list archive I found lots of discussion regarding the Blast SAX-type parser. Would this be the preferred way to cope with Fasta? This might be a bit of a challenge for me as I am initially confused by the various layers of the SAX-type system, but I'm sure I'll sort it out. (How does the SAX-type parser fit in with the code in org.biojava.bio.search?) And an observation: The EMBL flatfile feature table parser (at least, as it was until the new io stuff) would overwrite qualifiers. e.g. where there were several /gene names in a feature, only the last one would be retained. Also quirks similar to earlier Bioperl (like discarding information from < and > in locations, which is important for us to keep). Are these going to be addressed in the io shakeup? On a related note, if nobody is going to implement writeSequence for EMBL, then I'll offer to do it. cheers, Keith -- -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =- The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SAFrom mrp@sanger.ac.uk Mon Nov 27 15:03:07 2000 Date: Mon, 27 Nov 2000 15:03:07 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
Hi Keith, You should drop in some time and say hello (D322). Keith James wrote: > Hi, > > I'm one of the Sanger Pathogen Sequencing Unit annotators and I've > been writing/using OO Perl stuff for EMBL feature table editing, > Blast/Fasta/HMMER/EMBOSS etc. sequence analysis. I'm a Java newbie > looking to see if the 'grass is greener' on the Java side of the > fence. > > I spent a weekend reading the Javadoc and trying things out. No > problems. Now I have some questions: > Wow - you could make stuff work from reading the docs? They must be better than I remember... > > I want to implement a Fasta search output parser (for the nicer -m 10 > form of output). I have a Perl implementation right now. Going through > the list archive I found lots of discussion regarding the Blast > SAX-type parser. Would this be the preferred way to cope with Fasta? > This might be a bit of a challenge for me as I am initially confused > by the various layers of the SAX-type system, but I'm sure I'll sort > it out. > SAX would be the ideal way to do this, but as you say, it does require a level of effort that may be disproportionately high. > > (How does the SAX-type parser fit in with the code in > org.biojava.bio.search?) > bio.search specifies how the biojava objects for representing search methods & results should appear. The parsing framework specifies how the results flow through the application as a stream of data. It is easy to build bio.search objects from the xml streams by extracting interesting stuff. However, with the streams, you can do on-the-fly translation into other formats e.g. HTML. You could also build the bio.search objects directly from the fasta search output, or build them to represent the results of your personal search algorithm writen in Java. > > And an observation: > > The EMBL flatfile feature table parser (at least, as it was until the > new io stuff) would overwrite qualifiers. e.g. where there were > several /gene names in a feature, only the last one would be > retained. Also quirks similar to earlier Bioperl (like discarding > information from < and > in locations, which is important for us to > keep). Are these going to be addressed in the io shakeup? > The qualifier overwriting should be adressed by the new IO (fingers crossed). Fuzzy locations are evil. I ducked handeling this one untill somebody required it. You requre it, so I guess the days of ducking are over. I am willing to add a new implementation of the Location interface called FuzzyLocation. It will have isMinFuzzy() and isMaxFuzzy() boolean methods, and will decorate another Location for all the other location methods. This way I think we can store everything & lose nothing. Sounds good? > > On a related note, if nobody is going to implement writeSequence for > EMBL, then I'll offer to do it. Thanks - once the new IO has settled down, this would be great. > > > cheers, > > Keith > Matthew > > -- > > -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =- > The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-lFrom simon.brocklehurst@CambridgeAntibody.com Mon Nov 27 15:29:55 2000 Date: Mon, 27 Nov 2000 15:29:55 +0000 From: Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com Subject: [Biojava-l] Fasta & EMBL feature table parsing
Hi Keith, First, yes the grass is greener with Java ;-) The SAX2 event-based parsing framework is designed to be extensible (for example as well as the blast/wu-blast/hmmer stuff, there is proof-of-principle 3-D structure stuff which will be enhanced shortly). I'm sure you're not alone about being confused - I don't think there is enough documentation there to make it easy to get going on using the parsers to build applications, let alone extending the system by writing new SAX parsers. I have been meaning to put up some more documentation and tutorials on the biojava web site to make it easy for people to get going. As a start on this, I will try to get some UML class diagram stuff up late today. This should certainly help you figure out what classes can be reused. The place to start with this kind of thing is to figure out exactly what SAX2 events you will need to throw. What this means is that you need to work out what the XML format would be if your data was actually in XML format, and then put together a XML DTD or Schema to describe it. If you have any detailed questions, please feel free to drop a note to the list and I will do my best to help. Simon -- Simon M. Brocklehurst, Ph.D. Head of Bioinformatics & Advanced IS Cambridge Antibody Technology The Science Park, Melbourn, Cambridgeshire, UK http://www.CambridgeAntibody.com/ mailto:simon.brocklehurst@CambridgeAntibody.comFrom kdj@sanger.ac.uk Mon Nov 27 17:11:19 2000 Date: 27 Nov 2000 17:11:19 +0000 From: Keith James kdj@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
>>>>> "Matthew" == Matthew Pocock <mrp@sanger.ac.uk> writes: >> And an observation: >> >> The EMBL flatfile feature table parser (at least, as it was >> until the new io stuff) would overwrite qualifiers. e.g. where >> there were several /gene names in a feature, only the last one >> would be retained. Also quirks similar to earlier Bioperl (like >> discarding information from < and > in locations, which is >> important for us to keep). Are these going to be addressed in >> the io shakeup? Matthew> The qualifier overwriting should be adressed by the new Matthew> IO (fingers crossed). Fuzzy locations are evil. I ducked Matthew> handeling this one untill somebody required it. You Matthew> requre it, so I guess the days of ducking are over. I am Matthew> willing to add a new implementation of the Location Matthew> interface called FuzzyLocation. It will have isMinFuzzy() Matthew> and isMaxFuzzy() boolean methods, and will decorate Matthew> another Location for all the other location methods. This Matthew> way I think we can store everything & lose Matthew> nothing. Sounds good? I think we call fuzzy locations something different e.g. FT fuzzy_3p complement(130.140..2780) FT fuzzy_both 123.130..789.796 Thankfully, I have some Perl classes to deal with these and I'm going to ignore them. The < and > fuzziness is more important for us because they signify e.g. that there is more of the feature on an adjacent cosmid, or perhaps just 'beware incomplete CDS'. We sometimes use this to reconstitute bacterial genes across cosmid overlaps. Support for these would be great. Keith -- -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =- The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SAFrom mrp@sanger.ac.uk Mon Nov 27 17:20:21 2000 Date: Mon, 27 Nov 2000 17:20:21 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
Keith James wrote: > >>>>> "Matthew" == Matthew Pocock <mrp@sanger.ac.uk> writes: > > I think we call fuzzy locations something different e.g. > > FT fuzzy_3p complement(130.140..2780) > FT fuzzy_both 123.130..789.796 > > Thankfully, I have some Perl classes to deal with these and I'm going > to ignore them. > > The < and > fuzziness is more important for us because they signify > e.g. that there is more of the feature on an adjacent cosmid, or > perhaps just 'beware incomplete CDS'. We sometimes use this to > reconstitute bacterial genes across cosmid overlaps. > > Support for these would be great. > I have just checked in org.biojava.bio.symbol.FuzzyLocation which deals with < and > locations (getMinFuzzy & getMaxFuzzy are the two properties). I don't know how to handle the interval case (x.y rather than x..y) so I intend to duck that untill absolutely necisary. In an earlier post, there was a request for 'between' locations - I still can't see how to do that cleanly, so I haven't added it yet. Matthew > > Keith > > -- > > -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =- > The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SAFrom simon.brocklehurst@CambridgeAntibody.com Mon Nov 27 20:41:24 2000 Date: Mon, 27 Nov 2000 20:41:24 +0000 From: Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com Subject: [Biojava-l] Fasta & EMBL feature table parsing
Dear All, As per my previous post, there is now some detailed UML and JavaDocs for SAXParser writers (i.e. lots of detail e.g. including Classes with package-level visibility etc.) up in the following location: http://www.biojava.org/parsingTutorial1/ It's very much a beginning (understatement!). That is, if anyone is going to get anything out of this, they really need some understanding of Java, XML parsing using SAX2, and how SAXParsers work in general. NB For people reading the archive, please not that the above URL is a *temporary* location - I had to put it here due to issues with permissions on the web server. I expect this content (and later versions) will move into the tutorials section of the biojava web site soon. Simon -- Simon M. Brocklehurst, Ph.D. Head of Bioinformatics & Advanced IS Cambridge Antibody Technology The Science Park, Melbourn, Cambridgeshire, UK http://www.CambridgeAntibody.com/ mailto:simon.brocklehurst@CambridgeAntibody.comFrom kdj@sanger.ac.uk Mon Nov 27 21:18:50 2000 Date: 27 Nov 2000 21:18:50 +0000 From: Keith James kdj@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
>>>>> "Simon" == Simon Brocklehurst <simon.brocklehurst@CambridgeAntibody.com> writes: Simon> The place to start with this kind of thing is to figure out Simon> exactly what SAX2 events you will need to throw. What this Simon> means is that you need to work out what the XML format Simon> would be if your data was actually in XML format, and then Simon> put together a XML DTD or Schema to describe it. That's what I figured. Our group needs to work out what DTDs we will be using for annotation and search result interchange in general too. I hope we can pull all this together. I'll plough through the code and see what I can make of it. The diagram (I've had a look at it now) is very helpful. Ta. cheers, -- -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =- The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SAFrom mrp@sanger.ac.uk Tue Nov 28 19:07:40 2000 Date: Tue, 28 Nov 2000 19:07:40 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] CrossProductSymbols & stuff
Hi. It is that time again when I am looking at the symbol and alphabet indexes. All in all, they are working very well. The one rinkle for me at the moment is the Cross Product stuff. Pre 1.0 I changed Symbol so that all symbols were ambiguous, but AttomicSymbol is a sub-interface that guarantees that the only symbol it matches is itself. I am proposing to do the similar flip with CrossProduct symbol - all symbols are thought of as being cross products of other symbols, but a special sub-set are prime - can only be represented by raising themselves to the power 1, not by multiplying any other symbols together. This flip should not in practice change the day-to-day use of BioJava one jot. It will, however, clean up some of the internals for handeling alignments and probability distributions. The gap symbol should also become lest skitzoid (i hope for embarasement's sake that none of you have given the gap symbol a good poke arround). I hope that we can get things to be prety much binary compatible when seen from the outside. It will certainly have settled down by the time we get arround to a 1.1 release. This all came to light because I am trying to write strand-reversible 2nd order HMMs for modeling chromosomes, and the current scheim makes life painfull. All those with objections speak now & loudly, or next time you check out from CVS, this will all have been silently changed. MatthewFrom Robin.Emig@maxygen.com Wed Nov 29 16:17:03 2000 Date: Wed, 29 Nov 2000 08:17:03 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] CrossProductAlphabet
I have problems creating a cross product alphabet where the size is greater than 1000 symbols. Now this is the limit where a sparse cross product alphabet gets created instead of a simplecrossproductalphabet. I keep getting the exception No symbol for token 'GGG' found in alphabet (null x null x null). But both the original alphabet and the crossproduct alphabet do get instantiated. I am creating a codon alphabet that includes ambiguities. -RobinFrom Robin.Emig@maxygen.com Wed Nov 29 20:59:20 2000 Date: Wed, 29 Nov 2000 12:59:20 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] CrossProductSymbols & stuff
How about adding a AlphabetManager.getCrossProductAlphabet(collection,boolean) where boolean is true if we want to instantiate all the symbols then put some upper limit (say 100000) before creating a sparsecrossproduct alphabet -RobinFrom smarkel@netgenics.com Thu Nov 30 02:29:21 2000 Date: Wed, 29 Nov 2000 18:29:21 -0800 From: Scott Markel smarkel@netgenics.com Subject: [Biojava-l] seeking comments on proposed changes
We'd like to propose some changes and would like to get the group's feedback. * Location.empty.equals(Location.empty) evaluates to false. The problem is that EmptyLocation returns Integer.MIN_VALUE from the getMax() method and the LocationComparator determines the distance between the max of two Locations using subtraction. In this case of comparing Location.empty to itself the max values are both maximally negative so subtracting does not result in 0. We'd like to change EmptyLocation's equals() method. * FastaFormat doesn't use Java-like facilities such as reading lines as Strings from a BufferedReader. We tripped over this while tracking down a bug regarding DOS formatted end-of-line characters in a FASTA file. we have a fix to the DOS format bug that could be checked in, but we're wondering if using BufferedReader's readLine() method might be a safer approach that avoids that kind of problem. * We also noticed that when FastaFormat processes a sequence file a new String object is instantiated for each character in the sequence so that it can be parsed and added to the SymbolList. We've noticed a big performance hit for large sequences (100K - 10M bp). We'd like to do one of the following. - Add a method that mimics parseToken(), but takes a primitive char. This new method might live in either SymbolParser or a derived interface. Change the implementation of TokenParser's parse() method to not use substring(), which causes more Strings to be instantiated. - Change FastaFormat to use the current interface but instantiate a String per symbol in the alphabet and reuse them rather than creating a String per sequence character. Comments? Scott -- Scott Markel, Ph.D. NetGenics, Inc. smarkel@netgenics.com 4350 Executive Drive Tel: 858 455 5223 Suite 260 FAX: 858 455 1388 San Diego, CA 92121From td2@sanger.ac.uk Thu Nov 30 11:31:59 2000 Date: Thu, 30 Nov 2000 11:31:59 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] seeking comments on proposed changes
On Wed, Nov 29, 2000 at 06:29:21PM -0800, Scott Markel wrote: > We'd like to propose some changes and would like to get the group's > feedback. Great -- BioJava has grown quite a bit since the 1.0 release, and the more review it gets before 1.1 the better. > * Location.empty.equals(Location.empty) evaluates to false. The > problem is that EmptyLocation returns Integer.MIN_VALUE from the > getMax() method and the LocationComparator determines the distance > between the max of two Locations using subtraction. In this case of > comparing Location.empty to itself the max values are both maximally > negative so subtracting does not result in 0. We'd like to change > EmptyLocation's equals() method. That sounds reasonably to me... > * FastaFormat doesn't use Java-like facilities such as reading lines > as Strings from a BufferedReader. We tripped over this while > tracking down a bug regarding DOS formatted end-of-line characters > in a FASTA file. we have a fix to the DOS format bug that could be > checked in, but we're wondering if using BufferedReader's readLine() > method might be a safer approach that avoids that kind of problem. There is actually a very good reason why FastaFormat doesn't use BufferedReader.readLine (I went to some trouble when I rewrote it to stop using readLine). The trouble is that some FASTA files have long (potentially /very/ long) description lines. The only way to detect when you've hit the end of one sequence is to see the start of the description line of the next sequence. The contract for the SequenceFormat.readSequence method is to read exactly one sequence from the stream, and then leave it parked at the start of the next sequence (this is important for allowing IndexedSequenceDB to work). Since Java doens't allow truly random access on normal streams (only mark/restore), it's actually NOT safe to use readLine -- previous versions of BioJava did this, and ended up breaking if you used files with description lines bigger than the buffer of the BufferedReader :(. That said, you've clearly found a bug with FASTA files containing return characters -- glad someone found this sooner rather than later. But it's safer to just accept return characters as well as newlines, rather than using readLine() -- I'll check this in in a minute (bad Thomas for being Unix-centric). > * We also noticed that when FastaFormat processes a sequence file a > new String object is instantiated for each character in the sequence > so that it can be parsed and added to the SymbolList. We've noticed > a big performance hit for large sequences (100K - 10M bp). I know this isn't ideal (although it was actually less of a problem that I thought on the VMs I tested -- still worth fixing, though). I've been thinking about changes to the SymbolParser interface for a while, but haven't got round to doing anything. > We'd like to do one of the following. > > - Add a method that mimics parseToken(), but takes a primitive char. > This new method might live in either SymbolParser or a derived > interface. Change the implementation of TokenParser's parse() > method to not use substring(), which causes more Strings to be > instantiated. > > - Change FastaFormat to use the current interface but instantiate a > String per symbol in the alphabet and reuse them rather than > creating a String per sequence character. I'd be quite happy to see the first of these options implemented -- go ahead and do it now if you're being held back by the performance issues. An alternative solution which I've been thinking about is a `symbol-stream' parsing approach. The broad idea is that the SymbolParser gains an extra method to create a `streaming context' object. Blocks of primitives chars go in, blocks of Symbols come out. There are two possible ways this might be done: - Have a method on SymbolParser which takes a java Reader and returns a SymbolReader (the same interface I used in my initial newio proposal). SequenceFormats just provide a custom Reader implementation which exposes the raw sequence character data. - Have a special `streaming context' interface alongside the parser. This has a (SAX-like) characters(char[], int, int) method. A streaming context accepts character data, parses it, and passes blocks of symbols on to a SeqIOListener I think I'm starting to prefer the second of these proposals, and we then get rid of SymbolReader completely. The reason I'd like to use one of these two systems in preference to just having a parseToken(char) method is that, while these approaches should be just as efficient for streams with a single character -> Symbol encoding, they can also be used on multiple character -> Symbol encoded streams. I think the current SymbolParser interfaces was designed with multi-char -> Symbol encodings in mind. On the other hand, I'm open to being told that this is overkill and we should just concentrate on single-char -> Symbol parsing for now. Thomas. -- ``If I was going to carry a large axe on my back to a diplomatic function I think I'd want it glittery too.'' -- Terry Pratchett