From mrp@sanger.ac.uk Wed Nov 1 11:19:02 2000 Date: Wed, 01 Nov 2000 11:19:02 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Anotatable Symbol
Hi Mike.

There are several ways to do this without breaking anything we have at the
moment. Firstly, you could add a method to ProteinTools

double getResidueMass(Symbol s) throws IllegalSymbolException

You could store the mass information in a format similar to
resources/org/biojava/bio/seq/TranslationTables.xml (which is loaded by
RNATools). The proplem with this is that you would have many
getResidueMassByBla methods. Alternatively, you could write a new interface
like this:

public interface SymbolProperty {
  FiniteAlphabet getAlphabet();
  double getValue(Symbol s) throws IllegalSymbolException;
}

You could then have ProteinTools provide several well-known versions - mass,
charge, size etc. and load the data from a SymbolProperty.xml resource. It
also leaves the door open to things like DNA physical properties.

Another way to do this is to add the data to AlphabetManager.xml directly.
You would have to modify the DTD so that the description element could have
<key type="java.lang.String">mass<value
type="java.lang.Double">90.3</value></key> style children, and then extend
the symbolForXML code to handle this. The description elements should
probably move to being <key type="java.lang.String"><value
type="java.lang.String">The description goes in here</value></key>

My money is on the interface option, as it lets you plug in new physical
properties without having to have access to AlphabetManager.xml, including
parameterising algorithms at run-time - TranslationTables ended up being
great for this. The down-side for heavily computational algorithms is that
you will have to perform some type of search within the implementations to
find the value associated with a symbol. The issue of how to optimaly
implement this search is nicely solved with the AlphabetIndex interface
(just in), so it may not be that bad in practice. I have a feeling that the
overhead of finding a particular key within an annotation bundle will be
higher than the cost of looking up a double based upon the amino-acid, as
hash-codes have to be calculated, and lots of functions and members are
fetched to traverse the hash table.

What do other people think?

Mike Jones wrote:

> I am starting to work on a package for biojava that can be used for MS
> experimental data. Initially for proteins. So I need a way to annotate
> amino acids with their atomic mass. I would appreciate the help of those
> who have done such things. Can I just modify the AlphabetManager.xml.
> Say add a new Alphabet
>
> I would rather not rewrite each symbol but if I were this is how it
> would look.
> <alphabet name="RESIDUE_MASS" parent="PROTEIN">
>     <symbol name="s">
>             <short>S</short>
>             <long>SER</long>
>             <mono-mass>87.03203</mono-mass>
>             <avg-mass>87.0782</avg-mass>
>     </symbol>
>
> ...
>
> To do this though I imagine I would have to modify
> AlphabetManager.symbolFromXML.
>
> Please let me know if I am missing something or if any body has any
> ideas.
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From Raakesh.Syal@requisite.com Wed Nov 1 15:22:55 2000 Date: Wed, 1 Nov 2000 10:22:55 -0500 From: Raakesh Syal Raakesh.Syal@requisite.com Subject: [Biojava-l] biojava learning tools
Hi I am a science major with a background in programming.  I would like some
more information regarding learning biojava, either in the form of online
tutorials or books.
Thanks
Raakesh Syal

From mjones@mpi.com Wed Nov 1 19:54:57 2000 Date: Wed, 01 Nov 2000 14:54:57 -0500 From: Mike Jones mjones@mpi.com Subject: [Biojava-l] Anotatable Symbol
I think the interface idea sounds good but doesn't that seem like a lot of extra
classes if you
would make one for each property type. I would need at least 2 for residue
masses
(mon and iso topic masses). Maybe it could be more generic like:



public interface SymbolProperty {
  FiniteAlphabet getAlphabet();
  Object getValue(Symbol s, String type) throws IllegalSymbolException, throws
UnknownTypeException;
}

Also why would I want to return a FiniteAlphabet for each SymbolProperty?

I would like to get a better look at the AlphabetIndex
source. Since I am behind a pretty serious fire wall here I can't use cvs to get
the latest source.
Do you have a zipped archive containing the code.

Matthew Pocock wrote:

> Hi Mike.
>
> There are several ways to do this without breaking anything we have at the
> moment. Firstly, you could add a method to ProteinTools
>
> double getResidueMass(Symbol s) throws IllegalSymbolException
>
> You could store the mass information in a format similar to
> resources/org/biojava/bio/seq/TranslationTables.xml (which is loaded by
> RNATools). The proplem with this is that you would have many
> getResidueMassByBla methods. Alternatively, you could write a new interface
> like this:
>
> public interface SymbolProperty {
>   FiniteAlphabet getAlphabet();
>   double getValue(Symbol s) throws IllegalSymbolException;
> }
>
> You could then have ProteinTools provide several well-known versions - mass,
> charge, size etc. and load the data from a SymbolProperty.xml resource. It
> also leaves the door open to things like DNA physical properties.
>
> Another way to do this is to add the data to AlphabetManager.xml directly.
> You would have to modify the DTD so that the description element could have
> <key type="java.lang.String">mass<value
> type="java.lang.Double">90.3</value></key> style children, and then extend
> the symbolForXML code to handle this. The description elements should
> probably move to being <key type="java.lang.String"><value
> type="java.lang.String">The description goes in here</value></key>
>
> My money is on the interface option, as it lets you plug in new physical
> properties without having to have access to AlphabetManager.xml, including
> parameterising algorithms at run-time - TranslationTables ended up being
> great for this. The down-side for heavily computational algorithms is that
> you will have to perform some type of search within the implementations to
> find the value associated with a symbol. The issue of how to optimaly
> implement this search is nicely solved with the AlphabetIndex interface
> (just in), so it may not be that bad in practice. I have a feeling that the
> overhead of finding a particular key within an annotation bundle will be
> higher than the cost of looking up a double based upon the amino-acid, as
> hash-codes have to be calculated, and lots of functions and members are
> fetched to traverse the hash table.
>
> What do other people think?
>
> Mike Jones wrote:
>
> > I am starting to work on a package for biojava that can be used for MS
> > experimental data. Initially for proteins. So I need a way to annotate
> > amino acids with their atomic mass. I would appreciate the help of those
> > who have done such things. Can I just modify the AlphabetManager.xml.
> > Say add a new Alphabet
> >
> > I would rather not rewrite each symbol but if I were this is how it
> > would look.
> > <alphabet name="RESIDUE_MASS" parent="PROTEIN">
> >     <symbol name="s">
> >             <short>S</short>
> >             <long>SER</long>
> >             <mono-mass>87.03203</mono-mass>
> >             <avg-mass>87.0782</avg-mass>
> >     </symbol>
> >
> > ...
> >
> > To do this though I imagine I would have to modify
> > AlphabetManager.symbolFromXML.
> >
> > Please let me know if I am missing something or if any body has any
> > ideas.
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From vij_ivai@hotmail.com Mon Nov 6 05:03:32 2000 Date: Mon, 06 Nov 2000 00:03:32 EST From: Vijay Narayanasamy vij_ivai@hotmail.com Subject: [Biojava-l] BLAST Networking Questions
Dear all,

        I'm playing with the following project thru this weekend. I would 
like to do the following. I guess some one would have done this or know this 
already.

I would like to do the following:

1. Get the sequence data from the user with a GUI.

2. Send the sequence to the BLAST NCBI server

3. Get the output from the server

4. Present the output (may be in a different form) to the user.

So the questions are , how to connect with the BLAST server and how to input 
the data in the appropriate database search?

Is it possible to do with Java Servlets? If so how? Any other suggestions or 
comments?

Sincerely,

Vijay

nvijay@psu.edu

http://www.personal.psu.edu/vxn115

_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

Share information about yourself, create your own public profile at 
http://profiles.msn.com.


From garnhart@cisunix.unh.edu Mon Nov 6 12:03:30 2000 Date: Mon, 06 Nov 2000 07:03:30 -0500 From: Nancy J. Garnhart garnhart@cisunix.unh.edu Subject: [Biojava-l] BLAST Networking Questions
NCBI provides a stable URL that may be used to perform BLAST searches from
another program (i.e., without interactive use of a Web browser). A
demonstration client (ftp://ncbi.nlm.nih.gov/blast/blasturl/) and a README
demonstrate how to access this URL. on 11/6/00 12:03 AM, Vijay Narayanasamy
at vij_ivai@hotmail.com wrote:



the above is copied right out of the BLAST overview page:
http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html

Nancy




> Dear all,
> 
> I'm playing with the following project thru this weekend. I would
> like to do the following. I guess some one would have done this or know this
> already.
> 
> I would like to do the following:
> 
> 1. Get the sequence data from the user with a GUI.
> 
> 2. Send the sequence to the BLAST NCBI server
> 
> 3. Get the output from the server
> 
> 4. Present the output (may be in a different form) to the user.
> 
> So the questions are , how to connect with the BLAST server and how to input
> the data in the appropriate database search?
> 
> Is it possible to do with Java Servlets? If so how? Any other suggestions or
> comments?
> 
> Sincerely,
> 
> Vijay
> 
> nvijay@psu.edu
> 
> http://www.personal.psu.edu/vxn115
> 
> _________________________________________________________________________
> Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.
> 
> Share information about yourself, create your own public profile at
> http://profiles.msn.com.
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 


From Robin.Emig@maxygen.com Mon Nov 6 21:00:08 2000 Date: Mon, 6 Nov 2000 13:00:08 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] Symbols are 1 Char?
	I am trying to create a translation program that is based off of a
codon bias table. I am having a little trouble actually creating the class
though because I thought I'd create it as follows

a Class with the following members
SimpleDistribution (where the alphabet is DNA codons)
Translation Table (where one alphabet is codons and the other is AA's)
The problem is that the alphabets (built from symbols) are only 1 char
elements, ie I can't represent ATG as a symbol. Am I missing something, is
there a way to have a symbol be multiple chars? Even the interface defines
it as a char.
-Robin



Robin Emig
Bioinformatics Specialist
515 Galveston Dr
Redwood City, CA 94063
Maxygen Inc
650-298-5493



From td2@sanger.ac.uk Tue Nov 7 12:06:13 2000 Date: Tue, 7 Nov 2000 12:06:13 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Symbols are 1 Char?
On Mon, Nov 06, 2000 at 01:00:08PM -0800, Emig, Robin wrote:
>
> 	I am trying to create a translation program that is based off of a
> codon bias table. I am having a little trouble actually creating the class
> though because I thought I'd create it as follows
> 
> a Class with the following members
> SimpleDistribution (where the alphabet is DNA codons)
> Translation Table (where one alphabet is codons and the other is AA's)
> The problem is that the alphabets (built from symbols) are only 1 char
> elements, ie I can't represent ATG as a symbol. Am I missing something, is
> there a way to have a symbol be multiple chars? Even the interface defines
> it as a char.

Hi...

BioJava Symbol objects certainly aren't tied to representing
a single `char'.  There is a convenience method, getToken(),
which returns a char, but there isn't a requirement that this
be anything meaningful (checks documentation -- yes, looks like
to documentation of getToken() could do with some clarifications...)

The easy way to represent codons is to use a cross-product
alphabet.  This is an ordered list of `child' alphabets, and
contains symbols which are ordered lists of symbols from
these child alphabets.  So you can do something like:

  // Generate the alphabet DNA x DNA x DNA

  CrossProductAlphabet codonAlphabet = AlphabetManager.
          getCrossProductAlphabet(Collections.nCopies(3, DNATools.getDNA());

  // Obtain a specific symbol from the codon alphabet

  List baseList = new ArrayList();
  baseList.add(DNATools.a());
  baseList.add(DNATools.t());
  baseList.add(DNATools.g());
  Symbol startCodon = codonAlphabet.getSymbol(baseList);


You can do all the normal tricks with a cross-product alphabet,
including constructing a distribution, and using it to store
your codon bias table.

If you call the `getToken' method on symbols in the codon alphabet,
you'll get a unique (but not meaningful) char.  On the other hand,
getName() will return a sensible string representation of the
ordered list.

Hope this helps,

   Thomas.
-- 
One of the advantages of being disorderly is that one is
constantly making exciting discoveries.
                                       -- A. A. Milne

From mrp@sanger.ac.uk Tue Nov 7 13:25:26 2000 Date: Tue, 07 Nov 2000 13:25:26 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] BLAST Networking Questions
Vijay Narayanasamy wrote:

> Dear all,
>
>         I'm playing with the following project thru this weekend. I would
> like to do the following. I guess some one would have done this or know this
> already.
>
> I would like to do the following:
>
> 1. Get the sequence data from the user with a GUI.

There is a demo called seqviewer.EmblViewer that is a crude example of how to
build a simple sequence GUI. You should be able to pull out the bits of this
that you need - sequence loading, feature rendering, scaleing etc.

>
> 2. Send the sequence to the BLAST NCBI server

Nancy covered this...

>
>
> 3. Get the output from the server
>

and this.

>
> 4. Present the output (may be in a different form) to the user.

You can use org.biojava.bio.program.sax.BlastLikeSAXParser to parse the
resulting text into usefull information. You could then build new features on
the query sequence (and update the viewer with them), or spit out the hits in
some text format, or whatever.

Good luck

Matthew

>
>
> So the questions are , how to connect with the BLAST server and how to input
> the data in the appropriate database search?
>
> Is it possible to do with Java Servlets? If so how? Any other suggestions or
> comments?
>
> Sincerely,
>
> Vijay
>
> nvijay@psu.edu
>
> http://www.personal.psu.edu/vxn115


From td2@sanger.ac.uk Tue Nov 7 16:33:35 2000 Date: Tue, 7 Nov 2000 16:33:35 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [1.1] Sequence I/O rethink
Hi...

I'd guess that the biological sequence I/O code is one of most
widely useful parts of BioJava.  The current system has
served us quite well so far, but there are some issues that
have cropped up, and I think the time might be ripe for a
proper discussion of what we want from the package in the
future.

Issues which would be worth addressing (in no particular order):

  - It's not entirely clear how to handle alignments within
    the current I/O framework.

  - SequenceFormat classes tend to be tightly coupled to
    one particular mechanism for constructing SymbolLists.
    The mechanism used by all the current SequenceFormats
    is rather inefficient (both in time and space) when 
    handling very long pieces of sequence.

  - There is not always an easy way to control the rules
    used to convert data from a sequence file into BioJava
    Annotation bundles and Feature objects.  Some attempts
    /have/ been made in this direction (look at FastaDescriptionReader
    and FeatureBuilder).  Unfortunately, this kind of
    functionality currently has to be implemeneted on
    a per-format basis, and has it's limitations.  For
    instance, there is no simple way to agregate several
    feature-table entries in an EMBL file into a single
    BioJava feature.

  - The I/O framework only works on files which contain sequence
    data.  It would be nice if at least some parts of it could
    be applied to the handling of, for example, GFF files (which
    currently have an entirely separate framework).

What I'm potentially thinking an event-driven framework for parsing all
kinds of sequence files (by which I include sequence-and-feature
formats like EMBL, sequence-only like FASTA, feature-only like GFF,
and alignments).  We already have a simple event driven system
in BioJava (org.biojava.bio.program.gff) and it works pretty well.
There would then be a major refactoring of SequenceFactory so that
it can act as a listener for the event stream.

NOTE: I'm talking here primarily about changes to the guts of
the I/O framework.  I hope there won't be any significant
increase in the number of lines of code needed in the simple
case of reading a sequence from a common file format (EMBL, Genbank,
FASTA).

I know there are a number of people on the list who are interested
in file parsing, so it would be good to hear everyone's thoughts
and requirements before we finalize any API.


Just to start the ball rolling, I've had an extension to the
current I/O framework which decouples SymbolList creation
from file parsing.  I've been using this myself for a few
weeks now, and it considerably improves performance (3-4 times)
and peak memory usage (potentially a factor of almost two) when
reading large sequences.  This certainly doesn't address all
the issues with the I/O framework, but it shows one area where
some real improvements can be made.

If you want to try this out, there is source code and class
files in:

  http://www.biojava.org/proposals/newio.jar

There's also javadoc at:

  http://www.biojava.org/proposals/newio-doc/index.html

Any comments?

   Thomas.
-- 
One of the advantages of being disorderly is that one is
constantly making exciting discoveries.
                                       -- A. A. Milne

From rik@cs.ucsd.edu Tue Nov 7 22:40:38 2000 Date: Tue, 07 Nov 2000 14:40:38 -0800 From: Richard K. Belew rik@cs.ucsd.edu Subject: [Biojava-l] biojava.dp doc/tutorials?
mr. matthew pocock and biojava,

i'm contemplating using the dp example
as the focus for a project in a CS data structures
class will be teaching with a bioinformatics 
spin.  might anyone else have done
something similar, or developed other materials
motivating the code?  i find the author 'Samiul Hasan'
in one source file, but don't know how to contact
him either?  thanks for any help,

	rik

-- 
Richard K. Belew                          rik@cs.ucsd.edu
                              http://www.cs.ucsd.edu/~rik
Computer Science & Engr. Dept.      
Univ. California -- San Diego       858 / 534-2601       
9500 Gilman Dr. (0114)              858 / 532-0702 (msgs)
La Jolla CA 92093-0114 USA	    858 / 534-7029 (fax)

From mrp@sanger.ac.uk Wed Nov 8 15:24:24 2000 Date: Wed, 08 Nov 2000 15:24:24 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] biojava.dp doc/tutorials?
Hi Rik,

I hope that UCSD is having better weather than we are. England seems to
be totaly below water at the moment.

"Richard K. Belew" wrote:

> mr. matthew pocock and biojava,
>
> i'm contemplating using the dp example
> as the focus for a project in a CS data structures
> class will be teaching with a bioinformatics

*blush*

>
> spin.  might anyone else have done
> something similar, or developed other materials
> motivating the code?  i find the author 'Samiul Hasan'
> in one source file, but don't know how to contact
> him either?  thanks for any help,

The DP objects have changed a bit since the 1.01 release - and for the
better. We should put up a new snapshot of the project on the web-site
as soon as I fix a show-stopper bug in the alphabet indexing code. The
DP stuff was designed from the ground-up to be primarily a data
structure. For pair-wise DP, there is now an interpreter object that
performs alignments by 'interpreting' the HMM, and I have in development
(but not CVS) a compiler that 'compiles' the HMMs to java byte-code
which should be faster & produce bytecode that is *very* optimizable by
hotspot.

I have used the HMMs to model various biological sequences, and found
them to be very flexible. They may be prohibatively slow on some older
VMs for high-throughput, but for testing architectures & training
models, this speed penalty is more than out-weighted by the ease with
which you can build your particular model.

There is sadly almost no tutorial documentation. I think that Samiul is
interested in writing some. He has been using the package to model
histone binding sites, and I think he is the nearest persone we have to
a user (as oposed to developer) at this time.

Please feel free to bother me at any time about how the code works, why
it looks like that, or tell me of any difficulties/bugs. I hope that it
ends up being useful as a teaching aid for your class.

All the best,

Matthew

>
>
>         rik
>
> --
> Richard K. Belew                          rik@cs.ucsd.edu
>                               http://www.cs.ucsd.edu/~rik
> Computer Science & Engr. Dept.
> Univ. California -- San Diego       858 / 534-2601
> 9500 Gilman Dr. (0114)              858 / 532-0702 (msgs)
> La Jolla CA 92093-0114 USA          858 / 534-7029 (fax)
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From mrp@sanger.ac.uk Wed Nov 8 16:58:15 2000 Date: Wed, 08 Nov 2000 16:58:15 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] New BioCorba IDL
Hello all.

BioCorba is the bio* project that defines idl that should allow the
projects to interoperate programmaticaly. I think it is a very good
thing to have, particularly as it potentialy allows different parts of
informatics problems to be tackled in different languages without
re-writing all the code.

Those of you subscribed to the bioxml mailing list will know that Alan
Robinson has made a proposal for a new BioCorba idl
(http://biocorba.org/pipermail/biocorba-l/2000-November/000044.html).
The new IDL should be better behaved in situations where server memory
is an issue. Regardless of how perfect it is, it is definitely an
improvement over the current data model, and handles things like feature
hierachies more cleanly.

The BioCorba server and client has a seperate life-cycle to the BioJava
core, so our code should be in a seperate CVS module (but in the current
biojava repository) - how about a module called biocorba? Do any of you
use the current BioCorba client/server? Is anybody interested in being
the BioJava spokesperson for BioCorba-related things and/or our BioCorba
developer?

Anyway, thanks to Alan for putting together this revision, and getting a
reference server & client together.

Matthew


From rik@cs.ucsd.edu Wed Nov 8 18:17:21 2000 Date: Wed, 08 Nov 2000 10:17:21 -0800 From: Richard K. Belew rik@cs.ucsd.edu Subject: [Biojava-l] biojava.dp doc/tutorials?
hi matthew and samiul,

Matthew Pocock wrote:
> 
> I hope that UCSD is having better weather than we are. England seems to
> be totaly below water at the moment.

and i thought you Brit's enjoyed being all wet:)  we don't
have weather in SoCal, just the sort of stasis that breeds politicians,
like Ronald Reagan.

> The DP objects have changed a bit since the 1.01 release - and for the
> better. We should put up a new snapshot of the project on the web-site
> as soon as I fix a show-stopper bug in the alphabet indexing code. 

ah!  i am having problems running the SearchProfile demo:

> Loading sequences
> java.util.NoSuchElementException: There is no parser 'symbol' defined in 
>		alphabet PROTEIN+X
>         at org.biojava.bio.symbol.AbstractAlphabet.getParser(AbstractAlphabet.java:58)
>         at SearchProfile.readSequenceDB(SearchProfile.java:110)
>         at SearchProfile.main(SearchProfile.java:21)        

maybe this is related?

the new hacks you are developing sound very neat, so do let me know
when you are ready to let others play.  note that this is a relatively
intro course in our curriculum, so my goal is to use things like DP
to
understand how good, efficient designs can make java work even on
larger HMMs.  the ideal progression will be to let them work on
some small dataset (probably over DNA strings), then consider scaling
issues as the length of strings increases and we move to proteins.

> There is sadly almost no tutorial documentation. I think that Samiul is
> interested in writing some. He has been using the package to model
> histone binding sites, and I think he is the nearest persone we have to
> a user (as oposed to developer) at this time.

thanks Samiul for also replying!  can you point me to any prelim
writeups re: your use of these routines in your own work?  that might
help me slant what i develop (eg, towards data sets relevant to you)?

do you think it is worth bugging Durbin et al with this same question?
they'd be the sort of academics that i'd imagine also using this
in a class somewhere?  or are they all already listening in on biojava?

> Please feel free to bother me at any time about how the code works, why
> it looks like that, or tell me of any difficulties/bugs. 

thanks again, i'll probably take you up on that.

best,
	rik


-- 
Richard K. Belew                          rik@cs.ucsd.edu
                              http://www.cs.ucsd.edu/~rik
Computer Science & Engr. Dept.      
Univ. California -- San Diego       858 / 534-2601       
9500 Gilman Dr. (0114)              858 / 532-0702 (msgs)
La Jolla CA 92093-0114 USA	    858 / 534-7029 (fax)

From td2@sanger.ac.uk Thu Nov 9 13:47:56 2000 Date: Thu, 9 Nov 2000 13:47:56 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
--/9DWx/yDrRhgMJTb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi...

I've been making a little more progress with my plans for
refactoring the sequence I/O framework for BioJava 1.1.  I've
attached two interfaces:

  SeqIOListener     Generic listener for events produced by
                    parsing biological sequence data

  SequenceBuilder   SeqIOListener which builds a new BioJava
                    sequence object.

Rebuilding the I/O framework around these interfaces would
meet the following objectives:

  - Decoupling all parts of the Sequence construction process
    from the file parsing.

  - An easy way to plug in filter and transducer objects between
    the parser and the Sequence construction step.

  - Potential to handle `feature-only' formats like GFF and GAME.

Issues which are still open:

  - Exactly how should multiple sequence alignments be handled
    within the framework?  One suggestion made internally at
    sanger would be to use a separate SequenceBuilder for each
    component of the alignments.  I'd welcome comments from anyone
    who uses BioJava Alignments on this topic.  Are there any
    commonly used formats for `annotated' alignments, with
    data which should be built into BioJava feature objects?

  - Are there any extra methods on SeqIOListener which I've
    missed?  For instance, it's tempting to have a specific
    method for notifying the listener about a sequence's
    database ID, if this is present in the file.  Any thoughts?

Let me know what you think of these,

   Thomas
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

--/9DWx/yDrRhgMJTb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="SeqIOListener.java"

package newio;

/**
 * Notification interface for objects which listen to a sequence stream
 * parser.
 *
 * @author Thomas Down
 * @since 1.1 [newio proposal]
 */

public interface SeqIOListener {
    /**
     * Start the processing of a sequence.  This method exists primarily
     * to enforce the life-cycles of SeqIOListener objects.
     */

    public void startSequence();

    /**
     * Notify the listener that processing of the sequence is complete.
     */

    public void endSequence();

    /**
     * Notify the listener of symbol data.
     *
     * <p>
     * NOTE: The SymbolReader is only guarenteed to be valid within
     * this call.  If the listener does not fully read all the data,
     * the parser <em>may</em> assume that it is not required, and
     * skip it.
     * </p>
     */

    public void addSymbols(SymbolReader sr)
        throws IOException, IllegalSymbolException;

    /**
     * Notify the listener of a sequence-wide property.  This might
     * be stored as an entry in the sequence's annotation bundle.
     */

    public void addSequenceProperty(String key, Object value);

    /**
     * Notify the listener that a new feature object is starting.
     * Every call to startFeature should have a corresponding call
     * to endFeature.  If the listener is concerned with a hierarchy
     * of features, it should maintain a stack of `open' features.
     */

    public void startFeature(Feature.Template templ);

    /**
     * Mark the end of data associated with one specific feature.
     */

    public void endFeature();

    /**
     * Notify the listener of a feature property.
     */

    public void addFeatureProperty(String key, Object value);
}

--/9DWx/yDrRhgMJTb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="SequenceBuilder.java"

package newio;

import org.biojava.bio.seq.*;

/**
 * Interface for objects which accumulate state via SeqIOListener,
 * then construct a Sequence object.
 *
 * <p>
 * It is possible to build `transducer' objects which implement this
 * interface and pass on filtered notifications to a second, underlying
 * SequenceBuilder.  In this case, they should provide a
 * <code>makeSequence</code> method which delegates to the underlying
 * SequenceBuilder.
 * </p>
 *
 * @author Thomas Down
 * @since 1.1 [newio proposal]
 */

public interface SequenceBuilder extends SeqIOListener {
    /**
     * Return the Sequence object which has been constructed
     * by this builder.  This method is only expected to succeed
     * after the endSequence() notifier has been called.
     */

    public Sequence makeSequence(); throws BioException;
}

--/9DWx/yDrRhgMJTb--

From loraine@loraine.net Thu Nov 9 22:58:39 2000 Date: Thu, 9 Nov 2000 14:58:39 -0800 (PST) From: Ann Loraine loraine@loraine.net Subject: [Biojava-l] [newio] Proposed event-notification interfaces
On Thu, 9 Nov 2000, Thomas Down wrote:

> Hi...
> 
> I've been making a little more progress with my plans for
> refactoring the sequence I/O framework for BioJava 1.1.  I've
> attached two interfaces:
> 
>   SeqIOListener     Generic listener for events produced by
>                     parsing biological sequence data
> 
>   SequenceBuilder   SeqIOListener which builds a new BioJava
>                     sequence object.
> 
> Rebuilding the I/O framework around these interfaces would
> meet the following objectives:
> 
>   - Decoupling all parts of the Sequence construction process
>     from the file parsing.

Yes!  I like this concept!

> 
>   - An easy way to plug in filter and transducer objects between
>     the parser and the Sequence construction step.

Yes again!

> 
>   - Potential to handle `feature-only' formats like GFF and GAME.

You could build a double-parser that extracts coordinates from
a GFF/GAME file and then grabs the corresponding sequence out of
a fasta db.

> 
> Issues which are still open:
> 
>   - Exactly how should multiple sequence alignments be handled
>     within the framework?  One suggestion made internally at
>     sanger would be to use a separate SequenceBuilder for each
>     component of the alignments.  I'd welcome comments from anyone
>     who uses BioJava Alignments on this topic.  Are there any
>     commonly used formats for `annotated' alignments, with
>     data which should be built into BioJava feature objects?

Please allow in-between residues annotations as well as
on-top-of residues annotations.

For instance, in-between annotations are useful for mapping splice
sites onto alignments.  On-top-of anotations are useful for flagging
individual residues.

> 
>   - Are there any extra methods on SeqIOListener which I've
>     missed?  For instance, it's tempting to have a specific
>     method for notifying the listener about a sequence's
>     database ID, if this is present in the file.  Any thoughts?
> 

I would focus on designing the event class so that it can adequately
capture the information being parsed, and then write your listeners
based on the events.

Also seems like you would want to have a general enough type of event
that could handle structured information (name-value pairs, named
lists, etc) in which you don't know anything about the semantics of
what's coming.  

In cases where you do, you could have your parser broadcast more
specialized events - subclasses of your very general base class event.

The hard part in my mind is: where is the best place to put semantics?
For instance, what objects need to know about database id, locus name,
etc, and what objects just need to know about name-value/name-list pairs?

I hope this is useful!

-Ann


From jtang@gene.com Fri Nov 10 01:46:00 2000 Date: Thu, 09 Nov 2000 17:46:00 -0800 From: Jerry (Zhijun) Tang jtang@gene.com Subject: [Biojava-l] problem with biojava-1.00.jar
Hi,

I download the jar file. But I have problem to include it as a library
in JBuilder4. When I used "jar xvf biojava-1.00.jar" to see the classes
in it, I got the following message:
java.util.zip.ZipException: invalid entry size (expected 156 but got 158
bytes)
        at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:355)

        ...............

Please help, Jerry


From td2@sanger.ac.uk Fri Nov 10 11:21:12 2000 Date: Fri, 10 Nov 2000 11:21:12 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
On Thu, Nov 09, 2000 at 02:58:39PM -0800, Ann Loraine wrote:
>
> >   - Potential to handle `feature-only' formats like GFF and GAME.
> 
> You could build a double-parser that extracts coordinates from
> a GFF/GAME file and then grabs the corresponding sequence out of
> a fasta db.

Yes indeed.  This idea makes me lean even further towards
the idea that there should be some special mechanism on
the SeqIOListener interface for notifying a database ID
(so that you can easily write a SequenceBuilder which listens
for this, then goes and fetches the sequence data).

Then a structure like this should work nicely:

   GFFParser --->  FetchSymbolsSequenceBuilder ---> DefaultSequenceBuilder
                           ^
                           |
        FastaParser <------+


> Please allow in-between residues annotations as well as
> on-top-of residues annotations.
> 
> For instance, in-between annotations are useful for mapping splice
> sites onto alignments.  On-top-of anotations are useful for flagging
> individual residues.

This isn't really an issue for the I/O framework -- I'd
assumed that the parsers would just generate standard BioJava
Location objects.  It's the current Location interface which
forbids `between positions' locations -- in particular, the use
of inclusive coordinates.

I guess it should be possible to change the Location interface,
although doing this without breaking too many of the current
semantics might not be easy.

Up to now, I've always seen the splicing problem in terms of
exon and intron features, which can be modeled fine using our
current interface, but I can see that if you want to deal with
individual splice sites, matters become harder.

> I would focus on designing the event class so that it can adequately
> capture the information being parsed, and then write your listeners
> based on the events.

My current prototype for the SeqIOListener owes more to the
SAX DocumentHandler interface and friends than to AWT event
listeners.  I'm not sure we actually need any/many specialized
event objects -- instead, I've been trying to think about each
type of record that a parser might find, and add suitable
notification methods for each.  The closest I've got to using
an event object is for the startFeature notify, where the
existing Feature.Template (and subclasses) objects are used
to wrap up the Location and other basic information for the
feature.

As things stand at the moment, the database IO of the
sequence would be passed to listeners using the
addSequenceProperty notification.  But probably it's
important enough that we should have either a special
notification method, or pass it as a parameter to the
existing startSequence notification.

Thanks,

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From td2@sanger.ac.uk Fri Nov 10 11:35:21 2000 Date: Fri, 10 Nov 2000 11:35:21 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] problem with biojava-1.00.jar
On Thu, Nov 09, 2000 at 05:46:00PM -0800, Jerry (Zhijun) Tang wrote:
> I download the jar file. But I have problem to include it as a library
> in JBuilder4. When I used "jar xvf biojava-1.00.jar" to see the classes
> in it, I got the following message:
> java.util.zip.ZipException: invalid entry size (expected 156 but got 158
> bytes)
>         at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:355)
> 

Hi...

There's actually a 1.01 release, which fixes a few minor bugs in
1.00.  However, I don't think that's got anything to do with
the problem you mention.  The jar files are created using Sun's
standard JAR tool, and shouldn't be causing any problems.  My
best guess is that your jar file got corrupted at some point
during or after download.  If you downloaded by HTTP (using
the link from the front page of the web site) it's possible
that your browser corrupted the file during transit.  Right now,
our web server appears to be returning the Content-Type for
jar files as text/plain (ooops) which means that many browsers
will do some newline processing on the data.  This will be bad
news for any binary file.  I'll try to get this fixed, but in
the meantime:

  - If using Netscape, try holding down the SHIFT key when you
    download a file (this might work in other browsers, too, but
    I'm not sure).

  - Download from our FTP site instead:

         ftp://ftp.biojava.org/pub/biojava

  - If this still fails, try avoiding your browser completely
    and using a command-line FTP client

Hope this helps,

   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From td2@sanger.ac.uk Fri Nov 10 12:00:17 2000 Date: Fri, 10 Nov 2000 12:00:17 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] problem with biojava-1.00.jar
On Fri, Nov 10, 2000 at 11:35:21AM +0000, Thomas Down wrote:
> 
>  Right now,
> our web server appears to be returning the Content-Type for
> jar files as text/plain (ooops) which means that many browsers
> will do some newline processing on the data.  This will be bad
> news for any binary file.  I'll try to get this fixed, but in
> the meantime:

Okay, we've fixed the server, and it now gives a more sensible
MIME type, so you should be able to download again without any
trouble.

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From mrp@sanger.ac.uk Fri Nov 10 12:22:19 2000 Date: Fri, 10 Nov 2000 12:22:19 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] problem with biojava-1.00.jar
*embarased*

The MIME-type for jar files was absent. Jars were being converted into
text/plain. They are now sent as something sensible and binary, application,
jar-ish. Could you try to download 1.01 again, and see if you still get a
corrupted jar?

Thanks & sory.

Matthew



From mrp@sanger.ac.uk Fri Nov 10 12:32:11 2000 Date: Fri, 10 Nov 2000 12:32:11 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
Hi Ann.

Ann Loraine wrote:

> Please allow in-between residues annotations as well as
> on-top-of residues annotations.
>
> For instance, in-between annotations are useful for mapping splice
> sites onto alignments.  On-top-of anotations are useful for flagging
> individual residues.

The current location frame-work is effectively built around the concept of
sets of symbol indecies. Thus, there is no 'between'. This has caused
problems for edit operations - GappedSymbolList for example is a bit tortuous
in its definition of where to insert new gap characters. If you want to think
of the current locations in terms of between-ness, then the min represents
between it and the previous symbol, and max represents between it an the
following symbol. Since min < max, there is no way to represent 'between'.

The options are

a) A completely new position object. Pros - it can look however you want it
to. Cons - it will not play well with locations

b) A location implementation that is empty, and still represents a gap
emediately before min & emediately after max, and where max = min-1? Pros -
this would fit the current math cleanly, and would let you add features to
splice-sites. Cons - it kind-of breaks the Location concept.

c) A location implementation where max = min + 1, but is empty and represents
the position between the two indecies. Pros - nothing broken. Cons - We would
have to adjust the Location docs to state that min & max are not contained in
the location in this very special case - no biggie.

My vote is c). What about you?

Matthew

> I hope this is useful!
>
> -Ann
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From td2@sanger.ac.uk Fri Nov 10 17:21:12 2000 Date: Fri, 10 Nov 2000 17:21:12 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Proposed event-notification interfaces
On Fri, Nov 10, 2000 at 12:32:11PM +0000, Matthew Pocock wrote:
> 
> The current location frame-work is effectively built around the concept of
> sets of symbol indecies. Thus, there is no 'between'. This has caused
> problems for edit operations - GappedSymbolList for example is a bit tortuous
> in its definition of where to insert new gap characters. If you want to think
> of the current locations in terms of between-ness, then the min represents
> between it and the previous symbol, and max represents between it an the
> following symbol. Since min < max, there is no way to represent 'between'.
> 
> The options are
> 
> a) A completely new position object. Pros - it can look however you want it
> to. Cons - it will not play well with locations

This could be a bit awkward, especially when it comes to attaching
Position objects to Features (I guess we'd want a common base
interface for Position and Location, and I haven't a clue what
that would look like).  Could also lead to quite a bit of special
case code :(.

Probably worth exploring options which use the existing Location
interface first, anyway.

> b) A location implementation that is empty, and still represents a gap
> emediately before min & emediately after max, and where max = min-1? Pros -
> this would fit the current math cleanly, and would let you add features to
> splice-sites. Cons - it kind-of breaks the Location concept.
> 
> c) A location implementation where max = min + 1, but is empty and represents
> the position between the two indecies. Pros - nothing broken. Cons - We would
> have to adjust the Location docs to state that min & max are not contained in
> the location in this very special case - no biggie.

I think I prefer plan b -- to me this seems to be the smallest
possible change of current Location semantics.

One question concerns the semantics of the union operation for
`cut' locations.  If we have two cut locations, should the union
method give:

  - The empty location (i.e. the union operator is only considering
                        positions contained within the two locations).

  - A `compound cut' location -- I guess calling blockIterator on
    this will return the two individual cut-points.

  - Something else entirely?

What about the case the union of a `cut' location and a normal
`coverage' location?

Any thoughts?

  Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From mrp@sanger.ac.uk Fri Nov 10 19:11:50 2000 Date: Fri, 10 Nov 2000 19:11:50 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Re: DP bug query
(cc'ed to the list)

Hi Richard,

This is because it should be looking for a parser under the name "token", not "symbol". I have
changed my copy & checked it into CVS. I guess I need to check the demos more frequently. I
don't actualy remember what this demo did. It is all a bit hazey back there.

Best of luck,

Matthew

"Richard K. Belew" wrote:

> hi matthew,
>
> i'm brand new to the biojava list so please excuse newbie tendancies.
>
> but the following query was buried in my (8 nov) thread around tutorials:
>
> > > Matthew Pocock wrote:
> > >
> > > The DP objects have changed a bit since the 1.01 release - and for the
> > > better. We should put up a new snapshot of the project on the web-site
> > > as soon as I fix a show-stopper bug in the alphabet indexing code.
> >
> > ah!  i am having problems running the SearchProfile demo:
> >
> > > Loading sequences
> > > java.util.NoSuchElementException: There is no parser 'symbol' defined in
> > >               alphabet PROTEIN+X
> > >         at org.biojava.bio.symbol.AbstractAlphabet.getParser(AbstractAlphabet.java:58)
> > >         at SearchProfile.readSequenceDB(SearchProfile.java:110)
> > >         at SearchProfile.main(SearchProfile.java:21)
> >
> > maybe this is related?
>
> if this is unrelated to your new fixes i'll continue to dig in on it.
>
> and since a few other topics have come and gone thru the list i
> thought this might have slid by?
>
> thanks again,
>         rik
>
> --
> Richard K. Belew                          rik@cs.ucsd.edu
>                               http://www.cs.ucsd.edu/~rik
> Computer Science & Engr. Dept.
> Univ. California -- San Diego       858 / 534-2601
> 9500 Gilman Dr. (0114)              858 / 532-0702 (msgs)
> La Jolla CA 92093-0114 USA          858 / 534-7029 (fax)


From anthonygoss@yahoo.com Sat Nov 11 00:56:30 2000 Date: Fri, 10 Nov 2000 16:56:30 -0800 From: Anthony Goss anthonygoss@yahoo.com Subject: [Biojava-l] I need some Java people
I am looking for some Java people, experience with J2EE.
Plus would be Web methods, Web Logic, Netscape Application Server, and/or KIVA.

I will give you a 15% raise from what you are making now.

Depending on your location, there may be travel involved.  But, you do not have to relocate.

Please give me a call if you are interested.

Thank you in advance,

Anthony E. Goss
Ph: 832-577-8890

From tony_parsons@sandwich.pfizer.com Sun Nov 12 22:37:12 2000 Date: Sun, 12 Nov 2000 22:37:12 -0000 From: tony_parsons@sandwich.pfizer.com tony_parsons@sandwich.pfizer.com Subject: [Biojava-l] RE: Biojava-l digest, Vol 1 #173 - 2 msgs
Oh Please!,

We are all looking for good people in this area. I don't exactly recall the
constitution of this mailing list, but I thought it was for discourse about
biojava rather than a free for all advertisement agency.

If not can someone let me know if blatant job advertisements for free OK
here?

Best regards,

Tony Parsons

Dr.Tony Parsons,                          VOX :   + 44 1304 646596 
Information Management & Architecture,    FAX :   + 44 1304 656285 
Pfizer Central Research,                  e-mail:
Tony_Parsons@sandwich.pfizer.com 
Sandwich, 
CT13 9NJ UK 

-----Original Message-----
From: biojava-l-request@biojava.org
[mailto:biojava-l-request@biojava.org]
Sent: 11 November 2000 17:01
To: biojava-l@biojava.org
Subject: Biojava-l digest, Vol 1 #173 - 2 msgs


		PFIZER GLOBAL RESEARCH AND DEVELOPMENT
----------------------------------------------------------------
This message and any attachment has been virus checked by the 
PGRD Sandwich Data Centre.
----------------------------------------------------------------

Send Biojava-l mailing list submissions to
	biojava-l@biojava.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://biojava.org/mailman/listinfo/biojava-l
or, via email, send a message with subject or body 'help' to
	biojava-l-request@biojava.org

You can reach the person managing the list at
	biojava-l-admin@biojava.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Biojava-l digest..."


Today's Topics:

   1. Re: DP bug query (Matthew Pocock)
   2. I need some Java people (Anthony Goss)

--__--__--

Message: 1
Date: Fri, 10 Nov 2000 19:11:50 +0000
From: Matthew Pocock <mrp@sanger.ac.uk>
Organization: The Sanger Center
To: rik@cs.ucsd.edu
CC: "biojava-l@biojava.org" <biojava-l@biojava.org>
Subject: [Biojava-l] Re: DP bug query

(cc'ed to the list)

Hi Richard,

This is because it should be looking for a parser under the name "token",
not "symbol". I have
changed my copy & checked it into CVS. I guess I need to check the demos
more frequently. I
don't actualy remember what this demo did. It is all a bit hazey back there.

Best of luck,

Matthew

"Richard K. Belew" wrote:

> hi matthew,
>
> i'm brand new to the biojava list so please excuse newbie tendancies.
>
> but the following query was buried in my (8 nov) thread around tutorials:
>
> > > Matthew Pocock wrote:
> > >
> > > The DP objects have changed a bit since the 1.01 release - and for the
> > > better. We should put up a new snapshot of the project on the web-site
> > > as soon as I fix a show-stopper bug in the alphabet indexing code.
> >
> > ah!  i am having problems running the SearchProfile demo:
> >
> > > Loading sequences
> > > java.util.NoSuchElementException: There is no parser 'symbol' defined
in
> > >               alphabet PROTEIN+X
> > >         at
org.biojava.bio.symbol.AbstractAlphabet.getParser(AbstractAlphabet.java:58)
> > >         at SearchProfile.readSequenceDB(SearchProfile.java:110)
> > >         at SearchProfile.main(SearchProfile.java:21)
> >
> > maybe this is related?
>
> if this is unrelated to your new fixes i'll continue to dig in on it.
>
> and since a few other topics have come and gone thru the list i
> thought this might have slid by?
>
> thanks again,
>         rik
>
> --
> Richard K. Belew                          rik@cs.ucsd.edu
>                               http://www.cs.ucsd.edu/~rik
> Computer Science & Engr. Dept.
> Univ. California -- San Diego       858 / 534-2601
> 9500 Gilman Dr. (0114)              858 / 532-0702 (msgs)
> La Jolla CA 92093-0114 USA          858 / 534-7029 (fax)


--__--__--

Message: 2
From: Anthony Goss<anthonygoss@yahoo.com>
To: biojava-l@biojava.org
Date: Fri, 10 Nov 2000 16:56:30 -0800
Subject: [Biojava-l] I need some Java people


I am looking for some Java people, experience with J2EE.
Plus would be Web methods, Web Logic, Netscape Application Server, and/or
KIVA.

I will give you a 15% raise from what you are making now.

Depending on your location, there may be travel involved.  But, you do not
have to relocate.

Please give me a call if you are interested.

Thank you in advance,

Anthony E. Goss
Ph: 832-577-8890


--__--__--

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


End of Biojava-l Digest
		PFIZER GLOBAL RESEARCH AND DEVELOPMENT
----------------------------------------------------------------
This message and any attachment has been virus checked by the 
PGRD Sandwich Data Centre.
----------------------------------------------------------------


From Robin.Emig@maxygen.com Tue Nov 14 02:47:32 2000 Date: Mon, 13 Nov 2000 18:47:32 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] TokenParser and CrossProduct
	I just tried to use the TokenParser on a crossproduct alphabet and
it didn't work because the tokenParser class constructor sets up a map
between a single character and symbol. Can a registered cvs person fix this?
-Robin



Robin Emig
Bioinformatics Specialist
515 Galveston Dr
Redwood City, CA 94063
Maxygen Inc
650-298-5493



From td2@sanger.ac.uk Tue Nov 14 12:35:52 2000 Date: Tue, 14 Nov 2000 12:35:52 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] TokenParser and CrossProduct
On Mon, Nov 13, 2000 at 06:47:32PM -0800, Emig, Robin wrote:
> 	I just tried to use the TokenParser on a crossproduct alphabet and
> it didn't work because the tokenParser class constructor sets up a map
> between a single character and symbol. Can a registered cvs person fix this?

Just out of interest, are you actually explicitly constructing
a TokenParser, or using the form:

  Alphabet alpha = ...
  Parser alphaTokens = alpha.getParser("token");

The "token" parser of a given alphabet is only defined if
there exists a well-defined mapping between Symbols in that
alphabet and printable characters in the unicode set.  This
is true of the simple DNA, RNA, and Protein alphabets, and
I guess also for some other simple alphabets you might want
to work with (dice rolls, coin tosses, whatever).  Cross
Product symbols are harder -- I guess we could define a
standard single-char representation for some cases, like
DNA x DNA, but it might be hard to get this accepted as
a standard outside BioJava.  And things get /really/ complicated
once you get to alphabet like ((DNA x DNA x DNA) x Protein)
(which is an entirely reasonable use of cross-products -- you
might use that to represent an alignment of coding DNA against a
protein sequence).

On the other hand, CrossProductAlphabets do have a defined
"name" parser.  The symbols have names like (cytosine, adenine).
This is a pretty verbose format for storing large amounts of
alignment, but it is at least unambiguous.

You are of course welcome to define your own token-mapping
and parser implementation for your favourite cross-product
alphabets, but unless you're working with a very common case,
I'm not sure if this really belongs in the BioJava core.

What definitely does need doing is some more documentation
about the relationship between alphabets and parsers, and
the cases where token-mappings do and don't exist.  We may
also want to change the SymbolParser interface a little bit 
as we switch to the new event-based I/O framework.  I'm still
very open to ideas about how CrossProductSymbols and 
Alignments ought to be handled for I/O.  So we may be able
to get something like the behaviour you want in future.

Happy hacking,

   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From mrp@sanger.ac.uk Tue Nov 14 14:50:10 2000 Date: Tue, 14 Nov 2000 14:50:10 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] RE: Biojava-l digest, Vol 1 #173 - 2 msgs
Anthony Goss,

The Bio* mailing lists are for discussing Java in Bioinformatics, and in
particular Biojava. Do not send recruitment adverts via this list. If you have
any queries with regards to BioJava, or any of the other Bio* projects then
please contact me personaly.

Matthew Pocock

tony_parsons@sandwich.pfizer.com wrote:

> Oh Please!,
>
> We are all looking for good people in this area. I don't exactly recall the
> constitution of this mailing list, but I thought it was for discourse about
> biojava rather than a free for all advertisement agency.
>
> If not can someone let me know if blatant job advertisements for free OK
> here?
>
> Best regards,
>
> Tony Parsons
>
> Dr.Tony Parsons,                          VOX :   + 44 1304 646596
> Information Management & Architecture,    FAX :   + 44 1304 656285
> Pfizer Central Research,                  e-mail:
> Tony_Parsons@sandwich.pfizer.com
> Sandwich,
> CT13 9NJ UK


From Robin.Emig@maxygen.com Tue Nov 14 18:37:03 2000 Date: Tue, 14 Nov 2000 10:37:03 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] NameParser
	Is the best way to deal situations where multiple tokens(or name)
are really the same Symbol is to SubClass NameParser and add checks in it
that symply map the redundant names to a proper unique one, and then parse.
	The reason I ask is that I am reading in CodonBiasTables which often
have END TER or STP as the stop/terminal codon. I don't mind representing
all of these as the same symbol, because they are in my case, but I wanted
to know if there was a better way to do this, such as editing/creating and
alphabet to do this. I was thinking of also creating possible a translation
alphabet, essentially something that could set up all the mappings for a
java.Map.
-Robin

From mrp@sanger.ac.uk Tue Nov 14 19:03:36 2000 Date: Tue, 14 Nov 2000 19:03:36 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] NameParser
Hi Robin,

Just to make sure we are on the same page, you are sudgesting that END, TER and
STP all be legal names for a single termination symbol in the
protein-with-termination alphabet (retrievable from
ProteinTools.getTAlphabet()). The codon tables map from DNA^3 to
protein-with-termination, and the codon-bias tables give you a distribution over
DNA^3 for a given protein-with-termination symbol.

I sudgest a three-part solution.

1) Add a method to NameParser that lets you associate a name with a symbol. It
will look something like:

addSymbolForName(String name, Symbol sym) throws IllegalSymbolException;

It will add a map from name to sym, assuming that sym is within the alphabet for
the parser, and that name is not currently in use in that parser. You may wish
to add the corresponding remove method for breaking associations.

2) Add a 'synonym'(sp?) element to the AlphabetManager.xml resource, and to the
termination symbol add the synonyms.

3) Modify AlphabetManager.java so that it adds the synonyms to the name parser.

Does this sound do-able, or is it a bit complex?

All the best,

Matthew

"Emig, Robin" wrote:

>         Is the best way to deal situations where multiple tokens(or name)
> are really the same Symbol is to SubClass NameParser and add checks in it
> that symply map the redundant names to a proper unique one, and then parse.
>         The reason I ask is that I am reading in CodonBiasTables which often
> have END TER or STP as the stop/terminal codon. I don't mind representing
> all of these as the same symbol, because they are in my case, but I wanted
> to know if there was a better way to do this, such as editing/creating and
> alphabet to do this. I was thinking of also creating possible a translation
> alphabet, essentially something that could set up all the mappings for a
> java.Map.
> -Robin
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From alan.mcculloch@agresearch.co.nz Thu Nov 16 04:43:45 2000 Date: Thu, 16 Nov 2000 17:43:45 +1300 From: McCulloch, Alan alan.mcculloch@agresearch.co.nz Subject: [Biojava-l] database for biojava
Does anybody have any tips on the right approach to setting up a database
on top of which would sit biojava ?

The platform will be Oracle 8 and I am very keen to NOT do my
own data model (in the same way I'm keen to not do my own api/object
design which is why I want to use something like biojava !) - I want to 
use a standard model if possible, if there is such a thing.

Can a relational data model of some sort be derived from biojava ?

Maybe I could use something from the bioxml  project ?

I'd be grateful for any tips on where to start.

thanks

Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ

PS

One thing I'm interested in is the possibly of using Oracle CLOBs and LOBs
to perhaps store structured data or documents in single database fields (and
so avoid a totally normalised design for storing document contents
, which can be a pain) - however this is secondary to trying to use
a "standard" sequence data model if possible, if there is such a thing.



From td2@sanger.ac.uk Fri Nov 17 12:16:41 2000 Date: Fri, 17 Nov 2000 12:16:41 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Initial code landing
Hi...

I've been working on a first implementation of the proposed
new I/O interfaces.  Things still aren't set in stone, but some
practical tests should make further development much easier.
So far, the interfaces seem to be very well.

Unless there are any objections, I'd like to get these changes
into the BioJava 1.1 main codebase as soon as possible.  I'd like
to check them in late this afternoon (probably around 18:00 UTC).
Please let me know now if this is likely to cause any problems.

For people who are currently relying on CVS BioJava, it's probably
worth grabbing an up-to-date copy now before these changes land.

The good news is that for simple applications, there should only
be one change to make your code compatible with the new I/O:
StreamReader is now constructed with a SequenceBuilderFactory
(new interface) rather than the old SequenceFactory.  I hope
everyone can start using the new interfaces soon.

Happy hacking!

   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From mrp@sanger.ac.uk Fri Nov 17 12:41:32 2000 Date: Fri, 17 Nov 2000 12:41:32 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] byte-code generator
Hi.

Those of you working form the cvs repository may have noticed that there
is a new file called bytecode.jar in there. It is a fully functional
library for generating java byte-code and then loading and running this
code (credits: Thomas did most of the coding, he and I designed it & I
have done some debugging). It can be effectively used as a
macro-assembler for the Java VM. In the core biojava project it is
currently used for generating the projected feature proxies (magic
supplied by thomas), and in my experimental dynamic-programming
compiler.

We intend to release the bytecode generater under lgpl as a seperate
package to biojava. It is currently in a cvs module called bytecode (a
sister project to biojava-live). It has a sepreate development cycle to
biojava, so it is apropreate to keep releases of each one seperate. As
far as we are aware, it is the only open-source program of it's kind (we
may be wrong - tell us). It also has the cleanest API of any of the
bytecode generaters that I evaluated (of course, I would say that!). It
is light-weight enough that generating classes is not noticably more
expensive than loading byte-code direct from disk.

If you are interested in this functionality, feel free to check out the
bytecode module from anonymous CVS.

It is currently under (un) documented. This is bleeding-edge stuff, so
treat it carefully. If it doesn't do what you expect, then it may well
be doing the wrong thing. As always, all questions, bugs, ideas, flames
gratefully recieved.

Matthew Pocock


From armhold@cs.rutgers.edu Fri Nov 17 15:29:16 2000 Date: Fri, 17 Nov 2000 10:29:16 -0500 From: George Armhold armhold@cs.rutgers.edu Subject: [Biojava-l] introduction, and some BLAST/Genscan code
Hello,

I just subscribed to the biojava list and would like to say hello to
everyone, as well as offer up some code.  Browsing through the list
archives I found a message from Vijay Narayanasamy who was looking for
some code to talk to the BLAST server at NCBI.  It so happens that I
just completed such a class, and I'm happy to share it with anyone
that may find it useful. Here's a (simplified) example:

       String mySequence = createSequence();
       BlastConnection blast = new BlastConnection();
       blast.setQuerySequence(mySequence);
       blast.setProgram(BlastConnection.BLASTN);
       blast.setDatabase(BlastConnection.DBEST);
       blast.setExpect(10f);
       String requestID = blast.submit();

       // wait some amount of time for server to process

       String results = blast.getResults(requestID);
       if (results.equals(BlastConnection.IN_PROGRESS))
          System.out.println("request ID " + requestID + " is still in
progress.
");
       else 
          System.out.println(results);


I also have some code for talking to a Genscan server.  The code has
been in use at our site for a few weeks, but has not seen extensive
testing yet.  Source, binaries and documentation are available at
http://bigbio.rutgers.edu/~armhold/bioinf.



--
George Armhold
Rutgers University
Bioinformatics Initiative

From mrp@sanger.ac.uk Fri Nov 17 15:47:17 2000 Date: Fri, 17 Nov 2000 15:47:17 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] introduction, and some BLAST/Genscan code
Thanks George,

This looks like realy cool code. It is the part we are missing from the
process. CAT checked in blast parsing code (very cool) and Gerald Loeffler
checked in a set of interfaces for requesting and representing blast hits.
A standard object that actualy performs the blast search would be great.

If you are interested in contributing and maintaining this code and/or
becoming a BioJava developer, I can sort you out with a CVS account. In
the mean time, I will read through the rest of your docs. You may like to
look at the documentation for the packages:

org.biojava.bio.search
org.biojava.bio.program
org.biojava.bio.program.sax
org.biojava.bio.program.xml

We are betwen releases at the moment (hopefuly we should be able to get a
1.1 out the door in the not-to-distant future...), so none of the APIs are
above discussion.

Thanks for the message. All the best,

Matthew

George Armhold wrote:

> Hello,
>
> I just subscribed to the biojava list and would like to say hello to
> everyone, as well as offer up some code.  Browsing through the list
> archives I found a message from Vijay Narayanasamy who was looking for
> some code to talk to the BLAST server at NCBI.  It so happens that I
> just completed such a class, and I'm happy to share it with anyone
> that may find it useful. Here's a (simplified) example:
>
>        String mySequence = createSequence();
>        BlastConnection blast = new BlastConnection();
>        blast.setQuerySequence(mySequence);
>        blast.setProgram(BlastConnection.BLASTN);
>        blast.setDatabase(BlastConnection.DBEST);
>        blast.setExpect(10f);
>        String requestID = blast.submit();
>
>        // wait some amount of time for server to process
>
>        String results = blast.getResults(requestID);
>        if (results.equals(BlastConnection.IN_PROGRESS))
>           System.out.println("request ID " + requestID + " is still in
> progress.
> ");
>        else
>           System.out.println(results);
>
> I also have some code for talking to a Genscan server.  The code has
> been in use at our site for a few weeks, but has not seen extensive
> testing yet.  Source, binaries and documentation are available at
> http://bigbio.rutgers.edu/~armhold/bioinf.
>
> --
> George Armhold
> Rutgers University
> Bioinformatics Initiative
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From armhold@cs.rutgers.edu Fri Nov 17 16:05:52 2000 Date: Fri, 17 Nov 2000 11:05:52 -0500 From: George Armhold armhold@cs.rutgers.edu Subject: [Biojava-l] introduction, and some BLAST/Genscan code
I should clarify something.  My BlastConnection code interacts with
the WWW server at NCBI, not the Blast server directly.  It basically
does an HTTP POST to submit the sequence.  So it is subject to the
whims of their webmaster, should they decide to change their CGI
script.  I hope my previous message was not misleading.  I am planning
on working on something that does talk to the server directly, which
would be the Java equivalent of a "blast client".  (If anyone is
currently working on this I'd like to talk with them.)


--
George Armhold
Rutgers University
Bioinformatics Initiative

From hlapp@gmx.net Fri Nov 17 17:49:05 2000 Date: Fri, 17 Nov 2000 09:49:05 -0800 From: Hilmar Lapp hlapp@gmx.net Subject: [Biojava-l] byte-code generator
Matthew Pocock wrote:
> 
> We intend to release the bytecode generater under lgpl as a seperate
> package to biojava. It is currently in a cvs module called bytecode (a
> sister project to biojava-live). It has a sepreate development cycle to
> biojava, so it is apropreate to keep releases of each one seperate. As
> far as we are aware, it is the only open-source program of it's kind (we
> may be wrong - tell us). It also has the cleanest API of any of the
> bytecode generaters that I evaluated (of course, I would say that!). It
> is light-weight enough that generating classes is not noticably more
> expensive than loading byte-code direct from disk.
> 

I'm sure you evaluated the JavaClass API
(http://www.inf.fu-berlin.de/~dahm/JavaClass/), too (which BTW is also
open-source). I can't judge the extent of overlap between the
functionality of that package and your own, but there probably is some.
Was there a particular reason not to collaborate with them, or is their
interface not clean enough?

	Hilmar

-- 
-----------------------------------------------------------------
Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757
-----------------------------------------------------------------

From td2@sanger.ac.uk Fri Nov 17 18:12:45 2000 Date: Fri, 17 Nov 2000 18:12:45 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Code landed
--sm4nu43k4a2Rpi4c
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I've just checked in the first revision of my new sequence I/O
implementation.  There's still more work left to be done, but
hopefully most of the framework is now in place.  Please everyone
test this, read the code, shout at me if I've got something wrong,
etc., etc.

What's new:

  - Event-notification based sequence input, with full
    decoupling of the parsing from Sequence object creation.

  - A standard way to filter sequence and feature-table
    data as it is read into BioJava -- just implement the
    SequenceBuilder interface (see FastaDescriptionLineParser
    and EmblProcessor for examples)

  - Faster and more memory-efficient parsing of large sequences.

  - The irritating FASTA line-length bug dead and gone 
    forever :).

What's currently missing:

  - No GENBANK parser.  If anyone else wants to take this
    on, feel free (look at the new EmblLikeFormat and
    EmblProcessor classes for ideas), otherwise I'll try
    to revive the old implementation.

  - IndexedSequenceDB was clobbered by one of the internal
    API changes -- it's not a hard fix, but I've temporarily
    disabled it until we've worked out the neatest way to fit
    this functionality onto the new framework.

How to use it:

>From the outside, I've tried to make the minimum possible API
changes.  If you just use the I/O framework via the StreamReader
class, the only major change you'll see if that you now need to
provide a SequenceBuilderFactory in place of the old SequenceFactory.
The `standard' implementation is at SimpleSequenceBuilder.FACTORY.
But in practice, you may want to wrap this up in one or more extra
layers of sequence processing.

As a quick example, I've attached a newio version of the GCContent
demo program.  I'm in the process of updating the other demo programs
in the repository.

For people who were previously using EmblParser, this has now
been replaced by a lighter-weight EmblLikeParser (which should
also work for formats like SwissProt, Transfac, UTRdb, and
so on).  Output from this is converted into something resembling
the old parser using the EmblProcessor filter class.

Happy hacking!

   Thomas.


PS. For anyone who wants a copy of the last BioJava without newio,
    a checkout at 17:00 UTC today should be safe
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

--sm4nu43k4a2Rpi4c
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="GCContent.java"

package seq;

import java.io.*;

import org.biojava.bio.seq.io.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class GCContent {
    public static void main(String[] args)
        throws Exception
    {
        if (args.length != 1)
	    throw new Exception("usage: java GCContent filename.fa");
	String fileName = args[0];
       
	// Set up stream reader

	Alphabet dna = DNATools.getDNA();
	SymbolParser dnaParser = dna.getParser("token");
	BufferedReader br = new BufferedReader(
			        new FileReader(fileName));
	SequenceBuilderFactory fact = new FastaDescriptionLineParser.Factory(
					      SimpleSequenceBuilder.FACTORY);
	StreamReader stream = new StreamReader(br,
					       new FastaFormat(),
					       dnaParser,
					       fact);

	// Iterate over all sequences in the stream

	while (stream.hasNext()) {
	    Sequence seq = stream.nextSequence();
	    int gc = 0;
	    for (int pos = 1; pos <= seq.length(); ++pos) {
		Symbol sym = seq.symbolAt(pos);
		if (sym == DNATools.g() || sym == DNATools.c())
		    ++gc;
	    }
	    System.out.println(seq.getName() + ": " + 
			       ((gc * 100.0) / seq.length()) + 
			       "%");
	}
    }			       
}

--sm4nu43k4a2Rpi4c--

From td2@sanger.ac.uk Mon Nov 20 11:46:42 2000 Date: Mon, 20 Nov 2000 11:46:42 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Xerces-J updated
I've just upgraded the xerces.jar file in the biojava-live CVS
tree to match Xerces-J 1.2.1.  This has better support for XML
Schema validation, and also includes a fix which is relied on by
some new code I'll be checking in this afternoon.

Unless you are relying on some very specific bit of Xerces
behaviour, you should just be able to do a normal CVS update
and pick up the new jar file -- but your update may take a few
minutes longer than normal.

Let me know if this causes any problems,


   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From td2@sanger.ac.uk Mon Nov 20 12:03:56 2000 Date: Mon, 20 Nov 2000 12:03:56 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] BioJava DAS client
Hi...

After a couple of times when it's managed to get lost in my
TODO list, I've now merged the phase 1 BioJava DAS client into
the main CVS tree.  If you are interested, take a look at
the package

  org.biojava.bio.program.das

(remember, if you are doing a CVS update, use "update -dP"
to pick up newly created directories).

With this package, you can create a BioJava-style SequenceDB
which reflects the contents of a DAS reference datasource, then
layer feature sets from one or more `annotation' servers on top
of the reference sequence.  The code is currently missing a query-
optimizer module I'm working on at the moment, and it needs more
testing against different server implementations.  It should,
however, be fully functional -- if you can't access your favourite
DAS server, please report this as a bug. 

What I haven't done yet is build a graphical client application.
If anyone is interested in working on this, it shouldn't be too
hard to wire the client code up to the BioJava GUI packages to give
at least a first-pass attempt at a viewer.  Matthew demonstrated
something like this at ISMB over the summer, so we know it's
possible :).

For people who are interested in DAS, there is now a new web
site for the protocol specifications:

  http://biodas.org/

Happy hacking,

   Thomas
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From td2@sanger.ac.uk Mon Nov 20 15:15:59 2000 Date: Mon, 20 Nov 2000 15:15:59 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Dazzle server update
The latest development version of Dazzle, my DAS server toolkit, is now
available via CVS from the BioJava repository.  Details of
how to access this can be found on http://cvs.biojava.org/.
The module name is "dazzle".

Dazzle is a Java servlet which uses to BioJava core APIs
for handling sequence data and tables of features.
The idea is to provide a simple framework for handling DAS
requests and generating the basic documents, which can be
parameterized for specific purposes by plugging in one or
more DASDataSource objects.  These allow the same servlet
to work well as either a reference or an annotation server
on the DAS network.  I've got a couple of plans in mind for
Dazzle:

  - Combine with the experimental biojava-ensembl bridge code
    to serve human genome data directly out of the Ensembl
    project's SQL database.  This is a good testcase for Dazzle
    as a practical, scalable (!) server, as well as providing
    a sensible reference point for other people wanting to offer
    human annotation.  The bridge code is very close to a position
    where I could serve sequence data (a little bit more work is
    still needed to get genes and features working, though).

  - Package dazzle with a standalone servlet container (Tomcat?
    Interalia picoServer? something else?) and a simple admin
    tool to give a `10 minute' DAS server installation.  This
    should allow you to drop in some GFF/Game/whatever files and
    start serving annotations straight away).  

There are a few changes since the 0.04 tarball I put out a 
while back:

  - No longer needs to build from the same source tree as
    a DAS client -- I use the standard BioJava client code
    instead.

  - Some of the scalability bottlenecks fixed (but still more
    to go -- startup time for annotation servers is rather slower
    that I'd hope).

  - Tidied up the output -- should be able to generate 100%
    compliant DAS/0.98 documents.

If anyone is in a hurry to try it, the instructions for 0.04 should
still work.  Otherwise, in the next few days I'm hoping to make the
following changes:

  - Stabilise the DASDataSource interface

  - Migrate to servlets 2.2 (it currently builds against 2.1,
    but I don't know of any production quality 2.1 containers)

  - Improve lazy data source instantiation.

  - Write new installation documents.

I'll make a `proper' release once these changes are made.

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From mrp@sanger.ac.uk Mon Nov 20 15:32:03 2000 Date: Mon, 20 Nov 2000 15:32:03 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] byte-code generator
Hi Hilmar,

The honest answer is that when we were searching for available bytecode
generators, we didn't find JavaClass. Of the ones we did find, none fitted our
specs (both sensible & open-source). I think from looking at JavaClass, it is
aimed more at dissasembling and editing bytecode, were as ours is optimized
for generating it from scratch, either as the back-end for a compiler, or as a
macro-assembler. Also, lots of the things that you have to handle explicitly
in JavaClass (constant pool entries, jump points and the like) we take care of
"under the hood", so that it is much easier to write class generators that are
like c++ templates, and to organicaly re-use functionality (e.g. re-use a max
or isNaN macro).

Horses for courses. It is a shame that we didn't spot this one earlier.

Matthew

Hilmar Lapp wrote:

> I'm sure you evaluated the JavaClass API
> (http://www.inf.fu-berlin.de/~dahm/JavaClass/), too (which BTW is also
> open-source). I can't judge the extent of overlap between the
> functionality of that package and your own, but there probably is some.
> Was there a particular reason not to collaborate with them, or is their
> interface not clean enough?
>
>         Hilmar
>
> --
> -----------------------------------------------------------------
> Hilmar Lapp                                email: hlapp@gmx.net
> GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757
> -----------------------------------------------------------------
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From td2@sanger.ac.uk Mon Nov 20 16:10:13 2000 Date: Mon, 20 Nov 2000 16:10:13 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [newio] Light refactoring of the SeqIOListener interface
(This is only really of interest to people who write SequenceFormats
or SequenceBuilders -- the external API is unchanged since Friday)

In the interest of simplicity, I've changed the SeqIOListener
interface slightly so that we just notify the listener of blocks
of Symbols, rather than passing around SymbolReader objects.
This still allows us to optimize the SymbolList creation process --
performance and peak memory are essentially unchanged by this.
But the API is slightly simpler, and it will make it much easier
to write SequenceFormats which sit on top of some other parser
system (I'm thinking especially about XML formats here sitting
on top of SAX parsers).

Let me know if there are any problems with this,

   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From lstein@cshl.org Mon Nov 20 16:28:18 2000 Date: Mon, 20 Nov 2000 11:28:18 -0500 (EST) From: Lincoln Stein lstein@cshl.org Subject: [Biojava-l] Dazzle server update
Outstanding, on both the client and dazzle fronts!

Lincoln

Thomas Down writes:
 > The latest development version of Dazzle, my DAS server toolkit, is now
 > available via CVS from the BioJava repository.  Details of
 > how to access this can be found on http://cvs.biojava.org/.
 > The module name is "dazzle".
 > 
 > Dazzle is a Java servlet which uses to BioJava core APIs
 > for handling sequence data and tables of features.
 > The idea is to provide a simple framework for handling DAS
 > requests and generating the basic documents, which can be
 > parameterized for specific purposes by plugging in one or
 > more DASDataSource objects.  These allow the same servlet
 > to work well as either a reference or an annotation server
 > on the DAS network.  I've got a couple of plans in mind for
 > Dazzle:
 > 
 >   - Combine with the experimental biojava-ensembl bridge code
 >     to serve human genome data directly out of the Ensembl
 >     project's SQL database.  This is a good testcase for Dazzle
 >     as a practical, scalable (!) server, as well as providing
 >     a sensible reference point for other people wanting to offer
 >     human annotation.  The bridge code is very close to a position
 >     where I could serve sequence data (a little bit more work is
 >     still needed to get genes and features working, though).
 > 
 >   - Package dazzle with a standalone servlet container (Tomcat?
 >     Interalia picoServer? something else?) and a simple admin
 >     tool to give a `10 minute' DAS server installation.  This
 >     should allow you to drop in some GFF/Game/whatever files and
 >     start serving annotations straight away).  
 > 
 > There are a few changes since the 0.04 tarball I put out a 
 > while back:
 > 
 >   - No longer needs to build from the same source tree as
 >     a DAS client -- I use the standard BioJava client code
 >     instead.
 > 
 >   - Some of the scalability bottlenecks fixed (but still more
 >     to go -- startup time for annotation servers is rather slower
 >     that I'd hope).
 > 
 >   - Tidied up the output -- should be able to generate 100%
 >     compliant DAS/0.98 documents.
 > 
 > If anyone is in a hurry to try it, the instructions for 0.04 should
 > still work.  Otherwise, in the next few days I'm hoping to make the
 > following changes:
 > 
 >   - Stabilise the DASDataSource interface
 > 
 >   - Migrate to servlets 2.2 (it currently builds against 2.1,
 >     but I don't know of any production quality 2.1 containers)
 > 
 >   - Improve lazy data source instantiation.
 > 
 >   - Write new installation documents.
 > 
 > I'll make a `proper' release once these changes are made.
 > 
 > Thomas.
 > -- 
 > ``If I was going to carry a large axe on my back to a diplomatic
 > function I think I'd want it glittery too.''
 >            -- Terry Pratchett

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY

NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
PLEASE WRITE FOR DETAILS.
========================================================================

From birney@ebi.ac.uk Mon Nov 20 17:14:12 2000 Date: Mon, 20 Nov 2000 17:14:12 +0000 (GMT) From: Ewan Birney birney@ebi.ac.uk Subject: [Biojava-l] Re: Dazzle server update
On Mon, 20 Nov 2000, Thomas Down wrote:

> The latest development version of Dazzle, my DAS server toolkit, is now
> available via CVS from the BioJava repository.  Details of
> how to access this can be found on http://cvs.biojava.org/.
> The module name is "dazzle".


Wow...

I am very excited about this. If we can mount an ensembl reference server
that will mean there is (another) heavy server serving out data.


I am not sure if you have met the wonders of the static_golden_path table
yet in ensembl or not, but if not, I suspect you will soon. drop me a note
if you want a quick tour ;)



From td2@sanger.ac.uk Mon Nov 20 18:45:56 2000 Date: Mon, 20 Nov 2000 18:45:56 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] Oooops...
Just realized that I forgot to `CVS add' a vital file when I commited
my last round of I/O changes -- I'll check it in tommorow am.

Guess that if we are playing by EnsEMBL rules I owe large
quantities of beer... (although, in my defence, everything 
/did. compile and run on my machine before I checke in...)

If anyone is desparate, I'll try to e-mail the file this evening.

Thomas@home (currenlty withough CVS access).
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From birney@ebi.ac.uk Mon Nov 20 18:58:39 2000 Date: Mon, 20 Nov 2000 18:58:39 +0000 (GMT) From: Ewan Birney birney@ebi.ac.uk Subject: [Biojava-l] Oooops...
On Mon, 20 Nov 2000, Thomas Down wrote:

> Just realized that I forgot to `CVS add' a vital file when I commited
> my last round of I/O changes -- I'll check it in tommorow am.
> 
> Guess that if we are playing by EnsEMBL rules I owe large
> quantities of beer... (although, in my defence, everything 
> /did. compile and run on my machine before I checke in...)

<smile> beers rules rock </smile>

On bioperl/ensembl, I keep two checkout'd directory structures when I am
working. One for development, and for paranoid "clean room" tests. In
fact, it is better to have this on a different amchine if possible....

(ewan, who has been here many times before)


> 
> If anyone is desparate, I'll try to e-mail the file this evening.
> 
> Thomas@home (currenlty withough CVS access).
> -- 
> ``If I was going to carry a large axe on my back to a diplomatic
> function I think I'd want it glittery too.''
>            -- Terry Pratchett
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------


From td2@sanger.ac.uk Tue Nov 21 10:47:50 2000 Date: Tue, 21 Nov 2000 10:47:50 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] ChunkedSymbolListBuilder.java
...is now safely checked in...

*blush*

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From td2@sanger.ac.uk Wed Nov 22 14:08:57 2000 Date: Wed, 22 Nov 2000 14:08:57 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] [Dazzle] Servlet API update, and a `quick test' release
I've now upgraded to Dazzle servlet to work with the Servlet
API version 2.2 -- previous versions used 2.1, but this now
seems to be dead.  It should also work with 2.3 containers,
once they start to appear.

For people looking for a  Servlet 2.2 container, the leading
open source product seems to be Tomcat (http://jakarta.apache.org/).
I've been testing with Tomcat 3.2beta7, and this seems to work
well.

One of the advantages of servlet 2.2 containers is that they
support a standard servlet deployment mechanisms.  I've prepared
a `quick test' release of Dazzle, based on the configuration I've
been using for some of my own testing.  You can download this
at:

  ftp://ftp.biojava.org/pub/biojava/dazzle/dazzle-test-0.05.war

This contains the servlet itself, the libraries it required
(BioJava and Xerces-J), some test data, and a deployment
descriptor file, all wrapped up in the Servlet 2.2 `Web application'
format.

To try it out:

  - Install a servlet 2.2 container (tomcat)

  - Download the test distribution and rename it to `das.war'

  - Drop das.war into the deployment directory of the container
    (The standard Tomcat distribution has a `webapps' directory).

  - Restart the servlet container

  - Test by pointing your web browser to:
        <base_url_of_container>/das
    This should give an HTML `welcome' page generated by the
    servlet

Let me know how this works out -- I'd be especially interested
if anyone was testing using a container other than Tomcat.

In the meantime, I'm still working on the bridge which will
allow us to serve EnsEMBL -- watch this space.

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From td2@sanger.ac.uk Wed Nov 22 15:19:57 2000 Date: Wed, 22 Nov 2000 15:19:57 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] database for biojava
Just found this languishing at the end of my INBOX -- sorry...

On Thu, Nov 16, 2000 at 05:43:45PM +1300, McCulloch, Alan wrote:
> Does anybody have any tips on the right approach to setting up a database
> on top of which would sit biojava ?
> 
> The platform will be Oracle 8 and I am very keen to NOT do my
> own data model (in the same way I'm keen to not do my own api/object
> design which is why I want to use something like biojava !) - I want to 
> use a standard model if possible, if there is such a thing.
> 
> Can a relational data model of some sort be derived from biojava ?

It certainly should be possible to build a new relational model
based on BioJava.  Out basic model (simple sequence data,
hierarchical features) is really pretty simple -- the only
problems I can see might be:

  - Sparse locations -- it'll be a little bit of extra work to
    store these in the relational model.  I guess I'd go for
    having a `span' table:

      create table location_span (
        location_id         int not null,
        min_pos             int not null,
        max_pos             int not null
      ) ;
 
    So each location is modeled by one or more location_span
    rows.  Of course, the BioJava interfaces don't actually
    /require/ you to store sparse locations -- only implement
    this if you're actually going to need it.

  - Polymorphic features -- I guess the easiest way might be to
    have a separate table for each class of Feature object you
    want to store, but this means hardwiring the supported
    feature classes at a fairly low level.  Another approach
    would be to have a table like:

      create table feature (
          id               sequence,
          sequence_id      int not null,
          parent_id        int,
          location_id      int not null,
          type             text,
          source           text,
          biojava_feature  blob
      ) ;
      
    so you're storing the `universal' properties of the feature,
    and then serializing the whole feature object and dumping it
    in the blob. 


But before you start implementing from scratch, you might like
to take a look at what the EnsEMBL people have been doing
(http://www.ensembl.org).  They've got a fairly sophisticated
model for storing genomic data in a relational model (currently
using MySQL, but I've had the main tables running on PostgreSQL,
and I know someone is working on an Oracle port).  The EnsEMBL
tables are more closely geared towards one specific application
that the BioJava model is,  but it might be worth looking to
see if your data will fit into this model.

I've been working on some Java interfaces for EnsEMBL -- all
experimental code at the moment.  Feel free to take a look
at the following CVS modules if you're interested (in the main
BioJava repository):

  ensembl           Lightweight Java wrappers round the ensembl
                    SQL tables (largely complete for reading, maybe
                    40-50% done for writing)

  biojava-ensembl   Bridge which allows EnsEMBL databases to be
                    viewed as BioJava SequenceDBs  (currently
                    pretty experimental)

Hope this helps,

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

From loraine@loraine.net Mon Nov 27 06:47:59 2000 Date: Sun, 26 Nov 2000 22:47:59 -0800 (PST) From: Ann Loraine loraine@loraine.net Subject: [Biojava-l] BioJava DAS client
> 
> What I haven't done yet is build a graphical client application.
> If anyone is interested in working on this, it shouldn't be too
> hard to wire the client code up to the BioJava GUI packages to give
> at least a first-pass attempt at a viewer.  Matthew demonstrated
> something like this at ISMB over the summer, so we know it's
> possible :).
> 

You also might want to check out Jazz - a Java open source toolkit for
building graphical applications, such as genome and sequence map viewers!

Jazz is the open source, Java heir to Pad++, groundbreaking zooming
graphical interface project headed by Ben Bederson, now a prof at the
Human Computer Interaction Lab at the University of Maryland, USA.

The URL: http://www.cs.umd.edu/hcil/jazz/

-Ann


> For people who are interested in DAS, there is now a new web
> site for the protocol specifications:
> 
>   http://biodas.org/
> 
> Happy hacking,
> 
>    Thomas
> -- 
> ``If I was going to carry a large axe on my back to a diplomatic
> function I think I'd want it glittery too.''
>            -- Terry Pratchett
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 


From kdj@sanger.ac.uk Mon Nov 27 14:13:18 2000 Date: 27 Nov 2000 14:13:18 +0000 From: Keith James kdj@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
Hi,

I'm one of the Sanger Pathogen Sequencing Unit annotators and I've
been writing/using OO Perl stuff for EMBL feature table editing,
Blast/Fasta/HMMER/EMBOSS etc. sequence analysis. I'm a Java newbie
looking to see if the 'grass is greener' on the Java side of the
fence.

I spent a weekend reading the Javadoc and trying things out. No
problems. Now I have some questions:

I want to implement a Fasta search output parser (for the nicer -m 10
form of output). I have a Perl implementation right now. Going through
the list archive I found lots of discussion regarding the Blast
SAX-type parser. Would this be the preferred way to cope with Fasta?
This might be a bit of a challenge for me as I am initially confused
by the various layers of the SAX-type system, but I'm sure I'll sort
it out.

(How does the SAX-type parser fit in with the code in
org.biojava.bio.search?)

And an observation:

The EMBL flatfile feature table parser (at least, as it was until the
new io stuff) would overwrite qualifiers. e.g. where there were
several /gene names in a feature, only the last one would be
retained. Also quirks similar to earlier Bioperl (like discarding
information from < and > in locations, which is important for us to
keep). Are these going to be addressed in the io shakeup?

On a related note, if nobody is going to implement writeSequence for
EMBL, then I'll offer to do it.

cheers,

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA

From mrp@sanger.ac.uk Mon Nov 27 15:03:07 2000 Date: Mon, 27 Nov 2000 15:03:07 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
Hi Keith,

You should drop in some time and say hello (D322).

Keith James wrote:

> Hi,
>
> I'm one of the Sanger Pathogen Sequencing Unit annotators and I've
> been writing/using OO Perl stuff for EMBL feature table editing,
> Blast/Fasta/HMMER/EMBOSS etc. sequence analysis. I'm a Java newbie
> looking to see if the 'grass is greener' on the Java side of the
> fence.
>
> I spent a weekend reading the Javadoc and trying things out. No
> problems. Now I have some questions:
>

Wow - you could make stuff work from reading the docs? They must be better
than I remember...

>
> I want to implement a Fasta search output parser (for the nicer -m 10
> form of output). I have a Perl implementation right now. Going through
> the list archive I found lots of discussion regarding the Blast
> SAX-type parser. Would this be the preferred way to cope with Fasta?
> This might be a bit of a challenge for me as I am initially confused
> by the various layers of the SAX-type system, but I'm sure I'll sort
> it out.
>

SAX would be the ideal way to do this, but as you say, it does require a
level of effort that may be disproportionately high.

>
> (How does the SAX-type parser fit in with the code in
> org.biojava.bio.search?)
>

bio.search specifies how the biojava objects for representing search methods
& results should appear. The parsing framework specifies how the results
flow through the application as a stream of data. It is easy to build
bio.search objects from the xml streams by extracting interesting stuff.
However, with the streams, you can do on-the-fly translation into other
formats e.g. HTML. You could also build the bio.search objects directly from
the fasta search output, or build them to represent the results of your
personal search algorithm writen in Java.

>
> And an observation:
>
> The EMBL flatfile feature table parser (at least, as it was until the
> new io stuff) would overwrite qualifiers. e.g. where there were
> several /gene names in a feature, only the last one would be
> retained. Also quirks similar to earlier Bioperl (like discarding
> information from < and > in locations, which is important for us to
> keep). Are these going to be addressed in the io shakeup?
>

The qualifier overwriting should be adressed by the new IO (fingers
crossed). Fuzzy locations are evil. I ducked handeling this one untill
somebody required it. You requre it, so I guess the days of ducking are
over. I am willing to add a new implementation of the Location interface
called FuzzyLocation. It will have isMinFuzzy() and isMaxFuzzy() boolean
methods, and will decorate another Location for all the other location
methods. This way I think we can store everything & lose nothing. Sounds
good?

>
> On a related note, if nobody is going to implement writeSequence for
> EMBL, then I'll offer to do it.

Thanks - once the new IO has settled down, this would be great.

>
>
> cheers,
>
> Keith
>

Matthew

>
> --
>
> -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
> The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l


From simon.brocklehurst@CambridgeAntibody.com Mon Nov 27 15:29:55 2000 Date: Mon, 27 Nov 2000 15:29:55 +0000 From: Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com Subject: [Biojava-l] Fasta & EMBL feature table parsing
Hi Keith,

First, yes the grass is greener with Java ;-)

The SAX2 event-based parsing framework is designed to be extensible (for
example as well as the blast/wu-blast/hmmer stuff, there is
proof-of-principle 3-D structure stuff which will be enhanced shortly).

I'm sure you're not alone about being confused - I don't think there is
enough documentation there to make it easy to get going on using the parsers
to build applications, let alone extending the system by writing new SAX
parsers.

I have been meaning to put up some more documentation and tutorials on the
biojava web site to make it easy for people to get going.  As a start on
this, I will try to get some UML class diagram stuff up late today.  This
should certainly help you figure out what classes can be reused.

The place to start with this kind of thing is to figure out exactly what
SAX2 events you will need to throw.  What this means is that you need to
work out what the XML format would be if your data was actually in XML
format, and then put together a XML DTD or Schema to describe it.

If you have any detailed questions, please feel free to drop a note to the
list and I will do my best to help.

Simon
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com



From kdj@sanger.ac.uk Mon Nov 27 17:11:19 2000 Date: 27 Nov 2000 17:11:19 +0000 From: Keith James kdj@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
>>>>> "Matthew" == Matthew Pocock <mrp@sanger.ac.uk> writes:

    >>  And an observation:
    >> 
    >> The EMBL flatfile feature table parser (at least, as it was
    >> until the new io stuff) would overwrite qualifiers. e.g. where
    >> there were several /gene names in a feature, only the last one
    >> would be retained. Also quirks similar to earlier Bioperl (like
    >> discarding information from < and > in locations, which is
    >> important for us to keep). Are these going to be addressed in
    >> the io shakeup?

    Matthew> The qualifier overwriting should be adressed by the new
    Matthew> IO (fingers crossed). Fuzzy locations are evil. I ducked
    Matthew> handeling this one untill somebody required it. You
    Matthew> requre it, so I guess the days of ducking are over. I am
    Matthew> willing to add a new implementation of the Location
    Matthew> interface called FuzzyLocation. It will have isMinFuzzy()
    Matthew> and isMaxFuzzy() boolean methods, and will decorate
    Matthew> another Location for all the other location methods. This
    Matthew> way I think we can store everything & lose
    Matthew> nothing. Sounds good?

I think we call fuzzy locations something different e.g.

FT   fuzzy_3p        complement(130.140..2780)
FT   fuzzy_both      123.130..789.796

Thankfully, I have some Perl classes to deal with these and I'm going
to ignore them.

The < and > fuzziness is more important for us because they signify
e.g. that there is more of the feature on an adjacent cosmid, or
perhaps just 'beware incomplete CDS'. We sometimes use this to
reconstitute bacterial genes across cosmid overlaps.

Support for these would be great.

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA

From mrp@sanger.ac.uk Mon Nov 27 17:20:21 2000 Date: Mon, 27 Nov 2000 17:20:21 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
Keith James wrote:

> >>>>> "Matthew" == Matthew Pocock <mrp@sanger.ac.uk> writes:
>
> I think we call fuzzy locations something different e.g.
>
> FT   fuzzy_3p        complement(130.140..2780)
> FT   fuzzy_both      123.130..789.796
>
> Thankfully, I have some Perl classes to deal with these and I'm going
> to ignore them.
>
> The < and > fuzziness is more important for us because they signify
> e.g. that there is more of the feature on an adjacent cosmid, or
> perhaps just 'beware incomplete CDS'. We sometimes use this to
> reconstitute bacterial genes across cosmid overlaps.
>
> Support for these would be great.
>

I have just checked in org.biojava.bio.symbol.FuzzyLocation which deals with
< and > locations (getMinFuzzy & getMaxFuzzy are the two properties). I
don't know how to handle the interval case (x.y rather than x..y) so I
intend to duck that untill absolutely necisary.

In an earlier post, there was a request for 'between' locations - I still
can't see how to do that cleanly, so I haven't added it yet.

Matthew

>
> Keith
>
> --
>
> -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
> The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA


From simon.brocklehurst@CambridgeAntibody.com Mon Nov 27 20:41:24 2000 Date: Mon, 27 Nov 2000 20:41:24 +0000 From: Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com Subject: [Biojava-l] Fasta & EMBL feature table parsing
Dear All,

As per my previous post, there is now some detailed UML and JavaDocs for
SAXParser writers (i.e. lots of detail e.g. including Classes with
package-level visibility etc.) up in the following location:

http://www.biojava.org/parsingTutorial1/

It's very much a beginning (understatement!). That is, if anyone is going to
get anything out of this, they really need some understanding of Java, XML
parsing using SAX2, and how SAXParsers work in general.

NB For people reading the archive, please not that the above URL is a
*temporary* location - I had to put it here due to issues with permissions
on the web server.  I expect this content (and later versions) will move
into the tutorials section of the biojava web site soon.

Simon
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com



From kdj@sanger.ac.uk Mon Nov 27 21:18:50 2000 Date: 27 Nov 2000 21:18:50 +0000 From: Keith James kdj@sanger.ac.uk Subject: [Biojava-l] Fasta & EMBL feature table parsing
>>>>> "Simon" == Simon Brocklehurst <simon.brocklehurst@CambridgeAntibody.com> writes:

    Simon> The place to start with this kind of thing is to figure out
    Simon> exactly what SAX2 events you will need to throw.  What this
    Simon> means is that you need to work out what the XML format
    Simon> would be if your data was actually in XML format, and then
    Simon> put together a XML DTD or Schema to describe it.

That's what I figured. Our group needs to work out what DTDs we will
be using for annotation and search result interchange in general
too. I hope we can pull all this together.

I'll plough through the code and see what I can make of it. The
diagram (I've had a look at it now) is very helpful. Ta.

cheers,

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA

From mrp@sanger.ac.uk Tue Nov 28 19:07:40 2000 Date: Tue, 28 Nov 2000 19:07:40 +0000 From: Matthew Pocock mrp@sanger.ac.uk Subject: [Biojava-l] CrossProductSymbols & stuff
Hi.

It is that time again when I am looking at the symbol and alphabet
indexes. All in all, they are working very well. The one rinkle for me
at the moment is the Cross Product stuff. Pre 1.0 I changed Symbol so
that all symbols were ambiguous, but AttomicSymbol is a sub-interface
that guarantees that the only symbol it matches is itself. I am
proposing to do the similar flip with CrossProduct symbol - all symbols
are thought of as being cross products of other symbols, but a special
sub-set are prime - can only be represented by raising themselves to the
power 1, not by multiplying any other symbols together.

This flip should not in practice change the day-to-day use of BioJava
one jot. It will, however, clean up some of the internals for handeling
alignments and probability distributions. The gap symbol should also
become lest skitzoid (i hope for embarasement's sake that none of you
have given the gap symbol a good poke arround). I hope that we can get
things to be prety much binary compatible when seen from the outside. It
will certainly have settled down by the time we get arround to a 1.1
release.

This all came to light because I am trying to write strand-reversible
2nd order HMMs for modeling chromosomes, and the current scheim makes
life painfull. All those with objections speak now & loudly, or next
time you check out from CVS, this will all have been silently changed.

Matthew




From Robin.Emig@maxygen.com Wed Nov 29 16:17:03 2000 Date: Wed, 29 Nov 2000 08:17:03 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] CrossProductAlphabet
I have problems creating a cross product alphabet where the size is greater
than 1000 symbols. Now this is the limit where a sparse cross product
alphabet gets created instead of a simplecrossproductalphabet. I keep
getting the exception
No symbol for token 'GGG' found in alphabet (null x null x null). But both
the original alphabet and the crossproduct alphabet do get instantiated.
	I am creating a codon alphabet that includes ambiguities.
-Robin

From Robin.Emig@maxygen.com Wed Nov 29 20:59:20 2000 Date: Wed, 29 Nov 2000 12:59:20 -0800 From: Emig, Robin Robin.Emig@maxygen.com Subject: [Biojava-l] CrossProductSymbols & stuff
How about adding a
AlphabetManager.getCrossProductAlphabet(collection,boolean) 
where boolean is true if we want to instantiate all the symbols
then put some upper limit (say 100000) before creating a sparsecrossproduct
alphabet

-Robin

From smarkel@netgenics.com Thu Nov 30 02:29:21 2000 Date: Wed, 29 Nov 2000 18:29:21 -0800 From: Scott Markel smarkel@netgenics.com Subject: [Biojava-l] seeking comments on proposed changes
We'd like to propose some changes and would like to get the group's
feedback.

  * Location.empty.equals(Location.empty) evaluates to false.  The
    problem is that EmptyLocation returns Integer.MIN_VALUE from the
    getMax() method and the LocationComparator determines the distance
    between the max of two Locations using subtraction.  In this case of
    comparing Location.empty to itself the max values are both maximally
    negative so subtracting does not result in 0.  We'd like to change
    EmptyLocation's equals() method.

  * FastaFormat doesn't use Java-like facilities such as reading lines
    as Strings from a BufferedReader.  We tripped over this while
    tracking down a bug regarding DOS formatted end-of-line characters
    in a FASTA file.  we have a fix to the DOS format bug that could be
    checked in, but we're wondering if using BufferedReader's readLine()
    method might be a safer approach that avoids that kind of problem.

  * We also noticed that when FastaFormat processes a sequence file a
    new String object is instantiated for each character in the sequence
    so that it can be parsed and added to the SymbolList.  We've noticed
    a big performance hit for large sequences (100K - 10M bp).

    We'd like to do one of the following.

    - Add a method that mimics parseToken(), but takes a primitive char.
      This new method might live in either SymbolParser or a derived
      interface.  Change the implementation of TokenParser's parse()
      method to not use substring(), which causes more Strings to be
      instantiated.

    - Change FastaFormat to use the current interface but instantiate a
      String per symbol in the alphabet and reuse them rather than
      creating a String per sequence character.

Comments?

Scott

-- 
Scott Markel, Ph.D.       NetGenics, Inc.
smarkel@netgenics.com     4350 Executive Drive
Tel: 858 455 5223         Suite 260
FAX: 858 455 1388         San Diego, CA  92121

From td2@sanger.ac.uk Thu Nov 30 11:31:59 2000 Date: Thu, 30 Nov 2000 11:31:59 +0000 From: Thomas Down td2@sanger.ac.uk Subject: [Biojava-l] seeking comments on proposed changes
On Wed, Nov 29, 2000 at 06:29:21PM -0800, Scott Markel wrote:
> We'd like to propose some changes and would like to get the group's
> feedback.

Great -- BioJava has grown quite a bit since the 1.0 release,
and the more review it gets before 1.1 the better.

>   * Location.empty.equals(Location.empty) evaluates to false.  The
>     problem is that EmptyLocation returns Integer.MIN_VALUE from the
>     getMax() method and the LocationComparator determines the distance
>     between the max of two Locations using subtraction.  In this case of
>     comparing Location.empty to itself the max values are both maximally
>     negative so subtracting does not result in 0.  We'd like to change
>     EmptyLocation's equals() method.

That sounds reasonably to me...

>   * FastaFormat doesn't use Java-like facilities such as reading lines
>     as Strings from a BufferedReader.  We tripped over this while
>     tracking down a bug regarding DOS formatted end-of-line characters
>     in a FASTA file.  we have a fix to the DOS format bug that could be
>     checked in, but we're wondering if using BufferedReader's readLine()
>     method might be a safer approach that avoids that kind of problem.

There is actually a very good reason why FastaFormat doesn't use
BufferedReader.readLine (I went to some trouble when I rewrote it
to stop using readLine).  The trouble is that some FASTA files have
long (potentially /very/ long) description lines.  The only way
to detect when you've hit the end of one sequence is to see the
start of the description line of the next sequence.  The contract
for the SequenceFormat.readSequence method is to read exactly one
sequence from the stream, and then leave it parked at the start
of the next sequence (this is important for allowing IndexedSequenceDB
to work).  Since Java doens't allow truly random access on normal
streams (only mark/restore), it's actually NOT safe to use readLine --
previous versions of BioJava did this, and ended up breaking if you
used files with description lines bigger than the buffer of the
BufferedReader :(.

That said, you've clearly found a bug with FASTA files containing
return characters -- glad someone found this sooner rather than later.
But it's safer to just accept return characters as well as newlines,
rather than using readLine() -- I'll check this in in a minute
(bad Thomas for being Unix-centric).

>   * We also noticed that when FastaFormat processes a sequence file a
>     new String object is instantiated for each character in the sequence
>     so that it can be parsed and added to the SymbolList.  We've noticed
>     a big performance hit for large sequences (100K - 10M bp).

I know this isn't ideal (although it was actually less of a problem
that I thought on the VMs I tested -- still worth fixing,
though).  I've been thinking about changes to the SymbolParser 
interface for a while, but haven't got round to doing anything.

>     We'd like to do one of the following.
> 
>     - Add a method that mimics parseToken(), but takes a primitive char.
>       This new method might live in either SymbolParser or a derived
>       interface.  Change the implementation of TokenParser's parse()
>       method to not use substring(), which causes more Strings to be
>       instantiated.
> 
>     - Change FastaFormat to use the current interface but instantiate a
>       String per symbol in the alphabet and reuse them rather than
>       creating a String per sequence character.

I'd be quite happy to see the first of these options implemented --
go ahead and do it now if you're being held back by the performance
issues.

An alternative solution which I've been thinking about is a
`symbol-stream' parsing approach.  The broad idea is that the
SymbolParser gains an extra method to create a `streaming context'
object.  Blocks of primitives chars go in, blocks of Symbols come
out.  There are two possible ways this might be done:

  - Have a method on SymbolParser which takes a java Reader
    and returns a SymbolReader (the same interface I used in
    my initial newio proposal).  SequenceFormats just provide
    a custom Reader implementation which exposes the raw sequence
    character data.

  - Have a special `streaming context' interface alongside the
    parser.  This has a (SAX-like) characters(char[], int, int)
    method.  A streaming context accepts character data, parses
    it, and passes blocks of symbols on to a SeqIOListener

I think I'm starting to prefer the second of these proposals,
and we then get rid of SymbolReader completely.

The reason I'd like to use one of these two systems in preference
to just having a parseToken(char) method is that, while these
approaches should be just as efficient for streams with a single
character -> Symbol encoding, they can also be used on multiple
character -> Symbol encoded streams.  I think the current
SymbolParser interfaces was designed with multi-char -> Symbol
encodings in mind.

On the other hand, I'm open to being told that this is overkill
and we should just concentrate on single-char -> Symbol parsing
for now.

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett