From can.gencer at angiogenetics.se  Mon Nov  1 09:49:39 2004
From: can.gencer at angiogenetics.se (Can Gencer)
Date: Mon Nov  1 09:49:46 2004
Subject: [Biojava-l] Parsing a huge Blast File with Biojava
Message-ID: <1099320579.7620.8.camel@slyfox.angiogenetics.se>

Hello everyone,

We are trying to parse a quite large multiple BLAST results file (around
4GB), and the computer available has 1GB of RAM. However, when the code
in the cookbook is used (
"http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the
BlastLikeSAXParser it will give out an OutOfMemory exception after a
short while, and when I monitor the system during the parsing, I don't
see the memory usage going up significantly. It is the
parse(InputSource) method that throws the exception. Is there a way to
solve this problem ?

Thanks,

Can

From smh1008 at cus.cam.ac.uk  Mon Nov  1 10:08:13 2004
From: smh1008 at cus.cam.ac.uk (David Huen)
Date: Mon Nov  1 10:06:47 2004
Subject: [Biojava-l] Parsing a huge Blast File with Biojava
In-Reply-To: <1099320579.7620.8.camel@slyfox.angiogenetics.se>
References: <1099320579.7620.8.camel@slyfox.angiogenetics.se>
Message-ID: <200411011508.13501.smh1008@cus.cam.ac.uk>

On Monday 01 Nov 2004 14:49, Can Gencer wrote:
> Hello everyone,
>
> We are trying to parse a quite large multiple BLAST results file (around
> 4GB), and the computer available has 1GB of RAM. However, when the code
> in the cookbook is used (
> "http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the
> BlastLikeSAXParser it will give out an OutOfMemory exception after a
> short while, and when I monitor the system during the parsing, I don't
> see the memory usage going up significantly. It is the
> parse(InputSource) method that throws the exception. Is there a way to
> solve this problem ?
>
This is probably not the answer you want but I'm parsing BLAST files at 
least as large as yours without this problem using the BlastXMLParserFacade 
class.  Perhaps it may be a temporary workaround until someone who 
understands the other parser responds, I certainly don't.

There is also a alpha/beta-quality parser filter framework that could 
perhaps be used with the XML parser framework in CVS.

Regards,
David Huen
P.S. A number of fixes have gone into the XML parsing for NCBI Blastn (the 
only part I use, the other parts may work too)software in CVS which may 
make it workable for you now.  In particular, the irritating DTD related 
bug appears to be worked around.
From thomas at derkholm.net  Mon Nov  1 10:15:27 2004
From: thomas at derkholm.net (Thomas Down)
Date: Mon Nov  1 10:14:47 2004
Subject: [Biojava-l] Parsing a huge Blast File with Biojava
In-Reply-To: <1099320579.7620.8.camel@slyfox.angiogenetics.se>
References: <1099320579.7620.8.camel@slyfox.angiogenetics.se>
Message-ID: <20041101151527.GA27076@kalinda.derkholm.net>

On Mon, Nov 01, 2004 at 03:49:39PM +0100, Can Gencer wrote:
> Hello everyone,
> 
> We are trying to parse a quite large multiple BLAST results file (around
> 4GB), and the computer available has 1GB of RAM. However, when the code
> in the cookbook is used (
> "http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the
> BlastLikeSAXParser it will give out an OutOfMemory exception after a
> short while, and when I monitor the system during the parsing, I don't
> see the memory usage going up significantly. It is the
> parse(InputSource) method that throws the exception. Is there a way to
> solve this problem ?

Hi,

When you use the BioJava blast parser as described in the BJIA
article, it does build a fairly comprehensive set of objects which
reflect the contents of the blast output.  If those objects
turn out to be bigger than your available memory, then you'll
either have to split up the output or process it in a "streaming"
fashion.

The BioJava blast parsers actually work by converting the blast
output to XML, which is then presented to a SAX contenthandler.
The normal strategy is to use a ContentHandler which builds objects,
and this is what the BioJava BlastLikeSearchBuilder class is doing.
However, there's nothing to stop you writing a custom ContentHandler
which extracts the information you want directly from the XML
representation.  This strategy should let you process unlimited
amounts of blast output without running into memory problems, but
does involve a certain amount of work.  If you want to see what the
XML representation looks like, try the demos/nativeapps/BlastLike2XML.java 
script, included in the BioJava source distribution.

However, since you say "I don't see the memory usage going up
significantly", I'm wondering if your program is *really*
exhausting system memory, or if you're just hitting the default
limit on the Java heap size.  On many platforms, the default heap
size can be pretty low.  You can control it using the -Xmx and
-Xms options (try typing java -X for proper descriptions).  On 
a 1Gb machine, I'd suggest trying something like:

       java -Xmx850M YourProgram

This allows Java to use the bulk of system memory, while still leaving
a bit left for the operating system, etc.

Hope this helps,

        Thomas.
From Peter.Ng at bccdc.ca  Mon Nov  1 13:32:21 2004
From: Peter.Ng at bccdc.ca (Ng, Peter)
Date: Mon Nov  1 13:30:49 2004
Subject: [Biojava-l] Navigating a Vector
Message-ID: <C04143BBDB37E24E820B5B6CE3EE1C6A0118D109@srvex04.phsabc.ehcnet.ca>

I'm trying to iterate through a database using a Vector and
previous/next JButtons.  How do I find the Vector index of the current
record so I can navigate forward and back in the Vector?  Thanks in
advance!
-- 
Regards,

Peter Ng
Laboratory Information Management Coordinator
Laboratory Services
BC Centre for Disease Control
655 West 12th Avenue
Vancouver BC  V5Z 4R4
Tel: 604-660-2058     
Fax: 604-660-6073
Web: www.bccdc.org


From mark.schreiber at group.novartis.com  Mon Nov  1 19:55:06 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Nov  1 19:53:40 2004
Subject: [Biojava-l] Navigating a Vector
Message-ID: <OFAC45B18C.71627252-ON48256F40.0004D0F7-48256F40.00050BB9@EU.novartis.net>

I'm not sure you can, especially because iterators on Vectors are not 
gaurenteed to operate in any special order. If possible you should use an 
ArrayList or LinkedList. In this case you will be able to find the index 
or even ask for items by their index.

You can make a List or LinkedList out of a Vector as it is a Collection.

- Mark


"Ng, Peter" <Peter.Ng@bccdc.ca>
Sent by: biojava-l-bounces@portal.open-bio.org
11/02/2004 02:32 AM

 
        To:     <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Navigating a Vector


I'm trying to iterate through a database using a Vector and
previous/next JButtons.  How do I find the Vector index of the current
record so I can navigate forward and back in the Vector?  Thanks in
advance!
-- 
Regards,

Peter Ng
Laboratory Information Management Coordinator
Laboratory Services
BC Centre for Disease Control
655 West 12th Avenue
Vancouver BC  V5Z 4R4
Tel: 604-660-2058 
Fax: 604-660-6073
Web: www.bccdc.org


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From rahul at genebrew.com  Mon Nov  1 21:54:50 2004
From: rahul at genebrew.com (Rahul Karnik)
Date: Mon Nov  1 21:44:54 2004
Subject: [Biojava-l] Navigating a Vector
In-Reply-To: <OFAC45B18C.71627252-ON48256F40.0004D0F7-48256F40.00050BB9@EU.novartis.net>
References: <OFAC45B18C.71627252-ON48256F40.0004D0F7-48256F40.00050BB9@EU.novartis.net>
Message-ID: <4186F6FA.3050405@genebrew.com>

mark.schreiber@group.novartis.com wrote:
> I'm not sure you can, especially because iterators on Vectors are not 
> gaurenteed to operate in any special order. If possible you should use an 
> ArrayList or LinkedList. In this case you will be able to find the index 
> or even ask for items by their index.

While order is not guuranteed, you can actually loop over a Vector using 
a for loop and the Vector elementAt(int index) method. Besides, if you 
create a [Array|Linked]List from the Vector, you would get the same 
order. If you want to use an Iterator, Vector implements the iterator() 
method as well.

The only difference between Vector and ArrayList is that Vector is 
synchronized (threadsafe) and ArrayList is not.

http://java.sun.com/j2se/1.4.2/docs/api/java/util/ArrayList.html
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Vector.html

Thanks,
Rahul
From fpepin at cs.mcgill.ca  Mon Nov  1 22:54:18 2004
From: fpepin at cs.mcgill.ca (Francois Pepin)
Date: Mon Nov  1 22:53:19 2004
Subject: [Biojava-l] Navigating a Vector
In-Reply-To: <OFAC45B18C.71627252-ON48256F40.0004D0F7-48256F40.00050BB9@EU.novartis.net>
References: <OFAC45B18C.71627252-ON48256F40.0004D0F7-48256F40.00050BB9@EU.novartis.net>
Message-ID: <1099367657.2942.290.camel@ybrig.MCB.McGill.CA>

Vector implements List and List guarantees that the iterator goes
through the right order. And getting a ListIterator lets you go back and
forth.

And indexOf(Object element) would give you the (first) index of where a
given element is found.

Is there something I'm missing here? 

Francois

On Mon, 2004-11-01 at 19:55, mark.schreiber@group.novartis.com wrote:
> I'm not sure you can, especially because iterators on Vectors are not 
> gaurenteed to operate in any special order. If possible you should use an 
> ArrayList or LinkedList. In this case you will be able to find the index 
> or even ask for items by their index.
> 
> You can make a List or LinkedList out of a Vector as it is a Collection.
> 
> - Mark
> 
> 
> 
> 
> 
> "Ng, Peter" <Peter.Ng@bccdc.ca>
> Sent by: biojava-l-bounces@portal.open-bio.org
> 11/02/2004 02:32 AM
> 
>  
>         To:     <biojava-l@biojava.org>
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [Biojava-l] Navigating a Vector
> 
> 
> I'm trying to iterate through a database using a Vector and
> previous/next JButtons.  How do I find the Vector index of the current
> record so I can navigate forward and back in the Vector?  Thanks in
> advance!

From luqiang at scbit.org  Thu Nov  4 13:42:20 2004
From: luqiang at scbit.org (Lu Qiang)
Date: Thu Nov  4 13:41:37 2004
Subject: [Biojava-l] Parsing blast result with a lot of hit
Message-ID: <200411041841.iA4IfTKr024979@portal.open-bio.org>

Hi, Guys,

If we are tyring to parse a blast result with a lot of hits, the machine will be crashed, for example 5000 sequences blast themselves. 

This must be caused by a ArrayList storing all results.

How to solve this problem?

regards,

Lu


From mark.schreiber at group.novartis.com  Thu Nov  4 20:01:03 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Nov  4 19:59:35 2004
Subject: [Biojava-l] Parsing blast result with a lot of hit
Message-ID: <OF442214E6.FCF258EE-ON48256F43.00056C0A-48256F43.00059766@EU.novartis.net>

Hello Lu Qiang -

We get this question a lot. I have posted below a recent response (by 
Thomas Down) to the same question:


Hi,

When you use the BioJava blast parser as described in the BJIA
article, it does build a fairly comprehensive set of objects which
reflect the contents of the blast output.  If those objects
turn out to be bigger than your available memory, then you'll
either have to split up the output or process it in a "streaming"
fashion.

The BioJava blast parsers actually work by converting the blast
output to XML, which is then presented to a SAX contenthandler.
The normal strategy is to use a ContentHandler which builds objects,
and this is what the BioJava BlastLikeSearchBuilder class is doing.
However, there's nothing to stop you writing a custom ContentHandler
which extracts the information you want directly from the XML
representation.  This strategy should let you process unlimited
amounts of blast output without running into memory problems, but
does involve a certain amount of work.  If you want to see what the
XML representation looks like, try the demos/nativeapps/BlastLike2XML.java 

script, included in the BioJava source distribution.

However, since you say "I don't see the memory usage going up
significantly", I'm wondering if your program is *really*
exhausting system memory, or if you're just hitting the default
limit on the Java heap size.  On many platforms, the default heap
size can be pretty low.  You can control it using the -Xmx and
-Xms options (try typing java -X for proper descriptions).  On 
a 1Gb machine, I'd suggest trying something like:

       java -Xmx850M YourProgram

This allows Java to use the bulk of system memory, while still leaving
a bit left for the operating system, etc.

Hope this helps,

        Thomas.


Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


"Lu Qiang" <luqiang@scbit.org>
Sent by: biojava-l-bounces@portal.open-bio.org
11/05/2004 02:42 AM

 
        To:     "biojava-l@biojava.org" <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Parsing blast result with a lot of hit


Hi, Guys,

If we are tyring to parse a blast result with a lot of hits, the machine 
will be crashed, for example 5000 sequences blast themselves. 

This must be caused by a ArrayList storing all results.

How to solve this problem?

regards,

Lu


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From phxgm at hotmail.com  Thu Nov  4 21:06:49 2004
From: phxgm at hotmail.com (PhxGM Gim)
Date: Thu Nov  4 21:05:43 2004
Subject: [Biojava-l] Parsing blast result with a lot of hit
Message-ID: <BAY15-F31cMhtTYDGGf00030594@hotmail.com>

what is the exact msg you are recieving from the JVM when it aborts? I'm 
*assuming* it's the standard "Out of Memory Exception." You can increase the 
heap size allocated to the JVM upon startup of the java application by 
throwing a few switches to the jvm invocation. there are complete tutorials 
on how to set the heap sizes for the jvms on the sun site at java.sun.com. i 
have used these to some degree of success when scaling java apps and hope it 
is applicable to your situation.
other than that you can certainly do something about having all those 
instances in memory at any one time, perhaps read them 'on demand' from 
storage. clearly you are going to have to solve the issue via additional 
resource allocations to the JVM or programmatically by reading data only as 
needed instead of loading all the data into memory. As I haven't encountered 
this particular issue in my development as of yet (with biojava) I do not 
know what constraints are imposed on developers ability to do this.
Again, I'm going to assume you have a Blast XML output file, which 
theoretically should be handled by either the BlastLikeSAXParser or the 
BlastXMLParser. Taken from the biojava docs on the BlastLikeSAXParser - "The 
biojava Blast-like parsing framework is designed to uses minimal memory,so 
that in principle, extremely large native outputs can be parsed and XML 
ContentHandlers can listen only for small amounts of information." 
(http://www.biojava.org/docs/api/org/biojava/bio/program/sax/BlastLikeSAXParser.html.) 
you can use an 'event driven' SAX parser ContentHandlers to trigger events 
caused by the XML document you're parsing. Again, it claims to scale... 
whether it does or not is another issue.

hope this has been of at least some help,

jess vermont
chicago

>From: "Lu Qiang" <luqiang@scbit.org>
>To: "biojava-l@biojava.org" <biojava-l@biojava.org>
>Subject: [Biojava-l] Parsing blast result with a lot of hit
>Date: Thu, 4 Nov 2004 18:42:20 +0000
>
>Hi, Guys,
>
>If we are tyring to parse a blast result with a lot of hits, the machine 
>will be crashed, for example 5000 sequences blast themselves.
>
>This must be caused by a ArrayList storing all results.
>
>How to solve this problem?
>
>regards,
>
>Lu
>
>
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l@biojava.org
>http://biojava.org/mailman/listinfo/biojava-l

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

From rahul at genebrew.com  Fri Nov  5 11:54:23 2004
From: rahul at genebrew.com (Rahul Karnik)
Date: Fri Nov  5 06:43:31 2004
Subject: [Biojava-l] Parsing blast result with a lot of hit
In-Reply-To: <200411041841.iA4IfTKr024979@portal.open-bio.org>
References: <200411041841.iA4IfTKr024979@portal.open-bio.org>
Message-ID: <418BB03F.3050903@genebrew.com>

Lu Qiang wrote:
> This must be caused by a ArrayList storing all results.

You have diagnosed the problem perfectly. The BlastLikeSearchBuilder
used in the BioJava in Anger example stores all the hits in an
ArrayList, which means that if you are parsing a large BLAST results
file, the whole of the file is effectively being stored in memory. The
better approach is to print the results to your output as you encounter
them. For this, you probably want to write your own implementation of
the SearchContentHandler interface (using BlastLikeSearchBuilder as a
guide) that outputs the results in the format you want, rather than
storing them in a List. Then replace BlastLikeSearchBuilder with your
own implementation.

Note that it is probably easier to up the memory available to Java, so
try that first if you haven't already. I would only recommend the
approach described above if you are running up against hardware limitations.

Thanks,
Rahul
From kvddrift at earthlink.net  Tue Nov  9 19:17:35 2004
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Tue Nov  9 19:15:47 2004
Subject: [Biojava-l] biojava and Xcode
Message-ID: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>

Hi,

I have been able to build biojava using Apple's Xcode 1.5. I also was 
able to make a separate small Xcode project and run some code that
uses biojava. What I would like to be able to do is, is to debug my 
code including the code it uses from biojava. I can step through my own 
code, but as soon as the debugger steps into a biojava function, it 
treats that as a black box, and I cannot see what happens 'under the 
hood'. Is there any way to accomplish this, either with Xcode, or 
another OS X or X11 app? I understand that this is impossible when I 
only have a jar file, but now I also have all the source code from 
biojava.

Maybe I need to create another target in the same project that contains 
the biojava source, but so far I have not been able to get this to 
work.


thanks,

- Koen.

From heuermh at acm.org  Tue Nov  9 19:30:34 2004
From: heuermh at acm.org (Michael Heuer)
Date: Tue Nov  9 19:28:44 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
Message-ID: <Pine.GSO.4.44.0411091923380.7860-100000@shell3.shore.net>

Hello Koen,

Eclipse (www.eclipse.org) has a pretty slick debugger, can run a build
using ant, and runs well on MacOSX.  You can include the biojava jars in
your project and tell Eclipse where the biojava source is and it will step
into the biojava functions where appropriate.

I'd tell you exactly how to do this, but I can't stand using Eclipse
or any other IDE for very long because of their MDI windowing interfaces
(come on, I have a 1920x1200 desktop already!).  I'm an emacs/vi and
command-line maven and/or ant guy myself.

   michael


On Tue, 9 Nov 2004, Koen van der Drift wrote:

> Hi,
>
> I have been able to build biojava using Apple's Xcode 1.5. I also was
> able to make a separate small Xcode project and run some code that
> uses biojava. What I would like to be able to do is, is to debug my
> code including the code it uses from biojava. I can step through my own
> code, but as soon as the debugger steps into a biojava function, it
> treats that as a black box, and I cannot see what happens 'under the
> hood'. Is there any way to accomplish this, either with Xcode, or
> another OS X or X11 app? I understand that this is impossible when I
> only have a jar file, but now I also have all the source code from
> biojava.
>
> Maybe I need to create another target in the same project that contains
> the biojava source, but so far I have not been able to get this to
> work.
>
>
> thanks,
>
> - Koen.
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>


From fpepin at cs.mcgill.ca  Tue Nov  9 23:00:06 2004
From: fpepin at cs.mcgill.ca (Francois Pepin)
Date: Tue Nov  9 22:56:42 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
References: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
Message-ID: <1100059206.2056.21.camel@faery>

Hi Koen,

I've never tried it, but would it be able to follow the biojava code if
you had both the source and the class files in the jar?

Another way would be to compile biojava without making the jar file and
put the class files with yours. Then there would be no reason why the
debugger can't follow it out.

I never use a debugger so I'm not quite sure why it couldn't follow it.
I'm mostly a fan of log4j (I guess the 1.4 logging system would work
fine too) and of using bean shell to go step-by-step.

Francois

On Tue, 2004-11-09 at 19:17, Koen van der Drift wrote:
> Hi,
> 
> I have been able to build biojava using Apple's Xcode 1.5. I also was 
> able to make a separate small Xcode project and run some code that
> uses biojava. What I would like to be able to do is, is to debug my 
> code including the code it uses from biojava. I can step through my own 
> code, but as soon as the debugger steps into a biojava function, it 
> treats that as a black box, and I cannot see what happens 'under the 
> hood'. Is there any way to accomplish this, either with Xcode, or 
> another OS X or X11 app? I understand that this is impossible when I 
> only have a jar file, but now I also have all the source code from 
> biojava.
> 
> Maybe I need to create another target in the same project that contains 
> the biojava source, but so far I have not been able to get this to 
> work.
> 
> 
> thanks,
> 
> - Koen.
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

From mark.schreiber at group.novartis.com  Tue Nov  9 23:05:32 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Tue Nov  9 23:03:52 2004
Subject: [Biojava-l] biojava and Xcode
Message-ID: <OF3B45707E.8E022482-ON48256F48.00164AB6-48256F48.00167B59@EU.novartis.net>

Just to add to this.

One of the best ways to debug serious problems is to follow the stack 
trace. If your bug causing exceptions then you can find the class and line 
responsible in the stack trace. If the bug is more subtle then logging and 
or assertions are the preferable way to go. It's also a good discipline.

- Mark


Francois Pepin <fpepin@cs.mcgill.ca>
Sent by: biojava-l-bounces@portal.open-bio.org
11/10/2004 12:00 PM

 
        To:     Koen van der Drift <kvddrift@earthlink.net>
        cc:     biojava-list <biojava-l@biojava.org>, (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-l] biojava and Xcode


Hi Koen,

I've never tried it, but would it be able to follow the biojava code if
you had both the source and the class files in the jar?

Another way would be to compile biojava without making the jar file and
put the class files with yours. Then there would be no reason why the
debugger can't follow it out.

I never use a debugger so I'm not quite sure why it couldn't follow it.
I'm mostly a fan of log4j (I guess the 1.4 logging system would work
fine too) and of using bean shell to go step-by-step.

Francois

On Tue, 2004-11-09 at 19:17, Koen van der Drift wrote:
> Hi,
> 
> I have been able to build biojava using Apple's Xcode 1.5. I also was 
> able to make a separate small Xcode project and run some code that
> uses biojava. What I would like to be able to do is, is to debug my 
> code including the code it uses from biojava. I can step through my own 
> code, but as soon as the debugger steps into a biojava function, it 
> treats that as a black box, and I cannot see what happens 'under the 
> hood'. Is there any way to accomplish this, either with Xcode, or 
> another OS X or X11 app? I understand that this is impossible when I 
> only have a jar file, but now I also have all the source code from 
> biojava.
> 
> Maybe I need to create another target in the same project that contains 
> the biojava source, but so far I have not been able to get this to 
> work.
> 
> 
> thanks,
> 
> - Koen.
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From td2 at sanger.ac.uk  Wed Nov 10 03:32:42 2004
From: td2 at sanger.ac.uk (Thomas Down)
Date: Wed Nov 10 03:30:56 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
References: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
Message-ID: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk>


On 10 Nov 2004, at 00:17, Koen van der Drift wrote:

> Hi,
>
> I have been able to build biojava using Apple's Xcode 1.5. I also was 
> able to make a separate small Xcode project and run some code that
> uses biojava. What I would like to be able to do is, is to debug my 
> code including the code it uses from biojava. I can step through my 
> own code, but as soon as the debugger steps into a biojava function, 
> it treats that as a black box, and I cannot see what happens 'under 
> the hood'. Is there any way to accomplish this, either with Xcode, or 
> another OS X or X11 app? I understand that this is impossible when I 
> only have a jar file, but now I also have all the source code from 
> biojava.
>
> Maybe I need to create another target in the same project that 
> contains the biojava source, but so far I have not been able to get 
> this to work.

I'm afraid I'm yet another guy who tends to use stacktraces and logging 
rather than diving in with a debugger...  but just to check...  you do 
have BioJava build with "Generate debugging symbols" set in the Java 
Compiler settings panel?

If that doesn't help, I agree that adding BioJava to the same project 
is probably the next logical step.  Why isn't that working?

          Thomas

From kvddrift at earthlink.net  Wed Nov 10 04:35:38 2004
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Wed Nov 10 04:34:07 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk>
References: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
	<16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk>
Message-ID: <E1B430EC-32FB-11D9-A52E-003065A5FDCC@earthlink.net>


On Nov 10, 2004, at 3:32 AM, Thomas Down wrote:

> If that doesn't help, I agree that adding BioJava to the same project 
> is probably the next logical step.  Why isn't that working?
>

So far I was treating biojava and my own code as 2 different targets in 
the same project. I will try to make just one target and post here if 
it worked. Thanks all for the comments,

- Koen.

From kvddrift at earthlink.net  Thu Nov 11 17:21:25 2004
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Thu Nov 11 17:19:36 2004
Subject: [Biojava-l] opening unknown fasta file
Message-ID: <06409904-3430-11D9-9447-003065A5FDCC@earthlink.net>

Hi,

The BioJava tutorial (in anger) suggests the following code to open a 
fasta file:

[snip]

  // get the appropriate Alphabet
    Alphabet alpha = AlphabetManager.alphabetForName(args[1]);

  // get a SequenceDB of all sequences in the file
    SequenceDB db = SeqIOTools.readFasta(is, alpha);


But what should I do when I don't know if the fasta file contains a 
protein or dna sequence?


thanks,

- Koen.

From mark.schreiber at group.novartis.com  Thu Nov 11 21:01:13 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Nov 11 20:59:26 2004
Subject: [Biojava-l] opening unknown fasta file
Message-ID: <OF271FF870.50020420-ON48256F4A.0009FD94-48256F4A.000B191E@EU.novartis.net>

Hi Koen -

There was a method in SeqIOTools that can (mostly) guess the alphabet of a 
file but it is deprecated cause there is no standard convention of file 
naming.  ClustalW guesses by pre-reading the the file and looking for 
symbols that don't occur in DNA that are found in protein. They claim it's 
accuracy at guessing is in the high 90's but I'm not sure how they 
calculate that number.

Bascially there is absolutely no failsafe way to know if a fasta file is 
DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide 
which contains only acg and t although it becomes very unlikely with 
longer sequences. If you have control over the files you could adopt some 
naming specification (I use .fna for fasta DNA or faa for fasta amino 
acid). An alternative is to allow the specification of format and alphabet 
in the arguments to the program.

- Mark


Koen van der Drift <kvddrift@earthlink.net>
Sent by: biojava-l-bounces@portal.open-bio.org
11/12/2004 06:21 AM

 
        To:     biojava-list <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] opening unknown fasta file


Hi,

The BioJava tutorial (in anger) suggests the following code to open a 
fasta file:

[snip]

  // get the appropriate Alphabet
    Alphabet alpha = AlphabetManager.alphabetForName(args[1]);

  // get a SequenceDB of all sequences in the file
    SequenceDB db = SeqIOTools.readFasta(is, alpha);


But what should I do when I don't know if the fasta file contains a 
protein or dna sequence?


thanks,

- Koen.

_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From m.fortner at sbcglobal.net  Thu Nov 11 23:06:59 2004
From: m.fortner at sbcglobal.net (Mark A Fortner)
Date: Thu Nov 11 23:05:15 2004
Subject: [Biojava-l] opening unknown fasta file
In-Reply-To: <06409904-3430-11D9-9447-003065A5FDCC@earthlink.net>
Message-ID: <20041112040659.10085.qmail@web80303.mail.yahoo.com>

Koen,
One thing you might try is to parse the file, grab the
accession from the first line, and use regular
expressions to identify the type of sequence.  

Hope this helps,

Mark Fortner


--- Koen van der Drift <kvddrift@earthlink.net> wrote:

> Hi,
> 
> The BioJava tutorial (in anger) suggests the
> following code to open a 
> fasta file:
> 
> [snip]
> 
>   // get the appropriate Alphabet
>     Alphabet alpha =
> AlphabetManager.alphabetForName(args[1]);
> 
>   // get a SequenceDB of all sequences in the file
>     SequenceDB db = SeqIOTools.readFasta(is, alpha);
> 
> 
> But what should I do when I don't know if the fasta
> file contains a 
> protein or dna sequence?
> 
> 
> thanks,
> 
> - Koen.
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

From thomas at derkholm.net  Fri Nov 12 11:26:05 2004
From: thomas at derkholm.net (Thomas Down)
Date: Fri Nov 12 11:24:28 2004
Subject: [Biojava-l] opening unknown fasta file
In-Reply-To: <OF271FF870.50020420-ON48256F4A.0009FD94-48256F4A.000B191E@EU.novartis.net>
References: <OF271FF870.50020420-ON48256F4A.0009FD94-48256F4A.000B191E@EU.novartis.net>
Message-ID: <20041112162605.GA18883@kalinda.derkholm.net>

On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber@group.novartis.com wrote:
> 
> Bascially there is absolutely no failsafe way to know if a fasta file is 
> DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide 
> which contains only acg and t although it becomes very unlikely with 
> longer sequences.

The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
that appear in DNA sequences.  Ns are everywhere, but many of the other
ambiguities appear from time to time, too.

If we were *really* serious about alphabet-guessing (which scares me, to be
honest), one option would be to calculate histograms of character frequencies
in EMBL and Swissprot, and look for the closest match.  I believe that
Internet Explorer takes this approach when it hits a web page without an
explicitly-specified character encoding -- it apparently works pretty well...

Does anyone feel this serious?

       Thomas.
From jvermont at hotmail.com  Sat Nov 13 00:11:29 2004
From: jvermont at hotmail.com (j vermont)
Date: Sat Nov 13 00:10:58 2004
Subject: [Biojava-l] opening unknown fasta file
Message-ID: <BAY17-F5MBhe3WYIgoF00037292@hotmail.com>

IMO this should be addressed from a design standpoint of the API's 
themselves. If you are *aware* of the nature of the file you're dealing with 
the APIs should support the ability to differentiate them programmatically, 
either via a Factory design pattern or through subclassing. It would be far 
more efficient to solve via architecture and design a general solution than 
it would be to design a 'parsing' or algorithmic based solution which will 
be specific only (I'm guessing) on a case by case basis. Not to mention the 
legit observation someone made about 'alphabet guessing.'
Obviously take my input for what it's worth, I'm a programmer by trade with 
an interest in genetics so I lean towards (and understand better) the comp 
science aspects of these discussions. I hope my humble suggestions are at 
least somewhat helpful. Based on my understanding of what is being discussed 
in this thread, however, you should be able to programmatically (not 
algorithmically) solive this particular scenario. I could look at it further 
(an API/design based or pattern based solution) when I get a chance, if 
anyone thinks it worthwhile.

just my thoughts,

jess vermont
chicago

Universes of virtually unlimited complexity can be created in the form of 
computer programs. (Joseph Weizenbaum)


>From: Thomas Down <thomas@derkholm.net>
>To: mark.schreiber@group.novartis.com
>CC: biojava-list <biojava-l@biojava.org>
>Subject: Re: [Biojava-l] opening unknown fasta file
>Date: Fri, 12 Nov 2004 16:26:05 +0000
>
>On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber@group.novartis.com 
>wrote:
> >
> > Bascially there is absolutely no failsafe way to know if a fasta file is
> > DNA or Protein (or RNA). It's perfectly reasonable to have a short 
>peptide
> > which contains only acg and t although it becomes very unlikely with
> > longer sequences.
>
>The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
>that appear in DNA sequences.  Ns are everywhere, but many of the other
>ambiguities appear from time to time, too.
>
>If we were *really* serious about alphabet-guessing (which scares me, to be
>honest), one option would be to calculate histograms of character 
>frequencies
>in EMBL and Swissprot, and look for the closest match.  I believe that
>Internet Explorer takes this approach when it hits a web page without an
>explicitly-specified character encoding -- it apparently works pretty 
>well...
>
>Does anyone feel this serious?
>
>        Thomas.
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l@biojava.org
>http://biojava.org/mailman/listinfo/biojava-l

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

From kvddrift at earthlink.net  Sat Nov 13 14:47:57 2004
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Sat Nov 13 14:46:13 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <E1B430EC-32FB-11D9-A52E-003065A5FDCC@earthlink.net>
References: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
	<16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk>
	<E1B430EC-32FB-11D9-A52E-003065A5FDCC@earthlink.net>
Message-ID: <EAF27D42-35AC-11D9-B29B-003065A5FDCC@earthlink.net>


On Nov 10, 2004, at 4:35 AM, Koen van der Drift wrote:

> So far I was treating biojava and my own code as 2 different targets  
> in the same project. I will try to make just one target and post here  
> if it worked. Thanks all for the comments,
>

To follow up on this, it's working now. The trick is to create an  
"Ant-based Application Jar" project in Xcode (1.5), and copy all the  
code from the src directory in biojava-1.4pre1 plus my own code into  
the project. I did have to comment out a couple of lines that start  
with assert to compile successfully, for instance:

/Users/koen/Desktop/biojavatest2/src/org/biojava/bio/symbol/ 
SimpleGappedSymbolList.java:408: warning: as of release 1.4, assert is  
a keyword, and may not be used as an identifier
     assert isSane() : "Data corrupted: " + blocks;
     ^
/Users/koen/Desktop/biojavatest2/src/org/biojava/bio/symbol/ 
SimpleGappedSymbolList.java:408: ';' expected
     assert isSane() : "Data corrupted: " + blocks;


Regarding the suggestions to use the stack trace, I have a C/C++ and  
GUI background, so I prefer to visually step through the code to see  
the flow and the values of each variable.


cheers,

- Koen.

From td2 at sanger.ac.uk  Sat Nov 13 15:13:39 2004
From: td2 at sanger.ac.uk (Thomas Down)
Date: Sat Nov 13 15:13:26 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <EAF27D42-35AC-11D9-B29B-003065A5FDCC@earthlink.net>
References: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
	<16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk>
	<E1B430EC-32FB-11D9-A52E-003065A5FDCC@earthlink.net>
	<EAF27D42-35AC-11D9-B29B-003065A5FDCC@earthlink.net>
Message-ID: <823016E2-35B0-11D9-A4CC-000A95C8B056@sanger.ac.uk>


On 13 Nov 2004, at 19:47, Koen van der Drift wrote:

>
> On Nov 10, 2004, at 4:35 AM, Koen van der Drift wrote:
>
>> So far I was treating biojava and my own code as 2 different targets 
>> in the same project. I will try to make just one target and post here 
>> if it worked. Thanks all for the comments,
>>
>
> To follow up on this, it's working now. The trick is to create an 
> "Ant-based Application Jar" project in Xcode (1.5), and copy all the 
> code from the src directory in biojava-1.4pre1 plus my own code into 
> the project. I did have to comment out a couple of lines that start 
> with assert to compile successfully, for instance

Assert is a Java 1.4 language feature.  could you try opening the 
Target settings, looking at the "Java Compiler Settings" panel, and 
checking that "Source Version" is set to 1.4.  Default seems to be 
"Unspecified".  I've not tried building BioJava in Xcode (I'm an 
eclipse user myself), but this seems like the most likely problem.

         Thomas.

From kvddrift at earthlink.net  Sat Nov 13 16:44:02 2004
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Sat Nov 13 16:42:10 2004
Subject: [Biojava-l] biojava and Xcode
In-Reply-To: <823016E2-35B0-11D9-A4CC-000A95C8B056@sanger.ac.uk>
References: <EBC744D4-32AD-11D9-85FD-003065A5FDCC@earthlink.net>
	<16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk>
	<E1B430EC-32FB-11D9-A52E-003065A5FDCC@earthlink.net>
	<EAF27D42-35AC-11D9-B29B-003065A5FDCC@earthlink.net>
	<823016E2-35B0-11D9-A4CC-000A95C8B056@sanger.ac.uk>
Message-ID: <229E521B-35BD-11D9-B29B-003065A5FDCC@earthlink.net>


On Nov 13, 2004, at 3:13 PM, Thomas Down wrote:

>>
>> To follow up on this, it's working now. The trick is to create an 
>> "Ant-based Application Jar" project in Xcode (1.5), and copy all the 
>> code from the src directory in biojava-1.4pre1 plus my own code into 
>> the project. I did have to comment out a couple of lines that start 
>> with assert to compile successfully, for instance
>
> Assert is a Java 1.4 language feature.  could you try opening the 
> Target settings, looking at the "Java Compiler Settings" panel, and 
> checking that "Source Version" is set to 1.4.  Default seems to be 
> "Unspecified".  I've not tried building BioJava in Xcode (I'm an 
> eclipse user myself), but this seems like the most likely problem.
>

That setting is unavailable when I use a "Ant-based Application Jar" 
project. When I switch to a "Java Tool" project, I do see that setting, 
but changing it to 1.4 doesn't solve the problem. However, I don't 
think that it is that big of a problem, so I will just leave the few 
instances of assert commented out.


thanks,

- Koen.

From ml-it-biojava at epigenomics.com  Tue Nov 16 05:09:43 2004
From: ml-it-biojava at epigenomics.com (Dirk Habighorst)
Date: Tue Nov 16 05:08:26 2004
Subject: [Biojava-l] TestDAS problem
Message-ID: <cncjl7$5js$1@broglie.epigenomics.epi>

Hi,

running the TestDAS example (biojava-live) causes the following exception:

Exception in thread "main" org.biojava.bio.BioRuntimeException: org.biojava.bio.BioException: DAS error (status code = 401) connecting to http://servlet.sanger.ac.uk:8080/das/ with query http://servlet.sanger.ac.uk:8080/das/entry_points
	at org.biojava.bio.program.das.DASSequenceDB.ids(DASSequenceDB.java:286)
	at das.TestDAS.main(TestDAS.java:25)
Caused by: org.biojava.bio.BioException: DAS error (status code = 401) connecting to http://servlet.sanger.ac.uk:8080/das/ with query http://servlet.sanger.ac.uk:8080/das/entry_points
	at org.biojava.bio.program.das.DASSequenceDB.ids(DASSequenceDB.java:261)
	... 1 more

I have tried several other das servers with the same result. By the way are the sources for Matthew Pococks biojava das client available anywhere?

thanks, dirk
From thomas at derkholm.net  Tue Nov 16 05:25:08 2004
From: thomas at derkholm.net (Thomas Down)
Date: Tue Nov 16 05:23:15 2004
Subject: [Biojava-l] TestDAS problem
In-Reply-To: <cncjl7$5js$1@broglie.epigenomics.epi>
References: <cncjl7$5js$1@broglie.epigenomics.epi>
Message-ID: <20041116102508.GB27270@kalinda.derkholm.net>

On Tue, Nov 16, 2004 at 11:09:43AM +0100, Dirk Habighorst wrote:
> Hi,
> 
> running the TestDAS example (biojava-live) causes the following exception:
> 
> Exception in thread "main" org.biojava.bio.BioRuntimeException: 
> org.biojava.bio.BioException: DAS error (status code = 401) connecting to 
> http://servlet.sanger.ac.uk:8080/das/ with query 

That test script is meant to be pointed to an individual DAS data
source, not the root of a DAS server (which can potentially be
serving up many data sources).

Try something like:

     http://servlet.sanger.ac.uk:8080/das/homo_sapiens_core_25_34e/

(You can get a complete list of what's on offer by looking at
http://servlet.sanger.ac.uk:8080/das/ in a web browser).

> I have tried several other das servers with the same result. By the way are 
> the sources for Matthew Pococks biojava das client available anywhere?

There's a version in CVS:

        http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/das-client/?cvsroot=biojava

I think this up to date, but as you can see it's not had much development
recently.

   Thomas.
From jdiggans at excelsiortech.com  Sat Nov 20 23:12:49 2004
From: jdiggans at excelsiortech.com (James Diggans)
Date: Sat Nov 20 23:00:51 2004
Subject: [Biojava-l] Errors to STDOUT in BioJava?
Message-ID: <41A015C1.5030801@excelsiortech.com>


A design-related question: In GenbankSequenceDB's getSequence() method, 
the exception-handling code when catching an Exception thrown when a bad 
accession is used to search (returning nothing from Genbank) prints an 
error to STDOUT rather than passing the Exception up the chain like a 
good little Java method should:

...
     } catch (Exception e) {
       System.out.println("Exception found in GenbankSequenceDB -- 
getSequence");
       System.out.println(e.toString());
       ExceptionFound = true;
       IOExceptionFound = true;
       return null;
     }

Is there a reason behind this? It results in an application that prints 
to STDOUT regardless of my wishes and also limits my ability to catch 
the Exception myself higher up in the stack to deal with it in an 
application-specific way. Just curious ... thanks.
-j

From jvermont at hotmail.com  Sun Nov 21 03:13:07 2004
From: jvermont at hotmail.com (j vermont)
Date: Sun Nov 21 03:12:02 2004
Subject: [Biojava-l] Errors to STDOUT in BioJava?
In-Reply-To: <41A015C1.5030801@excelsiortech.com>
Message-ID: <BAY17-F180F172151665A63111353DCC50@phx.gbl>

hello all,

I asked JD if a proper solution to this would be to rethrow Exception as 
such:

} catch (Exception e) {
>       System.out.println("Exception found in GenbankSequenceDB -- 
>getSequence");
>       System.out.println(e.toString());
>       ExceptionFound = true;
>       IOExceptionFound = true;
         //create instance of Exception and throw it here so it gets passed 
back up the stack to
         // the calling method....
         Exception myException = new Exception("bad accession error");
         throw myException;
>       return null;
>     }

the compiler won't complain if you're not throwing a checked exception. Or 
you could perhaps put a check for the boolean ExceptionFound in a finally 
clause and throw and exception from there if ExceptionFound == true; such as

} catch (Exception e) {
       System.out.println("Exception found in GenbankSequenceDB --
getSequence");
       System.out.println(e.toString());
       ExceptionFound = true;
       IOExceptionFound = true;
       return null;
     }
//always executed unless system.exit() is called;
finally
{
  // check for state of boolean ExceptionFound here
  if(ExceptionFound)
    {
        //error occured, throw an exception that will be handled further up
        //the stack
        throw new IlllegalAccessionException();
    }
}


just some thoughts.

thanks for your time,

Jess Vermont
Chicago, Il.

Universes of virtually unlimited complexity can be created in the form of 
computer programs. (Joseph Weizenbaum)


>From: James Diggans <jdiggans@excelsiortech.com>
>To: biojava-l@biojava.org
>Subject: [Biojava-l] Errors to STDOUT in BioJava?
>Date: Sat, 20 Nov 2004 23:12:49 -0500
>
>
>A design-related question: In GenbankSequenceDB's getSequence() method, the 
>exception-handling code when catching an Exception thrown when a bad 
>accession is used to search (returning nothing from Genbank) prints an 
>error to STDOUT rather than passing the Exception up the chain like a good 
>little Java method should:
>
>...
>     } catch (Exception e) {
>       System.out.println("Exception found in GenbankSequenceDB -- 
>getSequence");
>       System.out.println(e.toString());
>       ExceptionFound = true;
>       IOExceptionFound = true;
>       return null;
>     }
>
>Is there a reason behind this? It results in an application that prints to 
>STDOUT regardless of my wishes and also limits my ability to catch the 
>Exception myself higher up in the stack to deal with it in an 
>application-specific way. Just curious ... thanks.
>-j
>
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l@biojava.org
>http://biojava.org/mailman/listinfo/biojava-l

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to 
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement

From jdiggans at excelsiortech.com  Sun Nov 21 03:52:08 2004
From: jdiggans at excelsiortech.com (James Diggans)
Date: Sun Nov 21 03:43:54 2004
Subject: [Biojava-l] Errors to STDOUT in BioJava?
In-Reply-To: <BAY17-F180F172151665A63111353DCC50@phx.gbl>
References: <BAY17-F180F172151665A63111353DCC50@phx.gbl>
Message-ID: <41A05738.5070005@excelsiortech.com>


Certainly. I was asking the list to inquire whether this was an 
intentional design choice (i.e. making the class noisy regardless of 
calling class) or whether anyone would be averse to fixing it to just 
simply be:

 >>     } catch (Exception e) {
 >>       ExceptionFound = true;
 >>       IOExceptionFound = true;
 >>       return null;
 >>     }

or, my preference, to let the exception (in a more specific enbodiment, 
say, InvalidIDException) percolate up the stack to the caller rather 
than being dealt with at this low level.
-j

j vermont wrote:
> hello all,
> 
> I asked JD if a proper solution to this would be to rethrow Exception as 
> such:
> 
> } catch (Exception e) {
> 
>>       System.out.println("Exception found in GenbankSequenceDB -- 
>> getSequence");
>>       System.out.println(e.toString());
>>       ExceptionFound = true;
>>       IOExceptionFound = true;
> 
>         //create instance of Exception and throw it here so it gets 
> passed back up the stack to
>         // the calling method....
>         Exception myException = new Exception("bad accession error");
>         throw myException;
> 
>>       return null;
>>     }
> 
> 
> the compiler won't complain if you're not throwing a checked exception. 
> Or you could perhaps put a check for the boolean ExceptionFound in a 
> finally clause and throw and exception from there if ExceptionFound == 
> true; such as
> 
> } catch (Exception e) {
>       System.out.println("Exception found in GenbankSequenceDB --
> getSequence");
>       System.out.println(e.toString());
>       ExceptionFound = true;
>       IOExceptionFound = true;
>       return null;
>     }
> //always executed unless system.exit() is called;
> finally
> {
>  // check for state of boolean ExceptionFound here
>  if(ExceptionFound)
>    {
>        //error occured, throw an exception that will be handled further up
>        //the stack
>        throw new IlllegalAccessionException();
>    }
> }
> 
> 
> 
> just some thoughts.
> 
> thanks for your time,
> 
> Jess Vermont
> Chicago, Il.
> 
> Universes of virtually unlimited complexity can be created in the form 
> of computer programs. (Joseph Weizenbaum)
> 
> 
From td2 at sanger.ac.uk  Sun Nov 21 06:18:59 2004
From: td2 at sanger.ac.uk (Thomas Down)
Date: Sun Nov 21 06:17:08 2004
Subject: [Biojava-l] Errors to STDOUT in BioJava?
In-Reply-To: <41A05738.5070005@excelsiortech.com>
References: <BAY17-F180F172151665A63111353DCC50@phx.gbl>
	<41A05738.5070005@excelsiortech.com>
Message-ID: <24347E20-3BAF-11D9-AE6B-000A95C8B056@sanger.ac.uk>


On 21 Nov 2004, at 08:52, James Diggans wrote:

>
> Certainly. I was asking the list to inquire whether this was an 
> intentional design choice (i.e. making the class noisy regardless of 
> calling class) or whether anyone would be averse to fixing it to just 
> simply be:
>
> >>     } catch (Exception e) {
> >>       ExceptionFound = true;
> >>       IOExceptionFound = true;
> >>       return null;
> >>     }
>
> or, my preference, to let the exception (in a more specific 
> enbodiment, say, InvalidIDException) percolate up the stack to the 
> caller rather than being dealt with at this low level.

Hi James,

I'd certainly agree that this should be throwing an exception rather 
than returning null.  If this class were to implement the standard 
SequenceDB interface (incidentally, does anyone know why it doesn't?), 
then the getSequence method is allowed to throw IllegalIDException or 
BioException -- the `ideal' behaviour, where possible, is to throw 
IllegalIDException for the specific case of a non-existant ID being 
requested, BioException if the ID is valid but there's some other error 
getting at the data.  I don't know how easy the Genbank protocol makes 
it to distinguish the two cases.

I'll fix this at some point in the next few days, or I'd be happy to 
apply a patch if you've sorted this out yourself.

        Thomas.

From len at reeltwo.com  Mon Nov  1 15:13:31 2004
From: len at reeltwo.com (Len Trigg)
Date: Sun Nov 21 16:10:57 2004
Subject: [Biojava-l] BioSQL
In-Reply-To: <OF4344BD33.150F9265-ON48256F23.001FB186-48256F23.001FD0E2@EU.novartis.net>
References: <OF4344BD33.150F9265-ON48256F23.001FB186-48256F23.001FD0E2@EU.novartis.net>
Message-ID: <hbhdo91g7k.wl%len@reeltwo.com>

Mark Schreiber wrote:
> Does anyone have a current BioSQL schema that matches the BioJava 
> bindings? Preferably one for Oracle.

AFAIK, current BioSQL CVS matches what BioJava expects -- Hilmar
recently added the extra table that BioJava uses.  I've also attached
the Oracle schema that I've used (it's a bit simpler than the full
BioSQL Oracle schema).

Cheers,
Len.

-------------- next part --------------
-- conventions:
-- <table_name>_id is primary internal id (usually autogenerated)

-- Authors: Ewan Birney, Elia Stupka
-- Contributors: Hilmar Lapp, Aaron Mackey
--
-- Copyright Ewan Birney. You may use, modify, and distribute this code under
-- the same terms as Perl. See the Perl Artistic License.
--
-- comments to biosql - biosql-l@open-bio.org

--
-- Migration of the MySQL schema to InnoDB by Hilmar Lapp <hlapp at gmx.net>
-- Post-Cape Town changes by Hilmar Lapp.
-- Singapore changes by Hilmar Lapp and Aaron Mackey.
--


CREATE SEQUENCE biodatabase_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE taxon_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE ontology_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE term_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE term_relationship_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE term_path_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE bioentry_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE bioentry_relationship_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE dbxref_pk_seq
	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE reference_pk_seq
 	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE anncomment_pk_seq
 	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE seqfeature_pk_seq
 	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE seqfeature_relationship_pk_seq
 	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;
CREATE SEQUENCE location_pk_seq
 	INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER;


-- database have bioentries. That is about it.
-- we do not store different versions of a database as different dbids
-- (there is no concept of versions of database). There is a concept of
-- versions of entries. Versions of databases deserve their own table and
-- join to bioentry table for tracking with versions of entries 

CREATE TABLE biodatabase (
  	biodatabase_id 	int  NOT NULL ,
  	name           	VARCHAR(128) NOT NULL,
	authority	VARCHAR(128),
	description	VARCHAR2(250),
	PRIMARY KEY (biodatabase_id),
  	UNIQUE (name)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX db_auth on biodatabase(authority) TABLESPACE "BIOSQL_INDEX";

-- we could insist that taxa are NCBI taxon id, but on reflection I made this
-- an optional extra line, as many flat file formats do not have the NCBI id
--
-- no organelle/sub species

-- corresponds to the node table of the NCBI taxonomy databaase
CREATE TABLE taxon (
       taxon_id		int  NOT NULL ,
       ncbi_taxon_id 	int,
       parent_taxon_id	int ,
       node_rank	VARCHAR(32),
       genetic_code	INT ,
       mito_genetic_code INT ,
       left_value	int ,
       right_value	int ,
       PRIMARY KEY (taxon_id),
       UNIQUE (ncbi_taxon_id),
       UNIQUE (left_value),
       UNIQUE (right_value)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX taxparent ON taxon(parent_taxon_id) TABLESPACE "BIOSQL_INDEX";

-- corresponds to the names table of the NCBI taxonomy databaase
CREATE TABLE taxon_name (
       taxon_id		int  NOT NULL,
       name		VARCHAR(255) NOT NULL,
       name_class	VARCHAR(32) NOT NULL,
       UNIQUE (taxon_id,name,name_class)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX taxnametaxonid ON taxon_name(taxon_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX taxnamename    ON taxon_name(name) TABLESPACE "BIOSQL_INDEX";

-- this is the namespace (controlled vocabulary) ontology terms live in
-- we chose to have a separate table for this instead of reusing biodatabase
CREATE TABLE ontology (
       	ontology_id        int  NOT NULL ,
       	name	   	   VARCHAR(32) NOT NULL,
       	definition	   VARCHAR2(250),
	PRIMARY KEY (ontology_id),
	UNIQUE (name)
) TABLESPACE "BIOSQL_DATA";

-- any controlled vocab term, everything from full ontology
-- terms eg GO IDs to the various keys allowed as qualifiers
CREATE TABLE term (
       	term_id   int  NOT NULL ,
       	name	   	   VARCHAR(255) NOT NULL,
       	definition	   VARCHAR2(250),
	identifier	   VARCHAR(40),
	is_obsolete	   CHAR(1),
	ontology_id	   int  NOT NULL,
	PRIMARY KEY (term_id),
	UNIQUE (name,ontology_id),
	UNIQUE (identifier)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX term_ont ON term(ontology_id) TABLESPACE "BIOSQL_INDEX";

-- ontology terms have synonyms, here is how to store them
CREATE TABLE term_synonym (
       name		  VARCHAR(255) NOT NULL,
       term_id		  int  NOT NULL,
       PRIMARY KEY (term_id,name)
) TABLESPACE "BIOSQL_DATA";

-- ontology terms to dbxref association: ontology terms have dbxrefs
CREATE TABLE term_dbxref (
       	term_id	          int  NOT NULL,
       	dbxref_id         int  NOT NULL,
	rank		  SMALLINT,
	PRIMARY KEY (term_id, dbxref_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX trmdbxref_dbxrefid ON term_dbxref(dbxref_id) TABLESPACE "BIOSQL_INDEX";

-- relationship between controlled vocabulary / ontology term
-- we use subject/predicate/object but this could also
-- be thought of as child/relationship-type/parent.
-- the subject/predicate/object naming is better as we
-- can think of the graph as composed of statements.
--
-- we also treat the relationshiptypes / predicates as
-- controlled terms in themselves; this is quite useful
-- as a lot of systems (eg GO) will soon require
-- ontologies of relationship types (eg subtle differences
-- in the partOf relationship)
--
-- this table probably won''t be filled for a while, the core
-- will just treat ontologies as flat lists of terms

CREATE TABLE term_relationship (
        term_relationship_id int  NOT NULL ,
       	subject_term_id	int  NOT NULL,
       	predicate_term_id    int  NOT NULL,
       	object_term_id       int  NOT NULL,
	ontology_id	int  NOT NULL,
	PRIMARY KEY (term_relationship_id),
	UNIQUE (subject_term_id,predicate_term_id,object_term_id,ontology_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX trmrel_predicateid ON term_relationship(predicate_term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX trmrel_objectid ON term_relationship(object_term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX trmrel_ontid ON term_relationship(ontology_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id);

-- the infamous transitive closure table on ontology term relationships
-- this is a warehouse approach - you will need to update this regularly
--
-- the triple of (subject, predicate, object) is the same as for ontology
-- relationships, with the exception of predicate being the greatest common
-- denominator of the relationships types visited in the path (i.e., if
-- relationship type A is-a relationship type B, the greatest common
-- denominator for path containing both types A and B is B)
--
-- See the GO database or Chado schema for other (and possibly better
-- documented) implementations of the transitive closure table approach.
CREATE TABLE term_path (
        term_path_id         int  NOT NULL ,
       	subject_term_id	     int  NOT NULL,
       	predicate_term_id    int  NOT NULL,
       	object_term_id       int  NOT NULL,
	ontology_id          int  NOT NULL,
	distance	     int ,
	PRIMARY KEY (term_path_id),
	UNIQUE (subject_term_id,predicate_term_id,object_term_id,ontology_id,distance)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX trmpath_predicateid ON term_path(predicate_term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX trmpath_objectid ON term_path(object_term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX trmpath_ontid ON term_path(ontology_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX trmpath_subjectid ON term_path(subject_term_id);


-- BioJava addition
CREATE TABLE term_relationship_term (
        term_relationship_id    int  DEFAULT 0 NOT NULL,
        term_id                 int  DEFAULT 0 NOT NULL,
        PRIMARY KEY (term_relationship_id,term_id),
) TABLESPACE "BIOSQL_DATA";

ALTER TABLE term_relationship_term ADD CONSTRAINT uni_term_relationship_id 
        UNIQUE (term_relationship_id) ENABLE VALIDATE;
ALTER TABLE term_relationship_term ADD CONSTRAINT uni_term_id 
        UNIQUE (term_id) ENABLE VALIDATE;


-- we can be a bioentry without a biosequence, but not visa-versa
-- most things are going to be keyed off bioentry_id
--
-- accession is the stable id, display_id is a potentially volatile,
-- human readable name.
--
-- Version may be unknown, may be undefined, or may not exist for a certain
-- accession or database (namespace). We require it here to avoid RDBMS-
-- dependend enforcement variants (version is in a compound alternative key),
-- and to simplify query construction for UK look-ups. If there is no version
-- the convention is to put 0 (zero) here. Likewise, a record with a version
-- of zero means the version is to be interpreted as NULL.
--
-- not all entries have a taxon, but many do.
-- one bioentry only has one taxon! (weirdo chimerias are not handled. tough)
--
-- Name maps to display_id in bioperl. We have a different column name
-- here to avoid confusion with the naming convention for foreign keys.

CREATE TABLE bioentry (
	bioentry_id	int  NOT NULL ,
  	biodatabase_id  int  NOT NULL,
  	taxon_id     	int ,
  	name		VARCHAR(40) NOT NULL,
  	accession    	VARCHAR(40) NOT NULL,
  	identifier   	VARCHAR(40),
	division	VARCHAR(6),
  	description  	VARCHAR2(250),
  	version 	SMALLINT  NOT NULL, 
	PRIMARY KEY (bioentry_id),
  	UNIQUE (accession,biodatabase_id,version),
  	UNIQUE (identifier)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX bioentry_name ON bioentry(name) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX bioentry_db   ON bioentry(biodatabase_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX bioentry_tax  ON bioentry(taxon_id) TABLESPACE "BIOSQL_INDEX";


--
-- bioentry-bioentry relationships: these are typed
--
CREATE TABLE bioentry_relationship (
        bioentry_relationship_id int  NOT NULL ,
   	object_bioentry_id 	int  NOT NULL,
   	subject_bioentry_id 	int  NOT NULL,
   	term_id 		int  NOT NULL,
   	rank 			INT,
   	PRIMARY KEY (bioentry_relationship_id),
	UNIQUE (object_bioentry_id,subject_bioentry_id,term_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX bioentryrel_trm   ON bioentry_relationship(term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX bioentryrel_child ON bioentry_relationship(subject_bioentry_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX bioentryrel_parent ON bioentry_relationship(object_bioentry_id);

-- for deep (depth > 1) bioentry relationship trees we need a transitive
-- closure table too
CREATE TABLE bioentry_path (
   	object_bioentry_id 	int  NOT NULL,
   	subject_bioentry_id 	int  NOT NULL,
   	term_id 		int  NOT NULL,
	distance	     	int ,
	UNIQUE (object_bioentry_id,subject_bioentry_id,term_id,distance)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX bioentrypath_trm   ON bioentry_path(term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX bioentrypath_child ON bioentry_path(subject_bioentry_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX bioentrypath_parent ON bioentry_path(object_bioentry_id);

-- some bioentries will have a sequence
-- biosequence because sequence is sometimes a reserved word

CREATE TABLE biosequence (
  	bioentry_id     int  NOT NULL,
  	version     	SMALLINT, 
  	length      	int,
  	alphabet        VARCHAR(10),
  	seq 		LONG,
	PRIMARY KEY (bioentry_id)
) TABLESPACE "BIOSQL_DATA";

-- add these only if you want them:
-- ALTER TABLE biosequence ADD COLUMN ( isoelec_pt NUMERIC(4,2) );
-- ALTER TABLE biosequence ADD COLUMN (	mol_wgt DOUBLE PRECISION );
-- ALTER TABLE biosequence ADD COLUMN ( perc_gc DOUBLE PRECISION );

-- database cross-references (e.g., GenBank:AC123456.1)
--
-- Version may be unknown, may be undefined, or may not exist for a certain
-- accession or database (namespace). We require it here to avoid RDBMS-
-- dependend enforcement variants (version is in a compound alternative key),
-- and to simplify query construction for UK look-ups. If there is no version
-- the convention is to put 0 (zero) here. Likewise, a record with a version
-- of zero means the version is to be interpreted as NULL.
--
CREATE TABLE dbxref (
        dbxref_id	int  NOT NULL ,
        dbname          VARCHAR(40) NOT NULL,
        accession       VARCHAR(40) NOT NULL,
	version		SMALLINT  NOT NULL,
	PRIMARY KEY (dbxref_id),
        UNIQUE(accession, dbname, version)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX dbxref_db  ON dbxref(dbname) TABLESPACE "BIOSQL_INDEX";

-- for roundtripping embl/genbank, we need to have the "optional ID"
-- for the dbxref.
--
-- another use of this table could be for storing
-- descriptive text for a dbxref. for example, we may want to
-- know stuff about the interpro accessions we store (without
-- importing all of interpro), so we can attach the text
-- description as a synonym
CREATE TABLE dbxref_qualifier_value (
       	dbxref_id 		int  NOT NULL,
       	term_id 		int  NOT NULL,
  	rank  		   	INT DEFAULT 0 NOT NULL,
       	value			VARCHAR2(100),
	PRIMARY KEY (dbxref_id,term_id,rank)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX dbxrefqual_dbx ON dbxref_qualifier_value(dbxref_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX dbxrefqual_trm ON dbxref_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX";

-- Direct dblinks. It is tempting to do this
-- from bioentry_id to bioentry_id. But that wont work
-- during updates of one database - we will have to edit
-- this table each time. Better to do the join through accession
-- and db each time. Should be almost as cheap

CREATE TABLE bioentry_dbxref ( 
       	bioentry_id        int  NOT NULL,
       	dbxref_id          int  NOT NULL,
  	rank  		   SMALLINT,
	PRIMARY KEY (bioentry_id,dbxref_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX dblink_dbx  ON bioentry_dbxref(dbxref_id) TABLESPACE "BIOSQL_INDEX";

-- We can have multiple references per bioentry, but one reference
-- can also be used for the same bioentry.
--
-- No two references can reference the same reference database entry
-- (dbxref_id). This is where the MEDLINE id goes: PUBMED:123456.

CREATE TABLE reference (
  	reference_id       int  NOT NULL ,
	dbxref_id	   int ,
  	location 	   VARCHAR2(100) NOT NULL,
  	title    	   VARCHAR2(100),
  	authors  	   VARCHAR2(100) NOT NULL,
  	crc	   	   VARCHAR(32),
	PRIMARY KEY (reference_id),
	UNIQUE (dbxref_id),
	UNIQUE (crc)
) TABLESPACE "BIOSQL_DATA";

-- bioentry to reference associations
CREATE TABLE bioentry_reference (
  	bioentry_id 	int  NOT NULL,
  	reference_id 	int  NOT NULL,
  	start_pos	int,
  	end_pos	  	int,
  	rank  		SMALLINT DEFAULT 0 NOT NULL,
  	PRIMARY KEY(bioentry_id,reference_id,rank)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX bioentryref_ref ON bioentry_reference(reference_id) TABLESPACE "BIOSQL_INDEX";


-- We can have multiple comments per seqentry, and
-- comments can have embedded '\n' characters

CREATE TABLE anncomment (
  	comment_id  	int  NOT NULL ,
  	bioentry_id    	int  NOT NULL,
  	comment_text   	VARCHAR2(100) NOT NULL,
  	rank   		SMALLINT DEFAULT 0 NOT NULL,
	PRIMARY KEY (comment_id),
  	UNIQUE(bioentry_id, rank)
) TABLESPACE "BIOSQL_DATA";


-- tag/value and ontology term annotation for bioentries goes here
CREATE TABLE bioentry_qualifier_value (
	bioentry_id   		int  NOT NULL,
   	term_id  		int  NOT NULL,
   	value         		VARCHAR2(100),
	rank			INT DEFAULT 0 NOT NULL,
	UNIQUE (bioentry_id,term_id,rank)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX bioentryqual_trm ON bioentry_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX";

-- feature table. We cleanly handle
--   - simple locations
--   - split locations
--   - split locations on remote sequences

CREATE TABLE seqfeature (
   	seqfeature_id 		int  NOT NULL ,
   	bioentry_id   		int  NOT NULL,
   	type_term_id		int  NOT NULL,
   	source_term_id  	int  NOT NULL,
	display_name		VARCHAR(64),
   	rank 			SMALLINT  DEFAULT 0  NOT NULL,
	PRIMARY KEY (seqfeature_id),
	UNIQUE (bioentry_id,type_term_id,source_term_id,rank)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX seqfeature_trm  ON seqfeature(type_term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX seqfeature_fsrc ON seqfeature(source_term_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX seqfeature_bioentryid ON seqfeature(bioentry_id);

-- seqfeatures can be arranged in containment hierarchies.
-- one can imagine storing other relationships between features,
-- in this case the term_id can be used to type the relationship

CREATE TABLE seqfeature_relationship (
        seqfeature_relationship_id int  NOT NULL ,
   	object_seqfeature_id	int  NOT NULL,
   	subject_seqfeature_id 	int  NOT NULL,
   	term_id 	int  NOT NULL,
   	rank 			INT,
   	PRIMARY KEY (seqfeature_relationship_id),
	UNIQUE (object_seqfeature_id,subject_seqfeature_id,term_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX seqfeaturerel_trm   ON seqfeature_relationship(term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX seqfeaturerel_child ON seqfeature_relationship(subject_seqfeature_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX seqfeaturerel_parent ON seqfeature_relationship(object_seqfeature_id);

-- for deep (depth > 1) seqfeature relationship trees we need a transitive
-- closure table too
CREATE TABLE seqfeature_path (
   	object_seqfeature_id	int  NOT NULL,
   	subject_seqfeature_id 	int  NOT NULL,
   	term_id 		int  NOT NULL,
	distance	     	int ,
	UNIQUE (object_seqfeature_id,subject_seqfeature_id,term_id,distance)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX seqfeaturepath_trm   ON seqfeature_path(term_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX seqfeaturepath_child ON seqfeature_path(subject_seqfeature_id) TABLESPACE "BIOSQL_INDEX";
-- you may want to add this for mysql because MySQL often is broken with
-- respect to using the composite index for the initial keys
--CREATE INDEX seqfeaturerel_parent ON seqfeature_path(object_seqfeature_id);

-- tag/value associations - or ontology annotations
CREATE TABLE seqfeature_qualifier_value (
	seqfeature_id 		int  NOT NULL,
   	term_id 		int  NOT NULL,
   	rank 			SMALLINT DEFAULT 0  NOT NULL,
   	value  			VARCHAR2(4000) NOT NULL,
   	PRIMARY KEY (seqfeature_id,term_id,rank)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX seqfeaturequal_trm ON seqfeature_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX";
   
-- DBXrefs for features. This is necessary for genome oriented viewpoints,
-- where you have a few have long sequences (contigs, or chromosomes) with many
-- features on them. In that case the features are the semantic scope for
-- their annotation bundles, not the bioentry they are attached to.

CREATE TABLE seqfeature_dbxref ( 
       	seqfeature_id      int  NOT NULL,
       	dbxref_id          int  NOT NULL,
  	rank  		   SMALLINT,
	PRIMARY KEY (seqfeature_id,dbxref_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX feadblink_dbx  ON seqfeature_dbxref(dbxref_id) TABLESPACE "BIOSQL_INDEX";

-- basically we model everything as potentially having
-- any number of locations, ie, a split location. SimpleLocations
-- just have one location. We need to have a location id for the qualifier
-- associations of fuzzy locations.

-- please do not try to model complex assemblies with this thing. It wont
-- work. Check out the ensembl schema for this.

-- we allow nulls for start/end - this is useful for fuzzies as
-- standard range queries will not be included

-- for remote locations, the join to make is to DBXref
-- the FK to term is a possibility to store the type of the
-- location for determining in one hit whether it's a fuzzy or not

CREATE TABLE location (
	location_id		int  NOT NULL ,
   	seqfeature_id		int  NOT NULL,
	dbxref_id		int ,
	term_id			int ,
   	start_pos              	int,
   	end_pos                	int,
   	strand             	INT NOT NULL,
   	rank          		SMALLINT DEFAULT 0  NOT NULL,
	PRIMARY KEY (location_id),
   	UNIQUE (seqfeature_id, rank)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX seqfeatureloc_dbx   ON location(dbxref_id) TABLESPACE "BIOSQL_INDEX";
CREATE INDEX seqfeatureloc_trm   ON location(term_id) TABLESPACE "BIOSQL_INDEX";

-- location qualifiers - mainly intended for fuzzies but anything
-- can go in here
-- some controlled vocab terms have slots;
-- fuzzies could be modeled as min_start(5), max_start(5)
-- 
-- there is no restriction on extending the fuzzy ontology
-- for your own nefarious aims, although the bio* apis will
-- most likely ignore these
CREATE TABLE location_qualifier_value (
	location_id		int  NOT NULL,
   	term_id 		int  NOT NULL,
   	value  			VARCHAR(255) NOT NULL,
   	int_value 		int,
	PRIMARY KEY (location_id,term_id)
) TABLESPACE "BIOSQL_DATA";

CREATE INDEX locationqual_trm ON location_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX";

--
-- Create the foreign key constraints
--

-- ontology term

ALTER TABLE term ADD CONSTRAINT FKont_term
	FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id)
	ON DELETE CASCADE;

-- term synonyms

ALTER TABLE term_synonym ADD CONSTRAINT FKterm_syn
	FOREIGN KEY (term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;

-- term_dbxref

ALTER TABLE term_dbxref ADD CONSTRAINT FKdbxref_trmdbxref
       	FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id)
	ON DELETE CASCADE;
ALTER TABLE term_dbxref ADD CONSTRAINT FKterm_trmdbxref
      FOREIGN KEY (term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;

-- term_relationship

ALTER TABLE term_relationship ADD CONSTRAINT FKtrmsubject_trmrel
	FOREIGN KEY (subject_term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;
ALTER TABLE term_relationship ADD CONSTRAINT FKtrmpredicate_trmrel
       	FOREIGN KEY (predicate_term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;
ALTER TABLE term_relationship ADD CONSTRAINT FKtrmobject_trmrel
       	FOREIGN KEY (object_term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;
ALTER TABLE term_relationship ADD CONSTRAINT FKterm_trmrel
       	FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id)
	ON DELETE CASCADE;

-- term_path

ALTER TABLE term_path ADD CONSTRAINT FKtrmsubject_trmpath
	FOREIGN KEY (subject_term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;
ALTER TABLE term_path ADD CONSTRAINT FKtrmpredicate_trmpath
       	FOREIGN KEY (predicate_term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;
ALTER TABLE term_path ADD CONSTRAINT FKtrmobject_trmpath
       	FOREIGN KEY (object_term_id) REFERENCES term(term_id)
	ON DELETE CASCADE;
ALTER TABLE term_path ADD CONSTRAINT FKontology_trmpath
       	FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id)
	ON DELETE CASCADE;

-- taxon, taxon_name

-- unfortunately, we can't constrain parent_taxon_id as it is violated
-- occasionally by the downloads available from NCBI
-- ALTER TABLE taxon ADD CONSTRAINT FKtaxon_taxon
--         FOREIGN KEY (parent_taxon_id) REFERENCES taxon(taxon_id);
ALTER TABLE taxon_name ADD CONSTRAINT FKtaxon_taxonname
        FOREIGN KEY (taxon_id) REFERENCES taxon(taxon_id)
        ON DELETE CASCADE;

-- bioentry

ALTER TABLE bioentry ADD CONSTRAINT FKtaxon_bioentry
	FOREIGN KEY (taxon_id) REFERENCES taxon(taxon_id);
ALTER TABLE bioentry ADD CONSTRAINT FKbiodatabase_bioentry
	FOREIGN KEY (biodatabase_id) REFERENCES biodatabase(biodatabase_id);

-- bioentry_relationship

ALTER TABLE bioentry_relationship ADD CONSTRAINT FKterm_bioentryrel
	FOREIGN KEY (term_id) REFERENCES term(term_id);
ALTER TABLE bioentry_relationship ADD CONSTRAINT FKparentent_bioentryrel
	FOREIGN KEY (object_bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;
ALTER TABLE bioentry_relationship ADD CONSTRAINT FKchildent_bioentryrel
	FOREIGN KEY (subject_bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;

-- bioentry_path

ALTER TABLE bioentry_path ADD CONSTRAINT FKterm_bioentrypath
	FOREIGN KEY (term_id) REFERENCES term(term_id);
ALTER TABLE bioentry_path ADD CONSTRAINT FKparentent_bioentrypath
	FOREIGN KEY (object_bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;
ALTER TABLE bioentry_path ADD CONSTRAINT FKchildent_bioentrypath
	FOREIGN KEY (subject_bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;

-- biosequence

ALTER TABLE biosequence ADD CONSTRAINT FKbioentry_bioseq
	FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;

-- comment

ALTER TABLE anncomment ADD CONSTRAINT FKbioentry_comment
	FOREIGN KEY(bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;

-- bioentry_dbxref

ALTER TABLE bioentry_dbxref ADD CONSTRAINT FKbioentry_dblink
        FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;
ALTER TABLE bioentry_dbxref ADD CONSTRAINT FKdbxref_dblink
       	FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id)
	ON DELETE CASCADE;

-- dbxref_qualifier_value

ALTER TABLE dbxref_qualifier_value ADD CONSTRAINT FKtrm_dbxrefqual
	FOREIGN KEY (term_id) REFERENCES term(term_id);
ALTER TABLE dbxref_qualifier_value ADD CONSTRAINT FKdbxref_dbxrefqual
	FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id)
	ON DELETE CASCADE;

-- bioentry_reference

ALTER TABLE bioentry_reference ADD CONSTRAINT FKbioentry_entryref
	FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;
ALTER TABLE bioentry_reference ADD CONSTRAINT FKreference_entryref
	FOREIGN KEY (reference_id) REFERENCES reference(reference_id)
	ON DELETE CASCADE;

-- bioentry_qualifier_value

ALTER TABLE bioentry_qualifier_value ADD CONSTRAINT FKbioentry_entqual
	FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;
ALTER TABLE bioentry_qualifier_value ADD CONSTRAINT FKterm_entqual
	FOREIGN KEY (term_id) REFERENCES term(term_id);

-- reference 
ALTER TABLE reference ADD CONSTRAINT FKdbxref_reference
      FOREIGN KEY ( dbxref_id ) REFERENCES dbxref ( dbxref_id ) ;

-- seqfeature

ALTER TABLE seqfeature ADD CONSTRAINT FKterm_seqfeature
	FOREIGN KEY (type_term_id) REFERENCES term(term_id);
ALTER TABLE seqfeature ADD CONSTRAINT FKsourceterm_seqfeature
	FOREIGN KEY (source_term_id) REFERENCES term(term_id);
ALTER TABLE seqfeature ADD CONSTRAINT FKbioentry_seqfeature
	FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id)
	ON DELETE CASCADE;

-- seqfeature_relationship

ALTER TABLE seqfeature_relationship ADD CONSTRAINT FKterm_seqfeatrel
	FOREIGN KEY (term_id) REFERENCES term(term_id);
ALTER TABLE seqfeature_relationship ADD CONSTRAINT FKparentfeat_seqfeatrel
	FOREIGN KEY (object_seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;
ALTER TABLE seqfeature_relationship ADD CONSTRAINT FKchildfeat_seqfeatrel
	FOREIGN KEY (subject_seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;

-- seqfeature_path

ALTER TABLE seqfeature_path ADD CONSTRAINT FKterm_seqfeatpath
	FOREIGN KEY (term_id) REFERENCES term(term_id);
ALTER TABLE seqfeature_path ADD CONSTRAINT FKparentfeat_seqfeatpath
	FOREIGN KEY (object_seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;
ALTER TABLE seqfeature_path ADD CONSTRAINT FKchildfeat_seqfeatpath
	FOREIGN KEY (subject_seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;

-- seqfeature_qualifier_value
ALTER TABLE seqfeature_qualifier_value ADD CONSTRAINT FKterm_featqual
	FOREIGN KEY (term_id) REFERENCES term(term_id);
ALTER TABLE seqfeature_qualifier_value ADD CONSTRAINT FKseqfeature_featqual
	FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;

-- seqfeature_dbxref

ALTER TABLE seqfeature_dbxref ADD CONSTRAINT FKseqfeature_feadblink
        FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;
ALTER TABLE seqfeature_dbxref ADD CONSTRAINT FKdbxref_feadblink
       	FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id)
	ON DELETE CASCADE;

-- location

ALTER TABLE location ADD CONSTRAINT FKseqfeature_location
	FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id)
	ON DELETE CASCADE;
ALTER TABLE location ADD CONSTRAINT FKdbxref_location
	FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id);
ALTER TABLE location ADD CONSTRAINT FKterm_featloc
	FOREIGN KEY (term_id) REFERENCES term(term_id);

-- location_qualifier_value

ALTER TABLE location_qualifier_value ADD CONSTRAINT FKfeatloc_locqual
	FOREIGN KEY (location_id) REFERENCES location(location_id)
	ON DELETE CASCADE;
ALTER TABLE location_qualifier_value ADD CONSTRAINT FKterm_locqual
	FOREIGN KEY (term_id) REFERENCES term(term_id);


--
-- Triggers for automatic primary key generation and other sanity checks
--

CREATE OR REPLACE TRIGGER BID_location
  BEFORE INSERT
  on location
  -- 
  for each row
BEGIN
IF :new.location_id IS NULL THEN
    SELECT location_pk_seq.nextval INTO :new.location_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_seqfeature
  BEFORE INSERT
  on seqfeature
  -- 
  for each row
BEGIN
IF :new.seqfeature_id IS NULL THEN
    SELECT seqfeature_pk_seq.nextval INTO :new.seqfeature_id FROM DUAL;
END IF;
END;
/


CREATE TRIGGER BID_seqfeature_relationship
  BEFORE INSERT
  on seqfeature_relationship
  -- 
  for each row
BEGIN
IF :new.seqfeature_relationship_id IS NULL THEN
    SELECT seqfeature_relationship_pk_seq.nextval INTO :new.seqfeature_relationship_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_anncomment
  BEFORE INSERT
  on anncomment
  -- 
  for each row
BEGIN
IF :new.comment_id IS NULL THEN
    SELECT anncomment_pk_seq.nextval INTO :new.comment_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_reference
  BEFORE INSERT
  on reference
  -- 
  for each row
BEGIN
IF :new.reference_id IS NULL THEN
    SELECT reference_pk_seq.nextval INTO :new.reference_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_bioentry_relationship
  BEFORE INSERT
  on bioentry_relationship
  -- 
  for each row
BEGIN
IF :new.bioentry_relationship_id IS NULL THEN
    SELECT bioentry_relationship_pk_seq.nextval INTO :new.bioentry_relationship_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_bioentry
  BEFORE INSERT
  on bioentry
  -- 
  for each row
BEGIN
IF :new.bioentry_id IS NULL THEN
    SELECT bioentry_pk_seq.nextval INTO :new.bioentry_id FROM DUAL;
END IF;
-- IF :new.Division IS NULL THEN
--    :new.Division := 'UNK';
-- END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_term
  BEFORE INSERT
  on term
  -- 
  for each row
BEGIN
IF :new.term_id IS NULL THEN
    SELECT term_pk_seq.nextval INTO :new.term_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_term_relationship
  BEFORE INSERT
  on term_relationship
  -- 
  for each row
BEGIN
IF :new.term_relationship_id IS NULL THEN
    SELECT term_relationship_pk_seq.nextval INTO :new.term_relationship_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_term_path
  BEFORE INSERT
  on term_path
  -- 
  for each row
BEGIN
IF :new.term_path_id IS NULL THEN
    SELECT term_path_pk_seq.nextval INTO :new.term_path_id FROM DUAL;
END IF;
END;
/

CREATE OR REPLACE TRIGGER BID_ontology
  BEFORE INSERT
  on ontology
  -- 
  for each row
BEGIN
IF :new.ontology_id IS NULL THEN
    SELECT ontology_pk_seq.nextval INTO :new.ontology_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_taxon
  BEFORE INSERT
  on taxon
  -- 
  for each row
BEGIN
IF :new.taxon_id IS NULL THEN
    SELECT taxon_pk_seq.nextval INTO :new.taxon_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_biodatabase
  BEFORE INSERT
  on biodatabase
  -- 
  for each row
BEGIN
IF :new.biodatabase_id IS NULL THEN
    SELECT biodatabase_pk_seq.nextval INTO :new.biodatabase_id FROM DUAL;
END IF;
END;
/


CREATE OR REPLACE TRIGGER BID_dbxref
  BEFORE INSERT
  on dbxref
  -- 
  for each row
BEGIN
IF :new.dbxref_id IS NULL THEN
    SELECT dbxref_pk_seq.nextval INTO :new.dbxref_id FROM DUAL;
END IF;
END;
/

-------------- next part --------------


From mark.schreiber at group.novartis.com  Sun Nov 21 21:39:58 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Sun Nov 21 21:38:05 2004
Subject: [Biojava-l] opening unknown fasta file
Message-ID: <OF0FC23168.E2B50C5C-ON48256F54.000E3CED-48256F54.000EA55A@EU.novartis.net>

One way to do this would be to create a Unicode alphabet (or ASCII 
alphabet) and read the file into a Sequence of that Alphabet, create a 
Distribution, compare it to the DNA/ RNA/ Protein distributions using 
DistributionTools and then convert it to the correct Alphabet.

Even more ambitious would be to read the whole file to a text buffer and 
guess the format and alphabet based on the usage of characters.

Anyone feel inspired to do something like this. We are always getting 
emails from students looking for short projects. How about that one? My 
basic minimal requirement would be that the file should not be read twice. 
I/O is expensive, Memory is cheap.

- Mark


Thomas Down <thomas@derkholm.net>
Sent by: biojava-l-bounces@portal.open-bio.org
11/13/2004 12:26 AM

 
        To:     Mark Schreiber/GP/Novartis@PH
        cc:     biojava-list <biojava-l@biojava.org>
        Subject:        Re: [Biojava-l] opening unknown fasta file


On Fri, Nov 12, 2004 at 10:01:13AM +0800, 
mark.schreiber@group.novartis.com wrote:
> 
> Bascially there is absolutely no failsafe way to know if a fasta file is 

> DNA or Protein (or RNA). It's perfectly reasonable to have a short 
peptide 
> which contains only acg and t although it becomes very unlikely with 
> longer sequences.

The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols
that appear in DNA sequences.  Ns are everywhere, but many of the other
ambiguities appear from time to time, too.

If we were *really* serious about alphabet-guessing (which scares me, to 
be
honest), one option would be to calculate histograms of character 
frequencies
in EMBL and Swissprot, and look for the closest match.  I believe that
Internet Explorer takes this approach when it hits a web page without an
explicitly-specified character encoding -- it apparently works pretty 
well...

Does anyone feel this serious?

       Thomas.
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From jdiggans at excelsiortech.com  Mon Nov 22 01:38:20 2004
From: jdiggans at excelsiortech.com (James Diggans)
Date: Mon Nov 22 01:30:52 2004
Subject: [Biojava-l] Parsing MegaBLAST output files?
Message-ID: <41A1895C.7000302@excelsiortech.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


All, I'm attempting to use BioJava to parse the output from NCBI's
commandline MegaBLAST and receiving an error:

'Could not recognise the format of this file as one supported by the
framework.'

in a SAXException thrown by BlastLikeSAXParser. An old post to the
mailing list:

http://www.biojava.org/pipermail/biojava-dev/2002-October/000150.html

seems to indicate that this was fixed long ago via this commit to CVS:

http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/ssbind/HeaderStAXHandler.java.diff?r1=1.3&r2=1.4&cvsroot=biojava

The MegaBLAST file I'm trying to parse is clean and my attempt at a
parse consists of (largely pulled from the recipe from BioJava in Anger):

- ------------------
InputStream is = new FileInputStream(blastResult);

BlastLikeSAXParser parser = new BlastLikeSAXParser();
SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
parser.setContentHandler(adapter);

alignmentResults = new ArrayList();
SearchContentHandler builder = new
	BlastLikeSearchBuilder(alignmentResults,
~                new DummySequenceDB("queries"),
		new DummySequenceDBInstallation());

adapter.setSearchContentHandler(builder);

parser.parse(new InputSource(is));
- ------------------

Any ideas on why I'm getting the SAXException? Thanks ...
- -j

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBoYlc75jgGJzUhNkRAu8zAJ9gTNoPouk4/29EDpWKcQVx5EB34gCg2MkD
DndldC3zi3bD2QKWgqMNOxs=
=TS47
-----END PGP SIGNATURE-----
From mark.schreiber at group.novartis.com  Mon Nov 22 19:45:38 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Mon Nov 22 19:43:33 2004
Subject: [Biojava-l] Parsing MegaBLAST output files?
Message-ID: <OFD16692E9.042E9C18-ON48256F55.0003D662-48256F55.00042DC9@EU.novartis.net>

Hello -

MegaBLAST is not offcially supported. This doesn't mean it won't work it 
just means we don't know if it will work. If it isn't too different from 
normal blast it probably will.

The BlastLikeSAXParser has two modes. Lazy and Strict. If you call 
setModeLazy() before parsing it won't care if it doesn't recognise the 
format as one that is tried and tested and will attempt to parse it 
anyway. You should carefully check a few results though to make sure it is 
going well. If things work let us know so we can add MegaBLAST to the list 
of trusted programs.

Hope this helps,

Mark


James Diggans <jdiggans@excelsiortech.com>
Sent by: biojava-l-bounces@portal.open-bio.org
11/22/2004 02:38 PM

 
        To:     BioJava <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Parsing MegaBLAST output files?


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


All, I'm attempting to use BioJava to parse the output from NCBI's
commandline MegaBLAST and receiving an error:

'Could not recognise the format of this file as one supported by the
framework.'

in a SAXException thrown by BlastLikeSAXParser. An old post to the
mailing list:

http://www.biojava.org/pipermail/biojava-dev/2002-October/000150.html

seems to indicate that this was fixed long ago via this commit to CVS:

http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/ssbind/HeaderStAXHandler.java.diff?r1=1.3&r2=1.4&cvsroot=biojava

The MegaBLAST file I'm trying to parse is clean and my attempt at a
parse consists of (largely pulled from the recipe from BioJava in Anger):

- ------------------
InputStream is = new FileInputStream(blastResult);

BlastLikeSAXParser parser = new BlastLikeSAXParser();
SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
parser.setContentHandler(adapter);

alignmentResults = new ArrayList();
SearchContentHandler builder = new
                 BlastLikeSearchBuilder(alignmentResults,
~                new DummySequenceDB("queries"),
                                 new DummySequenceDBInstallation());

adapter.setSearchContentHandler(builder);

parser.parse(new InputSource(is));
- ------------------

Any ideas on why I'm getting the SAXException? Thanks ...
- -j

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBoYlc75jgGJzUhNkRAu8zAJ9gTNoPouk4/29EDpWKcQVx5EB34gCg2MkD
DndldC3zi3bD2QKWgqMNOxs=
=TS47
-----END PGP SIGNATURE-----
_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From johnny.hujol at comcast.net  Mon Nov 22 21:09:13 2004
From: johnny.hujol at comcast.net (Johnny Hujol)
Date: Mon Nov 22 21:08:02 2004
Subject: [Biojava-l] Exception found in GenbankSequenceDB -- getSequence
Message-ID: <41A29BC9.7090305@comcast.net>

Hi,
I'm using biojava-1.30-jdk14.jar with
C:\>java -version
java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-b64)
Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode)
On W2000 SP2.

When running this little Java code:

Sequence seqObject = null;

try {
    seqObject = genbankSequenceDB.getSequence(text);
    SeqIOTools.writeGenbank(System.out, seqObject);
} catch (Exception e1) {
    e1.printStackTrace();
}
String sequence = seqObject.seqString();

This throws the following exception:

got data from 
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=55740946
Exception found in GenbankSequenceDB -- getSequence
org.biojava.bio.BioException: Could not read sequence
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
    at 
org.jfb.chp1.listing1_6.SequenceForm1_6$4.focusLost(SequenceForm1_6.java:272)
    at java.awt.AWTEventMulticaster.focusLost(AWTEventMulticaster.java:172)
    at java.awt.Component.processFocusEvent(Component.java:5380)
    at java.awt.Component.processEvent(Component.java:5244)
    at java.awt.Container.processEvent(Container.java:1966)
    at java.awt.Component.dispatchEventImpl(Component.java:3955)
    at java.awt.Container.dispatchEventImpl(Container.java:2024)
    at java.awt.Component.dispatchEvent(Component.java:3803)
    at 
java.awt.KeyboardFocusManager.redispatchEvent(KeyboardFocusManager.java:1810)
    at 
java.awt.DefaultKeyboardFocusManager.typeAheadAssertions(DefaultKeyboardFocusManager.java:836)
    at 
java.awt.DefaultKeyboardFocusManager.dispatchEvent(DefaultKeyboardFocusManager.java:526)
    at java.awt.Component.dispatchEventImpl(Component.java:3841)
    at java.awt.Container.dispatchEventImpl(Container.java:2024)
    at java.awt.Component.dispatchEvent(Component.java:3803)
    at java.awt.EventQueue.dispatchEvent(EventQueue.java:463)
    at 
java.awt.EventDispatchThread.pumpOneEventForHierarchy(EventDispatchThread.java:234)
    at 
java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:163)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:157)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:149)
    at java.awt.EventDispatchThread.run(EventDispatchThread.java:110)

Process finished with exit code 0

The url looks good but it seems that the parser does not get anything 
from the data returned by the server.
What's happening?
Any help would be appreciated.

Cheers,
J
From jdiggans at excelsiortech.com  Tue Nov 23 00:08:02 2004
From: jdiggans at excelsiortech.com (James Diggans)
Date: Tue Nov 23 00:00:58 2004
Subject: [Biojava-l] Parsing MegaBLAST output files?
In-Reply-To: <OFD16692E9.042E9C18-ON48256F55.0003D662-48256F55.00042DC9@EU.novartis.net>
References: <OFD16692E9.042E9C18-ON48256F55.0003D662-48256F55.00042DC9@EU.novartis.net>
Message-ID: <41A2C5B2.8010302@excelsiortech.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Thanks for the reply, Mark. Setting the parser to be lazy (just before
the parse; it shouldn't matter where I do this as long as it's prior to
the parse, correct?) doesn't seem to help -- I still get the same SAX
exception. The MegaBLAST output seems, to my eye, to be identical to
that of blastn minus the header line:

	MEGABLAST 2.2.10 [Oct-19-2004]

Looking at the code for BlastLikeSAXParser, it seems, even in lazy mode,
to require that the header line contain at least a name with which it is
familiar (lazy just turns off interest in the version number). Would a
fix be as simple as adding 'MEGABLAST' to the list of acceptable names?
I can provide any interested dev w/ a sample output file from the
above-mentioned version of MegaBLAST.

If no one's interested, I'll follow up but it'll take me a lot longer
than those already familiar w/ the BioJava parser code.

Thanks all,
- -j

mark.schreiber@group.novartis.com wrote:
| Hello -
|
| MegaBLAST is not offcially supported. This doesn't mean it won't work it
| just means we don't know if it will work. If it isn't too different from
| normal blast it probably will.
|
| The BlastLikeSAXParser has two modes. Lazy and Strict. If you call
| setModeLazy() before parsing it won't care if it doesn't recognise the
| format as one that is tried and tested and will attempt to parse it
| anyway. You should carefully check a few results though to make sure
it is
| going well. If things work let us know so we can add MegaBLAST to the
list
| of trusted programs.
|
| Hope this helps,
|
| Mark
|
|
| James Diggans <jdiggans@excelsiortech.com>
| Sent by: biojava-l-bounces@portal.open-bio.org
| 11/22/2004 02:38 PM
|
|
|         To:     BioJava <biojava-l@biojava.org>
|         cc:     (bcc: Mark Schreiber/GP/Novartis)
|         Subject:        [Biojava-l] Parsing MegaBLAST output files?
|
|
|
|
| All, I'm attempting to use BioJava to parse the output from NCBI's
| commandline MegaBLAST and receiving an error:
|
| 'Could not recognise the format of this file as one supported by the
| framework.'
|
| in a SAXException thrown by BlastLikeSAXParser. An old post to the
| mailing list:
|
| http://www.biojava.org/pipermail/biojava-dev/2002-October/000150.html
|
| seems to indicate that this was fixed long ago via this commit to CVS:
|
|
http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/ssbind/HeaderStAXHandler.java.diff?r1=1.3&r2=1.4&cvsroot=biojava
|
| The MegaBLAST file I'm trying to parse is clean and my attempt at a
| parse consists of (largely pulled from the recipe from BioJava in Anger):
|
| ------------------
| InputStream is = new FileInputStream(blastResult);
|
| BlastLikeSAXParser parser = new BlastLikeSAXParser();
| SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
| parser.setContentHandler(adapter);
|
| alignmentResults = new ArrayList();
| SearchContentHandler builder = new
|                  BlastLikeSearchBuilder(alignmentResults,
| ~                new DummySequenceDB("queries"),
|                                  new DummySequenceDBInstallation());
|
| adapter.setSearchContentHandler(builder);
|
| parser.parse(new InputSource(is));
| ------------------
|
| Any ideas on why I'm getting the SAXException? Thanks ...
| -j
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBosWy75jgGJzUhNkRAtL+AJ9V6JoMXSdT1AWPuFGMckUiMzFO5ACg2D1r
2R75Y4ElTIBxrMA+Pukgre0=
=Is3P
-----END PGP SIGNATURE-----
From vc100 at doc.ic.ac.uk  Tue Nov 23 11:44:25 2004
From: vc100 at doc.ic.ac.uk (Vasa Curcin)
Date: Tue Nov 23 11:42:07 2004
Subject: [Biojava-l] Writing EMBL files
Message-ID: <41A368E9.4020800@doc.ic.ac.uk>

Hello,

We are loading an EMBL file into a SequenceDB and then writing it out 
again and getting the following error:

16:42:19,032 INFO  [STDOUT]     at 
org.biojava.bio.seq.io.EmblFileFormer.addSequ
enceProperty(EmblFileFormer.java:246)
16:42:19,032 INFO  [STDOUT]     at 
org.biojava.bio.seq.io.SeqIOEventEmitter.getS
eqIOEvents(SeqIOEventEmitter.java:92)
16:42:19,032 INFO  [STDOUT]     at 
org.biojava.bio.seq.io.EmblLikeFormat.writeSe
quence(EmblLikeFormat.java:289)
16:42:19,032 INFO  [STDOUT]     at 
org.biojava.bio.seq.io.EmblLikeFormat.writeSe
quence(EmblLikeFormat.java:253)
16:42:19,032 INFO  [STDOUT]     at 
org.biojava.bio.seq.io.StreamWriter.writeStre
am(StreamWriter.java:63)
16:42:19,032 INFO  [STDOUT]     at 
org.biojava.bio.seq.io.SeqIOTools.writeEmbl(S
eqIOTools.java:289)
16:42:19,032 INFO  [STDOUT]     at 
SequenceDBToText.process(SequenceDBToText.jav
a:134)

This is the file we are using:

ID   AB126240   standard; genomic DNA; PRO; 1350 BP.
XX
AC   AB126240;
XX
SV   AB126240.1
XX
DT   03-SEP-2004 (Rel. 81, Created)
DT   03-SEP-2004 (Rel. 81, Last updated, Version 1)
XX
DE   Thermococcus kodakaraensis Tko1062 gene for phosphosugar mutase, 
complete
DE   cds.
XX
KW   .
XX
OS   Thermococcus kodakaraensis
OC   Archaea; Euryarchaeota; Thermococci; Thermococcales; Thermococcaceae;
OC   Thermococcus.
XX
RN   [1]
RP   1-1350
RA   Imanaka T., Atomi H., Rashid N.;
RT   ;
RL   Submitted (15-NOV-2003) to the EMBL/GenBank/DDBJ databases.
RL   Tadayuki Imanaka, Kyoto University, Synthetic Chemistry & Biological
RL   Chemistry, Graduate School of Engineering; Katsura, Nishikyo-ku, Kyoto
RL   615-8510, Japan (E-mail:imanaka@sbchem.kyoto-u.ac.jp, 
Tel:81-75-383-2777,
RL   Fax:81-75-383-2778)
XX
RN   [2]
RA   Rashid N., Kanai T., Atomi H., Imanaka T.;
RT   "Among Multiple Phosphomannomutase Gene Orthologues, Only One Gene 
Encodes
RT   a Protein with Phosphoglucomutase and Phosphomannomutase Activities in
RT   Thermococcus kodakaraensis";
RL   J. Bacteriol. 186:6070-6076(2004).
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1350
FT                   /db_xref="taxon:69014"
FT                   /mol_type="genomic DNA"
FT                   /organism="Thermococcus kodakaraensis"
FT                   /strain="KOD1"
FT   CDS             1..1350
FT                   /codon_start=1
FT                   /transl_table=11
FT                   /gene="Tko1062"
FT                   /product="phosphosugar mutase"
FT                   /protein_id="BAD42439.1"
FT                   
/translation="MGKYFGTSGIREVFNEKLTPELALKVGKALGTYLGGGKVVIGKDT
FT                   
RTSGDVIKSAVISGLLSTGVDVIDIGLAPTPLTGFAIKLYGADAGVTITASHNPPEYNG
FT                   
IKVWQANGMAYTSEMERELESIMDSGNFKKAPWNEIGTLRRADPSEEYINAALKFVKLE
FT                   
NSYTVVLDSGNGAGSVVSPYLQRELGNRVISLNSHPSGFFVRELEPNAKSLSALAKTVR
FT                   
VMKADVGIAHDGDADRIGVVDDQGNFVEYEVMLSLIAGYMLRKFGKGKIVTTVDAGFAL
FT                   
DDYLRPLGGEVIRTRVGDVAVADELAKHGGVFGGEPSGTWIIPQWNLTPDGIFAGALVL
FT                   
EMIDRLGPISELAKEVPRYVTLRAKIPCPNEKKAKAMEIIAREALKTFDYEGLIDIDGI
FT                   RIENGDWWILFRPSGTEPIMRITLEAHEEEKAKELMGKAERLVKKAISEA"
XX
SQ   Sequence 1350 BP; 339 A; 341 C; 417 G; 253 T; 0 other;
     atggggaagt acttcggaac cagcggaatc agggaagtct ttaatgagaa 
gctgacacct        60
     gagctggctc taaaggtcgg caaagccctt ggaacgtacc tcggcggcgg 
aaaggttgtt       120
     atcgggaagg ataccaggac tagcggcgac gttataaaat cagcagtcat 
aagcggactt       180
     ctctcaactg gtgttgatgt gattgacata ggtttagcgc caacgccgct 
cacgggcttt       240
     gcgataaagc tctacggtgc cgatgctggc gttaccatca cagcttctca 
caacccgccg       300
     gagtacaacg gcataaaggt gtggcaggcc aacggaatgg catacacctc 
tgagatggag       360
     cgtgaactcg agtccataat ggactcaggg aacttcaaaa aagctccctg 
gaatgagatc       420
     gggacgctta gaagggccga ccccagtgag gagtacataa acgcggcgct 
aaaattcgtc       480
     aaacttgaga actcctacac ggtcgtcctc gattctggaa acggtgcggg 
ctcggtggtc       540
     tccccctacc tccagcggga gctgggcaat agggttatct cgctcaactc 
ccacccgagc       600
     ggcttcttcg tcagggaact tgagccgaac gcgaagagcc tctccgccct 
agcgaagacc       660
     gttagagtga tgaaagccga cgtcggcata gcccacgacg gcgacgcaga 
taggatcggc       720
     gtcgttgatg atcagggcaa cttcgttgag tacgaggtca tgctctcgct 
catagcgggc       780
     tacatgctga ggaagttcgg gaaggggaaa atagttacca ccgttgatgc 
gggctttgct       840
     ttggacgact acctcagacc ccttggcgga gaagtcataa ggacgcgcgt 
tggtgatgtg       900
     gccgttgccg acgagctcgc aaaacacggc ggcgtcttcg gcggcgagcc 
gagtggcacg       960
     tggataatcc cgcagtggaa cctcaccccc gacggaatct ttgctggggc 
ccttgttctg      1020
     gagatgattg acagactcgg tccgataagc gagctggcca aggaagtccc 
gcgctacgtg      1080
     acgctccgcg ccaaaatccc ctgtccgaac gagaagaagg cgaaagccat 
ggagataata      1140
     gcgcgcgagg cactaaagac gttcgactac gaggggctga tagacataga 
tggaattagg      1200
     atagaaaacg gtgactggtg gatcctcttc cgcccgagcg gaaccgagcc 
gataatgcgc      1260
     ataactttgg aggcccacga ggaagagaag gcgaaggagc tgatggggaa 
ggcggagagg      1320
     ctggttaaga aagccatctc 
ggaggcctga                                       1350
//

 
Any ideas?

Regards,
Vasa
From mark.schreiber at group.novartis.com  Thu Nov 25 20:04:47 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Thu Nov 25 20:02:40 2004
Subject: [Biojava-l] Re: Biojava query
Message-ID: <OF0D919652.207B1882-ON48256F58.0005A0B3-48256F58.0005EEAE@EU.novartis.net>

Hi Russell -

This is a script I use to see which blast items are treated as what kind 
of events. Note that the object model doesn't capture everything from a 
report but you can always extend or write your own listener that gets what 
you want. Use the EchoBlast program to figure out which events you need to 
listen for...

import org.xml.sax.*;
import java.io.*;
import org.biojava.bio.program.sax.*;
import org.biojava.bio.program.ssbind.*;
import org.biojava.bio.search.*;

/**
 * <p> Echo's events from a blast like sax parser</p>
 * @author Mark Schreiber
 * @version 1.0
 */

public class BlastEcho {
  public BlastEcho() {
  }

  private void echo (InputSource source) throws IOException, SAXException{
    //make a BlastLikeSAXParser
    BlastLikeSAXParser parser = new BlastLikeSAXParser();
    parser.setModeLazy();

    ContentHandler handler = new SeqSimilarityAdapter();
    SearchContentHandler scHandler = new EchoSCHandler();
    ((SeqSimilarityAdapter)handler).setSearchContentHandler(scHandler);

    parser.setContentHandler(handler);
    parser.parse(source);
  }

  private class EchoSCHandler extends SearchContentAdapter{
    public void startHit(){
      System.out.println("startHit()");
    }
    public void endHit(){
      System.out.println("endHit()");
    }
    public void startSubHit(){
      System.out.println("startSubHit()");
    }
    public void endSubHit(){
      System.out.println("endSubHit()");
    }
    public void startSearch(){
      System.out.println("startSearch");
    }
    public void endSearch(){
      System.out.println("endSearch");
    }
    public void addHitProperty(Object key, Object val){
      System.out.println("\tHitProp:\t"+key+": "+val);
    }
    public void addSearchProperty(Object key, Object val){
      System.out.println("\tSearchProp:\t"+key+": "+val);
    }
    public void addSubHitProperty(Object key, Object val){
      System.out.println("\tSubHitProp:\t"+key+": "+val);
    }
  }

  public static void main(String[] args) throws Exception{
    InputSource is = new InputSource(new FileInputStream(args[0]));
    BlastEcho blastEcho = new BlastEcho();
    blastEcho.echo(is);
  }

}

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


"Smithies, Russell" <Russell.Smithies@agresearch.co.nz>
11/26/2004 08:42 AM

 
        To:     Mark Schreiber/GP/Novartis@PH
        cc: 
        Subject:        Biojava query


Hi Mark,

Just a quick question about Blast parsing,
How do you get the length of the query sequence with the parser example
on BJinA?
It's not in the annotations of the SeqSimilaritySearchResult. 
That only has databaseId, queryId, program, and version :-(
 

Russell 
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


From bindu_j2000 at yahoo.com  Wed Nov 24 21:54:44 2004
From: bindu_j2000 at yahoo.com (smitha kantipudi)
Date: Sun Nov 28 12:14:37 2004
Subject: [Biojava-l] 3D Dot Matrix for sequence similarity 
Message-ID: <20041125025444.79984.qmail@web60003.mail.yahoo.com>

Hi,
 
Can any one tell me how to implement 3D dot matrix for sequence similarty in Java or Perl.
In this we have to use sum of pairs, amnio acid subtitution matrix etc..
 
Thank you in advance.
 
Smitha.
 

---------------------------------
Do you Yahoo!?
 Yahoo! Mail - Helps protect you from nasty viruses.
From voisingreg at yahoo.fr  Fri Nov 26 11:45:42 2004
From: voisingreg at yahoo.fr (gregory voisin)
Date: Sun Nov 28 12:14:48 2004
Subject: [Biojava-l] biojava and microArray:::::::::Proget
Message-ID: <20041126164542.48653.qmail@web60409.mail.yahoo.com>

hie, biojavatien and microarrayers,
 
I'm trying to developping some  Java Classe to manipulate Expression data.... 
in fact , these classes are adapted for my use hence, for this moment, more specificly .....
 
i ' m not a  informatics developpers... just a poor bioinformatist but i would like initiate this project ( which is a very too big fish for me)  and work with severals brains to develop a new simple package for BioJava......
 
thanks for your time and your energy....
 
que le vent gonfle vos voiles et que le soleil inonde vos visages
 
 
VOISIN greg.
Bioinformaticien.
Centre de recherche du CHUM.
MONTREAL
		
---------------------------------
Cr?ez gratuitement votre Yahoo! Mail avec 100 Mo de stockage !
Cr?ez votre Yahoo! Mail

Le nouveau Yahoo! Messenger est arriv? ! D?couvrez toutes les nouveaut?s pour dialoguer instantan?ment avec vos amis.T?l?chargez GRATUITEMENT ici !
From heuermh at acm.org  Mon Nov 29 13:37:15 2004
From: heuermh at acm.org (Michael Heuer)
Date: Mon Nov 29 13:37:34 2004
Subject: [Biojava-l] biojava and microArray:::::::::Proget
In-Reply-To: <20041126164542.48653.qmail@web60409.mail.yahoo.com>
Message-ID: <Pine.GSO.4.44.0411291321370.14655-100000@shell3.shore.net>


I'm willing to coordinate efforts to bring gene expression support to
biojava.  However, I don't think it should be done without proper support
for MAGE and the MAGE Ontology, out of respect to those active standards
communities.

I've set up a wiki to discuss a biojava-expr library at

> http://hume.ccgb.umn.edu:8668/space/BiojavaExpr

Feel free to register to create an account, then edit that page or create
new ones linked from that page with your design considerations,
implementation ideas, or feature requirements.

Alternatively, I feel that the biojava developers mailing list is the most
appropriate venue for this kind of discussion, and would recommend that
anyone interested in contributing to a biojava-expr library subscribe to
and post there.

>  http://www.biojava.org/mailman/listinfo/biojava-dev

   michael


On Fri, 26 Nov 2004, gregory voisin wrote:

> hie, biojavatien and microarrayers,
>
> I'm trying to developping some  Java Classe to manipulate Expression data....
> in fact , these classes are adapted for my use hence, for this moment, more specificly .....
>
> i ' m not a  informatics developpers... just a poor bioinformatist but
> i would like initiate this project ( which is a very too big fish for
> me) and work with severals brains to develop a new simple package for
> BioJava......
>
> thanks for your time and your energy....
>
> que le vent gonfle vos voiles et que le soleil inonde vos visages
>
>
>
>
>
>
>
> VOISIN greg.
> Bioinformaticien.
> Centre de recherche du CHUM.
> MONTREAL
>
> ---------------------------------
> Cr�ez gratuitement votre Yahoo! Mail avec 100 Mo de stockage !
> Cr�ez votre Yahoo! Mail
>
> Le nouveau Yahoo! Messenger est arriv� ! D�couvrez toutes les nouveaut�s pour dialoguer instantan�ment avec vos amis.T�l�chargez GRATUITEMENT ici !


From Anna.Henricson at cgb.ki.se  Tue Nov 30 05:06:19 2004
From: Anna.Henricson at cgb.ki.se (Anna Henricson)
Date: Tue Nov 30 05:04:16 2004
Subject: [Biojava-l] Parsing an EMBL flatfile
Message-ID: <EOEJIPKFOLMCOPIEHGGNCEAGCGAA.Anna.Henricson@cgb.ki.se>

Hi,
I'm a new beginner at Biojava and I'm trying to parse an EMBL flatfile, it's
especially info in the CDS section of the Feature Table that I want to
retrieve. I have looked at the examples and tutorials on the Biojava website
and tried using the FeatureFilter.ByType("CDS"), however, that only gives me
the exons to join and not the info that follow, such as protein_id, db_xref,
the amino acid sequence etc.
Instead, I have been trying to use the EmblLikeFormat class, EmblProcessor,
FeatureTableParser and EmblLikeLocationParser, but I can't really put it
together.
I would really appreciate some help, since I'm the only one around here that
is using Biojava.
Thanks!
/Anna

--------------------------------------------
Anna Henricson, MSc, PhD student
Center for Genomics and Bioinformatics (CGB)
Karolinska Institutet
S-171 77 Stockholm
Sweden
Phone: +46 (0)8 524 87296
Fax: +46 (0)8 337983


From mark.schreiber at group.novartis.com  Tue Nov 30 19:49:31 2004
From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com)
Date: Tue Nov 30 19:47:20 2004
Subject: [Biojava-l] Parsing an EMBL flatfile
Message-ID: <OFC119C270.BB67B05D-ON48256F5D.0004417B-48256F5D.00048970@EU.novartis.net>

Hi Anna,

I think that information is probably ending up in an Annotation object. 
You can use the example TreeView program to interactively find out how a 
file is parsed and which features and Annotations end up where in the 
object model (http://www.biojava.org/docs/bj_in_anger/treeView.htm)

Let me know if this doesn't help.

Regards,

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


"Anna Henricson" <Anna.Henricson@cgb.ki.se>
Sent by: biojava-l-bounces@portal.open-bio.org
11/30/2004 06:06 PM

 
        To:     <biojava-l@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Parsing an EMBL flatfile


Hi,
I'm a new beginner at Biojava and I'm trying to parse an EMBL flatfile, 
it's
especially info in the CDS section of the Feature Table that I want to
retrieve. I have looked at the examples and tutorials on the Biojava 
website
and tried using the FeatureFilter.ByType("CDS"), however, that only gives 
me
the exons to join and not the info that follow, such as protein_id, 
db_xref,
the amino acid sequence etc.
Instead, I have been trying to use the EmblLikeFormat class, 
EmblProcessor,
FeatureTableParser and EmblLikeLocationParser, but I can't really put it
together.
I would really appreciate some help, since I'm the only one around here 
that
is using Biojava.
Thanks!
/Anna

--------------------------------------------
Anna Henricson, MSc, PhD student
Center for Genomics and Bioinformatics (CGB)
Karolinska Institutet
S-171 77 Stockholm
Sweden
Phone: +46 (0)8 524 87296
Fax: +46 (0)8 337983


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From xingenzhu at yahoo.com.cn  Mon Nov 29 12:46:11 2004
From: xingenzhu at yahoo.com.cn (Xingen Zhu)
Date: Mon Dec  6 09:20:48 2004
Subject: [Biojava-l] Parse Genbank file
Message-ID: <20041129174611.19544.qmail@web50601.mail.yahoo.com>

Hi all,
 I am a new user of biojava.  I use the following java program to parse a genebank file:
  import java.util.*;
import java.io.*;
 
import org.biojava.bio.*;
import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.seq.io.*;
 
public class biojava {
 
 public static void main(String[] args) {
  try {
        File genbankFile = new File("e:\\java\\gb.txt"); 
        BufferedReader gReader = new BufferedReader(new 
          InputStreamReader(new FileInputStream(genbankFile)));
       GenbankFormat gFormat = new GenbankFormat();
       Alphabet alpha = DNATools.getDNA();
    
  } catch (Throwable t) {
      t.printStackTrace();
      System.exit(1); 
 }
 
}
}
 
 
This program can be compiled, but not run. The error message is
 Java.lang.NoClassDefFoundError
 
If delete the following line
Alphabet alpha = DNATools.getDNA();
 
It will complie and run 
 
Any idea?
 
Thanks a lot.
Michael


---------------------------------
Do You Yahoo!?
������̫С���Ż������������ݣ�