From can.gencer at angiogenetics.se Mon Nov 1 09:49:39 2004 From: can.gencer at angiogenetics.se (Can Gencer) Date: Mon Nov 1 09:49:46 2004 Subject: [Biojava-l] Parsing a huge Blast File with Biojava Message-ID: <1099320579.7620.8.camel@slyfox.angiogenetics.se> Hello everyone, We are trying to parse a quite large multiple BLAST results file (around 4GB), and the computer available has 1GB of RAM. However, when the code in the cookbook is used ( "http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the BlastLikeSAXParser it will give out an OutOfMemory exception after a short while, and when I monitor the system during the parsing, I don't see the memory usage going up significantly. It is the parse(InputSource) method that throws the exception. Is there a way to solve this problem ? Thanks, Can From smh1008 at cus.cam.ac.uk Mon Nov 1 10:08:13 2004 From: smh1008 at cus.cam.ac.uk (David Huen) Date: Mon Nov 1 10:06:47 2004 Subject: [Biojava-l] Parsing a huge Blast File with Biojava In-Reply-To: <1099320579.7620.8.camel@slyfox.angiogenetics.se> References: <1099320579.7620.8.camel@slyfox.angiogenetics.se> Message-ID: <200411011508.13501.smh1008@cus.cam.ac.uk> On Monday 01 Nov 2004 14:49, Can Gencer wrote: > Hello everyone, > > We are trying to parse a quite large multiple BLAST results file (around > 4GB), and the computer available has 1GB of RAM. However, when the code > in the cookbook is used ( > "http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the > BlastLikeSAXParser it will give out an OutOfMemory exception after a > short while, and when I monitor the system during the parsing, I don't > see the memory usage going up significantly. It is the > parse(InputSource) method that throws the exception. Is there a way to > solve this problem ? > This is probably not the answer you want but I'm parsing BLAST files at least as large as yours without this problem using the BlastXMLParserFacade class. Perhaps it may be a temporary workaround until someone who understands the other parser responds, I certainly don't. There is also a alpha/beta-quality parser filter framework that could perhaps be used with the XML parser framework in CVS. Regards, David Huen P.S. A number of fixes have gone into the XML parsing for NCBI Blastn (the only part I use, the other parts may work too)software in CVS which may make it workable for you now. In particular, the irritating DTD related bug appears to be worked around. From thomas at derkholm.net Mon Nov 1 10:15:27 2004 From: thomas at derkholm.net (Thomas Down) Date: Mon Nov 1 10:14:47 2004 Subject: [Biojava-l] Parsing a huge Blast File with Biojava In-Reply-To: <1099320579.7620.8.camel@slyfox.angiogenetics.se> References: <1099320579.7620.8.camel@slyfox.angiogenetics.se> Message-ID: <20041101151527.GA27076@kalinda.derkholm.net> On Mon, Nov 01, 2004 at 03:49:39PM +0100, Can Gencer wrote: > Hello everyone, > > We are trying to parse a quite large multiple BLAST results file (around > 4GB), and the computer available has 1GB of RAM. However, when the code > in the cookbook is used ( > "http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the > BlastLikeSAXParser it will give out an OutOfMemory exception after a > short while, and when I monitor the system during the parsing, I don't > see the memory usage going up significantly. It is the > parse(InputSource) method that throws the exception. Is there a way to > solve this problem ? Hi, When you use the BioJava blast parser as described in the BJIA article, it does build a fairly comprehensive set of objects which reflect the contents of the blast output. If those objects turn out to be bigger than your available memory, then you'll either have to split up the output or process it in a "streaming" fashion. The BioJava blast parsers actually work by converting the blast output to XML, which is then presented to a SAX contenthandler. The normal strategy is to use a ContentHandler which builds objects, and this is what the BioJava BlastLikeSearchBuilder class is doing. However, there's nothing to stop you writing a custom ContentHandler which extracts the information you want directly from the XML representation. This strategy should let you process unlimited amounts of blast output without running into memory problems, but does involve a certain amount of work. If you want to see what the XML representation looks like, try the demos/nativeapps/BlastLike2XML.java script, included in the BioJava source distribution. However, since you say "I don't see the memory usage going up significantly", I'm wondering if your program is *really* exhausting system memory, or if you're just hitting the default limit on the Java heap size. On many platforms, the default heap size can be pretty low. You can control it using the -Xmx and -Xms options (try typing java -X for proper descriptions). On a 1Gb machine, I'd suggest trying something like: java -Xmx850M YourProgram This allows Java to use the bulk of system memory, while still leaving a bit left for the operating system, etc. Hope this helps, Thomas. From Peter.Ng at bccdc.ca Mon Nov 1 13:32:21 2004 From: Peter.Ng at bccdc.ca (Ng, Peter) Date: Mon Nov 1 13:30:49 2004 Subject: [Biojava-l] Navigating a Vector Message-ID: I'm trying to iterate through a database using a Vector and previous/next JButtons. How do I find the Vector index of the current record so I can navigate forward and back in the Vector? Thanks in advance! -- Regards, Peter Ng Laboratory Information Management Coordinator Laboratory Services BC Centre for Disease Control 655 West 12th Avenue Vancouver BC V5Z 4R4 Tel: 604-660-2058 Fax: 604-660-6073 Web: www.bccdc.org From mark.schreiber at group.novartis.com Mon Nov 1 19:55:06 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Nov 1 19:53:40 2004 Subject: [Biojava-l] Navigating a Vector Message-ID: I'm not sure you can, especially because iterators on Vectors are not gaurenteed to operate in any special order. If possible you should use an ArrayList or LinkedList. In this case you will be able to find the index or even ask for items by their index. You can make a List or LinkedList out of a Vector as it is a Collection. - Mark "Ng, Peter" Sent by: biojava-l-bounces@portal.open-bio.org 11/02/2004 02:32 AM To: cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Navigating a Vector I'm trying to iterate through a database using a Vector and previous/next JButtons. How do I find the Vector index of the current record so I can navigate forward and back in the Vector? Thanks in advance! -- Regards, Peter Ng Laboratory Information Management Coordinator Laboratory Services BC Centre for Disease Control 655 West 12th Avenue Vancouver BC V5Z 4R4 Tel: 604-660-2058 Fax: 604-660-6073 Web: www.bccdc.org _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From rahul at genebrew.com Mon Nov 1 21:54:50 2004 From: rahul at genebrew.com (Rahul Karnik) Date: Mon Nov 1 21:44:54 2004 Subject: [Biojava-l] Navigating a Vector In-Reply-To: References: Message-ID: <4186F6FA.3050405@genebrew.com> mark.schreiber@group.novartis.com wrote: > I'm not sure you can, especially because iterators on Vectors are not > gaurenteed to operate in any special order. If possible you should use an > ArrayList or LinkedList. In this case you will be able to find the index > or even ask for items by their index. While order is not guuranteed, you can actually loop over a Vector using a for loop and the Vector elementAt(int index) method. Besides, if you create a [Array|Linked]List from the Vector, you would get the same order. If you want to use an Iterator, Vector implements the iterator() method as well. The only difference between Vector and ArrayList is that Vector is synchronized (threadsafe) and ArrayList is not. http://java.sun.com/j2se/1.4.2/docs/api/java/util/ArrayList.html http://java.sun.com/j2se/1.4.2/docs/api/java/util/Vector.html Thanks, Rahul From fpepin at cs.mcgill.ca Mon Nov 1 22:54:18 2004 From: fpepin at cs.mcgill.ca (Francois Pepin) Date: Mon Nov 1 22:53:19 2004 Subject: [Biojava-l] Navigating a Vector In-Reply-To: References: Message-ID: <1099367657.2942.290.camel@ybrig.MCB.McGill.CA> Vector implements List and List guarantees that the iterator goes through the right order. And getting a ListIterator lets you go back and forth. And indexOf(Object element) would give you the (first) index of where a given element is found. Is there something I'm missing here? Francois On Mon, 2004-11-01 at 19:55, mark.schreiber@group.novartis.com wrote: > I'm not sure you can, especially because iterators on Vectors are not > gaurenteed to operate in any special order. If possible you should use an > ArrayList or LinkedList. In this case you will be able to find the index > or even ask for items by their index. > > You can make a List or LinkedList out of a Vector as it is a Collection. > > - Mark > > > > > > "Ng, Peter" > Sent by: biojava-l-bounces@portal.open-bio.org > 11/02/2004 02:32 AM > > > To: > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-l] Navigating a Vector > > > I'm trying to iterate through a database using a Vector and > previous/next JButtons. How do I find the Vector index of the current > record so I can navigate forward and back in the Vector? Thanks in > advance! From luqiang at scbit.org Thu Nov 4 13:42:20 2004 From: luqiang at scbit.org (Lu Qiang) Date: Thu Nov 4 13:41:37 2004 Subject: [Biojava-l] Parsing blast result with a lot of hit Message-ID: <200411041841.iA4IfTKr024979@portal.open-bio.org> Hi, Guys, If we are tyring to parse a blast result with a lot of hits, the machine will be crashed, for example 5000 sequences blast themselves. This must be caused by a ArrayList storing all results. How to solve this problem? regards, Lu From mark.schreiber at group.novartis.com Thu Nov 4 20:01:03 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Nov 4 19:59:35 2004 Subject: [Biojava-l] Parsing blast result with a lot of hit Message-ID: Hello Lu Qiang - We get this question a lot. I have posted below a recent response (by Thomas Down) to the same question: Hi, When you use the BioJava blast parser as described in the BJIA article, it does build a fairly comprehensive set of objects which reflect the contents of the blast output. If those objects turn out to be bigger than your available memory, then you'll either have to split up the output or process it in a "streaming" fashion. The BioJava blast parsers actually work by converting the blast output to XML, which is then presented to a SAX contenthandler. The normal strategy is to use a ContentHandler which builds objects, and this is what the BioJava BlastLikeSearchBuilder class is doing. However, there's nothing to stop you writing a custom ContentHandler which extracts the information you want directly from the XML representation. This strategy should let you process unlimited amounts of blast output without running into memory problems, but does involve a certain amount of work. If you want to see what the XML representation looks like, try the demos/nativeapps/BlastLike2XML.java script, included in the BioJava source distribution. However, since you say "I don't see the memory usage going up significantly", I'm wondering if your program is *really* exhausting system memory, or if you're just hitting the default limit on the Java heap size. On many platforms, the default heap size can be pretty low. You can control it using the -Xmx and -Xms options (try typing java -X for proper descriptions). On a 1Gb machine, I'd suggest trying something like: java -Xmx850M YourProgram This allows Java to use the bulk of system memory, while still leaving a bit left for the operating system, etc. Hope this helps, Thomas. Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 "Lu Qiang" Sent by: biojava-l-bounces@portal.open-bio.org 11/05/2004 02:42 AM To: "biojava-l@biojava.org" cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Parsing blast result with a lot of hit Hi, Guys, If we are tyring to parse a blast result with a lot of hits, the machine will be crashed, for example 5000 sequences blast themselves. This must be caused by a ArrayList storing all results. How to solve this problem? regards, Lu _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From phxgm at hotmail.com Thu Nov 4 21:06:49 2004 From: phxgm at hotmail.com (PhxGM Gim) Date: Thu Nov 4 21:05:43 2004 Subject: [Biojava-l] Parsing blast result with a lot of hit Message-ID: what is the exact msg you are recieving from the JVM when it aborts? I'm *assuming* it's the standard "Out of Memory Exception." You can increase the heap size allocated to the JVM upon startup of the java application by throwing a few switches to the jvm invocation. there are complete tutorials on how to set the heap sizes for the jvms on the sun site at java.sun.com. i have used these to some degree of success when scaling java apps and hope it is applicable to your situation. other than that you can certainly do something about having all those instances in memory at any one time, perhaps read them 'on demand' from storage. clearly you are going to have to solve the issue via additional resource allocations to the JVM or programmatically by reading data only as needed instead of loading all the data into memory. As I haven't encountered this particular issue in my development as of yet (with biojava) I do not know what constraints are imposed on developers ability to do this. Again, I'm going to assume you have a Blast XML output file, which theoretically should be handled by either the BlastLikeSAXParser or the BlastXMLParser. Taken from the biojava docs on the BlastLikeSAXParser - "The biojava Blast-like parsing framework is designed to uses minimal memory,so that in principle, extremely large native outputs can be parsed and XML ContentHandlers can listen only for small amounts of information." (http://www.biojava.org/docs/api/org/biojava/bio/program/sax/BlastLikeSAXParser.html.) you can use an 'event driven' SAX parser ContentHandlers to trigger events caused by the XML document you're parsing. Again, it claims to scale... whether it does or not is another issue. hope this has been of at least some help, jess vermont chicago >From: "Lu Qiang" >To: "biojava-l@biojava.org" >Subject: [Biojava-l] Parsing blast result with a lot of hit >Date: Thu, 4 Nov 2004 18:42:20 +0000 > >Hi, Guys, > >If we are tyring to parse a blast result with a lot of hits, the machine >will be crashed, for example 5000 sequences blast themselves. > >This must be caused by a ArrayList storing all results. > >How to solve this problem? > >regards, > >Lu > > >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ From rahul at genebrew.com Fri Nov 5 11:54:23 2004 From: rahul at genebrew.com (Rahul Karnik) Date: Fri Nov 5 06:43:31 2004 Subject: [Biojava-l] Parsing blast result with a lot of hit In-Reply-To: <200411041841.iA4IfTKr024979@portal.open-bio.org> References: <200411041841.iA4IfTKr024979@portal.open-bio.org> Message-ID: <418BB03F.3050903@genebrew.com> Lu Qiang wrote: > This must be caused by a ArrayList storing all results. You have diagnosed the problem perfectly. The BlastLikeSearchBuilder used in the BioJava in Anger example stores all the hits in an ArrayList, which means that if you are parsing a large BLAST results file, the whole of the file is effectively being stored in memory. The better approach is to print the results to your output as you encounter them. For this, you probably want to write your own implementation of the SearchContentHandler interface (using BlastLikeSearchBuilder as a guide) that outputs the results in the format you want, rather than storing them in a List. Then replace BlastLikeSearchBuilder with your own implementation. Note that it is probably easier to up the memory available to Java, so try that first if you haven't already. I would only recommend the approach described above if you are running up against hardware limitations. Thanks, Rahul From kvddrift at earthlink.net Tue Nov 9 19:17:35 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Tue Nov 9 19:15:47 2004 Subject: [Biojava-l] biojava and Xcode Message-ID: Hi, I have been able to build biojava using Apple's Xcode 1.5. I also was able to make a separate small Xcode project and run some code that uses biojava. What I would like to be able to do is, is to debug my code including the code it uses from biojava. I can step through my own code, but as soon as the debugger steps into a biojava function, it treats that as a black box, and I cannot see what happens 'under the hood'. Is there any way to accomplish this, either with Xcode, or another OS X or X11 app? I understand that this is impossible when I only have a jar file, but now I also have all the source code from biojava. Maybe I need to create another target in the same project that contains the biojava source, but so far I have not been able to get this to work. thanks, - Koen. From heuermh at acm.org Tue Nov 9 19:30:34 2004 From: heuermh at acm.org (Michael Heuer) Date: Tue Nov 9 19:28:44 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: Message-ID: Hello Koen, Eclipse (www.eclipse.org) has a pretty slick debugger, can run a build using ant, and runs well on MacOSX. You can include the biojava jars in your project and tell Eclipse where the biojava source is and it will step into the biojava functions where appropriate. I'd tell you exactly how to do this, but I can't stand using Eclipse or any other IDE for very long because of their MDI windowing interfaces (come on, I have a 1920x1200 desktop already!). I'm an emacs/vi and command-line maven and/or ant guy myself. michael On Tue, 9 Nov 2004, Koen van der Drift wrote: > Hi, > > I have been able to build biojava using Apple's Xcode 1.5. I also was > able to make a separate small Xcode project and run some code that > uses biojava. What I would like to be able to do is, is to debug my > code including the code it uses from biojava. I can step through my own > code, but as soon as the debugger steps into a biojava function, it > treats that as a black box, and I cannot see what happens 'under the > hood'. Is there any way to accomplish this, either with Xcode, or > another OS X or X11 app? I understand that this is impossible when I > only have a jar file, but now I also have all the source code from > biojava. > > Maybe I need to create another target in the same project that contains > the biojava source, but so far I have not been able to get this to > work. > > > thanks, > > - Koen. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From fpepin at cs.mcgill.ca Tue Nov 9 23:00:06 2004 From: fpepin at cs.mcgill.ca (Francois Pepin) Date: Tue Nov 9 22:56:42 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: References: Message-ID: <1100059206.2056.21.camel@faery> Hi Koen, I've never tried it, but would it be able to follow the biojava code if you had both the source and the class files in the jar? Another way would be to compile biojava without making the jar file and put the class files with yours. Then there would be no reason why the debugger can't follow it out. I never use a debugger so I'm not quite sure why it couldn't follow it. I'm mostly a fan of log4j (I guess the 1.4 logging system would work fine too) and of using bean shell to go step-by-step. Francois On Tue, 2004-11-09 at 19:17, Koen van der Drift wrote: > Hi, > > I have been able to build biojava using Apple's Xcode 1.5. I also was > able to make a separate small Xcode project and run some code that > uses biojava. What I would like to be able to do is, is to debug my > code including the code it uses from biojava. I can step through my own > code, but as soon as the debugger steps into a biojava function, it > treats that as a black box, and I cannot see what happens 'under the > hood'. Is there any way to accomplish this, either with Xcode, or > another OS X or X11 app? I understand that this is impossible when I > only have a jar file, but now I also have all the source code from > biojava. > > Maybe I need to create another target in the same project that contains > the biojava source, but so far I have not been able to get this to > work. > > > thanks, > > - Koen. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From mark.schreiber at group.novartis.com Tue Nov 9 23:05:32 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Tue Nov 9 23:03:52 2004 Subject: [Biojava-l] biojava and Xcode Message-ID: Just to add to this. One of the best ways to debug serious problems is to follow the stack trace. If your bug causing exceptions then you can find the class and line responsible in the stack trace. If the bug is more subtle then logging and or assertions are the preferable way to go. It's also a good discipline. - Mark Francois Pepin Sent by: biojava-l-bounces@portal.open-bio.org 11/10/2004 12:00 PM To: Koen van der Drift cc: biojava-list , (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-l] biojava and Xcode Hi Koen, I've never tried it, but would it be able to follow the biojava code if you had both the source and the class files in the jar? Another way would be to compile biojava without making the jar file and put the class files with yours. Then there would be no reason why the debugger can't follow it out. I never use a debugger so I'm not quite sure why it couldn't follow it. I'm mostly a fan of log4j (I guess the 1.4 logging system would work fine too) and of using bean shell to go step-by-step. Francois On Tue, 2004-11-09 at 19:17, Koen van der Drift wrote: > Hi, > > I have been able to build biojava using Apple's Xcode 1.5. I also was > able to make a separate small Xcode project and run some code that > uses biojava. What I would like to be able to do is, is to debug my > code including the code it uses from biojava. I can step through my own > code, but as soon as the debugger steps into a biojava function, it > treats that as a black box, and I cannot see what happens 'under the > hood'. Is there any way to accomplish this, either with Xcode, or > another OS X or X11 app? I understand that this is impossible when I > only have a jar file, but now I also have all the source code from > biojava. > > Maybe I need to create another target in the same project that contains > the biojava source, but so far I have not been able to get this to > work. > > > thanks, > > - Koen. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From td2 at sanger.ac.uk Wed Nov 10 03:32:42 2004 From: td2 at sanger.ac.uk (Thomas Down) Date: Wed Nov 10 03:30:56 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: References: Message-ID: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk> On 10 Nov 2004, at 00:17, Koen van der Drift wrote: > Hi, > > I have been able to build biojava using Apple's Xcode 1.5. I also was > able to make a separate small Xcode project and run some code that > uses biojava. What I would like to be able to do is, is to debug my > code including the code it uses from biojava. I can step through my > own code, but as soon as the debugger steps into a biojava function, > it treats that as a black box, and I cannot see what happens 'under > the hood'. Is there any way to accomplish this, either with Xcode, or > another OS X or X11 app? I understand that this is impossible when I > only have a jar file, but now I also have all the source code from > biojava. > > Maybe I need to create another target in the same project that > contains the biojava source, but so far I have not been able to get > this to work. I'm afraid I'm yet another guy who tends to use stacktraces and logging rather than diving in with a debugger... but just to check... you do have BioJava build with "Generate debugging symbols" set in the Java Compiler settings panel? If that doesn't help, I agree that adding BioJava to the same project is probably the next logical step. Why isn't that working? Thomas From kvddrift at earthlink.net Wed Nov 10 04:35:38 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Wed Nov 10 04:34:07 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk> References: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk> Message-ID: On Nov 10, 2004, at 3:32 AM, Thomas Down wrote: > If that doesn't help, I agree that adding BioJava to the same project > is probably the next logical step. Why isn't that working? > So far I was treating biojava and my own code as 2 different targets in the same project. I will try to make just one target and post here if it worked. Thanks all for the comments, - Koen. From kvddrift at earthlink.net Thu Nov 11 17:21:25 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Thu Nov 11 17:19:36 2004 Subject: [Biojava-l] opening unknown fasta file Message-ID: <06409904-3430-11D9-9447-003065A5FDCC@earthlink.net> Hi, The BioJava tutorial (in anger) suggests the following code to open a fasta file: [snip] // get the appropriate Alphabet Alphabet alpha = AlphabetManager.alphabetForName(args[1]); // get a SequenceDB of all sequences in the file SequenceDB db = SeqIOTools.readFasta(is, alpha); But what should I do when I don't know if the fasta file contains a protein or dna sequence? thanks, - Koen. From mark.schreiber at group.novartis.com Thu Nov 11 21:01:13 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Nov 11 20:59:26 2004 Subject: [Biojava-l] opening unknown fasta file Message-ID: Hi Koen - There was a method in SeqIOTools that can (mostly) guess the alphabet of a file but it is deprecated cause there is no standard convention of file naming. ClustalW guesses by pre-reading the the file and looking for symbols that don't occur in DNA that are found in protein. They claim it's accuracy at guessing is in the high 90's but I'm not sure how they calculate that number. Bascially there is absolutely no failsafe way to know if a fasta file is DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide which contains only acg and t although it becomes very unlikely with longer sequences. If you have control over the files you could adopt some naming specification (I use .fna for fasta DNA or faa for fasta amino acid). An alternative is to allow the specification of format and alphabet in the arguments to the program. - Mark Koen van der Drift Sent by: biojava-l-bounces@portal.open-bio.org 11/12/2004 06:21 AM To: biojava-list cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] opening unknown fasta file Hi, The BioJava tutorial (in anger) suggests the following code to open a fasta file: [snip] // get the appropriate Alphabet Alphabet alpha = AlphabetManager.alphabetForName(args[1]); // get a SequenceDB of all sequences in the file SequenceDB db = SeqIOTools.readFasta(is, alpha); But what should I do when I don't know if the fasta file contains a protein or dna sequence? thanks, - Koen. _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From m.fortner at sbcglobal.net Thu Nov 11 23:06:59 2004 From: m.fortner at sbcglobal.net (Mark A Fortner) Date: Thu Nov 11 23:05:15 2004 Subject: [Biojava-l] opening unknown fasta file In-Reply-To: <06409904-3430-11D9-9447-003065A5FDCC@earthlink.net> Message-ID: <20041112040659.10085.qmail@web80303.mail.yahoo.com> Koen, One thing you might try is to parse the file, grab the accession from the first line, and use regular expressions to identify the type of sequence. Hope this helps, Mark Fortner --- Koen van der Drift wrote: > Hi, > > The BioJava tutorial (in anger) suggests the > following code to open a > fasta file: > > [snip] > > // get the appropriate Alphabet > Alphabet alpha = > AlphabetManager.alphabetForName(args[1]); > > // get a SequenceDB of all sequences in the file > SequenceDB db = SeqIOTools.readFasta(is, alpha); > > > But what should I do when I don't know if the fasta > file contains a > protein or dna sequence? > > > thanks, > > - Koen. > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From thomas at derkholm.net Fri Nov 12 11:26:05 2004 From: thomas at derkholm.net (Thomas Down) Date: Fri Nov 12 11:24:28 2004 Subject: [Biojava-l] opening unknown fasta file In-Reply-To: References: Message-ID: <20041112162605.GA18883@kalinda.derkholm.net> On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber@group.novartis.com wrote: > > Bascially there is absolutely no failsafe way to know if a fasta file is > DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide > which contains only acg and t although it becomes very unlikely with > longer sequences. The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols that appear in DNA sequences. Ns are everywhere, but many of the other ambiguities appear from time to time, too. If we were *really* serious about alphabet-guessing (which scares me, to be honest), one option would be to calculate histograms of character frequencies in EMBL and Swissprot, and look for the closest match. I believe that Internet Explorer takes this approach when it hits a web page without an explicitly-specified character encoding -- it apparently works pretty well... Does anyone feel this serious? Thomas. From jvermont at hotmail.com Sat Nov 13 00:11:29 2004 From: jvermont at hotmail.com (j vermont) Date: Sat Nov 13 00:10:58 2004 Subject: [Biojava-l] opening unknown fasta file Message-ID: IMO this should be addressed from a design standpoint of the API's themselves. If you are *aware* of the nature of the file you're dealing with the APIs should support the ability to differentiate them programmatically, either via a Factory design pattern or through subclassing. It would be far more efficient to solve via architecture and design a general solution than it would be to design a 'parsing' or algorithmic based solution which will be specific only (I'm guessing) on a case by case basis. Not to mention the legit observation someone made about 'alphabet guessing.' Obviously take my input for what it's worth, I'm a programmer by trade with an interest in genetics so I lean towards (and understand better) the comp science aspects of these discussions. I hope my humble suggestions are at least somewhat helpful. Based on my understanding of what is being discussed in this thread, however, you should be able to programmatically (not algorithmically) solive this particular scenario. I could look at it further (an API/design based or pattern based solution) when I get a chance, if anyone thinks it worthwhile. just my thoughts, jess vermont chicago Universes of virtually unlimited complexity can be created in the form of computer programs. (Joseph Weizenbaum) >From: Thomas Down >To: mark.schreiber@group.novartis.com >CC: biojava-list >Subject: Re: [Biojava-l] opening unknown fasta file >Date: Fri, 12 Nov 2004 16:26:05 +0000 > >On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber@group.novartis.com >wrote: > > > > Bascially there is absolutely no failsafe way to know if a fasta file is > > DNA or Protein (or RNA). It's perfectly reasonable to have a short >peptide > > which contains only acg and t although it becomes very unlikely with > > longer sequences. > >The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols >that appear in DNA sequences. Ns are everywhere, but many of the other >ambiguities appear from time to time, too. > >If we were *really* serious about alphabet-guessing (which scares me, to be >honest), one option would be to calculate histograms of character >frequencies >in EMBL and Swissprot, and look for the closest match. I believe that >Internet Explorer takes this approach when it hits a web page without an >explicitly-specified character encoding -- it apparently works pretty >well... > >Does anyone feel this serious? > > Thomas. >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today - it's FREE! hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ From kvddrift at earthlink.net Sat Nov 13 14:47:57 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat Nov 13 14:46:13 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: References: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk> Message-ID: On Nov 10, 2004, at 4:35 AM, Koen van der Drift wrote: > So far I was treating biojava and my own code as 2 different targets > in the same project. I will try to make just one target and post here > if it worked. Thanks all for the comments, > To follow up on this, it's working now. The trick is to create an "Ant-based Application Jar" project in Xcode (1.5), and copy all the code from the src directory in biojava-1.4pre1 plus my own code into the project. I did have to comment out a couple of lines that start with assert to compile successfully, for instance: /Users/koen/Desktop/biojavatest2/src/org/biojava/bio/symbol/ SimpleGappedSymbolList.java:408: warning: as of release 1.4, assert is a keyword, and may not be used as an identifier assert isSane() : "Data corrupted: " + blocks; ^ /Users/koen/Desktop/biojavatest2/src/org/biojava/bio/symbol/ SimpleGappedSymbolList.java:408: ';' expected assert isSane() : "Data corrupted: " + blocks; Regarding the suggestions to use the stack trace, I have a C/C++ and GUI background, so I prefer to visually step through the code to see the flow and the values of each variable. cheers, - Koen. From td2 at sanger.ac.uk Sat Nov 13 15:13:39 2004 From: td2 at sanger.ac.uk (Thomas Down) Date: Sat Nov 13 15:13:26 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: References: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk> Message-ID: <823016E2-35B0-11D9-A4CC-000A95C8B056@sanger.ac.uk> On 13 Nov 2004, at 19:47, Koen van der Drift wrote: > > On Nov 10, 2004, at 4:35 AM, Koen van der Drift wrote: > >> So far I was treating biojava and my own code as 2 different targets >> in the same project. I will try to make just one target and post here >> if it worked. Thanks all for the comments, >> > > To follow up on this, it's working now. The trick is to create an > "Ant-based Application Jar" project in Xcode (1.5), and copy all the > code from the src directory in biojava-1.4pre1 plus my own code into > the project. I did have to comment out a couple of lines that start > with assert to compile successfully, for instance Assert is a Java 1.4 language feature. could you try opening the Target settings, looking at the "Java Compiler Settings" panel, and checking that "Source Version" is set to 1.4. Default seems to be "Unspecified". I've not tried building BioJava in Xcode (I'm an eclipse user myself), but this seems like the most likely problem. Thomas. From kvddrift at earthlink.net Sat Nov 13 16:44:02 2004 From: kvddrift at earthlink.net (Koen van der Drift) Date: Sat Nov 13 16:42:10 2004 Subject: [Biojava-l] biojava and Xcode In-Reply-To: <823016E2-35B0-11D9-A4CC-000A95C8B056@sanger.ac.uk> References: <16A91F5D-32F3-11D9-B425-000A95C8B056@sanger.ac.uk> <823016E2-35B0-11D9-A4CC-000A95C8B056@sanger.ac.uk> Message-ID: <229E521B-35BD-11D9-B29B-003065A5FDCC@earthlink.net> On Nov 13, 2004, at 3:13 PM, Thomas Down wrote: >> >> To follow up on this, it's working now. The trick is to create an >> "Ant-based Application Jar" project in Xcode (1.5), and copy all the >> code from the src directory in biojava-1.4pre1 plus my own code into >> the project. I did have to comment out a couple of lines that start >> with assert to compile successfully, for instance > > Assert is a Java 1.4 language feature. could you try opening the > Target settings, looking at the "Java Compiler Settings" panel, and > checking that "Source Version" is set to 1.4. Default seems to be > "Unspecified". I've not tried building BioJava in Xcode (I'm an > eclipse user myself), but this seems like the most likely problem. > That setting is unavailable when I use a "Ant-based Application Jar" project. When I switch to a "Java Tool" project, I do see that setting, but changing it to 1.4 doesn't solve the problem. However, I don't think that it is that big of a problem, so I will just leave the few instances of assert commented out. thanks, - Koen. From ml-it-biojava at epigenomics.com Tue Nov 16 05:09:43 2004 From: ml-it-biojava at epigenomics.com (Dirk Habighorst) Date: Tue Nov 16 05:08:26 2004 Subject: [Biojava-l] TestDAS problem Message-ID: Hi, running the TestDAS example (biojava-live) causes the following exception: Exception in thread "main" org.biojava.bio.BioRuntimeException: org.biojava.bio.BioException: DAS error (status code = 401) connecting to http://servlet.sanger.ac.uk:8080/das/ with query http://servlet.sanger.ac.uk:8080/das/entry_points at org.biojava.bio.program.das.DASSequenceDB.ids(DASSequenceDB.java:286) at das.TestDAS.main(TestDAS.java:25) Caused by: org.biojava.bio.BioException: DAS error (status code = 401) connecting to http://servlet.sanger.ac.uk:8080/das/ with query http://servlet.sanger.ac.uk:8080/das/entry_points at org.biojava.bio.program.das.DASSequenceDB.ids(DASSequenceDB.java:261) ... 1 more I have tried several other das servers with the same result. By the way are the sources for Matthew Pococks biojava das client available anywhere? thanks, dirk From thomas at derkholm.net Tue Nov 16 05:25:08 2004 From: thomas at derkholm.net (Thomas Down) Date: Tue Nov 16 05:23:15 2004 Subject: [Biojava-l] TestDAS problem In-Reply-To: References: Message-ID: <20041116102508.GB27270@kalinda.derkholm.net> On Tue, Nov 16, 2004 at 11:09:43AM +0100, Dirk Habighorst wrote: > Hi, > > running the TestDAS example (biojava-live) causes the following exception: > > Exception in thread "main" org.biojava.bio.BioRuntimeException: > org.biojava.bio.BioException: DAS error (status code = 401) connecting to > http://servlet.sanger.ac.uk:8080/das/ with query That test script is meant to be pointed to an individual DAS data source, not the root of a DAS server (which can potentially be serving up many data sources). Try something like: http://servlet.sanger.ac.uk:8080/das/homo_sapiens_core_25_34e/ (You can get a complete list of what's on offer by looking at http://servlet.sanger.ac.uk:8080/das/ in a web browser). > I have tried several other das servers with the same result. By the way are > the sources for Matthew Pococks biojava das client available anywhere? There's a version in CVS: http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/das-client/?cvsroot=biojava I think this up to date, but as you can see it's not had much development recently. Thomas. From jdiggans at excelsiortech.com Sat Nov 20 23:12:49 2004 From: jdiggans at excelsiortech.com (James Diggans) Date: Sat Nov 20 23:00:51 2004 Subject: [Biojava-l] Errors to STDOUT in BioJava? Message-ID: <41A015C1.5030801@excelsiortech.com> A design-related question: In GenbankSequenceDB's getSequence() method, the exception-handling code when catching an Exception thrown when a bad accession is used to search (returning nothing from Genbank) prints an error to STDOUT rather than passing the Exception up the chain like a good little Java method should: ... } catch (Exception e) { System.out.println("Exception found in GenbankSequenceDB -- getSequence"); System.out.println(e.toString()); ExceptionFound = true; IOExceptionFound = true; return null; } Is there a reason behind this? It results in an application that prints to STDOUT regardless of my wishes and also limits my ability to catch the Exception myself higher up in the stack to deal with it in an application-specific way. Just curious ... thanks. -j From jvermont at hotmail.com Sun Nov 21 03:13:07 2004 From: jvermont at hotmail.com (j vermont) Date: Sun Nov 21 03:12:02 2004 Subject: [Biojava-l] Errors to STDOUT in BioJava? In-Reply-To: <41A015C1.5030801@excelsiortech.com> Message-ID: hello all, I asked JD if a proper solution to this would be to rethrow Exception as such: } catch (Exception e) { > System.out.println("Exception found in GenbankSequenceDB -- >getSequence"); > System.out.println(e.toString()); > ExceptionFound = true; > IOExceptionFound = true; //create instance of Exception and throw it here so it gets passed back up the stack to // the calling method.... Exception myException = new Exception("bad accession error"); throw myException; > return null; > } the compiler won't complain if you're not throwing a checked exception. Or you could perhaps put a check for the boolean ExceptionFound in a finally clause and throw and exception from there if ExceptionFound == true; such as } catch (Exception e) { System.out.println("Exception found in GenbankSequenceDB -- getSequence"); System.out.println(e.toString()); ExceptionFound = true; IOExceptionFound = true; return null; } //always executed unless system.exit() is called; finally { // check for state of boolean ExceptionFound here if(ExceptionFound) { //error occured, throw an exception that will be handled further up //the stack throw new IlllegalAccessionException(); } } just some thoughts. thanks for your time, Jess Vermont Chicago, Il. Universes of virtually unlimited complexity can be created in the form of computer programs. (Joseph Weizenbaum) >From: James Diggans >To: biojava-l@biojava.org >Subject: [Biojava-l] Errors to STDOUT in BioJava? >Date: Sat, 20 Nov 2004 23:12:49 -0500 > > >A design-related question: In GenbankSequenceDB's getSequence() method, the >exception-handling code when catching an Exception thrown when a bad >accession is used to search (returning nothing from Genbank) prints an >error to STDOUT rather than passing the Exception up the chain like a good >little Java method should: > >... > } catch (Exception e) { > System.out.println("Exception found in GenbankSequenceDB -- >getSequence"); > System.out.println(e.toString()); > ExceptionFound = true; > IOExceptionFound = true; > return null; > } > >Is there a reason behind this? It results in an application that prints to >STDOUT regardless of my wishes and also limits my ability to catch the >Exception myself higher up in the stack to deal with it in an >application-specific way. Just curious ... thanks. >-j > >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l _________________________________________________________________ On the road to retirement? Check out MSN Life Events for advice on how to get there! http://lifeevents.msn.com/category.aspx?cid=Retirement From jdiggans at excelsiortech.com Sun Nov 21 03:52:08 2004 From: jdiggans at excelsiortech.com (James Diggans) Date: Sun Nov 21 03:43:54 2004 Subject: [Biojava-l] Errors to STDOUT in BioJava? In-Reply-To: References: Message-ID: <41A05738.5070005@excelsiortech.com> Certainly. I was asking the list to inquire whether this was an intentional design choice (i.e. making the class noisy regardless of calling class) or whether anyone would be averse to fixing it to just simply be: >> } catch (Exception e) { >> ExceptionFound = true; >> IOExceptionFound = true; >> return null; >> } or, my preference, to let the exception (in a more specific enbodiment, say, InvalidIDException) percolate up the stack to the caller rather than being dealt with at this low level. -j j vermont wrote: > hello all, > > I asked JD if a proper solution to this would be to rethrow Exception as > such: > > } catch (Exception e) { > >> System.out.println("Exception found in GenbankSequenceDB -- >> getSequence"); >> System.out.println(e.toString()); >> ExceptionFound = true; >> IOExceptionFound = true; > > //create instance of Exception and throw it here so it gets > passed back up the stack to > // the calling method.... > Exception myException = new Exception("bad accession error"); > throw myException; > >> return null; >> } > > > the compiler won't complain if you're not throwing a checked exception. > Or you could perhaps put a check for the boolean ExceptionFound in a > finally clause and throw and exception from there if ExceptionFound == > true; such as > > } catch (Exception e) { > System.out.println("Exception found in GenbankSequenceDB -- > getSequence"); > System.out.println(e.toString()); > ExceptionFound = true; > IOExceptionFound = true; > return null; > } > //always executed unless system.exit() is called; > finally > { > // check for state of boolean ExceptionFound here > if(ExceptionFound) > { > //error occured, throw an exception that will be handled further up > //the stack > throw new IlllegalAccessionException(); > } > } > > > > just some thoughts. > > thanks for your time, > > Jess Vermont > Chicago, Il. > > Universes of virtually unlimited complexity can be created in the form > of computer programs. (Joseph Weizenbaum) > > From td2 at sanger.ac.uk Sun Nov 21 06:18:59 2004 From: td2 at sanger.ac.uk (Thomas Down) Date: Sun Nov 21 06:17:08 2004 Subject: [Biojava-l] Errors to STDOUT in BioJava? In-Reply-To: <41A05738.5070005@excelsiortech.com> References: <41A05738.5070005@excelsiortech.com> Message-ID: <24347E20-3BAF-11D9-AE6B-000A95C8B056@sanger.ac.uk> On 21 Nov 2004, at 08:52, James Diggans wrote: > > Certainly. I was asking the list to inquire whether this was an > intentional design choice (i.e. making the class noisy regardless of > calling class) or whether anyone would be averse to fixing it to just > simply be: > > >> } catch (Exception e) { > >> ExceptionFound = true; > >> IOExceptionFound = true; > >> return null; > >> } > > or, my preference, to let the exception (in a more specific > enbodiment, say, InvalidIDException) percolate up the stack to the > caller rather than being dealt with at this low level. Hi James, I'd certainly agree that this should be throwing an exception rather than returning null. If this class were to implement the standard SequenceDB interface (incidentally, does anyone know why it doesn't?), then the getSequence method is allowed to throw IllegalIDException or BioException -- the `ideal' behaviour, where possible, is to throw IllegalIDException for the specific case of a non-existant ID being requested, BioException if the ID is valid but there's some other error getting at the data. I don't know how easy the Genbank protocol makes it to distinguish the two cases. I'll fix this at some point in the next few days, or I'd be happy to apply a patch if you've sorted this out yourself. Thomas. From len at reeltwo.com Mon Nov 1 15:13:31 2004 From: len at reeltwo.com (Len Trigg) Date: Sun Nov 21 16:10:57 2004 Subject: [Biojava-l] BioSQL In-Reply-To: References: Message-ID: Mark Schreiber wrote: > Does anyone have a current BioSQL schema that matches the BioJava > bindings? Preferably one for Oracle. AFAIK, current BioSQL CVS matches what BioJava expects -- Hilmar recently added the extra table that BioJava uses. I've also attached the Oracle schema that I've used (it's a bit simpler than the full BioSQL Oracle schema). Cheers, Len. -------------- next part -------------- -- conventions: -- _id is primary internal id (usually autogenerated) -- Authors: Ewan Birney, Elia Stupka -- Contributors: Hilmar Lapp, Aaron Mackey -- -- Copyright Ewan Birney. You may use, modify, and distribute this code under -- the same terms as Perl. See the Perl Artistic License. -- -- comments to biosql - biosql-l@open-bio.org -- -- Migration of the MySQL schema to InnoDB by Hilmar Lapp -- Post-Cape Town changes by Hilmar Lapp. -- Singapore changes by Hilmar Lapp and Aaron Mackey. -- CREATE SEQUENCE biodatabase_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE taxon_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE ontology_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE term_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE term_relationship_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE term_path_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE bioentry_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE bioentry_relationship_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE dbxref_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE reference_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE anncomment_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE seqfeature_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE seqfeature_relationship_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; CREATE SEQUENCE location_pk_seq INCREMENT BY 1 START WITH 1 NOMAXVALUE NOMINVALUE NOCYCLE NOORDER; -- database have bioentries. That is about it. -- we do not store different versions of a database as different dbids -- (there is no concept of versions of database). There is a concept of -- versions of entries. Versions of databases deserve their own table and -- join to bioentry table for tracking with versions of entries CREATE TABLE biodatabase ( biodatabase_id int NOT NULL , name VARCHAR(128) NOT NULL, authority VARCHAR(128), description VARCHAR2(250), PRIMARY KEY (biodatabase_id), UNIQUE (name) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX db_auth on biodatabase(authority) TABLESPACE "BIOSQL_INDEX"; -- we could insist that taxa are NCBI taxon id, but on reflection I made this -- an optional extra line, as many flat file formats do not have the NCBI id -- -- no organelle/sub species -- corresponds to the node table of the NCBI taxonomy databaase CREATE TABLE taxon ( taxon_id int NOT NULL , ncbi_taxon_id int, parent_taxon_id int , node_rank VARCHAR(32), genetic_code INT , mito_genetic_code INT , left_value int , right_value int , PRIMARY KEY (taxon_id), UNIQUE (ncbi_taxon_id), UNIQUE (left_value), UNIQUE (right_value) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX taxparent ON taxon(parent_taxon_id) TABLESPACE "BIOSQL_INDEX"; -- corresponds to the names table of the NCBI taxonomy databaase CREATE TABLE taxon_name ( taxon_id int NOT NULL, name VARCHAR(255) NOT NULL, name_class VARCHAR(32) NOT NULL, UNIQUE (taxon_id,name,name_class) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX taxnametaxonid ON taxon_name(taxon_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX taxnamename ON taxon_name(name) TABLESPACE "BIOSQL_INDEX"; -- this is the namespace (controlled vocabulary) ontology terms live in -- we chose to have a separate table for this instead of reusing biodatabase CREATE TABLE ontology ( ontology_id int NOT NULL , name VARCHAR(32) NOT NULL, definition VARCHAR2(250), PRIMARY KEY (ontology_id), UNIQUE (name) ) TABLESPACE "BIOSQL_DATA"; -- any controlled vocab term, everything from full ontology -- terms eg GO IDs to the various keys allowed as qualifiers CREATE TABLE term ( term_id int NOT NULL , name VARCHAR(255) NOT NULL, definition VARCHAR2(250), identifier VARCHAR(40), is_obsolete CHAR(1), ontology_id int NOT NULL, PRIMARY KEY (term_id), UNIQUE (name,ontology_id), UNIQUE (identifier) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX term_ont ON term(ontology_id) TABLESPACE "BIOSQL_INDEX"; -- ontology terms have synonyms, here is how to store them CREATE TABLE term_synonym ( name VARCHAR(255) NOT NULL, term_id int NOT NULL, PRIMARY KEY (term_id,name) ) TABLESPACE "BIOSQL_DATA"; -- ontology terms to dbxref association: ontology terms have dbxrefs CREATE TABLE term_dbxref ( term_id int NOT NULL, dbxref_id int NOT NULL, rank SMALLINT, PRIMARY KEY (term_id, dbxref_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX trmdbxref_dbxrefid ON term_dbxref(dbxref_id) TABLESPACE "BIOSQL_INDEX"; -- relationship between controlled vocabulary / ontology term -- we use subject/predicate/object but this could also -- be thought of as child/relationship-type/parent. -- the subject/predicate/object naming is better as we -- can think of the graph as composed of statements. -- -- we also treat the relationshiptypes / predicates as -- controlled terms in themselves; this is quite useful -- as a lot of systems (eg GO) will soon require -- ontologies of relationship types (eg subtle differences -- in the partOf relationship) -- -- this table probably won''t be filled for a while, the core -- will just treat ontologies as flat lists of terms CREATE TABLE term_relationship ( term_relationship_id int NOT NULL , subject_term_id int NOT NULL, predicate_term_id int NOT NULL, object_term_id int NOT NULL, ontology_id int NOT NULL, PRIMARY KEY (term_relationship_id), UNIQUE (subject_term_id,predicate_term_id,object_term_id,ontology_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX trmrel_predicateid ON term_relationship(predicate_term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX trmrel_objectid ON term_relationship(object_term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX trmrel_ontid ON term_relationship(ontology_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); -- the infamous transitive closure table on ontology term relationships -- this is a warehouse approach - you will need to update this regularly -- -- the triple of (subject, predicate, object) is the same as for ontology -- relationships, with the exception of predicate being the greatest common -- denominator of the relationships types visited in the path (i.e., if -- relationship type A is-a relationship type B, the greatest common -- denominator for path containing both types A and B is B) -- -- See the GO database or Chado schema for other (and possibly better -- documented) implementations of the transitive closure table approach. CREATE TABLE term_path ( term_path_id int NOT NULL , subject_term_id int NOT NULL, predicate_term_id int NOT NULL, object_term_id int NOT NULL, ontology_id int NOT NULL, distance int , PRIMARY KEY (term_path_id), UNIQUE (subject_term_id,predicate_term_id,object_term_id,ontology_id,distance) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX trmpath_predicateid ON term_path(predicate_term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX trmpath_objectid ON term_path(object_term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX trmpath_ontid ON term_path(ontology_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX trmpath_subjectid ON term_path(subject_term_id); -- BioJava addition CREATE TABLE term_relationship_term ( term_relationship_id int DEFAULT 0 NOT NULL, term_id int DEFAULT 0 NOT NULL, PRIMARY KEY (term_relationship_id,term_id), ) TABLESPACE "BIOSQL_DATA"; ALTER TABLE term_relationship_term ADD CONSTRAINT uni_term_relationship_id UNIQUE (term_relationship_id) ENABLE VALIDATE; ALTER TABLE term_relationship_term ADD CONSTRAINT uni_term_id UNIQUE (term_id) ENABLE VALIDATE; -- we can be a bioentry without a biosequence, but not visa-versa -- most things are going to be keyed off bioentry_id -- -- accession is the stable id, display_id is a potentially volatile, -- human readable name. -- -- Version may be unknown, may be undefined, or may not exist for a certain -- accession or database (namespace). We require it here to avoid RDBMS- -- dependend enforcement variants (version is in a compound alternative key), -- and to simplify query construction for UK look-ups. If there is no version -- the convention is to put 0 (zero) here. Likewise, a record with a version -- of zero means the version is to be interpreted as NULL. -- -- not all entries have a taxon, but many do. -- one bioentry only has one taxon! (weirdo chimerias are not handled. tough) -- -- Name maps to display_id in bioperl. We have a different column name -- here to avoid confusion with the naming convention for foreign keys. CREATE TABLE bioentry ( bioentry_id int NOT NULL , biodatabase_id int NOT NULL, taxon_id int , name VARCHAR(40) NOT NULL, accession VARCHAR(40) NOT NULL, identifier VARCHAR(40), division VARCHAR(6), description VARCHAR2(250), version SMALLINT NOT NULL, PRIMARY KEY (bioentry_id), UNIQUE (accession,biodatabase_id,version), UNIQUE (identifier) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX bioentry_name ON bioentry(name) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX bioentry_db ON bioentry(biodatabase_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX bioentry_tax ON bioentry(taxon_id) TABLESPACE "BIOSQL_INDEX"; -- -- bioentry-bioentry relationships: these are typed -- CREATE TABLE bioentry_relationship ( bioentry_relationship_id int NOT NULL , object_bioentry_id int NOT NULL, subject_bioentry_id int NOT NULL, term_id int NOT NULL, rank INT, PRIMARY KEY (bioentry_relationship_id), UNIQUE (object_bioentry_id,subject_bioentry_id,term_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX bioentryrel_trm ON bioentry_relationship(term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX bioentryrel_child ON bioentry_relationship(subject_bioentry_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX bioentryrel_parent ON bioentry_relationship(object_bioentry_id); -- for deep (depth > 1) bioentry relationship trees we need a transitive -- closure table too CREATE TABLE bioentry_path ( object_bioentry_id int NOT NULL, subject_bioentry_id int NOT NULL, term_id int NOT NULL, distance int , UNIQUE (object_bioentry_id,subject_bioentry_id,term_id,distance) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX bioentrypath_trm ON bioentry_path(term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX bioentrypath_child ON bioentry_path(subject_bioentry_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX bioentrypath_parent ON bioentry_path(object_bioentry_id); -- some bioentries will have a sequence -- biosequence because sequence is sometimes a reserved word CREATE TABLE biosequence ( bioentry_id int NOT NULL, version SMALLINT, length int, alphabet VARCHAR(10), seq LONG, PRIMARY KEY (bioentry_id) ) TABLESPACE "BIOSQL_DATA"; -- add these only if you want them: -- ALTER TABLE biosequence ADD COLUMN ( isoelec_pt NUMERIC(4,2) ); -- ALTER TABLE biosequence ADD COLUMN ( mol_wgt DOUBLE PRECISION ); -- ALTER TABLE biosequence ADD COLUMN ( perc_gc DOUBLE PRECISION ); -- database cross-references (e.g., GenBank:AC123456.1) -- -- Version may be unknown, may be undefined, or may not exist for a certain -- accession or database (namespace). We require it here to avoid RDBMS- -- dependend enforcement variants (version is in a compound alternative key), -- and to simplify query construction for UK look-ups. If there is no version -- the convention is to put 0 (zero) here. Likewise, a record with a version -- of zero means the version is to be interpreted as NULL. -- CREATE TABLE dbxref ( dbxref_id int NOT NULL , dbname VARCHAR(40) NOT NULL, accession VARCHAR(40) NOT NULL, version SMALLINT NOT NULL, PRIMARY KEY (dbxref_id), UNIQUE(accession, dbname, version) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX dbxref_db ON dbxref(dbname) TABLESPACE "BIOSQL_INDEX"; -- for roundtripping embl/genbank, we need to have the "optional ID" -- for the dbxref. -- -- another use of this table could be for storing -- descriptive text for a dbxref. for example, we may want to -- know stuff about the interpro accessions we store (without -- importing all of interpro), so we can attach the text -- description as a synonym CREATE TABLE dbxref_qualifier_value ( dbxref_id int NOT NULL, term_id int NOT NULL, rank INT DEFAULT 0 NOT NULL, value VARCHAR2(100), PRIMARY KEY (dbxref_id,term_id,rank) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX dbxrefqual_dbx ON dbxref_qualifier_value(dbxref_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX dbxrefqual_trm ON dbxref_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX"; -- Direct dblinks. It is tempting to do this -- from bioentry_id to bioentry_id. But that wont work -- during updates of one database - we will have to edit -- this table each time. Better to do the join through accession -- and db each time. Should be almost as cheap CREATE TABLE bioentry_dbxref ( bioentry_id int NOT NULL, dbxref_id int NOT NULL, rank SMALLINT, PRIMARY KEY (bioentry_id,dbxref_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX dblink_dbx ON bioentry_dbxref(dbxref_id) TABLESPACE "BIOSQL_INDEX"; -- We can have multiple references per bioentry, but one reference -- can also be used for the same bioentry. -- -- No two references can reference the same reference database entry -- (dbxref_id). This is where the MEDLINE id goes: PUBMED:123456. CREATE TABLE reference ( reference_id int NOT NULL , dbxref_id int , location VARCHAR2(100) NOT NULL, title VARCHAR2(100), authors VARCHAR2(100) NOT NULL, crc VARCHAR(32), PRIMARY KEY (reference_id), UNIQUE (dbxref_id), UNIQUE (crc) ) TABLESPACE "BIOSQL_DATA"; -- bioentry to reference associations CREATE TABLE bioentry_reference ( bioentry_id int NOT NULL, reference_id int NOT NULL, start_pos int, end_pos int, rank SMALLINT DEFAULT 0 NOT NULL, PRIMARY KEY(bioentry_id,reference_id,rank) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX bioentryref_ref ON bioentry_reference(reference_id) TABLESPACE "BIOSQL_INDEX"; -- We can have multiple comments per seqentry, and -- comments can have embedded '\n' characters CREATE TABLE anncomment ( comment_id int NOT NULL , bioentry_id int NOT NULL, comment_text VARCHAR2(100) NOT NULL, rank SMALLINT DEFAULT 0 NOT NULL, PRIMARY KEY (comment_id), UNIQUE(bioentry_id, rank) ) TABLESPACE "BIOSQL_DATA"; -- tag/value and ontology term annotation for bioentries goes here CREATE TABLE bioentry_qualifier_value ( bioentry_id int NOT NULL, term_id int NOT NULL, value VARCHAR2(100), rank INT DEFAULT 0 NOT NULL, UNIQUE (bioentry_id,term_id,rank) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX bioentryqual_trm ON bioentry_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX"; -- feature table. We cleanly handle -- - simple locations -- - split locations -- - split locations on remote sequences CREATE TABLE seqfeature ( seqfeature_id int NOT NULL , bioentry_id int NOT NULL, type_term_id int NOT NULL, source_term_id int NOT NULL, display_name VARCHAR(64), rank SMALLINT DEFAULT 0 NOT NULL, PRIMARY KEY (seqfeature_id), UNIQUE (bioentry_id,type_term_id,source_term_id,rank) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX seqfeature_trm ON seqfeature(type_term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX seqfeature_fsrc ON seqfeature(source_term_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX seqfeature_bioentryid ON seqfeature(bioentry_id); -- seqfeatures can be arranged in containment hierarchies. -- one can imagine storing other relationships between features, -- in this case the term_id can be used to type the relationship CREATE TABLE seqfeature_relationship ( seqfeature_relationship_id int NOT NULL , object_seqfeature_id int NOT NULL, subject_seqfeature_id int NOT NULL, term_id int NOT NULL, rank INT, PRIMARY KEY (seqfeature_relationship_id), UNIQUE (object_seqfeature_id,subject_seqfeature_id,term_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX seqfeaturerel_trm ON seqfeature_relationship(term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX seqfeaturerel_child ON seqfeature_relationship(subject_seqfeature_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX seqfeaturerel_parent ON seqfeature_relationship(object_seqfeature_id); -- for deep (depth > 1) seqfeature relationship trees we need a transitive -- closure table too CREATE TABLE seqfeature_path ( object_seqfeature_id int NOT NULL, subject_seqfeature_id int NOT NULL, term_id int NOT NULL, distance int , UNIQUE (object_seqfeature_id,subject_seqfeature_id,term_id,distance) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX seqfeaturepath_trm ON seqfeature_path(term_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX seqfeaturepath_child ON seqfeature_path(subject_seqfeature_id) TABLESPACE "BIOSQL_INDEX"; -- you may want to add this for mysql because MySQL often is broken with -- respect to using the composite index for the initial keys --CREATE INDEX seqfeaturerel_parent ON seqfeature_path(object_seqfeature_id); -- tag/value associations - or ontology annotations CREATE TABLE seqfeature_qualifier_value ( seqfeature_id int NOT NULL, term_id int NOT NULL, rank SMALLINT DEFAULT 0 NOT NULL, value VARCHAR2(4000) NOT NULL, PRIMARY KEY (seqfeature_id,term_id,rank) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX seqfeaturequal_trm ON seqfeature_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX"; -- DBXrefs for features. This is necessary for genome oriented viewpoints, -- where you have a few have long sequences (contigs, or chromosomes) with many -- features on them. In that case the features are the semantic scope for -- their annotation bundles, not the bioentry they are attached to. CREATE TABLE seqfeature_dbxref ( seqfeature_id int NOT NULL, dbxref_id int NOT NULL, rank SMALLINT, PRIMARY KEY (seqfeature_id,dbxref_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX feadblink_dbx ON seqfeature_dbxref(dbxref_id) TABLESPACE "BIOSQL_INDEX"; -- basically we model everything as potentially having -- any number of locations, ie, a split location. SimpleLocations -- just have one location. We need to have a location id for the qualifier -- associations of fuzzy locations. -- please do not try to model complex assemblies with this thing. It wont -- work. Check out the ensembl schema for this. -- we allow nulls for start/end - this is useful for fuzzies as -- standard range queries will not be included -- for remote locations, the join to make is to DBXref -- the FK to term is a possibility to store the type of the -- location for determining in one hit whether it's a fuzzy or not CREATE TABLE location ( location_id int NOT NULL , seqfeature_id int NOT NULL, dbxref_id int , term_id int , start_pos int, end_pos int, strand INT NOT NULL, rank SMALLINT DEFAULT 0 NOT NULL, PRIMARY KEY (location_id), UNIQUE (seqfeature_id, rank) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX seqfeatureloc_start ON location(start_pos, end_pos) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX seqfeatureloc_dbx ON location(dbxref_id) TABLESPACE "BIOSQL_INDEX"; CREATE INDEX seqfeatureloc_trm ON location(term_id) TABLESPACE "BIOSQL_INDEX"; -- location qualifiers - mainly intended for fuzzies but anything -- can go in here -- some controlled vocab terms have slots; -- fuzzies could be modeled as min_start(5), max_start(5) -- -- there is no restriction on extending the fuzzy ontology -- for your own nefarious aims, although the bio* apis will -- most likely ignore these CREATE TABLE location_qualifier_value ( location_id int NOT NULL, term_id int NOT NULL, value VARCHAR(255) NOT NULL, int_value int, PRIMARY KEY (location_id,term_id) ) TABLESPACE "BIOSQL_DATA"; CREATE INDEX locationqual_trm ON location_qualifier_value(term_id) TABLESPACE "BIOSQL_INDEX"; -- -- Create the foreign key constraints -- -- ontology term ALTER TABLE term ADD CONSTRAINT FKont_term FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id) ON DELETE CASCADE; -- term synonyms ALTER TABLE term_synonym ADD CONSTRAINT FKterm_syn FOREIGN KEY (term_id) REFERENCES term(term_id) ON DELETE CASCADE; -- term_dbxref ALTER TABLE term_dbxref ADD CONSTRAINT FKdbxref_trmdbxref FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id) ON DELETE CASCADE; ALTER TABLE term_dbxref ADD CONSTRAINT FKterm_trmdbxref FOREIGN KEY (term_id) REFERENCES term(term_id) ON DELETE CASCADE; -- term_relationship ALTER TABLE term_relationship ADD CONSTRAINT FKtrmsubject_trmrel FOREIGN KEY (subject_term_id) REFERENCES term(term_id) ON DELETE CASCADE; ALTER TABLE term_relationship ADD CONSTRAINT FKtrmpredicate_trmrel FOREIGN KEY (predicate_term_id) REFERENCES term(term_id) ON DELETE CASCADE; ALTER TABLE term_relationship ADD CONSTRAINT FKtrmobject_trmrel FOREIGN KEY (object_term_id) REFERENCES term(term_id) ON DELETE CASCADE; ALTER TABLE term_relationship ADD CONSTRAINT FKterm_trmrel FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id) ON DELETE CASCADE; -- term_path ALTER TABLE term_path ADD CONSTRAINT FKtrmsubject_trmpath FOREIGN KEY (subject_term_id) REFERENCES term(term_id) ON DELETE CASCADE; ALTER TABLE term_path ADD CONSTRAINT FKtrmpredicate_trmpath FOREIGN KEY (predicate_term_id) REFERENCES term(term_id) ON DELETE CASCADE; ALTER TABLE term_path ADD CONSTRAINT FKtrmobject_trmpath FOREIGN KEY (object_term_id) REFERENCES term(term_id) ON DELETE CASCADE; ALTER TABLE term_path ADD CONSTRAINT FKontology_trmpath FOREIGN KEY (ontology_id) REFERENCES ontology(ontology_id) ON DELETE CASCADE; -- taxon, taxon_name -- unfortunately, we can't constrain parent_taxon_id as it is violated -- occasionally by the downloads available from NCBI -- ALTER TABLE taxon ADD CONSTRAINT FKtaxon_taxon -- FOREIGN KEY (parent_taxon_id) REFERENCES taxon(taxon_id); ALTER TABLE taxon_name ADD CONSTRAINT FKtaxon_taxonname FOREIGN KEY (taxon_id) REFERENCES taxon(taxon_id) ON DELETE CASCADE; -- bioentry ALTER TABLE bioentry ADD CONSTRAINT FKtaxon_bioentry FOREIGN KEY (taxon_id) REFERENCES taxon(taxon_id); ALTER TABLE bioentry ADD CONSTRAINT FKbiodatabase_bioentry FOREIGN KEY (biodatabase_id) REFERENCES biodatabase(biodatabase_id); -- bioentry_relationship ALTER TABLE bioentry_relationship ADD CONSTRAINT FKterm_bioentryrel FOREIGN KEY (term_id) REFERENCES term(term_id); ALTER TABLE bioentry_relationship ADD CONSTRAINT FKparentent_bioentryrel FOREIGN KEY (object_bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; ALTER TABLE bioentry_relationship ADD CONSTRAINT FKchildent_bioentryrel FOREIGN KEY (subject_bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; -- bioentry_path ALTER TABLE bioentry_path ADD CONSTRAINT FKterm_bioentrypath FOREIGN KEY (term_id) REFERENCES term(term_id); ALTER TABLE bioentry_path ADD CONSTRAINT FKparentent_bioentrypath FOREIGN KEY (object_bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; ALTER TABLE bioentry_path ADD CONSTRAINT FKchildent_bioentrypath FOREIGN KEY (subject_bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; -- biosequence ALTER TABLE biosequence ADD CONSTRAINT FKbioentry_bioseq FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; -- comment ALTER TABLE anncomment ADD CONSTRAINT FKbioentry_comment FOREIGN KEY(bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; -- bioentry_dbxref ALTER TABLE bioentry_dbxref ADD CONSTRAINT FKbioentry_dblink FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; ALTER TABLE bioentry_dbxref ADD CONSTRAINT FKdbxref_dblink FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id) ON DELETE CASCADE; -- dbxref_qualifier_value ALTER TABLE dbxref_qualifier_value ADD CONSTRAINT FKtrm_dbxrefqual FOREIGN KEY (term_id) REFERENCES term(term_id); ALTER TABLE dbxref_qualifier_value ADD CONSTRAINT FKdbxref_dbxrefqual FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id) ON DELETE CASCADE; -- bioentry_reference ALTER TABLE bioentry_reference ADD CONSTRAINT FKbioentry_entryref FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; ALTER TABLE bioentry_reference ADD CONSTRAINT FKreference_entryref FOREIGN KEY (reference_id) REFERENCES reference(reference_id) ON DELETE CASCADE; -- bioentry_qualifier_value ALTER TABLE bioentry_qualifier_value ADD CONSTRAINT FKbioentry_entqual FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; ALTER TABLE bioentry_qualifier_value ADD CONSTRAINT FKterm_entqual FOREIGN KEY (term_id) REFERENCES term(term_id); -- reference ALTER TABLE reference ADD CONSTRAINT FKdbxref_reference FOREIGN KEY ( dbxref_id ) REFERENCES dbxref ( dbxref_id ) ; -- seqfeature ALTER TABLE seqfeature ADD CONSTRAINT FKterm_seqfeature FOREIGN KEY (type_term_id) REFERENCES term(term_id); ALTER TABLE seqfeature ADD CONSTRAINT FKsourceterm_seqfeature FOREIGN KEY (source_term_id) REFERENCES term(term_id); ALTER TABLE seqfeature ADD CONSTRAINT FKbioentry_seqfeature FOREIGN KEY (bioentry_id) REFERENCES bioentry(bioentry_id) ON DELETE CASCADE; -- seqfeature_relationship ALTER TABLE seqfeature_relationship ADD CONSTRAINT FKterm_seqfeatrel FOREIGN KEY (term_id) REFERENCES term(term_id); ALTER TABLE seqfeature_relationship ADD CONSTRAINT FKparentfeat_seqfeatrel FOREIGN KEY (object_seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; ALTER TABLE seqfeature_relationship ADD CONSTRAINT FKchildfeat_seqfeatrel FOREIGN KEY (subject_seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; -- seqfeature_path ALTER TABLE seqfeature_path ADD CONSTRAINT FKterm_seqfeatpath FOREIGN KEY (term_id) REFERENCES term(term_id); ALTER TABLE seqfeature_path ADD CONSTRAINT FKparentfeat_seqfeatpath FOREIGN KEY (object_seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; ALTER TABLE seqfeature_path ADD CONSTRAINT FKchildfeat_seqfeatpath FOREIGN KEY (subject_seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; -- seqfeature_qualifier_value ALTER TABLE seqfeature_qualifier_value ADD CONSTRAINT FKterm_featqual FOREIGN KEY (term_id) REFERENCES term(term_id); ALTER TABLE seqfeature_qualifier_value ADD CONSTRAINT FKseqfeature_featqual FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; -- seqfeature_dbxref ALTER TABLE seqfeature_dbxref ADD CONSTRAINT FKseqfeature_feadblink FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; ALTER TABLE seqfeature_dbxref ADD CONSTRAINT FKdbxref_feadblink FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id) ON DELETE CASCADE; -- location ALTER TABLE location ADD CONSTRAINT FKseqfeature_location FOREIGN KEY (seqfeature_id) REFERENCES seqfeature(seqfeature_id) ON DELETE CASCADE; ALTER TABLE location ADD CONSTRAINT FKdbxref_location FOREIGN KEY (dbxref_id) REFERENCES dbxref(dbxref_id); ALTER TABLE location ADD CONSTRAINT FKterm_featloc FOREIGN KEY (term_id) REFERENCES term(term_id); -- location_qualifier_value ALTER TABLE location_qualifier_value ADD CONSTRAINT FKfeatloc_locqual FOREIGN KEY (location_id) REFERENCES location(location_id) ON DELETE CASCADE; ALTER TABLE location_qualifier_value ADD CONSTRAINT FKterm_locqual FOREIGN KEY (term_id) REFERENCES term(term_id); -- -- Triggers for automatic primary key generation and other sanity checks -- CREATE OR REPLACE TRIGGER BID_location BEFORE INSERT on location -- for each row BEGIN IF :new.location_id IS NULL THEN SELECT location_pk_seq.nextval INTO :new.location_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_seqfeature BEFORE INSERT on seqfeature -- for each row BEGIN IF :new.seqfeature_id IS NULL THEN SELECT seqfeature_pk_seq.nextval INTO :new.seqfeature_id FROM DUAL; END IF; END; / CREATE TRIGGER BID_seqfeature_relationship BEFORE INSERT on seqfeature_relationship -- for each row BEGIN IF :new.seqfeature_relationship_id IS NULL THEN SELECT seqfeature_relationship_pk_seq.nextval INTO :new.seqfeature_relationship_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_anncomment BEFORE INSERT on anncomment -- for each row BEGIN IF :new.comment_id IS NULL THEN SELECT anncomment_pk_seq.nextval INTO :new.comment_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_reference BEFORE INSERT on reference -- for each row BEGIN IF :new.reference_id IS NULL THEN SELECT reference_pk_seq.nextval INTO :new.reference_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_bioentry_relationship BEFORE INSERT on bioentry_relationship -- for each row BEGIN IF :new.bioentry_relationship_id IS NULL THEN SELECT bioentry_relationship_pk_seq.nextval INTO :new.bioentry_relationship_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_bioentry BEFORE INSERT on bioentry -- for each row BEGIN IF :new.bioentry_id IS NULL THEN SELECT bioentry_pk_seq.nextval INTO :new.bioentry_id FROM DUAL; END IF; -- IF :new.Division IS NULL THEN -- :new.Division := 'UNK'; -- END IF; END; / CREATE OR REPLACE TRIGGER BID_term BEFORE INSERT on term -- for each row BEGIN IF :new.term_id IS NULL THEN SELECT term_pk_seq.nextval INTO :new.term_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_term_relationship BEFORE INSERT on term_relationship -- for each row BEGIN IF :new.term_relationship_id IS NULL THEN SELECT term_relationship_pk_seq.nextval INTO :new.term_relationship_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_term_path BEFORE INSERT on term_path -- for each row BEGIN IF :new.term_path_id IS NULL THEN SELECT term_path_pk_seq.nextval INTO :new.term_path_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_ontology BEFORE INSERT on ontology -- for each row BEGIN IF :new.ontology_id IS NULL THEN SELECT ontology_pk_seq.nextval INTO :new.ontology_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_taxon BEFORE INSERT on taxon -- for each row BEGIN IF :new.taxon_id IS NULL THEN SELECT taxon_pk_seq.nextval INTO :new.taxon_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_biodatabase BEFORE INSERT on biodatabase -- for each row BEGIN IF :new.biodatabase_id IS NULL THEN SELECT biodatabase_pk_seq.nextval INTO :new.biodatabase_id FROM DUAL; END IF; END; / CREATE OR REPLACE TRIGGER BID_dbxref BEFORE INSERT on dbxref -- for each row BEGIN IF :new.dbxref_id IS NULL THEN SELECT dbxref_pk_seq.nextval INTO :new.dbxref_id FROM DUAL; END IF; END; / -------------- next part -------------- From mark.schreiber at group.novartis.com Sun Nov 21 21:39:58 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Sun Nov 21 21:38:05 2004 Subject: [Biojava-l] opening unknown fasta file Message-ID: One way to do this would be to create a Unicode alphabet (or ASCII alphabet) and read the file into a Sequence of that Alphabet, create a Distribution, compare it to the DNA/ RNA/ Protein distributions using DistributionTools and then convert it to the correct Alphabet. Even more ambitious would be to read the whole file to a text buffer and guess the format and alphabet based on the usage of characters. Anyone feel inspired to do something like this. We are always getting emails from students looking for short projects. How about that one? My basic minimal requirement would be that the file should not be read twice. I/O is expensive, Memory is cheap. - Mark Thomas Down Sent by: biojava-l-bounces@portal.open-bio.org 11/13/2004 12:26 AM To: Mark Schreiber/GP/Novartis@PH cc: biojava-list Subject: Re: [Biojava-l] opening unknown fasta file On Fri, Nov 12, 2004 at 10:01:13AM +0800, mark.schreiber@group.novartis.com wrote: > > Bascially there is absolutely no failsafe way to know if a fasta file is > DNA or Protein (or RNA). It's perfectly reasonable to have a short peptide > which contains only acg and t although it becomes very unlikely with > longer sequences. The real problem isn't A, C, G, or T, but the other 11 ambiguity symbols that appear in DNA sequences. Ns are everywhere, but many of the other ambiguities appear from time to time, too. If we were *really* serious about alphabet-guessing (which scares me, to be honest), one option would be to calculate histograms of character frequencies in EMBL and Swissprot, and look for the closest match. I believe that Internet Explorer takes this approach when it hits a web page without an explicitly-specified character encoding -- it apparently works pretty well... Does anyone feel this serious? Thomas. _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From jdiggans at excelsiortech.com Mon Nov 22 01:38:20 2004 From: jdiggans at excelsiortech.com (James Diggans) Date: Mon Nov 22 01:30:52 2004 Subject: [Biojava-l] Parsing MegaBLAST output files? Message-ID: <41A1895C.7000302@excelsiortech.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 All, I'm attempting to use BioJava to parse the output from NCBI's commandline MegaBLAST and receiving an error: 'Could not recognise the format of this file as one supported by the framework.' in a SAXException thrown by BlastLikeSAXParser. An old post to the mailing list: http://www.biojava.org/pipermail/biojava-dev/2002-October/000150.html seems to indicate that this was fixed long ago via this commit to CVS: http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/ssbind/HeaderStAXHandler.java.diff?r1=1.3&r2=1.4&cvsroot=biojava The MegaBLAST file I'm trying to parse is clean and my attempt at a parse consists of (largely pulled from the recipe from BioJava in Anger): - ------------------ InputStream is = new FileInputStream(blastResult); BlastLikeSAXParser parser = new BlastLikeSAXParser(); SeqSimilarityAdapter adapter = new SeqSimilarityAdapter(); parser.setContentHandler(adapter); alignmentResults = new ArrayList(); SearchContentHandler builder = new BlastLikeSearchBuilder(alignmentResults, ~ new DummySequenceDB("queries"), new DummySequenceDBInstallation()); adapter.setSearchContentHandler(builder); parser.parse(new InputSource(is)); - ------------------ Any ideas on why I'm getting the SAXException? Thanks ... - -j -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3-nr1 (Windows XP) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBoYlc75jgGJzUhNkRAu8zAJ9gTNoPouk4/29EDpWKcQVx5EB34gCg2MkD DndldC3zi3bD2QKWgqMNOxs= =TS47 -----END PGP SIGNATURE----- From mark.schreiber at group.novartis.com Mon Nov 22 19:45:38 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Mon Nov 22 19:43:33 2004 Subject: [Biojava-l] Parsing MegaBLAST output files? Message-ID: Hello - MegaBLAST is not offcially supported. This doesn't mean it won't work it just means we don't know if it will work. If it isn't too different from normal blast it probably will. The BlastLikeSAXParser has two modes. Lazy and Strict. If you call setModeLazy() before parsing it won't care if it doesn't recognise the format as one that is tried and tested and will attempt to parse it anyway. You should carefully check a few results though to make sure it is going well. If things work let us know so we can add MegaBLAST to the list of trusted programs. Hope this helps, Mark James Diggans Sent by: biojava-l-bounces@portal.open-bio.org 11/22/2004 02:38 PM To: BioJava cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Parsing MegaBLAST output files? -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 All, I'm attempting to use BioJava to parse the output from NCBI's commandline MegaBLAST and receiving an error: 'Could not recognise the format of this file as one supported by the framework.' in a SAXException thrown by BlastLikeSAXParser. An old post to the mailing list: http://www.biojava.org/pipermail/biojava-dev/2002-October/000150.html seems to indicate that this was fixed long ago via this commit to CVS: http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/ssbind/HeaderStAXHandler.java.diff?r1=1.3&r2=1.4&cvsroot=biojava The MegaBLAST file I'm trying to parse is clean and my attempt at a parse consists of (largely pulled from the recipe from BioJava in Anger): - ------------------ InputStream is = new FileInputStream(blastResult); BlastLikeSAXParser parser = new BlastLikeSAXParser(); SeqSimilarityAdapter adapter = new SeqSimilarityAdapter(); parser.setContentHandler(adapter); alignmentResults = new ArrayList(); SearchContentHandler builder = new BlastLikeSearchBuilder(alignmentResults, ~ new DummySequenceDB("queries"), new DummySequenceDBInstallation()); adapter.setSearchContentHandler(builder); parser.parse(new InputSource(is)); - ------------------ Any ideas on why I'm getting the SAXException? Thanks ... - -j -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3-nr1 (Windows XP) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBoYlc75jgGJzUhNkRAu8zAJ9gTNoPouk4/29EDpWKcQVx5EB34gCg2MkD DndldC3zi3bD2QKWgqMNOxs= =TS47 -----END PGP SIGNATURE----- _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From johnny.hujol at comcast.net Mon Nov 22 21:09:13 2004 From: johnny.hujol at comcast.net (Johnny Hujol) Date: Mon Nov 22 21:08:02 2004 Subject: [Biojava-l] Exception found in GenbankSequenceDB -- getSequence Message-ID: <41A29BC9.7090305@comcast.net> Hi, I'm using biojava-1.30-jdk14.jar with C:\>java -version java version "1.5.0" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-b64) Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode) On W2000 SP2. When running this little Java code: Sequence seqObject = null; try { seqObject = genbankSequenceDB.getSequence(text); SeqIOTools.writeGenbank(System.out, seqObject); } catch (Exception e1) { e1.printStackTrace(); } String sequence = seqObject.seqString(); This throws the following exception: got data from http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=55740946 Exception found in GenbankSequenceDB -- getSequence org.biojava.bio.BioException: Could not read sequence Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException at org.jfb.chp1.listing1_6.SequenceForm1_6$4.focusLost(SequenceForm1_6.java:272) at java.awt.AWTEventMulticaster.focusLost(AWTEventMulticaster.java:172) at java.awt.Component.processFocusEvent(Component.java:5380) at java.awt.Component.processEvent(Component.java:5244) at java.awt.Container.processEvent(Container.java:1966) at java.awt.Component.dispatchEventImpl(Component.java:3955) at java.awt.Container.dispatchEventImpl(Container.java:2024) at java.awt.Component.dispatchEvent(Component.java:3803) at java.awt.KeyboardFocusManager.redispatchEvent(KeyboardFocusManager.java:1810) at java.awt.DefaultKeyboardFocusManager.typeAheadAssertions(DefaultKeyboardFocusManager.java:836) at java.awt.DefaultKeyboardFocusManager.dispatchEvent(DefaultKeyboardFocusManager.java:526) at java.awt.Component.dispatchEventImpl(Component.java:3841) at java.awt.Container.dispatchEventImpl(Container.java:2024) at java.awt.Component.dispatchEvent(Component.java:3803) at java.awt.EventQueue.dispatchEvent(EventQueue.java:463) at java.awt.EventDispatchThread.pumpOneEventForHierarchy(EventDispatchThread.java:234) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:163) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:157) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:149) at java.awt.EventDispatchThread.run(EventDispatchThread.java:110) Process finished with exit code 0 The url looks good but it seems that the parser does not get anything from the data returned by the server. What's happening? Any help would be appreciated. Cheers, J From jdiggans at excelsiortech.com Tue Nov 23 00:08:02 2004 From: jdiggans at excelsiortech.com (James Diggans) Date: Tue Nov 23 00:00:58 2004 Subject: [Biojava-l] Parsing MegaBLAST output files? In-Reply-To: References: Message-ID: <41A2C5B2.8010302@excelsiortech.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks for the reply, Mark. Setting the parser to be lazy (just before the parse; it shouldn't matter where I do this as long as it's prior to the parse, correct?) doesn't seem to help -- I still get the same SAX exception. The MegaBLAST output seems, to my eye, to be identical to that of blastn minus the header line: MEGABLAST 2.2.10 [Oct-19-2004] Looking at the code for BlastLikeSAXParser, it seems, even in lazy mode, to require that the header line contain at least a name with which it is familiar (lazy just turns off interest in the version number). Would a fix be as simple as adding 'MEGABLAST' to the list of acceptable names? I can provide any interested dev w/ a sample output file from the above-mentioned version of MegaBLAST. If no one's interested, I'll follow up but it'll take me a lot longer than those already familiar w/ the BioJava parser code. Thanks all, - -j mark.schreiber@group.novartis.com wrote: | Hello - | | MegaBLAST is not offcially supported. This doesn't mean it won't work it | just means we don't know if it will work. If it isn't too different from | normal blast it probably will. | | The BlastLikeSAXParser has two modes. Lazy and Strict. If you call | setModeLazy() before parsing it won't care if it doesn't recognise the | format as one that is tried and tested and will attempt to parse it | anyway. You should carefully check a few results though to make sure it is | going well. If things work let us know so we can add MegaBLAST to the list | of trusted programs. | | Hope this helps, | | Mark | | | James Diggans | Sent by: biojava-l-bounces@portal.open-bio.org | 11/22/2004 02:38 PM | | | To: BioJava | cc: (bcc: Mark Schreiber/GP/Novartis) | Subject: [Biojava-l] Parsing MegaBLAST output files? | | | | | All, I'm attempting to use BioJava to parse the output from NCBI's | commandline MegaBLAST and receiving an error: | | 'Could not recognise the format of this file as one supported by the | framework.' | | in a SAXException thrown by BlastLikeSAXParser. An old post to the | mailing list: | | http://www.biojava.org/pipermail/biojava-dev/2002-October/000150.html | | seems to indicate that this was fixed long ago via this commit to CVS: | | http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/ssbind/HeaderStAXHandler.java.diff?r1=1.3&r2=1.4&cvsroot=biojava | | The MegaBLAST file I'm trying to parse is clean and my attempt at a | parse consists of (largely pulled from the recipe from BioJava in Anger): | | ------------------ | InputStream is = new FileInputStream(blastResult); | | BlastLikeSAXParser parser = new BlastLikeSAXParser(); | SeqSimilarityAdapter adapter = new SeqSimilarityAdapter(); | parser.setContentHandler(adapter); | | alignmentResults = new ArrayList(); | SearchContentHandler builder = new | BlastLikeSearchBuilder(alignmentResults, | ~ new DummySequenceDB("queries"), | new DummySequenceDBInstallation()); | | adapter.setSearchContentHandler(builder); | | parser.parse(new InputSource(is)); | ------------------ | | Any ideas on why I'm getting the SAXException? Thanks ... | -j | -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3-nr1 (Windows XP) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBosWy75jgGJzUhNkRAtL+AJ9V6JoMXSdT1AWPuFGMckUiMzFO5ACg2D1r 2R75Y4ElTIBxrMA+Pukgre0= =Is3P -----END PGP SIGNATURE----- From vc100 at doc.ic.ac.uk Tue Nov 23 11:44:25 2004 From: vc100 at doc.ic.ac.uk (Vasa Curcin) Date: Tue Nov 23 11:42:07 2004 Subject: [Biojava-l] Writing EMBL files Message-ID: <41A368E9.4020800@doc.ic.ac.uk> Hello, We are loading an EMBL file into a SequenceDB and then writing it out again and getting the following error: 16:42:19,032 INFO [STDOUT] at org.biojava.bio.seq.io.EmblFileFormer.addSequ enceProperty(EmblFileFormer.java:246) 16:42:19,032 INFO [STDOUT] at org.biojava.bio.seq.io.SeqIOEventEmitter.getS eqIOEvents(SeqIOEventEmitter.java:92) 16:42:19,032 INFO [STDOUT] at org.biojava.bio.seq.io.EmblLikeFormat.writeSe quence(EmblLikeFormat.java:289) 16:42:19,032 INFO [STDOUT] at org.biojava.bio.seq.io.EmblLikeFormat.writeSe quence(EmblLikeFormat.java:253) 16:42:19,032 INFO [STDOUT] at org.biojava.bio.seq.io.StreamWriter.writeStre am(StreamWriter.java:63) 16:42:19,032 INFO [STDOUT] at org.biojava.bio.seq.io.SeqIOTools.writeEmbl(S eqIOTools.java:289) 16:42:19,032 INFO [STDOUT] at SequenceDBToText.process(SequenceDBToText.jav a:134) This is the file we are using: ID AB126240 standard; genomic DNA; PRO; 1350 BP. XX AC AB126240; XX SV AB126240.1 XX DT 03-SEP-2004 (Rel. 81, Created) DT 03-SEP-2004 (Rel. 81, Last updated, Version 1) XX DE Thermococcus kodakaraensis Tko1062 gene for phosphosugar mutase, complete DE cds. XX KW . XX OS Thermococcus kodakaraensis OC Archaea; Euryarchaeota; Thermococci; Thermococcales; Thermococcaceae; OC Thermococcus. XX RN [1] RP 1-1350 RA Imanaka T., Atomi H., Rashid N.; RT ; RL Submitted (15-NOV-2003) to the EMBL/GenBank/DDBJ databases. RL Tadayuki Imanaka, Kyoto University, Synthetic Chemistry & Biological RL Chemistry, Graduate School of Engineering; Katsura, Nishikyo-ku, Kyoto RL 615-8510, Japan (E-mail:imanaka@sbchem.kyoto-u.ac.jp, Tel:81-75-383-2777, RL Fax:81-75-383-2778) XX RN [2] RA Rashid N., Kanai T., Atomi H., Imanaka T.; RT "Among Multiple Phosphomannomutase Gene Orthologues, Only One Gene Encodes RT a Protein with Phosphoglucomutase and Phosphomannomutase Activities in RT Thermococcus kodakaraensis"; RL J. Bacteriol. 186:6070-6076(2004). XX FH Key Location/Qualifiers FH FT source 1..1350 FT /db_xref="taxon:69014" FT /mol_type="genomic DNA" FT /organism="Thermococcus kodakaraensis" FT /strain="KOD1" FT CDS 1..1350 FT /codon_start=1 FT /transl_table=11 FT /gene="Tko1062" FT /product="phosphosugar mutase" FT /protein_id="BAD42439.1" FT /translation="MGKYFGTSGIREVFNEKLTPELALKVGKALGTYLGGGKVVIGKDT FT RTSGDVIKSAVISGLLSTGVDVIDIGLAPTPLTGFAIKLYGADAGVTITASHNPPEYNG FT IKVWQANGMAYTSEMERELESIMDSGNFKKAPWNEIGTLRRADPSEEYINAALKFVKLE FT NSYTVVLDSGNGAGSVVSPYLQRELGNRVISLNSHPSGFFVRELEPNAKSLSALAKTVR FT VMKADVGIAHDGDADRIGVVDDQGNFVEYEVMLSLIAGYMLRKFGKGKIVTTVDAGFAL FT DDYLRPLGGEVIRTRVGDVAVADELAKHGGVFGGEPSGTWIIPQWNLTPDGIFAGALVL FT EMIDRLGPISELAKEVPRYVTLRAKIPCPNEKKAKAMEIIAREALKTFDYEGLIDIDGI FT RIENGDWWILFRPSGTEPIMRITLEAHEEEKAKELMGKAERLVKKAISEA" XX SQ Sequence 1350 BP; 339 A; 341 C; 417 G; 253 T; 0 other; atggggaagt acttcggaac cagcggaatc agggaagtct ttaatgagaa gctgacacct 60 gagctggctc taaaggtcgg caaagccctt ggaacgtacc tcggcggcgg aaaggttgtt 120 atcgggaagg ataccaggac tagcggcgac gttataaaat cagcagtcat aagcggactt 180 ctctcaactg gtgttgatgt gattgacata ggtttagcgc caacgccgct cacgggcttt 240 gcgataaagc tctacggtgc cgatgctggc gttaccatca cagcttctca caacccgccg 300 gagtacaacg gcataaaggt gtggcaggcc aacggaatgg catacacctc tgagatggag 360 cgtgaactcg agtccataat ggactcaggg aacttcaaaa aagctccctg gaatgagatc 420 gggacgctta gaagggccga ccccagtgag gagtacataa acgcggcgct aaaattcgtc 480 aaacttgaga actcctacac ggtcgtcctc gattctggaa acggtgcggg ctcggtggtc 540 tccccctacc tccagcggga gctgggcaat agggttatct cgctcaactc ccacccgagc 600 ggcttcttcg tcagggaact tgagccgaac gcgaagagcc tctccgccct agcgaagacc 660 gttagagtga tgaaagccga cgtcggcata gcccacgacg gcgacgcaga taggatcggc 720 gtcgttgatg atcagggcaa cttcgttgag tacgaggtca tgctctcgct catagcgggc 780 tacatgctga ggaagttcgg gaaggggaaa atagttacca ccgttgatgc gggctttgct 840 ttggacgact acctcagacc ccttggcgga gaagtcataa ggacgcgcgt tggtgatgtg 900 gccgttgccg acgagctcgc aaaacacggc ggcgtcttcg gcggcgagcc gagtggcacg 960 tggataatcc cgcagtggaa cctcaccccc gacggaatct ttgctggggc ccttgttctg 1020 gagatgattg acagactcgg tccgataagc gagctggcca aggaagtccc gcgctacgtg 1080 acgctccgcg ccaaaatccc ctgtccgaac gagaagaagg cgaaagccat ggagataata 1140 gcgcgcgagg cactaaagac gttcgactac gaggggctga tagacataga tggaattagg 1200 atagaaaacg gtgactggtg gatcctcttc cgcccgagcg gaaccgagcc gataatgcgc 1260 ataactttgg aggcccacga ggaagagaag gcgaaggagc tgatggggaa ggcggagagg 1320 ctggttaaga aagccatctc ggaggcctga 1350 // Any ideas? Regards, Vasa From mark.schreiber at group.novartis.com Thu Nov 25 20:04:47 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Thu Nov 25 20:02:40 2004 Subject: [Biojava-l] Re: Biojava query Message-ID: Hi Russell - This is a script I use to see which blast items are treated as what kind of events. Note that the object model doesn't capture everything from a report but you can always extend or write your own listener that gets what you want. Use the EchoBlast program to figure out which events you need to listen for... import org.xml.sax.*; import java.io.*; import org.biojava.bio.program.sax.*; import org.biojava.bio.program.ssbind.*; import org.biojava.bio.search.*; /** *

Echo's events from a blast like sax parser

* @author Mark Schreiber * @version 1.0 */ public class BlastEcho { public BlastEcho() { } private void echo (InputSource source) throws IOException, SAXException{ //make a BlastLikeSAXParser BlastLikeSAXParser parser = new BlastLikeSAXParser(); parser.setModeLazy(); ContentHandler handler = new SeqSimilarityAdapter(); SearchContentHandler scHandler = new EchoSCHandler(); ((SeqSimilarityAdapter)handler).setSearchContentHandler(scHandler); parser.setContentHandler(handler); parser.parse(source); } private class EchoSCHandler extends SearchContentAdapter{ public void startHit(){ System.out.println("startHit()"); } public void endHit(){ System.out.println("endHit()"); } public void startSubHit(){ System.out.println("startSubHit()"); } public void endSubHit(){ System.out.println("endSubHit()"); } public void startSearch(){ System.out.println("startSearch"); } public void endSearch(){ System.out.println("endSearch"); } public void addHitProperty(Object key, Object val){ System.out.println("\tHitProp:\t"+key+": "+val); } public void addSearchProperty(Object key, Object val){ System.out.println("\tSearchProp:\t"+key+": "+val); } public void addSubHitProperty(Object key, Object val){ System.out.println("\tSubHitProp:\t"+key+": "+val); } } public static void main(String[] args) throws Exception{ InputSource is = new InputSource(new FileInputStream(args[0])); BlastEcho blastEcho = new BlastEcho(); blastEcho.echo(is); } } Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 "Smithies, Russell" 11/26/2004 08:42 AM To: Mark Schreiber/GP/Novartis@PH cc: Subject: Biojava query Hi Mark, Just a quick question about Blast parsing, How do you get the length of the query sequence with the parser example on BJinA? It's not in the annotations of the SeqSimilaritySearchResult. That only has databaseId, queryId, program, and version :-( Russell ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From bindu_j2000 at yahoo.com Wed Nov 24 21:54:44 2004 From: bindu_j2000 at yahoo.com (smitha kantipudi) Date: Sun Nov 28 12:14:37 2004 Subject: [Biojava-l] 3D Dot Matrix for sequence similarity Message-ID: <20041125025444.79984.qmail@web60003.mail.yahoo.com> Hi, Can any one tell me how to implement 3D dot matrix for sequence similarty in Java or Perl. In this we have to use sum of pairs, amnio acid subtitution matrix etc.. Thank you in advance. Smitha. --------------------------------- Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. From voisingreg at yahoo.fr Fri Nov 26 11:45:42 2004 From: voisingreg at yahoo.fr (gregory voisin) Date: Sun Nov 28 12:14:48 2004 Subject: [Biojava-l] biojava and microArray:::::::::Proget Message-ID: <20041126164542.48653.qmail@web60409.mail.yahoo.com> hie, biojavatien and microarrayers, I'm trying to developping some Java Classe to manipulate Expression data.... in fact , these classes are adapted for my use hence, for this moment, more specificly ..... i ' m not a informatics developpers... just a poor bioinformatist but i would like initiate this project ( which is a very too big fish for me) and work with severals brains to develop a new simple package for BioJava...... thanks for your time and your energy.... que le vent gonfle vos voiles et que le soleil inonde vos visages VOISIN greg. Bioinformaticien. Centre de recherche du CHUM. MONTREAL --------------------------------- Cr?ez gratuitement votre Yahoo! Mail avec 100 Mo de stockage ! Cr?ez votre Yahoo! Mail Le nouveau Yahoo! Messenger est arriv? ! D?couvrez toutes les nouveaut?s pour dialoguer instantan?ment avec vos amis.T?l?chargez GRATUITEMENT ici ! From heuermh at acm.org Mon Nov 29 13:37:15 2004 From: heuermh at acm.org (Michael Heuer) Date: Mon Nov 29 13:37:34 2004 Subject: [Biojava-l] biojava and microArray:::::::::Proget In-Reply-To: <20041126164542.48653.qmail@web60409.mail.yahoo.com> Message-ID: I'm willing to coordinate efforts to bring gene expression support to biojava. However, I don't think it should be done without proper support for MAGE and the MAGE Ontology, out of respect to those active standards communities. I've set up a wiki to discuss a biojava-expr library at > http://hume.ccgb.umn.edu:8668/space/BiojavaExpr Feel free to register to create an account, then edit that page or create new ones linked from that page with your design considerations, implementation ideas, or feature requirements. Alternatively, I feel that the biojava developers mailing list is the most appropriate venue for this kind of discussion, and would recommend that anyone interested in contributing to a biojava-expr library subscribe to and post there. > http://www.biojava.org/mailman/listinfo/biojava-dev michael On Fri, 26 Nov 2004, gregory voisin wrote: > hie, biojavatien and microarrayers, > > I'm trying to developping some Java Classe to manipulate Expression data.... > in fact , these classes are adapted for my use hence, for this moment, more specificly ..... > > i ' m not a informatics developpers... just a poor bioinformatist but > i would like initiate this project ( which is a very too big fish for > me) and work with severals brains to develop a new simple package for > BioJava...... > > thanks for your time and your energy.... > > que le vent gonfle vos voiles et que le soleil inonde vos visages > > > > > > > > VOISIN greg. > Bioinformaticien. > Centre de recherche du CHUM. > MONTREAL > > --------------------------------- > Créez gratuitement votre Yahoo! Mail avec 100 Mo de stockage ! > Créez votre Yahoo! Mail > > Le nouveau Yahoo! Messenger est arrivé ! Découvrez toutes les nouveautés pour dialoguer instantanément avec vos amis.Téléchargez GRATUITEMENT ici ! From Anna.Henricson at cgb.ki.se Tue Nov 30 05:06:19 2004 From: Anna.Henricson at cgb.ki.se (Anna Henricson) Date: Tue Nov 30 05:04:16 2004 Subject: [Biojava-l] Parsing an EMBL flatfile Message-ID: Hi, I'm a new beginner at Biojava and I'm trying to parse an EMBL flatfile, it's especially info in the CDS section of the Feature Table that I want to retrieve. I have looked at the examples and tutorials on the Biojava website and tried using the FeatureFilter.ByType("CDS"), however, that only gives me the exons to join and not the info that follow, such as protein_id, db_xref, the amino acid sequence etc. Instead, I have been trying to use the EmblLikeFormat class, EmblProcessor, FeatureTableParser and EmblLikeLocationParser, but I can't really put it together. I would really appreciate some help, since I'm the only one around here that is using Biojava. Thanks! /Anna -------------------------------------------- Anna Henricson, MSc, PhD student Center for Genomics and Bioinformatics (CGB) Karolinska Institutet S-171 77 Stockholm Sweden Phone: +46 (0)8 524 87296 Fax: +46 (0)8 337983 From mark.schreiber at group.novartis.com Tue Nov 30 19:49:31 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Tue Nov 30 19:47:20 2004 Subject: [Biojava-l] Parsing an EMBL flatfile Message-ID: Hi Anna, I think that information is probably ending up in an Annotation object. You can use the example TreeView program to interactively find out how a file is parsed and which features and Annotations end up where in the object model (http://www.biojava.org/docs/bj_in_anger/treeView.htm) Let me know if this doesn't help. Regards, Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 "Anna Henricson" Sent by: biojava-l-bounces@portal.open-bio.org 11/30/2004 06:06 PM To: cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Parsing an EMBL flatfile Hi, I'm a new beginner at Biojava and I'm trying to parse an EMBL flatfile, it's especially info in the CDS section of the Feature Table that I want to retrieve. I have looked at the examples and tutorials on the Biojava website and tried using the FeatureFilter.ByType("CDS"), however, that only gives me the exons to join and not the info that follow, such as protein_id, db_xref, the amino acid sequence etc. Instead, I have been trying to use the EmblLikeFormat class, EmblProcessor, FeatureTableParser and EmblLikeLocationParser, but I can't really put it together. I would really appreciate some help, since I'm the only one around here that is using Biojava. Thanks! /Anna -------------------------------------------- Anna Henricson, MSc, PhD student Center for Genomics and Bioinformatics (CGB) Karolinska Institutet S-171 77 Stockholm Sweden Phone: +46 (0)8 524 87296 Fax: +46 (0)8 337983 _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From xingenzhu at yahoo.com.cn Mon Nov 29 12:46:11 2004 From: xingenzhu at yahoo.com.cn (Xingen Zhu) Date: Mon Dec 6 09:20:48 2004 Subject: [Biojava-l] Parse Genbank file Message-ID: <20041129174611.19544.qmail@web50601.mail.yahoo.com> Hi all, I am a new user of biojava. I use the following java program to parse a genebank file: import java.util.*; import java.io.*; import org.biojava.bio.*; import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; import org.biojava.bio.seq.io.*; public class biojava { public static void main(String[] args) { try { File genbankFile = new File("e:\\java\\gb.txt"); BufferedReader gReader = new BufferedReader(new InputStreamReader(new FileInputStream(genbankFile))); GenbankFormat gFormat = new GenbankFormat(); Alphabet alpha = DNATools.getDNA(); } catch (Throwable t) { t.printStackTrace(); System.exit(1); } } } This program can be compiled, but not run. The error message is Java.lang.NoClassDefFoundError If delete the following line Alphabet alpha = DNATools.getDNA(); It will complie and run Any idea? Thanks a lot. Michael --------------------------------- Do You Yahoo!? ÏÓÓÊÏä̫С£¿ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡