From dlampkin at xencor.com Mon Jan 5 13:02:38 2004 From: dlampkin at xencor.com (DeAngelo Lampkin) Date: Mon Jan 5 13:09:46 2004 Subject: [Biojava-l] Blast Version Message-ID: Hi George, I've seen this before. Check out the following link for the solution. Basically you have to set the parser to lazy. http://www.biojava.org/pipermail/biojava-l/2003-July/003990.html DeAngelo -----Original Message----- From: biojava-l-bounces@portal.open-bio.org [mailto:biojava-l-bounces@portal.open-bio.org]On Behalf Of Y D Sun Sent: Tuesday, December 23, 2003 9:38 AM To: biojava-l@biojava.org Subject: [Biojava-l] Blast Version Hi, What blast version does BlastLikeSAXParser support? I encounter the following error when running the sample code of BLAST Result Parser in Biojava In Anger to parse blast 2.2.5 output: org.xml.sax.SAXException: Program ncbi-blastp Version 2.2.5 is not supported by the biojava blast-like parsing framework at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXPar ser.java:241) at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser. java:160) Thanks. George _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From srinandakumar at yahoo.com Tue Jan 6 06:33:55 2004 From: srinandakumar at yahoo.com (nandakumar sridharan) Date: Tue Jan 6 08:41:57 2004 Subject: [Biojava-l] biojava doubts and problems Message-ID: <20040106113355.80521.qmail@web60501.mail.yahoo.com> any reference books available for the biojava docs and tutorials. GCContent .java gives exception "usage: java GCContent filename.fa" how to solve it --------------------------------- Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes From david.huen at ntlworld.com Tue Jan 6 08:56:45 2004 From: david.huen at ntlworld.com (David Huen) Date: Tue Jan 6 09:03:51 2004 Subject: [Biojava-l] biojava doubts and problems In-Reply-To: <20040106113355.80521.qmail@web60501.mail.yahoo.com> References: <20040106113355.80521.qmail@web60501.mail.yahoo.com> Message-ID: <200401061356.46203.david.huen@ntlworld.com> On Tuesday 06 Jan 2004 11:33 am, nandakumar sridharan wrote: > any reference books available for the biojava docs and tutorials. Please look at www.biojava.org for some material. Follow the link there to "Biojava In Anger" for further useful cookbook style materials. > GCContent .java gives exception "usage: java GCContent filename.fa" how > to solve it > The above is not an exception but a Unix-style way of telling you that the command line format for invoking the code. In this case, it appears to be saying you need to provide it a file in FASTA format:- java GCContent Regards, David Huen From benjamins at Biomax.de Wed Jan 7 03:47:11 2004 From: benjamins at Biomax.de (Benjamin Schuster-Boeckler) Date: Wed Jan 7 03:54:14 2004 Subject: [Biojava-l] Is there a way to determine the query within a FeatureFilter? Message-ID: <3FFBC78F.4090900@Biomax.de> As I understood the principle, a FeatureFilter will be used to test one Feature after the other through it's accept(Feature f) method. Now my problem is that this is _far_ too slow. I need to check millions of Features of which only a few will be selected. So what I want to do is to get the "rule" that the FeatureFilter uses and translate it into a SQL query so I can get the right Features from the database straight away. Unfortunately, I don't see a way how this could be possible. Thanks in advance, Greetings, Bejamin Schuster-B?ckler From benjamins at Biomax.de Wed Jan 7 04:25:23 2004 From: benjamins at Biomax.de (Benjamin Schuster-Boeckler) Date: Wed Jan 7 04:32:26 2004 Subject: [Biojava-l] Is there a way to determine the query within a FeatureFilter? In-Reply-To: <3FFBC78F.4090900@Biomax.de> References: <3FFBC78F.4090900@Biomax.de> Message-ID: <3FFBD083.5080501@Biomax.de> Benjamin Schuster-Boeckler wrote: > As I understood the principle, a FeatureFilter will be used to test > one Feature after the other through it's accept(Feature f) method. Now > my problem is that this is _far_ too slow. I need to check millions of > Features of which only a few will be selected. So what I want to do is > to get the "rule" that the FeatureFilter uses and translate it into a > SQL query so I can get the right Features from the database straight > away. Unfortunately, I don't see a way how this could be possible. Ah, toString() could do it, am I right? I think it returns a neat pattern that I could parse token for token... From thomas at derkholm.net Wed Jan 7 08:45:08 2004 From: thomas at derkholm.net (Thomas Down) Date: Wed Jan 7 08:02:39 2004 Subject: [Biojava-l] Is there a way to determine the query within a FeatureFilter? In-Reply-To: <3FFBD083.5080501@Biomax.de> References: <3FFBC78F.4090900@Biomax.de> <3FFBD083.5080501@Biomax.de> Message-ID: <20040107134508.GC23814@firechild> Once upon a time, Benjamin Schuster-Boeckler wrote: > Benjamin Schuster-Boeckler wrote: > > >As I understood the principle, a FeatureFilter will be used to test > >one Feature after the other through it's accept(Feature f) method. Now > >my problem is that this is _far_ too slow. I need to check millions of > >Features of which only a few will be selected. So what I want to do is > >to get the "rule" that the FeatureFilter uses and translate it into a > >SQL query so I can get the right Features from the database straight > >away. Unfortunately, I don't see a way how this could be possible. > > Ah, toString() could do it, am I right? I think it returns a neat > pattern that I could parse token for token... You certainly could do this if you wanted to. However, I would generally advise against it. If you look at all the FeatureFilter implementations in BioJava, you'll find that they all have accessor methods for getting at their parameters. So while it's possible to convert a filter to text then parse it again, you'd do better by looking at filters as parse-trees, ready to analyse directly. You'll find code which manipulates FeatureFilters in this way throughout the BioJava code base. One bit which might be particularly relevant to you is the (private) method sqlizeFilter in BioSQLSequenceDB. It's pretty simplistic, but might give you some ideas. One thing to remember is that you don't have to translate every part of a complex filter to SQL -- just do the easy/ important bits, then use the FeatureFilter.accept methods to apply the parts of the filter your translater doesn't understand, and you should still get correct results with good performance. In the past, David Huen has talked about some more general solutions for FeatureFilter -> SQL translation, so he might want to add something here, Thomas. From benjamins at Biomax.de Fri Jan 9 05:30:05 2004 From: benjamins at Biomax.de (Benjamin Schuster-Boeckler) Date: Fri Jan 9 05:37:04 2004 Subject: [Biojava-l] I get an IndexOutOfBoundsException with visitFilter in WalkerFactory Message-ID: <3FFE82AD.7030000@Biomax.de> Hy. I'm trying to handle the nodes of the filter tree with the visitFilter method now. To do so, I wrote a FilterHandler class that looks like: --------------------------- snip ----------------------------- public final class FilterHandler implements org.biojava.utils.walker.Visitor { public String and( FeatureFilter.And ffa, String ch1, String ch2 ) { return ch1 + " AND " + ch2; } public String or( FeatureFilter.Or ffo, String ch1, String ch2 ) { return ch1 + " OR " + ch2; } public String not( FeatureFilter.Not ffn, String ch1 ) { return "NOT " + ch1; } public String byAncestor( FeatureFilter.ByAncestor ffb, String ch1) { return "BYANCESTOR" + ch1; } public String byClass( FeatureFilter.ByClass ffc ) { return ffc.getTestClass().getName(); } public String byType( FeatureFilter.ByType fft ) { return fft.getType(); } public String overlapsLocation( FeatureFilter.OverlapsLocation ffo ) { return "["+ffo.getLocation().getMin()+", "+ffo.getLocation().getMax()+"]"; } } --------------------------- snap ----------------------------- Of course, this is just a test yet, so later these methods should do something more useful ;-) Now, what I get is --------------------------- snip ----------------------------- java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:507) at java.util.ArrayList.get(ArrayList.java:324) at org.biojava.utils.walker.WalkerFactory.generateWalker(WalkerFactory.java:299) at org.biojava.utils.walker.WalkerFactory.getWalker(WalkerFactory.java:61) at org.biojava.bio.seq.FilterUtils.visitFilter(FilterUtils.java:988) at com.biomax.pedant3.das.ContigSequence.filter(ContigSequence.java:132) --------------------------- snap ----------------------------- the filter that was passed to visitFilter was Not(ByAncestor(ByClass(org.biojava.bio.seq.ComponentFeature))) <- gathered by toString() but from debugging I found out that the error happened while evaluating the "and" method in WalkerFactory.getWalker. In line 299, the getWalker method tries to get as many elements out of the vector "wrappedLVs" as there are parameters to the handler-method, but wrappedLVs only has 1 element, which is null I think. How can this be?! Greetings, Benjamin From benjamins at Biomax.de Fri Jan 9 05:37:47 2004 From: benjamins at Biomax.de (Benjamin Schuster-Boeckler) Date: Fri Jan 9 05:44:47 2004 Subject: [Biojava-l] I get an IndexOutOfBoundsException with visitFilter in WalkerFactory In-Reply-To: <3FFE82AD.7030000@Biomax.de> References: <3FFE82AD.7030000@Biomax.de> Message-ID: <3FFE847B.4050900@Biomax.de> Benjamin Schuster-Boeckler wrote: > > but from debugging I found out that the error happened while > evaluating the "and" method in WalkerFactory.getWalker. In line 299, > the getWalker method tries to get as many elements out of the vector > "wrappedLVs" as there are parameters to the handler-method, but > wrappedLVs only has 1 element, which is null I think. How can this be?! Of course, it's generateWalker instead of getWalker! Best regards, Ben From daviddebeule at pandora.be Sat Jan 10 11:38:53 2004 From: daviddebeule at pandora.be (david de beule) Date: Sat Jan 10 11:45:53 2004 Subject: [Biojava-l] seqString() produces stacktrace Message-ID: <001f01c3d798$3bedb1e0$f416a451@davidpc> Hi all, This piece of code: Alphabet dna1 = DNATools.getDNA(); SymbolTokenization dnaToke1 = dna1.getTokenization("token"); SymbolList symbolList = new SimpleSymbolList(dnaToke1, "ACTGGACCTAAGG"); Sequence sequence = new SimpleSequence(symbolList, "test", "test", null); SimpleGappedSequence gappedSequence = new SimpleGappedSequence(sequence); gappedSequence.addGapsInView(4, 4); gappedSequence.removeGap(7); gappedSequence.removeGaps(4, 3); gappedSequence.addGapsInView(7, 2); gappedSequence.addGapsInView(9, 3); gappedSequence.addGapsInView(12, 2); gappedSequence.addGapsInView(14, 3); gappedSequence.addGapsInView(17, 2); gappedSequence.removeGap(18); gappedSequence.removeGaps(11, 6); System.out.println(gappedSequence.seqString()); breaks on the seqString() call and gives produces the following stacktrace: java.lang.ArrayIndexOutOfBoundsException: 13 at org.biojava.bio.symbol.SimpleSymbolList.symbolAt(SimpleSymbolList.java:271) at org.biojava.bio.seq.impl.SimpleSequence.symbolAt(SimpleSequence.java:120) at org.biojava.bio.symbol.SimpleGappedSymbolList.symbolAt(SimpleGappedSymbolLis t.java:508) at org.biojava.bio.symbol.AbstractSymbolList$SymbolIterator.next(AbstractSymbol List.java:201) at org.biojava.bio.seq.io.CharacterTokenization.tokenizeSymbolList(CharacterTok enization.java:211) at org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper.tokenize SymbolList(AlphabetManager.java:1383) at org.biojava.bio.symbol.AbstractSymbolList.seqString(AbstractSymbolList.java: 102) Can this be a bug ? Any help would be appreciated, David De Beule From getksn at rediffmail.com Sat Jan 10 10:00:36 2004 From: getksn at rediffmail.com (karla suri nath) Date: Sat Jan 10 11:56:03 2004 Subject: [Biojava-l] wanted info ! Message-ID: <20040110150036.15072.qmail@webmail8.rediffmail.com> Hello, can you please helpme by sending the info regarding development of an application of my own which will help chemists to develop an application to draw chemical structures and as well reade the structure that is drawn. From mark.schreiber at group.novartis.com Sun Jan 11 19:57:06 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Sun Jan 11 20:01:05 2004 Subject: [Biojava-l] wanted info ! Message-ID: Hi - BioJava currently doesn't have support for chemical drawing. I would reccomend looking at the Java3D API from sun which has great support for drawing things like molecules in 3D. Of course if you want to contribute some chem drawing code it would be most appreciated. - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 1 Science Park Road #04-14 The Capricorn Singapore 117528 phone +65 6722 2973 fax +65 6722 2910 "karla suri nath" Sent by: biojava-l-bounces@portal.open-bio.org 01/10/2004 11:00 PM Please respond to karla suri nath To: biojava-l@biojava.org cc: Subject: [Biojava-l] wanted info ! Hello, can you please helpme by sending the info regarding development of an application of my own which will help chemists to develop an application to draw chemical structures and as well reade the structure that is drawn._______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From mark.schreiber at group.novartis.com Sun Jan 11 21:57:40 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Sun Jan 11 22:01:38 2004 Subject: [Biojava-l] seqString() produces stacktrace Message-ID: I've done some investigations and the offending line is: gappedSequence.removeGaps(11, 6); This seems to be a real bug (at least in BJ1.3.1). Confusingly if you change the 6 to any other legal value everything works as expected. Matthew do you know whats going on here?? - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 1 Science Park Road #04-14 The Capricorn Singapore 117528 phone +65 6722 2973 fax +65 6722 2910 "david de beule" Sent by: biojava-l-bounces@portal.open-bio.org 01/11/2004 12:38 AM To: cc: Subject: [Biojava-l] seqString() produces stacktrace Hi all, This piece of code: Alphabet dna1 = DNATools.getDNA(); SymbolTokenization dnaToke1 = dna1.getTokenization("token"); SymbolList symbolList = new SimpleSymbolList(dnaToke1, "ACTGGACCTAAGG"); Sequence sequence = new SimpleSequence(symbolList, "test", "test", null); SimpleGappedSequence gappedSequence = new SimpleGappedSequence(sequence); gappedSequence.addGapsInView(4, 4); gappedSequence.removeGap(7); gappedSequence.removeGaps(4, 3); gappedSequence.addGapsInView(7, 2); gappedSequence.addGapsInView(9, 3); gappedSequence.addGapsInView(12, 2); gappedSequence.addGapsInView(14, 3); gappedSequence.addGapsInView(17, 2); gappedSequence.removeGap(18); gappedSequence.removeGaps(11, 6); System.out.println(gappedSequence.seqString()); breaks on the seqString() call and gives produces the following stacktrace: java.lang.ArrayIndexOutOfBoundsException: 13 at org.biojava.bio.symbol.SimpleSymbolList.symbolAt(SimpleSymbolList.java:271) at org.biojava.bio.seq.impl.SimpleSequence.symbolAt(SimpleSequence.java:120) at org.biojava.bio.symbol.SimpleGappedSymbolList.symbolAt(SimpleGappedSymbolLis t.java:508) at org.biojava.bio.symbol.AbstractSymbolList$SymbolIterator.next(AbstractSymbol List.java:201) at org.biojava.bio.seq.io.CharacterTokenization.tokenizeSymbolList(CharacterTok enization.java:211) at org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper.tokenize SymbolList(AlphabetManager.java:1383) at org.biojava.bio.symbol.AbstractSymbolList.seqString(AbstractSymbolList.java: 102) Can this be a bug ? Any help would be appreciated, David De Beule _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From matthew_pocock at yahoo.co.uk Mon Jan 12 12:41:10 2004 From: matthew_pocock at yahoo.co.uk (Matthew Pocock) Date: Mon Jan 12 12:52:18 2004 Subject: [Biojava-l] seqString() produces stacktrace In-Reply-To: References: Message-ID: <4002DC36.3020106@yahoo.co.uk> I've fixed the symptom - aprox line 285 of SimpleGappedSymbolList we where renumbering the ungapped blocks before measuring the number of gaps in the block of gaps being edited. I've added lots more error checking code to this class. Code in CVS. Does this now give the result you expected, or have I introduced a bug in removing the error? Matthew mark.schreiber@group.novartis.com wrote: >I've done some investigations and the offending line is: > > >gappedSequence.removeGaps(11, 6); > >This seems to be a real bug (at least in BJ1.3.1). Confusingly if you >change the 6 to any other legal value everything works as expected. >Matthew do you know whats going on here?? > >- Mark > > >Mark Schreiber >Principal Scientist (Bioinformatics) > >Novartis Institute for Tropical Diseases (NITD) >1 Science Park Road >#04-14 The Capricorn >Singapore 117528 > >phone +65 6722 2973 >fax +65 6722 2910 > > > > > >"david de beule" >Sent by: biojava-l-bounces@portal.open-bio.org >01/11/2004 12:38 AM > > > To: > cc: > Subject: [Biojava-l] seqString() produces stacktrace > > >Hi all, > >This piece of code: > >Alphabet dna1 = DNATools.getDNA(); > SymbolTokenization dnaToke1 = dna1.getTokenization("token"); > SymbolList symbolList = new SimpleSymbolList(dnaToke1, >"ACTGGACCTAAGG"); > Sequence sequence = new SimpleSequence(symbolList, "test", "test", >null); > SimpleGappedSequence gappedSequence = new >SimpleGappedSequence(sequence); > > gappedSequence.addGapsInView(4, 4); > gappedSequence.removeGap(7); > gappedSequence.removeGaps(4, 3); > gappedSequence.addGapsInView(7, 2); > gappedSequence.addGapsInView(9, 3); > gappedSequence.addGapsInView(12, 2); > gappedSequence.addGapsInView(14, 3); > gappedSequence.addGapsInView(17, 2); > gappedSequence.removeGap(18); > gappedSequence.removeGaps(11, 6); > > System.out.println(gappedSequence.seqString()); > >breaks on the seqString() call and gives produces the following >stacktrace: > >java.lang.ArrayIndexOutOfBoundsException: 13 > at >org.biojava.bio.symbol.SimpleSymbolList.symbolAt(SimpleSymbolList.java:271) > at >org.biojava.bio.seq.impl.SimpleSequence.symbolAt(SimpleSequence.java:120) > at >org.biojava.bio.symbol.SimpleGappedSymbolList.symbolAt(SimpleGappedSymbolLis >t.java:508) > at >org.biojava.bio.symbol.AbstractSymbolList$SymbolIterator.next(AbstractSymbol >List.java:201) > at >org.biojava.bio.seq.io.CharacterTokenization.tokenizeSymbolList(CharacterTok >enization.java:211) > at >org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper.tokenize >SymbolList(AlphabetManager.java:1383) > at >org.biojava.bio.symbol.AbstractSymbolList.seqString(AbstractSymbolList.java: >102) > >Can this be a bug ? >Any help would be appreciated, > >David De Beule > > > >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l > > > >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l > > > From matthew_pocock at yahoo.co.uk Mon Jan 12 12:47:00 2004 From: matthew_pocock at yahoo.co.uk (Matthew Pocock) Date: Mon Jan 12 12:58:07 2004 Subject: [Biojava-l] Is there a way to determine the query within a FeatureFilter? In-Reply-To: <20040107134508.GC23814@firechild> References: <3FFBC78F.4090900@Biomax.de> <3FFBD083.5080501@Biomax.de> <20040107134508.GC23814@firechild> Message-ID: <4002DD94.4060109@yahoo.co.uk> In additon to what Thomas said, you may also want to look at the org.biojava.utils.walker package and its use in the FilterUtils code (arround line 986). This is a framework we developed to make it trivial to walk a fiter expression, possibly building an sql query as you go. Matthew Thomas Down wrote: >Once upon a time, Benjamin Schuster-Boeckler wrote: > > >>Benjamin Schuster-Boeckler wrote: >> >> >> >>>As I understood the principle, a FeatureFilter will be used to test >>>one Feature after the other through it's accept(Feature f) method. Now >>>my problem is that this is _far_ too slow. I need to check millions of >>>Features of which only a few will be selected. So what I want to do is >>>to get the "rule" that the FeatureFilter uses and translate it into a >>>SQL query so I can get the right Features from the database straight >>>away. Unfortunately, I don't see a way how this could be possible. >>> >>> >>Ah, toString() could do it, am I right? I think it returns a neat >>pattern that I could parse token for token... >> >> > >You certainly could do this if you wanted to. However, I would >generally advise against it. If you look at all the FeatureFilter >implementations in BioJava, you'll find that they all have accessor >methods for getting at their parameters. So while it's possible >to convert a filter to text then parse it again, you'd do better >by looking at filters as parse-trees, ready to analyse directly. > >You'll find code which manipulates FeatureFilters in this way >throughout the BioJava code base. One bit which might be >particularly relevant to you is the (private) method sqlizeFilter >in BioSQLSequenceDB. It's pretty simplistic, but might give you >some ideas. > >One thing to remember is that you don't have to translate >every part of a complex filter to SQL -- just do the easy/ >important bits, then use the FeatureFilter.accept methods >to apply the parts of the filter your translater doesn't >understand, and you should still get correct results with >good performance. > >In the past, David Huen has talked about some more general >solutions for FeatureFilter -> SQL translation, so he might >want to add something here, > > Thomas. >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l > > > From eric_bellard at yahoo.com Tue Jan 13 08:35:06 2004 From: eric_bellard at yahoo.com (Eric BELLARD) Date: Tue Jan 13 08:41:55 2004 Subject: [Biojava-l] how to calculate consensus from a fasta file Message-ID: <20040113133506.52143.qmail@web41509.mail.yahoo.com> Hi, I'd like to first thank you all for your great job on this project. I'm using biojava in a project to store some sequencing result. In my application the user upload sequences from a fasta file, and I like to build an alignment from it. With your project, I can easily parse the fasta file and get all the sequences. Let's consider the sequences as lines. I'd like to calculate the column consensus using dna degenerate alphabet. Does biojava implements a way to do this? Thanks by advance. Eric __________________________________ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus From mark.schreiber at group.novartis.com Tue Jan 13 20:01:53 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Tue Jan 13 20:05:45 2004 Subject: [Biojava-l] how to calculate consensus from a fasta file Message-ID: Hi Eric - I'm not sure if this will solve your problem but you could make an Alignment object from the sequences and then use the methods of DistributionTools to get a Distribution object for each position in the Alignment. These distributions will tell you the frequency of each base at each position in the Alignment which you could use to make a consensus. You can also use DistributionTools to calculate information or entropy at each position. Alternatively you could generate a markov model that represents the alignment and probabilistically represents the consensus. Hope this helps Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 1 Science Park Road #04-14 The Capricorn Singapore 117528 phone +65 6722 2973 fax +65 6722 2910 Eric BELLARD Sent by: biojava-l-bounces@portal.open-bio.org 01/13/2004 09:35 PM Please respond to eric To: biojava-l@biojava.org cc: Subject: [Biojava-l] how to calculate consensus from a fasta file Hi, I'd like to first thank you all for your great job on this project. I'm using biojava in a project to store some sequencing result. In my application the user upload sequences from a fasta file, and I like to build an alignment from it. With your project, I can easily parse the fasta file and get all the sequences. Let's consider the sequences as lines. I'd like to calculate the column consensus using dna degenerate alphabet. Does biojava implements a way to do this? Thanks by advance. Eric __________________________________ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From eric_bellard at yahoo.com Wed Jan 14 04:03:05 2004 From: eric_bellard at yahoo.com (Eric BELLARD) Date: Wed Jan 14 04:09:53 2004 Subject: [Biojava-l] how to calculate consensus from a fasta file In-Reply-To: Message-ID: <20040114090306.57723.qmail@web41506.mail.yahoo.com> Thanks for your response. My problem is easier than you though. I simpy have to calculate the ambiguity symbol for each column. My solution is: - create a list whith a set of symbol for each column - fill the set with each symbol of each sequence - calculate the ambiguity symbols for each set of this list It works pretty well but if the sequences become too long I imagine I'll use too much memory. I'll try to find another solution using the alignment object in the framework. At the moment I don't know enough the framework to find solution of this kind with it. I'll try... Anyway thanks for your help. Eric --- mark.schreiber@group.novartis.com wrote: > Hi Eric - > > I'm not sure if this will solve your problem but you > could make an > Alignment object from the sequences and then use the > methods of > DistributionTools to get a Distribution object for > each position in the > Alignment. These distributions will tell you the > frequency of each base at > each position in the Alignment which you could use > to make a consensus. > You can also use DistributionTools to calculate > information or entropy at > each position. > > Alternatively you could generate a markov model that > represents the > alignment and probabilistically represents the > consensus. > > Hope this helps > > Mark > > > > Mark Schreiber > Principal Scientist (Bioinformatics) > > Novartis Institute for Tropical Diseases (NITD) > 1 Science Park Road > #04-14 The Capricorn > Singapore 117528 > > phone +65 6722 2973 > fax +65 6722 2910 > > > > > > Eric BELLARD > Sent by: biojava-l-bounces@portal.open-bio.org > 01/13/2004 09:35 PM > Please respond to eric > > > To: biojava-l@biojava.org > cc: > Subject: [Biojava-l] how to calculate > consensus from a fasta file > > > Hi, > > I'd like to first thank you all for your great job > on > this project. > > I'm using biojava in a project to store some > sequencing result. > > In my application the user upload sequences from a > fasta file, and I like to build an alignment from > it. > > With your project, I can easily parse the fasta file > and get all the sequences. > > Let's consider the sequences as lines. > I'd like to calculate the column consensus using dna > degenerate alphabet. > > Does biojava implements a way to do this? > > Thanks by advance. > > Eric > > > > __________________________________ > Do you Yahoo!? > Yahoo! Hotjobs: Enter the "Signing Bonus" > Sweepstakes > http://hotjobs.sweepstakes.yahoo.com/signingbonus > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > > > __________________________________ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus From benjamins at Biomax.de Wed Jan 14 04:18:41 2004 From: benjamins at Biomax.de (Benjamin Schuster-Boeckler) Date: Wed Jan 14 04:25:30 2004 Subject: [Biojava-l] Non-Deterministic behaviour of the FeatureFilter-Walker !? Message-ID: <40050971.6070009@Biomax.de> I witness a _very_ weird problem executing my code. I have a class called FilterHandler that implements org.biojava.utils.walker.Visitor and is used in conjunction with FilterUtils.visitFilter(). I added some debuging-output to the each handling method just to see what's going on. So normally I get this: ------------------------------ snip ------------------------------ DEBUG das.ContigSequence - filter - And(And(ByType(orf) , Not(ByAncestor(ByClass(org.biojava.bio.seq.ComponentFeature)))) , Overlaps([1,1000000])), recurse=true DEBUG das.FilterHandler - byType - orf DEBUG das.FilterHandler - byClass - org.biojava.bio.seq.ComponentFeature DEBUG das.FilterHandler - byAncestor - ByAncestor DEBUG das.FilterHandler - not - Not DEBUG das.FilterHandler - and - And DEBUG das.FilterHandler - overlapsLocation - [1, 1000000] DEBUG das.FilterHandler - and - And ------------------------------ snap ------------------------------ This is fine. The first line is simply filter.toString() before I pass the filter to the visitFilter method. Now for some reason, and I can't tell when actually, I get that: ------------------------------ snip ------------------------------ DEBUG das.ContigSequence - filter - And(And(ByType(orf) , Not(ByAncestor(ByClass(org.biojava.bio.seq.ComponentFeature)))) , Overlaps([1,1000000])), recurse=true WARN das.FilterHandler - and - Seems like some Filter was not evaluated! DEBUG das.FilterHandler - and - And ------------------------------ snap ------------------------------ The warning is of course my own code, it just triggers if And was called without any other filters evaluated beforehand. It seems like the filter is not beeing processed, just the outer 'and' but nothing else. Any clue how this can be? I don't seem to get any error messages, too. I'll attach the handler method at the end of the file. Thanks in advance, Benjamin ------------------------------ snip ------------------------------ public final class FilterHandler implements org.biojava.utils.walker.Visitor { private Logger logger; //the log4j-logger to write to private Stack query; //keep track of recent calls private List types; //in case we have ByType queries private Location boundaries; private boolean isComponentQuery; public FilterHandler() { this.logger = Logger.getLogger(FilterHandler.class); //tell logger the name of this class query = new Stack(); //the query is still empty" types = new Vector(); //no types to look for, yet... isComponentQuery = false; //we don't look for structural features, do we? } public void and( FeatureFilter.And ffa ) { try{ query.pop(); query.pop(); } catch(EmptyStackException e) { logger.warn("Seems like some Filter was not evaluated!", e); } query.push("and"); logger.debug("And"); } public void or( FeatureFilter.Or ffo ) { try{ query.pop(); query.pop(); } catch(EmptyStackException e) { logger.warn("Seems like some Filter was not evaluated!", e); } query.push("or"); logger.debug("Or"); } public void not( FeatureFilter.Not ffn ) { try { if((String)query.pop() == "byAncestor") if((String)query.pop() == "byClass") if(isComponentQuery) isComponentQuery = false; } catch(EmptyStackException e) { logger.warn("This was not the usual case of 'not'!", e); } query.push("not"); logger.debug("Not"); } public void byAncestor( FeatureFilter.ByAncestor ffb ) { query.push("byAncestor"); logger.debug("ByAncestor"); } public void byClass( FeatureFilter.ByClass ffc ) { query.push("byClass"); if(ffc.getTestClass().getName().indexOf("ComponentFeature")>=0) isComponentQuery = true; logger.debug(ffc.getTestClass().getName()); } public void byType( FeatureFilter.ByType fft ) { query.push("byType"); types.add(fft.getType()); logger.debug(fft.getType()); } public void overlapsLocation( FeatureFilter.OverlapsLocation ffo ) { query.push("overlapsLocation"); boundaries = ffo.getLocation(); logger.debug("["+ffo.getLocation().getMin()+", "+ffo.getLocation().getMax()+"]"); } ------------------------------ snap ------------------------------ From eric_bellard at yahoo.com Wed Jan 14 05:40:44 2004 From: eric_bellard at yahoo.com (Eric BELLARD) Date: Wed Jan 14 05:47:31 2004 Subject: [Biojava-l] ClassCastException during ambiguity calcul Message-ID: <20040114104044.13010.qmail@web41509.mail.yahoo.com> Hi, I'm calculating ambiguity of symbol sets and i've got a class cast exception. I've managed to isolate the bug. The following code triggers the exception: Set lc_set = new HashSet(); SymbolList lc_sequence = DNATools.createDNA("AG"); Iterator lc_i = lc_sequence.iterator(); while (lc_i.hasNext()) { lc_set.add(lc_i.next()); } lc_set.add(DNATools.getDNA().getGapSymbol()); try { Symbol lc_symbol = DNATools.getDNA().getAmbiguity(lc_set); System.out.println(lc_symbol.getName()); } catch (IllegalSymbolException e) { throw e; } If I don't add the gap symbol in the set I don't have the exception. Even an excpetion must be thrown during this code execution, I don't think the class cast exception is the good one. Does someone have a clue? Is it a bug? Do you want me to propose a fix? I think I got good java skills but the worst biological skill you've ever seen :-) Regards, Eric __________________________________ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus From mark.schreiber at group.novartis.com Wed Jan 14 20:54:58 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Wed Jan 14 20:58:50 2004 Subject: [Biojava-l] ClassCastException during ambiguity calcul Message-ID: Hi Eric - I think the problem is caused by trying to determine the ambiguity for a set that contains the gap symbol. In reality there is no IUPAC ambiguity code for anything paired with a gap. An ambiguity is like saying either this nucleotide or this one (as in an ambiguous DNA sequence read). It can't really cover the concept of either this nucleotide or a gap - Mark Eric BELLARD Sent by: biojava-l-bounces@portal.open-bio.org 01/14/2004 06:40 PM Please respond to eric To: biojava-l@biojava.org cc: Subject: [Biojava-l] ClassCastException during ambiguity calcul Hi, I'm calculating ambiguity of symbol sets and i've got a class cast exception. I've managed to isolate the bug. The following code triggers the exception: Set lc_set = new HashSet(); SymbolList lc_sequence = DNATools.createDNA("AG"); Iterator lc_i = lc_sequence.iterator(); while (lc_i.hasNext()) { lc_set.add(lc_i.next()); } lc_set.add(DNATools.getDNA().getGapSymbol()); try { Symbol lc_symbol = DNATools.getDNA().getAmbiguity(lc_set); System.out.println(lc_symbol.getName()); } catch (IllegalSymbolException e) { throw e; } If I don't add the gap symbol in the set I don't have the exception. Even an excpetion must be thrown during this code execution, I don't think the class cast exception is the good one. Does someone have a clue? Is it a bug? Do you want me to propose a fix? I think I got good java skills but the worst biological skill you've ever seen :-) Regards, Eric __________________________________ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From immunoguest at hotmail.com Sun Jan 18 19:44:57 2004 From: immunoguest at hotmail.com (tai kwan do) Date: Sun Jan 18 19:51:38 2004 Subject: [Biojava-l] Blast XML output question Message-ID: Hello, I'm seeing a difference in the data being output by stand-alone blast and online blast. The identities value are different between the xml output and the text output. The other difference I see is in the query and hit sequences. I've included below the outputs using the same input parameters, does anyone know why this is the case? gb|AE000111.1|AE000111 Escherichia coli K-12 MG1655 section 1 of 400 of the complete genome Length = 10596 Score = 589 bits (297), Expect = e-168 Identities = 315/324 (97%) Strand = Plus / Plus Query: 237 aggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtg 296 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 237 aggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtg 296 Query: 297 cgggcnnnnnnnnncgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcgg 356 ||||| |||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 297 cgggctttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcgg 356 Query: 357 cggtacatcagtggcaaatgcagaacgttttctgcgtgttgccgatattctggaaagcaa 416 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 357 cggtacatcagtggcaaatgcagaacgttttctgcgtgttgccgatattctggaaagcaa 416 Query: 417 tgccaggcaggggcaggtggccaccgtcctctctgcccccgccaaaatcaccaaccacct 476 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 417 tgccaggcaggggcaggtggccaccgtcctctctgcccccgccaaaatcaccaaccacct 476 Query: 477 ggtggcgatgattgaaaaaaccattagcggccaggatgctttacccaatatcagcgatgc 536 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 477 ggtggcgatgattgaaaaaaccattagcggccaggatgctttacccaatatcagcgatgc 536 Query: 537 cgaacgtatttttgccgaactttt 560 |||||||||||||||||||||||| Sbjct: 537 cgaacgtatttttgccgaactttt 560 1 gi|1786181|gb|AE000111.1|AE000111 Escherichia coli K-12 MG1655 section 1 of 400 of the complete genome AE000111 10596 1 589.253 297 1.04898e-168 237 560 237 560 1 1 324 324 324 AGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT AGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| _________________________________________________________________ Rethink your business approach for the new year with the helpful tips here. http://special.msn.com/bcentral/prep04.armx From tkirsten at izbi.uni-leipzig.de Fri Jan 23 10:35:47 2004 From: tkirsten at izbi.uni-leipzig.de (Toralf Kirsten) Date: Fri Jan 23 10:40:54 2004 Subject: [Biojava-l] GenBank XML File Parse Error Message-ID: <40113F53.8020407@izbi.uni-leipzig.de> Hi, I have to extract data from the GenBank XML files. For this purpose I use the biojava API. But I get a parser error. java.lang.StringIndexOutOfBoundsException: String index out of range: 12 at java.lang.String.substring(String.java:1477) at org.biojava.bio.seq.io.GenbankContext.processHeaderLine (GenbankContext.java:621) at org.biojava.bio.seq.io.GenbankContext.processLine (GenbankContext.java:263) at org.biojava.bio.seq.io.GenbankFormat.readSequence (GenbankFormat.java:144) at org.biojava.bio.seq.io.StreamReader.nextSequence (StreamReader.java:100) rethrown as org.biojava.bio.BioException: Could not read sequence at org.biojava.bio.seq.io.StreamReader.nextSequence (StreamReader.java:103) at de.izbi.gbm.logistics.GenBankBioJavaImporter.readFile (GenBankBioJavaImporter.java:41) at de.izbi.gbm.gui.GenBankBaseFrame.actionPerformed (GenBankBaseFrame.java:134) at javax.swing.AbstractButton.fireActionPerformed (AbstractButton.java:1764) at javax.swing.AbstractButton$ForwardActionEvents.actionPerformed (AbstractButton.java:1817) at javax.swing.DefaultButtonModel.fireActionPerformed (DefaultButtonModel.java:419) at javax.swing.DefaultButtonModel.setPressed (DefaultButtonModel.java:257) at javax.swing.AbstractButton.doClick(AbstractButton.java:289) at javax.swing.plaf.basic.BasicMenuItemUI.doClick (BasicMenuItemUI.java:1109) at javax.swing.plaf.basic.BasicMenuItemUI$MouseInputHandler. mouseReleased(BasicMenuItemUI.java:943) at java.awt.Component.processMouseEvent(Component.java:5093) at java.awt.Component.processEvent(Component.java:4890) at java.awt.Container.processEvent(Container.java:1566) at java.awt.Component.dispatchEventImpl(Component.java:3598) at java.awt.Container.dispatchEventImpl(Container.java:1623) at java.awt.Component.dispatchEvent(Component.java:3439) at java.awt.LightweightDispatcher.retargetMouseEvent (Container.java:3450) at java.awt.LightweightDispatcher.processMouseEvent (Container.java:3165) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:3095) at java.awt.Container.dispatchEventImpl(Container.java:1609) at java.awt.Window.dispatchEventImpl(Window.java:1585) at java.awt.Component.dispatchEvent(Component.java:3439) at java.awt.EventQueue.dispatchEvent(EventQueue.java:450) at java.awt.EventDispatchThread.pumpOneEventForHierarchy (EventDispatchThread.java:197) at java.awt.EventDispatchThread.pumpEventsForHierarchy (EventDispatchThread.java:150) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:144) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:136) at java.awt.EventDispatchThread.run(EventDispatchThread.java:99) The program is just simple. The user specifies path and file name by the FileChooser component. Then I open the file and apply the Sequence and Annotation classes as visible in the attached method taken from a extended file class. What I need are the sequence data of the GenBank entry (accession, sequence etc.) and also for its features (start, end position, subtype like t-RNA, cds etc.) Any hints are welcome. Thanks Tori --------------------- public GenBankBioJavaImporter(String path, String fileName, Connection genDbCon) { super(); super.setPath(path); super.setFileName(fileName); } public boolean readFile() { if (!super.createInputFile()) return(false); //read the GenBank File SequenceIterator sequences = SeqIOTools.readGenbank(super.fileReaderHandler); // fileReaderHandler is a BufferedReader //iterate through the sequences while(sequences.hasNext()) { try { Sequence seq = sequences.nextSequence(); //do stuff with the sequence System.out.println("Info: "+seq.getName()+", "+seq.getURN()+", "+seq.countFeatures()); Annotation anno = seq.getAnnotation(); //anno.getProperty() } catch (BioException ex) { //not in GenBank format ex.printStackTrace(); super.closeInputFile(); return(false); }catch (NoSuchElementException ex) { //request for more sequence when there isn't any ex.printStackTrace(); super.closeInputFile(); return(false); } } super.closeInputFile(); return(true); } From thomas at derkholm.net Fri Jan 23 12:01:51 2004 From: thomas at derkholm.net (Thomas Down) Date: Fri Jan 23 12:10:21 2004 Subject: [Biojava-l] GenBank XML File Parse Error In-Reply-To: <40113F53.8020407@izbi.uni-leipzig.de> References: <40113F53.8020407@izbi.uni-leipzig.de> Message-ID: <20040123170151.GA5148@firechild> Once upon a time, Toralf Kirsten wrote: > Hi, > I have to extract data from the GenBank XML files. > For this purpose I use the biojava API. But I get a parser error. > > java.lang.StringIndexOutOfBoundsException: String index out of range: 12 > at java.lang.String.substring(String.java:1477) > at org.biojava.bio.seq.io.GenbankContext.processHeaderLine > (GenbankContext.java:621) > [snip] > > > The program is just simple. The user specifies path and file name by the > FileChooser component. Then I open the file and apply the Sequence and > Annotation classes as visible in the attached method taken from a extended > file class. > > What I need are the sequence data of the GenBank entry (accession, > sequence etc.) > and also for its features (start, end position, subtype like t-RNA, cds > etc.) I'm afraid that BioJava doesn't currently support the XML version of genbank records. The Genbank parser you are using expects the normal flatfile version of the genbank records -- do you have access to these? We should probably look at adding Genbank XML support to BioJava. Does anyone know how widely it's used (I must admit I haven't met it before). Thomas. From tkirsten at izbi.uni-leipzig.de Fri Jan 23 12:17:05 2004 From: tkirsten at izbi.uni-leipzig.de (Toralf Kirsten) Date: Fri Jan 23 22:17:03 2004 Subject: [Biojava-l] GenBank XML File Parse Error In-Reply-To: <20040123170151.GA5148@firechild> References: <40113F53.8020407@izbi.uni-leipzig.de> <20040123170151.GA5148@firechild> Message-ID: <40115711.5080000@izbi.uni-leipzig.de> Thomas, thanks for the your answer. ASCII plain text or normal flat file as you said is downloadable from the NCBI web page. So there is no problem to use it. But we would like to use XML file, due to each term is accessible at atomic level. Thanks again. Toralf Thomas Down wrote: >Once upon a time, Toralf Kirsten wrote: > > >>Hi, >>I have to extract data from the GenBank XML files. >>For this purpose I use the biojava API. But I get a parser error. >> >>java.lang.StringIndexOutOfBoundsException: String index out of range: 12 >>at java.lang.String.substring(String.java:1477) >>at org.biojava.bio.seq.io.GenbankContext.processHeaderLine >>(GenbankContext.java:621) >>[snip] >> >> >>The program is just simple. The user specifies path and file name by the >>FileChooser component. Then I open the file and apply the Sequence and >>Annotation classes as visible in the attached method taken from a extended >>file class. >> >>What I need are the sequence data of the GenBank entry (accession, >>sequence etc.) >>and also for its features (start, end position, subtype like t-RNA, cds >>etc.) >> >> > >I'm afraid that BioJava doesn't currently support the XML version >of genbank records. The Genbank parser you are using expects the >normal flatfile version of the genbank records -- do you have >access to these? > >We should probably look at adding Genbank XML support to BioJava. >Does anyone know how widely it's used (I must admit I haven't met >it before). > > Thomas. > > From tkirsten at izbi.uni-leipzig.de Tue Jan 27 02:20:52 2004 From: tkirsten at izbi.uni-leipzig.de (Toralf Kirsten) Date: Tue Jan 27 02:25:52 2004 Subject: [Biojava-l] GenBank Feature Extraction Message-ID: <40161154.9050905@izbi.uni-leipzig.de> Hi, does anybody know how to extract the feature details, e.g. CDS /product /translation etc., from the GenBank flat file? Any code example or hints which class/classes should be used are welcome. Thanks, Toralf From mark.schreiber at group.novartis.com Tue Jan 27 03:06:21 2004 From: mark.schreiber at group.novartis.com (mark.schreiber@group.novartis.com) Date: Tue Jan 27 03:09:43 2004 Subject: [Biojava-l] GenBank Feature Extraction Message-ID: Hi - Take a look at http://www.biojava.org/docs/bj_in_anger/filter.htm - Mark Toralf Kirsten Sent by: biojava-l-bounces@portal.open-bio.org 01/27/2004 03:20 PM To: biojava-l@biojava.org cc: Subject: [Biojava-l] GenBank Feature Extraction Hi, does anybody know how to extract the feature details, e.g. CDS /product /translation etc., from the GenBank flat file? Any code example or hints which class/classes should be used are welcome. Thanks, Toralf _______________________________________________ Biojava-l mailing list - Biojava-l@biojava.org http://biojava.org/mailman/listinfo/biojava-l From tkirsten at izbi.uni-leipzig.de Tue Jan 27 11:10:41 2004 From: tkirsten at izbi.uni-leipzig.de (Toralf Kirsten) Date: Tue Jan 27 11:15:49 2004 Subject: [Biojava-l] GenBank Feature Extraction In-Reply-To: References: Message-ID: <40168D81.7000506@izbi.uni-leipzig.de> Hi Mark, thank you very much. It works. ;-) Toralf mark.schreiber@group.novartis.com wrote: >Hi - > >Take a look at http://www.biojava.org/docs/bj_in_anger/filter.htm > >- Mark > > > > > >Toralf Kirsten >Sent by: biojava-l-bounces@portal.open-bio.org >01/27/2004 03:20 PM > > > To: biojava-l@biojava.org > cc: > Subject: [Biojava-l] GenBank Feature Extraction > > >Hi, >does anybody know how to extract the feature details, e.g. CDS /product >/translation etc., from the GenBank flat file? >Any code example or hints which class/classes should be used are welcome. >Thanks, Toralf > >_______________________________________________ >Biojava-l mailing list - Biojava-l@biojava.org >http://biojava.org/mailman/listinfo/biojava-l > > > > From sherpa at coresoft.be Tue Jan 27 13:55:39 2004 From: sherpa at coresoft.be (Frederik Decouttere) Date: Tue Jan 27 14:02:08 2004 Subject: [Biojava-l] 2 problems found in org.biojava.bio.seq.db.biosql code In-Reply-To: <40168D81.7000506@izbi.uni-leipzig.de> Message-ID: <001501c3e507$286deb90$7f7ba8c0@SISKA> Hi, After doing some tests with the biojava - biosql code I think there are 2 (little) bugs in there: - when persisting a Sequence which contains a Feature with BetweenLocation this Location gets converted to a RangeLocation upon retrieval - when persisting a Sequence which contains (a) Feature(s) in 2 different biodatabases an exception is thrown in the ontology code part of biojava You can find an example for both cases below Environment: biojava.jar build from cvs, hsqldb database + biosql schema, jdk1.4 If someone with knowledge of the biosql persistence code finds some time to have a look... Dont think it's very hard to fix Ciao Frederik ######test code################################################################## import java.util.*; import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.seq.db.biosql.*; import org.biojava.bio.seq.impl.*; import org.biojava.bio.symbol.*; public class BioSqlTest { String driver = "org.hsqldb.jdbcDriver" ; String user = "sa"; String pass = ""; String url = "jdbc:hsqldb:file:./database/biosqldb" ; public BioSqlTest() { } /** * demonstrates the switch from between to range location after persistence code */ public void generateFeaturePersistenceProblem() throws Exception { BioSQLSequenceDB db1 = new BioSQLSequenceDB(driver, url, user, pass , "testbiosqldb_1", true); db1.addSequence(getSequence()) ; Sequence seq = db1.getSequence("test_seq") ; for (Iterator iter = seq.features(); iter.hasNext();) { StrandedFeature f = (StrandedFeature) iter.next(); Location loc = f.getLocation() ; /* * ERROR: Location is now a RangeLocation and not a BetweenLocation ! */ if(loc instanceof BetweenLocation) { // ok System.out.println("[feature] location is still a BetweenLocation: " + loc.getClass().getName()); } else { // problem System.out.println("[feature] location is now an instance of: " + loc.getClass().getName()); } } } public void generateOntolgyPersistenceProblem() throws Exception { BioSQLSequenceDB db2 = new BioSQLSequenceDB(driver, url, user, pass , "testbiosqldb_2", true); /* EXCEPTION: the second addSequence will cause a exception in the ontology persistence code Caused by: java.sql.SQLException: Failed to persist term: ATYPE from ontology: ontology: __biojava_guano with error: -9 : 23000 at org.biojava.bio.seq.db.biosql.OntologySQL.persistTerm(OntologySQL.java:5 06) at org.biojava.bio.seq.db.biosql.OntologySQL.persistTerm(OntologySQL.java:4 68) ... 12 more Caused by: java.sql.SQLException: Violation of unique index: SYS_IDX_SYS_CT_9_10 in statement [insert into term (name, definition, ontology_id) values (?, ?, ?)] at org.hsqldb.jdbcDriver.throwError(Unknown Source) at org.hsqldb.jdbcPreparedStatement.executeUpdate(Unknown Source) at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(Delega tingPreparedStatement.java:233) at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(Delega tingPreparedStatement.java:233) at org.biojava.bio.seq.db.biosql.OntologySQL.persistTerm(OntologySQL.java:4 98) ... 13 more */ db2.addSequence(getSequence()) ; } public static Sequence getSequence() throws Exception { SymbolList sl = DNATools.createDNA("ACTGGTGTACCCCAATGGGAATATC") ; Sequence sequence = new SimpleSequence(sl, null, "test_seq", null); sequence.createFeature(getFeature()); return sequence ; } private static StrandedFeature.Template getFeature() throws Exception { SimpleAnnotation annotation = new SimpleAnnotation(); annotation.setProperty("Comment", "comment line"); StrandedFeature.Template templ = new StrandedFeature.Template(); templ.annotation = annotation; templ.location = new BetweenLocation(new RangeLocation(3,4)); templ.strand = StrandedFeature.POSITIVE; templ.type = "ATYPE"; templ.source = "ASRC"; return templ ; } public static void main(String[] args) { BioSqlTest test = new BioSqlTest() ; try { test.generateFeaturePersistenceProblem() ; } catch(Throwable t) { t.printStackTrace(); } try { //test.generateOntolgyPersistenceProblem() ; } catch(Throwable t) { t.printStackTrace(); } finally { System.exit(0); } } } --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.569 / Virus Database: 360 - Release Date: 26/01/2004 From jan.wuerthner at uni-duesseldorf.de Thu Jan 29 14:23:54 2004 From: jan.wuerthner at uni-duesseldorf.de (Jan =?iso-8859-15?q?W=FCrthner?=) Date: Thu Jan 29 14:29:03 2004 Subject: [Biojava-l] creating a sequence... Message-ID: <200401292023.54751.jan.wuerthner@uni-duesseldorf.de> Hi, I have written a BlastParser for xml based BLAST results, in order to reconstruct SeqSimilaritySearchResult instances from xml files returned by the NCBI for BLAST requests. In constructing the searchresult, I use SequenceBuilder sb = new SimpleSequenceBuilder(); SeqSimilaritySearchResult result = new SimpleSeqSimilaritySearchResult( sb.makeSequence(), new DummySequenceDB("dummy"), searchParameters, hits, new SimpleAnnotation(annotations)); where hits is a List of SeqSimilaritySearchHits and searchParameters and annotations are Maps. The XML file still contains information about the query, like ID and length, but not the whole query sequence, e.g.: gi|1698579|gb|U60438.1|MMU60438 Mus musculus serum amyloid A protein isoform 2 mRNA, complete cds 576 Here is my question: How can I adopt this information into the SimpleSeqSimilaritySearchResult? The sequence I obtain by sb.makeSequence() of course does not contain anything. Is there a way to construct a sequence by the query-ID, query-def and query-len? (Especially the length is something I need!) Thanks in advance Jan -- Jan W?rthner Institute for Medical Microbiology Building 22.21 Heinrich-Heine-University Universit?tsstra?e 1 40225 Duesseldorf Tel. +49 (0) 211 81 12461 URL: www.medmikro.uni-duesseldorf.de From jan.wuerthner at uni-duesseldorf.de Thu Jan 29 15:54:26 2004 From: jan.wuerthner at uni-duesseldorf.de (Jan =?iso-8859-15?q?W=FCrthner?=) Date: Thu Jan 29 15:59:22 2004 Subject: [Biojava-l] creating a sequence... In-Reply-To: <200401292023.54751.jan.wuerthner@uni-duesseldorf.de> References: <200401292023.54751.jan.wuerthner@uni-duesseldorf.de> Message-ID: <200401292154.26545.jan.wuerthner@uni-duesseldorf.de> Sorry for answering the email myself... after looking at it again, I think it is adequate to treat these three elements (query-ID, query-def and query-len) as annotations. There is no sense in assigning them to an unknown sequence, right? I have implemented it this way and it works fine, thanks anyway kind regards Jan Am Thursday 29 January 2004 20:23 schrieb Jan W?rthner: > Hi, > > I have written a BlastParser for xml based BLAST results, in order to > reconstruct SeqSimilaritySearchResult instances from xml files returned by > the NCBI for BLAST requests. > > In constructing the searchresult, I use > > SequenceBuilder sb = new SimpleSequenceBuilder(); > SeqSimilaritySearchResult result > = new SimpleSeqSimilaritySearchResult( sb.makeSequence(), > new DummySequenceDB("dummy"), > searchParameters, > hits, > new SimpleAnnotation(annotations)); > where hits is a List of SeqSimilaritySearchHits and searchParameters and > annotations are Maps. > > The XML file still contains information about the query, like ID and > length, but not the whole query sequence, e.g.: > > > gi|1698579|gb|U60438.1|MMU60438> Mus musculus serum amyloid A protein isoform 2 > mRNA, complete cds > 576 > > Here is my question: > > How can I adopt this information into the SimpleSeqSimilaritySearchResult? > The sequence I obtain by > > sb.makeSequence() > > of course does not contain anything. Is there a way to construct a sequence > by the query-ID, query-def and query-len? (Especially the length is > something I need!) > > Thanks in advance > Jan -- Jan W?rthner Institute for Medical Microbiology Building 22.21 Heinrich-Heine-University Universit?tsstra?e 1 40225 Duesseldorf Tel. +49 (0) 211 81 12461 URL: www.medmikro.uni-duesseldorf.de