From biojava at hannes.oib.com Thu Dec 1 07:59:10 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 1 Dec 2011 13:59:10 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment Message-ID: What am I doing wrong? I get: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.biojava3.alignment.SimpleAlignedSequence.setLocation(SimpleAlignedSequence.java:362) at org.biojava3.alignment.SimpleAlignedSequence.(SimpleAlignedSequence.java:88) at org.biojava3.alignment.SimpleProfile.(SimpleProfile.java:118) at org.biojava3.alignment.SimpleSequencePair.(SimpleSequencePair.java:86) at org.biojava3.alignment.SmithWaterman.setProfile(SmithWaterman.java:71) at org.biojava3.alignment.template.AbstractMatrixAligner.align(AbstractMatrixAligner.java:342) at org.biojava3.alignment.template.AbstractPairwiseSequenceAligner.getPair(AbstractPairwiseSequenceAligner.java:112) at org.biojava3.alignment.Alignments.getPairwiseAlignment(Alignments.java:208) when calling Alignments.getPairwiseAlignment(dnaSequence, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); where matrix = new SimpleSubstitutionMatrix();, sequence and target are both DNASequence Possible causes I can think of: target might contain IUB-Codes (not just ACGTU, but also RYKM...) Hannes From andreas at sdsc.edu Fri Dec 2 17:04:56 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 2 Dec 2011 14:04:56 -0800 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Hi Hannes, Did you make sure to use the correct substitution matrix for the alignment? You need substitution scores for all nucleotides in your sequence to be present in the matrix... Andreas On Thu, Dec 1, 2011 at 4:59 AM, Hannes Brandst?tter-M?ller wrote: > What am I doing wrong? I get: > > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: > 0, Size: 0 > ? ? ? ?at java.util.ArrayList.RangeCheck(ArrayList.java:547) > ? ? ? ?at java.util.ArrayList.get(ArrayList.java:322) > ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.setLocation(SimpleAlignedSequence.java:362) > ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.(SimpleAlignedSequence.java:88) > ? ? ? ?at org.biojava3.alignment.SimpleProfile.(SimpleProfile.java:118) > ? ? ? ?at org.biojava3.alignment.SimpleSequencePair.(SimpleSequencePair.java:86) > ? ? ? ?at org.biojava3.alignment.SmithWaterman.setProfile(SmithWaterman.java:71) > ? ? ? ?at org.biojava3.alignment.template.AbstractMatrixAligner.align(AbstractMatrixAligner.java:342) > ? ? ? ?at org.biojava3.alignment.template.AbstractPairwiseSequenceAligner.getPair(AbstractPairwiseSequenceAligner.java:112) > ? ? ? ?at org.biojava3.alignment.Alignments.getPairwiseAlignment(Alignments.java:208) > > when calling Alignments.getPairwiseAlignment(dnaSequence, target, > PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); > > where matrix = new SimpleSubstitutionMatrix();, > sequence and target are both DNASequence > > Possible causes I can think of: target might contain IUB-Codes (not > just ACGTU, but also RYKM...) > > Hannes > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From biojava at hannes.oib.com Mon Dec 5 02:42:59 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Mon, 5 Dec 2011 08:42:59 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Yes, that fixed that exception. Now, I'm getting null return value - must be still something wrong in the parameters... SubstitutionMatrix matrix = SubstitutionMatrixHelper.getNuc4_4(); Alignments.getPairwiseAlignment(dnaSequence, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); Where should I start looking for that? Is there a simple way to align (or score, don't need the full alignment) a single DNA sequence against a List of sequences? I only found the "How to concurrently create a PSA for each pair in a sequence list in BioJava" cookbook entry, but that calculates a bunch of (for me) useless PSAs. would it be better to perform a blast search (custom "library" to search against) for that? Thanks, Hannes On Fri, Dec 2, 2011 at 23:04, Andreas Prlic wrote: > Hi Hannes, > > Did you make sure to use the correct substitution matrix for the > alignment? You need substitution scores for all nucleotides in your > sequence to be present in the matrix... > > Andreas > > On Thu, Dec 1, 2011 at 4:59 AM, Hannes Brandst?tter-M?ller > wrote: >> What am I doing wrong? I get: >> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: >> 0, Size: 0 >> ? ? ? ?at java.util.ArrayList.RangeCheck(ArrayList.java:547) >> ? ? ? ?at java.util.ArrayList.get(ArrayList.java:322) >> ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.setLocation(SimpleAlignedSequence.java:362) >> ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.(SimpleAlignedSequence.java:88) >> ? ? ? ?at org.biojava3.alignment.SimpleProfile.(SimpleProfile.java:118) >> ? ? ? ?at org.biojava3.alignment.SimpleSequencePair.(SimpleSequencePair.java:86) >> ? ? ? ?at org.biojava3.alignment.SmithWaterman.setProfile(SmithWaterman.java:71) >> ? ? ? ?at org.biojava3.alignment.template.AbstractMatrixAligner.align(AbstractMatrixAligner.java:342) >> ? ? ? ?at org.biojava3.alignment.template.AbstractPairwiseSequenceAligner.getPair(AbstractPairwiseSequenceAligner.java:112) >> ? ? ? ?at org.biojava3.alignment.Alignments.getPairwiseAlignment(Alignments.java:208) >> >> when calling Alignments.getPairwiseAlignment(dnaSequence, target, >> PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); >> >> where matrix = new SimpleSubstitutionMatrix();, >> sequence and target are both DNASequence >> >> Possible causes I can think of: target might contain IUB-Codes (not >> just ACGTU, but also RYKM...) >> >> Hannes >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Mon Dec 5 20:57:43 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 5 Dec 2011 17:57:43 -0800 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: > Now, I'm getting null return value - must be still something wrong in > the parameters... > > Where should I start looking for that? try different gap penalties, I think the default ones are for protein alignments and one of the blosum matrices... If that does not help, can you send some of the sequences that are causing problems? There should be more informative error messages.. > Is there a simple way to align (or score, don't need the full > alignment) a single DNA sequence against a List of sequences? You could do a multiple sequence alignment. http://www.biojava.org/wiki/BioJava:CookBook3:MSA would it be better to perform a blast search > (custom "library" to search against) for that? depends on what you actually want to learn about your sequence. Blast is good to find matches to new sequences, that you did not know of before (but has worse alignment quality compared to dynamic programming). Andreas From biojava at hannes.oib.com Tue Dec 6 03:20:46 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Tue, 6 Dec 2011 09:20:46 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: On Tue, Dec 6, 2011 at 02:57, Andreas Prlic wrote: >> Now, I'm getting null return value - must be still something wrong in >> the parameters... >> >> Where should I start looking for that? > > try different gap penalties, I think the default ones are for protein > alignments and one of the blosum matrices... > If that does not help, can you send some of the sequences that are > causing problems? There should be more informative error messages.. There are no other gap penalties predefined, and using a custom simple gap penalty with (gop=1, gep=1) also does not change the null outcome. Here is a unit test case that fails for me: public void testPSA() { String targetSeq = "CACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTTCTTCAATGGGACGGA" + "GCGGGTGCGGTTGCTGGAAAGATGCATCTATAACCAAGAGGAGTCCGTGCGCTTCGACAGC" + "GACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACA" + "GCCAGAAGGACCTCCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTA" + "CGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAG"; DNASequence target = new DNASequence(targetSeq, AmbiguityDNACompoundSet.getDNACompoundSet()); String querySeq = "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCAGGCCCCCTTTCCGTCCTCAGGAA" + "GACAGAGGAGGAGCCCCTCGGGCTGCAGGTGGTGGGCGTTGCGGCGGCGGCCGGTTAAGGT" + "TCCCAGTGCCCGCACCCGGCCCACGGGAGCCCCGGACTGGCGGCGTCACTGTCAGTGTCTT" + "CTCAGGAGGCCGCCTGTGTGACTGGATCGTTCGTGTCCCCACAGCACGTTTCTTGGAGTAC" + "TCTACGTCTGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACT" + "TCCATAACCAGGAGGAGAACGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGAC" + "GGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACATCCTGGAAGACGAG" + "CGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGAGAGCTTCACCGTGCA" + "GCGGCGAGACGCACTCGT"; DNASequence query = new DNASequence(querySeq); SubstitutionMatrix matrix = SubstitutionMatrixHelper.getNuc4_4(); SequencePair psa = Alignments.getPairwiseAlignment(query, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); assertNotNull(psa); } >> Is there a simple way to align (or score, don't need the full >> alignment) a single DNA sequence against a List of sequences? > > You could do a multiple sequence alignment. > http://www.biojava.org/wiki/BioJava:CookBook3:MSA yeah, but that also computes loads of unnecessary PSAs. I just need the following: I get some sequences (from a sequencing machine), and for each of these sequences I want to look up in my (small) 'library' of reference sequences which one would be the most likely. So, I don't want PSAs of the reference sequences, just my query against each ref seq - something like that should be in the biojava library itself, the only thing I found was to calculate PSAs of eact sequence in a list (much like you need for a MSA), but if biuojava could offer that using the ConcurrencyTools stuff, that would be cool - I really need to figure out the inner structure of the biojava classes and start implementing that stuff for myself, but the factory method stuff is kinda confusing to get a hang of. As soon as I figure this out, I'm going to improve the hell out of the cookbook examples. Those are next to useless for my scenario. Hannes From biojava at hannes.oib.com Tue Dec 6 03:59:47 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Tue, 6 Dec 2011 09:59:47 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Hah, I got some not-null results: On Tue, Dec 6, 2011 at 09:20, Hannes Brandst?tter-M?ller wrote: > > public void testPSA() { > ? ? ? ?String targetSeq = > "CACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTTCTTCAATGGGACGGA" > ? ? ? ? ? ? ? ?+ > "GCGGGTGCGGTTGCTGGAAAGATGCATCTATAACCAAGAGGAGTCCGTGCGCTTCGACAGC" > ? ? ? ? ? ? ? ?+ > "GACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACA" > ? ? ? ? ? ? ? ?+ > "GCCAGAAGGACCTCCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTA" > ? ? ? ? ? ? ? ?+ "CGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAG"; > ? ? ? ?DNASequence target = new DNASequence(targetSeq, > AmbiguityDNACompoundSet.getDNACompoundSet()); > ? ? ? ?String querySeq = > "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCAGGCCCCCTTTCCGTCCTCAGGAA" > ? ? ? ? ? ? ? ?+ > "GACAGAGGAGGAGCCCCTCGGGCTGCAGGTGGTGGGCGTTGCGGCGGCGGCCGGTTAAGGT" > ? ? ? ? ? ? ? ?+ > "TCCCAGTGCCCGCACCCGGCCCACGGGAGCCCCGGACTGGCGGCGTCACTGTCAGTGTCTT" > ? ? ? ? ? ? ? ?+ > "CTCAGGAGGCCGCCTGTGTGACTGGATCGTTCGTGTCCCCACAGCACGTTTCTTGGAGTAC" > ? ? ? ? ? ? ? ?+ > "TCTACGTCTGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACT" > ? ? ? ? ? ? ? ?+ > "TCCATAACCAGGAGGAGAACGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGAC" > ? ? ? ? ? ? ? ?+ > "GGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACATCCTGGAAGACGAG" > ? ? ? ? ? ? ? ?+ > "CGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGAGAGCTTCACCGTGCA" > ? ? ? ? ? ? ? ?+ "GCGGCGAGACGCACTCGT"; > ? ? ? ?DNASequence query = new DNASequence(querySeq); query.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); // inserting that helps. > ? ? ? ?SubstitutionMatrix matrix = > SubstitutionMatrixHelper.getNuc4_4(); > ? ? ? ?SequencePair psa = > Alignments.getPairwiseAlignment(query, target, > PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); > ? ? ? ?assertNotNull(psa); > ? ?} But when I try something similar in my production code, I get an java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 again dnaSequence.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); // If I remove this line, the exception is gone again, but I get NULL result. psa = Alignments.getPairwiseAlignment(dnaSequence, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); the dnaSequence in that case is something that is passed to this method, and is a sequence generated by a fasta reader - should have no ambiguity in there, just plain ACGT. It has a plain DNACompoundSet too. Hannes From biojava at hannes.oib.com Tue Dec 6 05:39:39 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Tue, 6 Dec 2011 11:39:39 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: another update: that most recent exception was caused by a reference sequence consisting only of "NNNN" - looks like something that should be handled more gracefully to me :) Hannes On Tue, Dec 6, 2011 at 09:59, Hannes Brandst?tter-M?ller wrote: > Hah, I got some not-null results: > > > On Tue, Dec 6, 2011 at 09:20, Hannes Brandst?tter-M?ller > wrote: >> >> public void testPSA() { >> ? ? ? ?String targetSeq = >> "CACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTTCTTCAATGGGACGGA" >> ? ? ? ? ? ? ? ?+ >> "GCGGGTGCGGTTGCTGGAAAGATGCATCTATAACCAAGAGGAGTCCGTGCGCTTCGACAGC" >> ? ? ? ? ? ? ? ?+ >> "GACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACA" >> ? ? ? ? ? ? ? ?+ >> "GCCAGAAGGACCTCCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTA" >> ? ? ? ? ? ? ? ?+ "CGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAG"; >> ? ? ? ?DNASequence target = new DNASequence(targetSeq, >> AmbiguityDNACompoundSet.getDNACompoundSet()); >> ? ? ? ?String querySeq = >> "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCAGGCCCCCTTTCCGTCCTCAGGAA" >> ? ? ? ? ? ? ? ?+ >> "GACAGAGGAGGAGCCCCTCGGGCTGCAGGTGGTGGGCGTTGCGGCGGCGGCCGGTTAAGGT" >> ? ? ? ? ? ? ? ?+ >> "TCCCAGTGCCCGCACCCGGCCCACGGGAGCCCCGGACTGGCGGCGTCACTGTCAGTGTCTT" >> ? ? ? ? ? ? ? ?+ >> "CTCAGGAGGCCGCCTGTGTGACTGGATCGTTCGTGTCCCCACAGCACGTTTCTTGGAGTAC" >> ? ? ? ? ? ? ? ?+ >> "TCTACGTCTGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACT" >> ? ? ? ? ? ? ? ?+ >> "TCCATAACCAGGAGGAGAACGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGAC" >> ? ? ? ? ? ? ? ?+ >> "GGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACATCCTGGAAGACGAG" >> ? ? ? ? ? ? ? ?+ >> "CGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGAGAGCTTCACCGTGCA" >> ? ? ? ? ? ? ? ?+ "GCGGCGAGACGCACTCGT"; >> ? ? ? ?DNASequence query = new DNASequence(querySeq); > > query.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); // > inserting that helps. > >> ? ? ? ?SubstitutionMatrix matrix = >> SubstitutionMatrixHelper.getNuc4_4(); >> ? ? ? ?SequencePair psa = >> Alignments.getPairwiseAlignment(query, target, >> PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); >> ? ? ? ?assertNotNull(psa); >> ? ?} > > But when I try something similar in my production code, I get an > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 again > > dnaSequence.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); > // If I remove this line, the exception is gone again, but I get NULL > result. > psa = Alignments.getPairwiseAlignment(dnaSequence, target, > PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); > > the dnaSequence in that case is something that is passed to this > method, and is a sequence generated by a fasta reader - should have no > ambiguity in there, just plain ACGT. It has a plain DNACompoundSet > too. > > Hannes > From daniel.svozil at vscht.cz Tue Dec 6 06:33:24 2011 From: daniel.svozil at vscht.cz (Daniel Svozil) Date: Tue, 6 Dec 2011 12:33:24 +0100 Subject: [Biojava-l] mmView - a tool for mmCIF exploration In-Reply-To: References: Message-ID: Dear colleagues, We would like to announce the availability of mmView - the web-based application which allows to comfortably explore the structural data of biomacromolecules stored in the mmCIF (macromolecular Crystallographic Information File) format. The mmView software system is primarily intended for educational purposes but it can also serve as an auxiliary tool for working with biomolecular structures. The mmView application is offered in two flavors: as a publicly available web server http://ich.vscht.cz/projects/mmview/, and as an open-source stand-alone application (available from http://sourceforge.net/projects/mmview) that can be installed on the user?s computer. Petr Cech and Daniel Svozil -- Daniel Svozil, PhD Head of Laboratory of Informatics and Chemistry Institute of Chemical Technology Czech Republic phone: +420 220 444 391 http://ich.vscht.cz/~svozil From andreas at sdsc.edu Wed Dec 7 00:28:18 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 6 Dec 2011 21:28:18 -0800 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Hi Hannes, Couple of things in response to your mails: - thanks for providing the example, I have moved it to a Cookbook page, since we had several DNA alignment related questions recently. http://biojava.org/wiki/BioJava;CookBook3:PSA_DNA - I believe the AmbiguityDNACompoundSet is required, to match the compound set of the substitution matrix. - about your sequence matching strategy: this really depends on your size and similarity of DNA sequences. if you have many, then blast might be the way to go, otherwise just looking a number of identical positions (psa.getNumIdenticals()) might be a quick and dirty solution for this as well. - ConcurrencyTool is just a utility class for working with the Java concurrency framework. Should be easy to write your own Callable classes that do pairwise alignment, if you needs to do this in multiple threads... Andreas From andreas at sdsc.edu Wed Dec 7 14:33:50 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 7 Dec 2011 11:33:50 -0800 Subject: [Biojava-l] mmView - a tool for mmCIF exploration In-Reply-To: References: Message-ID: Thanks, Daniel, what is the connection to BioJava? This seems to be a Python project? Otherwise, nice tool! Andreas On Tue, Dec 6, 2011 at 3:33 AM, Daniel Svozil wrote: > Dear colleagues, > > We would like to announce the availability of mmView - the web-based > application which allows to comfortably explore the structural data of > biomacromolecules stored in the mmCIF (macromolecular Crystallographic > Information File) format. The mmView software system is primarily > intended for educational purposes but it can also serve as an > auxiliary tool for working with biomolecular structures. > > The mmView application is offered in two flavors: as a publicly > available web server http://ich.vscht.cz/projects/mmview/, and as an > open-source stand-alone application (available from > http://sourceforge.net/projects/mmview) that can be installed on the > user?s computer. > > Petr Cech and Daniel Svozil > > -- > Daniel Svozil, PhD > Head of Laboratory of Informatics and Chemistry > Institute of Chemical Technology > Czech Republic > > phone: +420 220 444 391 > http://ich.vscht.cz/~svozil > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From daniel.svozil at vscht.cz Wed Dec 7 14:43:23 2011 From: daniel.svozil at vscht.cz (Daniel Svozil) Date: Wed, 7 Dec 2011 20:43:23 +0100 Subject: [Biojava-l] mmView - a tool for mmCIF exploration In-Reply-To: <17c23e022b7c4cac9a273e5ded98874c@FE15.vscht.cz> References: <17c23e022b7c4cac9a273e5ded98874c@FE15.vscht.cz> Message-ID: Hi Andreas, thanks for the compliment. Yes, it is a Python tool, no direct connection ro BioJava. However, we were thinking somebody from bioinformatics community working with mmCIF files may find it useful. Daniel On Wed, Dec 7, 2011 at 8:33 PM, Andreas Prlic wrote: > Thanks, Daniel, > > what is the connection to BioJava? This seems to be a Python project? > Otherwise, nice tool! > > Andreas > > > On Tue, Dec 6, 2011 at 3:33 AM, Daniel Svozil wrote: >> Dear colleagues, >> >> We would like to announce the availability of mmView - the web-based >> application which allows to comfortably explore the structural data of >> biomacromolecules stored in the mmCIF (macromolecular Crystallographic >> Information File) format. The mmView software system is primarily >> intended for educational purposes but it can also serve as an >> auxiliary tool for working with biomolecular structures. >> >> The mmView application is offered in two flavors: as a publicly >> available web server http://ich.vscht.cz/projects/mmview/, and as an >> open-source stand-alone application (available from >> http://sourceforge.net/projects/mmview) that can be installed on the >> user?s computer. >> >> Petr Cech and Daniel Svozil >> >> -- >> Daniel Svozil, PhD >> Head of Laboratory of Informatics and Chemistry >> Institute of Chemical Technology >> Czech Republic >> >> phone: +420 220 444 391 >> http://ich.vscht.cz/~svozil >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l -- Daniel Svozil, PhD Head of Laboratory of Informatics and Chemistry Institute of Chemical Technology Czech Republic phone: +420 220 444 391 http://ich.vscht.cz/~svozil From biojava at hannes.oib.com Mon Dec 19 08:06:17 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Mon, 19 Dec 2011 14:06:17 +0100 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? Message-ID: Hi! I have 2 files as output of my sequencer: one is a standard fasta, which is no problem, the other is a fasta file with the same headers, but instead of simple letters, there are integer values from 0 to 40 denoting the quality of the sequencing at this position. Is it easy to adapt the biojava fasta parser to read such files (by feeding different classes to the parser), or should I write a specialized parser from scratch? Hannes From p.j.a.cock at googlemail.com Wed Dec 21 04:21:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Dec 2011 09:21:32 +0000 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: On Monday, December 19, 2011, Hannes Brandst?tter-M?ller < biojava at hannes.oib.com> wrote: > Hi! > > I have 2 files as output of my sequencer: one is a standard fasta, > which is no problem, the other is a fasta file with the same headers, > but instead of simple letters, there are integer values from 0 to 40 > denoting the quality of the sequencing at this position. Is it easy to > adapt the biojava fasta parser to read such files (by feeding > different classes to the parser), or should I write a specialized > parser from scratch? > > Hannes You mean a QUAL file? Often named example.qual or example.fasta.qual and used in conjunction with a matching FASTA file as a clumsy alternative to FASTQ (taking more space on disk too). For Biopython we have a separate parser, although the Header line handling is the same, the quality sequence is quite different (splitting on white space, ASCII strings then converted to integers). Peter From andreas at sdsc.edu Wed Dec 21 09:21:17 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 21 Dec 2011 06:21:17 -0800 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: Hi Hannes, if this is a more frequently used file format, would be great to get a parser for this... Any patches are welcome ! ;-) Andreas On Wed, Dec 21, 2011 at 1:21 AM, Peter Cock wrote: > On Monday, December 19, 2011, Hannes Brandst?tter-M?ller < > biojava at hannes.oib.com> wrote: >> Hi! >> >> I have 2 files as output of my sequencer: one is a standard fasta, >> which is no problem, the other is a fasta file with the same headers, >> but instead of simple letters, there are integer values from 0 to 40 >> denoting the quality of the sequencing at this position. Is it easy to >> adapt the biojava fasta parser to read such files (by feeding >> different classes to the parser), or should I write a specialized >> parser from scratch? >> >> Hannes > > You mean a QUAL file? Often named example.qual or > example.fasta.qual and used in conjunction with a matching > FASTA file as a clumsy alternative to FASTQ (taking more > space on disk too). > > For Biopython we have a separate parser, although the > Header line handling is the same, the quality sequence > is quite different (splitting on white space, ASCII strings > then converted to integers). > > Peter > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From biojava at hannes.oib.com Wed Dec 21 09:44:51 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Wed, 21 Dec 2011 15:44:51 +0100 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: On Wed, Dec 21, 2011 at 15:21, Andreas Prlic wrote: > Hi Hannes, > > if this is a more frequently used file format, would be great to get a > parser for this... Any patches are welcome ! ;-) > > Andreas I just put together some kludgy thing that works for me now, but there needs to be something done for this. I mailed a bit back and forth with Peter, and he pointed me to http://news.open-bio.org/news/2009/12/nar-fastq-format/ where Biojava is mentioned to understand that quality information too. I guess that got lost in the move to 3.0? If nothing happens until March/April, I'll pick up this as my pet project and work to help biojava to support FASTQ and the FASTA/QUAL format. But until then I'm totally swamped. If someone else (perhaps the one who wrote the FASTA Parser initially) could pick that up, that would be great. Hannes From andreas at sdsc.edu Wed Dec 21 10:09:45 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 21 Dec 2011 07:09:45 -0800 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: The fastq parser is in the legacy biojava 1.8 and can still be downloaded if you want. Not sure how hard it would be to migrate it to biojava3. A On Wed, Dec 21, 2011 at 6:44 AM, Hannes Brandst?tter-M?ller wrote: > On Wed, Dec 21, 2011 at 15:21, Andreas Prlic wrote: >> Hi Hannes, >> >> if this is a more frequently used file format, would be great to get a >> parser for this... Any patches are welcome ! ;-) >> >> Andreas > > I just put together some kludgy thing that works for me now, but there > needs to be something done for this. > I mailed a bit back and forth with Peter, and he pointed me to > http://news.open-bio.org/news/2009/12/nar-fastq-format/ where Biojava > is mentioned to understand that quality information too. I guess that > got lost in the move to 3.0? > > If nothing happens until March/April, I'll pick up this as my pet > project and work to help biojava to support FASTQ and the FASTA/QUAL > format. But until then I'm totally swamped. If someone else (perhaps > the one who wrote the FASTA Parser initially) could pick that up, that > would be great. > > Hannes From biojava at hannes.oib.com Thu Dec 22 02:34:32 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 22 Dec 2011 08:34:32 +0100 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: On Wed, Dec 21, 2011 at 16:09, Andreas Prlic wrote: > The fastq parser is in the legacy biojava 1.8 and can still be > downloaded if you want. Not sure how hard it would be to migrate it to > biojava3. > > A FASTQ support in 3.0 would be nice. If noone else is doing it, I'll take a look at migrating it, but that won't happen before April next year. Hannes From biojava at hannes.oib.com Thu Dec 22 04:18:37 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 22 Dec 2011 10:18:37 +0100 Subject: [Biojava-l] Cookbook entry - feedback please Message-ID: Hi! I recently ran into the problem of having 2 very similar DNA Sequences and wanting to get (1) a difference count and (2) a consensus sequence. I asked on stackoverflow and biostar.stackexchange.com for some input, and the answers pointed me towards enums and a lookup table using these. So I sat down and wrote that (that LUT took me quite some time, I hope I did not make errors in there) and made a small class to quickly get to the result. I post the code as cookbook page for future reference, possible inclusion of the code into the library and of course, feedback. http://biojava.org/wiki/BioJava:CookBook:Core:SequenceCompare Hannes From andreas at sdsc.edu Thu Dec 22 08:29:32 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 22 Dec 2011 05:29:32 -0800 Subject: [Biojava-l] Cookbook entry - feedback please In-Reply-To: References: Message-ID: Did you try to align the sequences first? That would also give you the number of identical nucleotides. Andreas On Thu, Dec 22, 2011 at 1:18 AM, Hannes Brandst?tter-M?ller wrote: > Hi! > > I recently ran into the problem of having 2 very similar DNA Sequences > and wanting to get (1) a difference count and (2) a consensus > sequence. > > I asked on stackoverflow and biostar.stackexchange.com for some input, > and the answers pointed me towards enums and a lookup table using > these. So I sat down and wrote that (that LUT took me quite some time, > I hope I did not make errors in there) and made a small class to > quickly get to the result. > > I post the code as cookbook page for future reference, possible > inclusion of the code into the library and of course, feedback. > > http://biojava.org/wiki/BioJava:CookBook:Core:SequenceCompare > > Hannes > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From biojava at hannes.oib.com Thu Dec 22 09:58:49 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 22 Dec 2011 15:58:49 +0100 Subject: [Biojava-l] Cookbook entry - feedback please In-Reply-To: References: Message-ID: In my case, the sequences were all pre-aligned, but yes, that could be done first. A question concerning consensus sequences: how do you handle gaps? N (as I understood it) does not allow a gap. Is there a way to encode a "might be gap here, or A or G"? Hannes On Thu, Dec 22, 2011 at 14:29, Andreas Prlic wrote: > Did you try to align the sequences first? That would also give you the > number of identical nucleotides. > > Andreas > > > > On Thu, Dec 22, 2011 at 1:18 AM, Hannes Brandst?tter-M?ller > wrote: >> Hi! >> >> I recently ran into the problem of having 2 very similar DNA Sequences >> and wanting to get (1) a difference count and (2) a consensus >> sequence. >> >> I asked on stackoverflow and biostar.stackexchange.com for some input, >> and the answers pointed me towards enums and a lookup table using >> these. So I sat down and wrote that (that LUT took me quite some time, >> I hope I did not make errors in there) and made a small class to >> quickly get to the result. >> >> I post the code as cookbook page for future reference, possible >> inclusion of the code into the library and of course, feedback. >> >> http://biojava.org/wiki/BioJava:CookBook:Core:SequenceCompare >> >> Hannes >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Thu Dec 22 10:26:44 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 22 Dec 2011 07:26:44 -0800 Subject: [Biojava-l] Cookbook entry - feedback please In-Reply-To: References: Message-ID: > A question concerning consensus sequences: how do you handle gaps? N > (as I understood it) does not allow a gap. Is there a way to encode a > "might be gap here, or A or G"? If the input is a multiple sequence alignment then you could count frequencies at each position and take the most frequently occurring nucleotide. For each position you could count a conservation score. Andreas From P.V.Troshin at dundee.ac.uk Fri Dec 23 18:47:58 2011 From: P.V.Troshin at dundee.ac.uk (Peter Troshin) Date: Fri, 23 Dec 2011 23:47:58 +0000 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? Message-ID: <4EF5132E020000ED0001016B@ia-gw-6.dundee.ac.uk> There is a nice FastQ parser and many other goodies available from the http://picard.sourceforge.net/command-line-overview.shtml (just download the source code) if you need one. Regards, Peter >>> Andreas Prlic 12/21/11 3:12 PM >>> The fastq parser is in the legacy biojava 1.8 and can still be downloaded if you want. Not sure how hard it would be to migrate it to biojava3. A On Wed, Dec 21, 2011 at 6:44 AM, Hannes Brandst?tter-M?ller wrote: > On Wed, Dec 21, 2011 at 15:21, Andreas Prlic wrote: >> Hi Hannes, >> >> if this is a more frequently used file format, would be great to get a >> parser for this... Any patches are welcome ! ;-) >> >> Andreas > > I just put together some kludgy thing that works for me now, but there > needs to be something done for this. > I mailed a bit back and forth with Peter, and he pointed me to > http://news.open-bio.org/news/2009/12/nar-fastq-format/ where Biojava > is mentioned to understand that quality information too. I guess that > got lost in the move to 3.0? > > If nothing happens until March/April, I'll pick up this as my pet > project and work to help biojava to support FASTQ and the FASTA/QUAL > format. But until then I'm totally swamped. If someone else (perhaps > the one who wrote the FASTA Parser initially) could pick that up, that > would be great. > > Hannes _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l ************************************************************ Please consider the environment. Do you really need to print this email? From biojava at hannes.oib.com Thu Dec 1 12:59:10 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 1 Dec 2011 13:59:10 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment Message-ID: What am I doing wrong? I get: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.biojava3.alignment.SimpleAlignedSequence.setLocation(SimpleAlignedSequence.java:362) at org.biojava3.alignment.SimpleAlignedSequence.(SimpleAlignedSequence.java:88) at org.biojava3.alignment.SimpleProfile.(SimpleProfile.java:118) at org.biojava3.alignment.SimpleSequencePair.(SimpleSequencePair.java:86) at org.biojava3.alignment.SmithWaterman.setProfile(SmithWaterman.java:71) at org.biojava3.alignment.template.AbstractMatrixAligner.align(AbstractMatrixAligner.java:342) at org.biojava3.alignment.template.AbstractPairwiseSequenceAligner.getPair(AbstractPairwiseSequenceAligner.java:112) at org.biojava3.alignment.Alignments.getPairwiseAlignment(Alignments.java:208) when calling Alignments.getPairwiseAlignment(dnaSequence, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); where matrix = new SimpleSubstitutionMatrix();, sequence and target are both DNASequence Possible causes I can think of: target might contain IUB-Codes (not just ACGTU, but also RYKM...) Hannes From andreas at sdsc.edu Fri Dec 2 22:04:56 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 2 Dec 2011 14:04:56 -0800 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Hi Hannes, Did you make sure to use the correct substitution matrix for the alignment? You need substitution scores for all nucleotides in your sequence to be present in the matrix... Andreas On Thu, Dec 1, 2011 at 4:59 AM, Hannes Brandst?tter-M?ller wrote: > What am I doing wrong? I get: > > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: > 0, Size: 0 > ? ? ? ?at java.util.ArrayList.RangeCheck(ArrayList.java:547) > ? ? ? ?at java.util.ArrayList.get(ArrayList.java:322) > ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.setLocation(SimpleAlignedSequence.java:362) > ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.(SimpleAlignedSequence.java:88) > ? ? ? ?at org.biojava3.alignment.SimpleProfile.(SimpleProfile.java:118) > ? ? ? ?at org.biojava3.alignment.SimpleSequencePair.(SimpleSequencePair.java:86) > ? ? ? ?at org.biojava3.alignment.SmithWaterman.setProfile(SmithWaterman.java:71) > ? ? ? ?at org.biojava3.alignment.template.AbstractMatrixAligner.align(AbstractMatrixAligner.java:342) > ? ? ? ?at org.biojava3.alignment.template.AbstractPairwiseSequenceAligner.getPair(AbstractPairwiseSequenceAligner.java:112) > ? ? ? ?at org.biojava3.alignment.Alignments.getPairwiseAlignment(Alignments.java:208) > > when calling Alignments.getPairwiseAlignment(dnaSequence, target, > PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); > > where matrix = new SimpleSubstitutionMatrix();, > sequence and target are both DNASequence > > Possible causes I can think of: target might contain IUB-Codes (not > just ACGTU, but also RYKM...) > > Hannes > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From biojava at hannes.oib.com Mon Dec 5 07:42:59 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Mon, 5 Dec 2011 08:42:59 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Yes, that fixed that exception. Now, I'm getting null return value - must be still something wrong in the parameters... SubstitutionMatrix matrix = SubstitutionMatrixHelper.getNuc4_4(); Alignments.getPairwiseAlignment(dnaSequence, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); Where should I start looking for that? Is there a simple way to align (or score, don't need the full alignment) a single DNA sequence against a List of sequences? I only found the "How to concurrently create a PSA for each pair in a sequence list in BioJava" cookbook entry, but that calculates a bunch of (for me) useless PSAs. would it be better to perform a blast search (custom "library" to search against) for that? Thanks, Hannes On Fri, Dec 2, 2011 at 23:04, Andreas Prlic wrote: > Hi Hannes, > > Did you make sure to use the correct substitution matrix for the > alignment? You need substitution scores for all nucleotides in your > sequence to be present in the matrix... > > Andreas > > On Thu, Dec 1, 2011 at 4:59 AM, Hannes Brandst?tter-M?ller > wrote: >> What am I doing wrong? I get: >> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: >> 0, Size: 0 >> ? ? ? ?at java.util.ArrayList.RangeCheck(ArrayList.java:547) >> ? ? ? ?at java.util.ArrayList.get(ArrayList.java:322) >> ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.setLocation(SimpleAlignedSequence.java:362) >> ? ? ? ?at org.biojava3.alignment.SimpleAlignedSequence.(SimpleAlignedSequence.java:88) >> ? ? ? ?at org.biojava3.alignment.SimpleProfile.(SimpleProfile.java:118) >> ? ? ? ?at org.biojava3.alignment.SimpleSequencePair.(SimpleSequencePair.java:86) >> ? ? ? ?at org.biojava3.alignment.SmithWaterman.setProfile(SmithWaterman.java:71) >> ? ? ? ?at org.biojava3.alignment.template.AbstractMatrixAligner.align(AbstractMatrixAligner.java:342) >> ? ? ? ?at org.biojava3.alignment.template.AbstractPairwiseSequenceAligner.getPair(AbstractPairwiseSequenceAligner.java:112) >> ? ? ? ?at org.biojava3.alignment.Alignments.getPairwiseAlignment(Alignments.java:208) >> >> when calling Alignments.getPairwiseAlignment(dnaSequence, target, >> PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); >> >> where matrix = new SimpleSubstitutionMatrix();, >> sequence and target are both DNASequence >> >> Possible causes I can think of: target might contain IUB-Codes (not >> just ACGTU, but also RYKM...) >> >> Hannes >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Tue Dec 6 01:57:43 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 5 Dec 2011 17:57:43 -0800 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: > Now, I'm getting null return value - must be still something wrong in > the parameters... > > Where should I start looking for that? try different gap penalties, I think the default ones are for protein alignments and one of the blosum matrices... If that does not help, can you send some of the sequences that are causing problems? There should be more informative error messages.. > Is there a simple way to align (or score, don't need the full > alignment) a single DNA sequence against a List of sequences? You could do a multiple sequence alignment. http://www.biojava.org/wiki/BioJava:CookBook3:MSA would it be better to perform a blast search > (custom "library" to search against) for that? depends on what you actually want to learn about your sequence. Blast is good to find matches to new sequences, that you did not know of before (but has worse alignment quality compared to dynamic programming). Andreas From biojava at hannes.oib.com Tue Dec 6 08:20:46 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Tue, 6 Dec 2011 09:20:46 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: On Tue, Dec 6, 2011 at 02:57, Andreas Prlic wrote: >> Now, I'm getting null return value - must be still something wrong in >> the parameters... >> >> Where should I start looking for that? > > try different gap penalties, I think the default ones are for protein > alignments and one of the blosum matrices... > If that does not help, can you send some of the sequences that are > causing problems? There should be more informative error messages.. There are no other gap penalties predefined, and using a custom simple gap penalty with (gop=1, gep=1) also does not change the null outcome. Here is a unit test case that fails for me: public void testPSA() { String targetSeq = "CACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTTCTTCAATGGGACGGA" + "GCGGGTGCGGTTGCTGGAAAGATGCATCTATAACCAAGAGGAGTCCGTGCGCTTCGACAGC" + "GACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACA" + "GCCAGAAGGACCTCCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTA" + "CGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAG"; DNASequence target = new DNASequence(targetSeq, AmbiguityDNACompoundSet.getDNACompoundSet()); String querySeq = "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCAGGCCCCCTTTCCGTCCTCAGGAA" + "GACAGAGGAGGAGCCCCTCGGGCTGCAGGTGGTGGGCGTTGCGGCGGCGGCCGGTTAAGGT" + "TCCCAGTGCCCGCACCCGGCCCACGGGAGCCCCGGACTGGCGGCGTCACTGTCAGTGTCTT" + "CTCAGGAGGCCGCCTGTGTGACTGGATCGTTCGTGTCCCCACAGCACGTTTCTTGGAGTAC" + "TCTACGTCTGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACT" + "TCCATAACCAGGAGGAGAACGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGAC" + "GGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACATCCTGGAAGACGAG" + "CGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGAGAGCTTCACCGTGCA" + "GCGGCGAGACGCACTCGT"; DNASequence query = new DNASequence(querySeq); SubstitutionMatrix matrix = SubstitutionMatrixHelper.getNuc4_4(); SequencePair psa = Alignments.getPairwiseAlignment(query, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); assertNotNull(psa); } >> Is there a simple way to align (or score, don't need the full >> alignment) a single DNA sequence against a List of sequences? > > You could do a multiple sequence alignment. > http://www.biojava.org/wiki/BioJava:CookBook3:MSA yeah, but that also computes loads of unnecessary PSAs. I just need the following: I get some sequences (from a sequencing machine), and for each of these sequences I want to look up in my (small) 'library' of reference sequences which one would be the most likely. So, I don't want PSAs of the reference sequences, just my query against each ref seq - something like that should be in the biojava library itself, the only thing I found was to calculate PSAs of eact sequence in a list (much like you need for a MSA), but if biuojava could offer that using the ConcurrencyTools stuff, that would be cool - I really need to figure out the inner structure of the biojava classes and start implementing that stuff for myself, but the factory method stuff is kinda confusing to get a hang of. As soon as I figure this out, I'm going to improve the hell out of the cookbook examples. Those are next to useless for my scenario. Hannes From biojava at hannes.oib.com Tue Dec 6 08:59:47 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Tue, 6 Dec 2011 09:59:47 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Hah, I got some not-null results: On Tue, Dec 6, 2011 at 09:20, Hannes Brandst?tter-M?ller wrote: > > public void testPSA() { > ? ? ? ?String targetSeq = > "CACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTTCTTCAATGGGACGGA" > ? ? ? ? ? ? ? ?+ > "GCGGGTGCGGTTGCTGGAAAGATGCATCTATAACCAAGAGGAGTCCGTGCGCTTCGACAGC" > ? ? ? ? ? ? ? ?+ > "GACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACA" > ? ? ? ? ? ? ? ?+ > "GCCAGAAGGACCTCCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTA" > ? ? ? ? ? ? ? ?+ "CGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAG"; > ? ? ? ?DNASequence target = new DNASequence(targetSeq, > AmbiguityDNACompoundSet.getDNACompoundSet()); > ? ? ? ?String querySeq = > "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCAGGCCCCCTTTCCGTCCTCAGGAA" > ? ? ? ? ? ? ? ?+ > "GACAGAGGAGGAGCCCCTCGGGCTGCAGGTGGTGGGCGTTGCGGCGGCGGCCGGTTAAGGT" > ? ? ? ? ? ? ? ?+ > "TCCCAGTGCCCGCACCCGGCCCACGGGAGCCCCGGACTGGCGGCGTCACTGTCAGTGTCTT" > ? ? ? ? ? ? ? ?+ > "CTCAGGAGGCCGCCTGTGTGACTGGATCGTTCGTGTCCCCACAGCACGTTTCTTGGAGTAC" > ? ? ? ? ? ? ? ?+ > "TCTACGTCTGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACT" > ? ? ? ? ? ? ? ?+ > "TCCATAACCAGGAGGAGAACGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGAC" > ? ? ? ? ? ? ? ?+ > "GGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACATCCTGGAAGACGAG" > ? ? ? ? ? ? ? ?+ > "CGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGAGAGCTTCACCGTGCA" > ? ? ? ? ? ? ? ?+ "GCGGCGAGACGCACTCGT"; > ? ? ? ?DNASequence query = new DNASequence(querySeq); query.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); // inserting that helps. > ? ? ? ?SubstitutionMatrix matrix = > SubstitutionMatrixHelper.getNuc4_4(); > ? ? ? ?SequencePair psa = > Alignments.getPairwiseAlignment(query, target, > PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); > ? ? ? ?assertNotNull(psa); > ? ?} But when I try something similar in my production code, I get an java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 again dnaSequence.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); // If I remove this line, the exception is gone again, but I get NULL result. psa = Alignments.getPairwiseAlignment(dnaSequence, target, PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); the dnaSequence in that case is something that is passed to this method, and is a sequence generated by a fasta reader - should have no ambiguity in there, just plain ACGT. It has a plain DNACompoundSet too. Hannes From biojava at hannes.oib.com Tue Dec 6 10:39:39 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Tue, 6 Dec 2011 11:39:39 +0100 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: another update: that most recent exception was caused by a reference sequence consisting only of "NNNN" - looks like something that should be handled more gracefully to me :) Hannes On Tue, Dec 6, 2011 at 09:59, Hannes Brandst?tter-M?ller wrote: > Hah, I got some not-null results: > > > On Tue, Dec 6, 2011 at 09:20, Hannes Brandst?tter-M?ller > wrote: >> >> public void testPSA() { >> ? ? ? ?String targetSeq = >> "CACGTTTCTTGTGGCAGCTTAAGTTTGAATGTCATTTCTTCAATGGGACGGA" >> ? ? ? ? ? ? ? ?+ >> "GCGGGTGCGGTTGCTGGAAAGATGCATCTATAACCAAGAGGAGTCCGTGCGCTTCGACAGC" >> ? ? ? ? ? ? ? ?+ >> "GACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACA" >> ? ? ? ? ? ? ? ?+ >> "GCCAGAAGGACCTCCTGGAGCAGAGGCGGGCCGCGGTGGACACCTACTGCAGACACAACTA" >> ? ? ? ? ? ? ? ?+ "CGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAG"; >> ? ? ? ?DNASequence target = new DNASequence(targetSeq, >> AmbiguityDNACompoundSet.getDNACompoundSet()); >> ? ? ? ?String querySeq = >> "ACGAGTGCGTGTTTTCCCGCCTGGTCCCCAGGCCCCCTTTCCGTCCTCAGGAA" >> ? ? ? ? ? ? ? ?+ >> "GACAGAGGAGGAGCCCCTCGGGCTGCAGGTGGTGGGCGTTGCGGCGGCGGCCGGTTAAGGT" >> ? ? ? ? ? ? ? ?+ >> "TCCCAGTGCCCGCACCCGGCCCACGGGAGCCCCGGACTGGCGGCGTCACTGTCAGTGTCTT" >> ? ? ? ? ? ? ? ?+ >> "CTCAGGAGGCCGCCTGTGTGACTGGATCGTTCGTGTCCCCACAGCACGTTTCTTGGAGTAC" >> ? ? ? ? ? ? ? ?+ >> "TCTACGTCTGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACT" >> ? ? ? ? ? ? ? ?+ >> "TCCATAACCAGGAGGAGAACGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGAC" >> ? ? ? ? ? ? ? ?+ >> "GGAGCTGGGGCGGCCTGATGCCGAGTACTGGAACAGCCAGAAGGACATCCTGGAAGACGAG" >> ? ? ? ? ? ? ? ?+ >> "CGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGAGAGCTTCACCGTGCA" >> ? ? ? ? ? ? ? ?+ "GCGGCGAGACGCACTCGT"; >> ? ? ? ?DNASequence query = new DNASequence(querySeq); > > query.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); // > inserting that helps. > >> ? ? ? ?SubstitutionMatrix matrix = >> SubstitutionMatrixHelper.getNuc4_4(); >> ? ? ? ?SequencePair psa = >> Alignments.getPairwiseAlignment(query, target, >> PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); >> ? ? ? ?assertNotNull(psa); >> ? ?} > > But when I try something similar in my production code, I get an > java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 again > > dnaSequence.setCompoundSet(AmbiguityDNACompoundSet.getDNACompoundSet()); > // If I remove this line, the exception is gone again, but I get NULL > result. > psa = Alignments.getPairwiseAlignment(dnaSequence, target, > PairwiseSequenceAlignerType.LOCAL, new SimpleGapPenalty(), matrix); > > the dnaSequence in that case is something that is passed to this > method, and is a sequence generated by a fasta reader - should have no > ambiguity in there, just plain ACGT. It has a plain DNACompoundSet > too. > > Hannes > From daniel.svozil at vscht.cz Tue Dec 6 11:33:24 2011 From: daniel.svozil at vscht.cz (Daniel Svozil) Date: Tue, 6 Dec 2011 12:33:24 +0100 Subject: [Biojava-l] mmView - a tool for mmCIF exploration In-Reply-To: References: Message-ID: Dear colleagues, We would like to announce the availability of mmView - the web-based application which allows to comfortably explore the structural data of biomacromolecules stored in the mmCIF (macromolecular Crystallographic Information File) format. The mmView software system is primarily intended for educational purposes but it can also serve as an auxiliary tool for working with biomolecular structures. The mmView application is offered in two flavors: as a publicly available web server http://ich.vscht.cz/projects/mmview/, and as an open-source stand-alone application (available from http://sourceforge.net/projects/mmview) that can be installed on the user?s computer. Petr Cech and Daniel Svozil -- Daniel Svozil, PhD Head of Laboratory of Informatics and Chemistry Institute of Chemical Technology Czech Republic phone: +420 220 444 391 http://ich.vscht.cz/~svozil From andreas at sdsc.edu Wed Dec 7 05:28:18 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 6 Dec 2011 21:28:18 -0800 Subject: [Biojava-l] IndexOutOfBounds Exception when performing Pairwise Alignment In-Reply-To: References: Message-ID: Hi Hannes, Couple of things in response to your mails: - thanks for providing the example, I have moved it to a Cookbook page, since we had several DNA alignment related questions recently. http://biojava.org/wiki/BioJava;CookBook3:PSA_DNA - I believe the AmbiguityDNACompoundSet is required, to match the compound set of the substitution matrix. - about your sequence matching strategy: this really depends on your size and similarity of DNA sequences. if you have many, then blast might be the way to go, otherwise just looking a number of identical positions (psa.getNumIdenticals()) might be a quick and dirty solution for this as well. - ConcurrencyTool is just a utility class for working with the Java concurrency framework. Should be easy to write your own Callable classes that do pairwise alignment, if you needs to do this in multiple threads... Andreas From andreas at sdsc.edu Wed Dec 7 19:33:50 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 7 Dec 2011 11:33:50 -0800 Subject: [Biojava-l] mmView - a tool for mmCIF exploration In-Reply-To: References: Message-ID: Thanks, Daniel, what is the connection to BioJava? This seems to be a Python project? Otherwise, nice tool! Andreas On Tue, Dec 6, 2011 at 3:33 AM, Daniel Svozil wrote: > Dear colleagues, > > We would like to announce the availability of mmView - the web-based > application which allows to comfortably explore the structural data of > biomacromolecules stored in the mmCIF (macromolecular Crystallographic > Information File) format. The mmView software system is primarily > intended for educational purposes but it can also serve as an > auxiliary tool for working with biomolecular structures. > > The mmView application is offered in two flavors: as a publicly > available web server http://ich.vscht.cz/projects/mmview/, and as an > open-source stand-alone application (available from > http://sourceforge.net/projects/mmview) that can be installed on the > user?s computer. > > Petr Cech and Daniel Svozil > > -- > Daniel Svozil, PhD > Head of Laboratory of Informatics and Chemistry > Institute of Chemical Technology > Czech Republic > > phone: +420 220 444 391 > http://ich.vscht.cz/~svozil > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From daniel.svozil at vscht.cz Wed Dec 7 19:43:23 2011 From: daniel.svozil at vscht.cz (Daniel Svozil) Date: Wed, 7 Dec 2011 20:43:23 +0100 Subject: [Biojava-l] mmView - a tool for mmCIF exploration In-Reply-To: <17c23e022b7c4cac9a273e5ded98874c@FE15.vscht.cz> References: <17c23e022b7c4cac9a273e5ded98874c@FE15.vscht.cz> Message-ID: Hi Andreas, thanks for the compliment. Yes, it is a Python tool, no direct connection ro BioJava. However, we were thinking somebody from bioinformatics community working with mmCIF files may find it useful. Daniel On Wed, Dec 7, 2011 at 8:33 PM, Andreas Prlic wrote: > Thanks, Daniel, > > what is the connection to BioJava? This seems to be a Python project? > Otherwise, nice tool! > > Andreas > > > On Tue, Dec 6, 2011 at 3:33 AM, Daniel Svozil wrote: >> Dear colleagues, >> >> We would like to announce the availability of mmView - the web-based >> application which allows to comfortably explore the structural data of >> biomacromolecules stored in the mmCIF (macromolecular Crystallographic >> Information File) format. The mmView software system is primarily >> intended for educational purposes but it can also serve as an >> auxiliary tool for working with biomolecular structures. >> >> The mmView application is offered in two flavors: as a publicly >> available web server http://ich.vscht.cz/projects/mmview/, and as an >> open-source stand-alone application (available from >> http://sourceforge.net/projects/mmview) that can be installed on the >> user?s computer. >> >> Petr Cech and Daniel Svozil >> >> -- >> Daniel Svozil, PhD >> Head of Laboratory of Informatics and Chemistry >> Institute of Chemical Technology >> Czech Republic >> >> phone: +420 220 444 391 >> http://ich.vscht.cz/~svozil >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l -- Daniel Svozil, PhD Head of Laboratory of Informatics and Chemistry Institute of Chemical Technology Czech Republic phone: +420 220 444 391 http://ich.vscht.cz/~svozil From biojava at hannes.oib.com Mon Dec 19 13:06:17 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Mon, 19 Dec 2011 14:06:17 +0100 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? Message-ID: Hi! I have 2 files as output of my sequencer: one is a standard fasta, which is no problem, the other is a fasta file with the same headers, but instead of simple letters, there are integer values from 0 to 40 denoting the quality of the sequencing at this position. Is it easy to adapt the biojava fasta parser to read such files (by feeding different classes to the parser), or should I write a specialized parser from scratch? Hannes From p.j.a.cock at googlemail.com Wed Dec 21 09:21:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Dec 2011 09:21:32 +0000 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: On Monday, December 19, 2011, Hannes Brandst?tter-M?ller < biojava at hannes.oib.com> wrote: > Hi! > > I have 2 files as output of my sequencer: one is a standard fasta, > which is no problem, the other is a fasta file with the same headers, > but instead of simple letters, there are integer values from 0 to 40 > denoting the quality of the sequencing at this position. Is it easy to > adapt the biojava fasta parser to read such files (by feeding > different classes to the parser), or should I write a specialized > parser from scratch? > > Hannes You mean a QUAL file? Often named example.qual or example.fasta.qual and used in conjunction with a matching FASTA file as a clumsy alternative to FASTQ (taking more space on disk too). For Biopython we have a separate parser, although the Header line handling is the same, the quality sequence is quite different (splitting on white space, ASCII strings then converted to integers). Peter From andreas at sdsc.edu Wed Dec 21 14:21:17 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 21 Dec 2011 06:21:17 -0800 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: Hi Hannes, if this is a more frequently used file format, would be great to get a parser for this... Any patches are welcome ! ;-) Andreas On Wed, Dec 21, 2011 at 1:21 AM, Peter Cock wrote: > On Monday, December 19, 2011, Hannes Brandst?tter-M?ller < > biojava at hannes.oib.com> wrote: >> Hi! >> >> I have 2 files as output of my sequencer: one is a standard fasta, >> which is no problem, the other is a fasta file with the same headers, >> but instead of simple letters, there are integer values from 0 to 40 >> denoting the quality of the sequencing at this position. Is it easy to >> adapt the biojava fasta parser to read such files (by feeding >> different classes to the parser), or should I write a specialized >> parser from scratch? >> >> Hannes > > You mean a QUAL file? Often named example.qual or > example.fasta.qual and used in conjunction with a matching > FASTA file as a clumsy alternative to FASTQ (taking more > space on disk too). > > For Biopython we have a separate parser, although the > Header line handling is the same, the quality sequence > is quite different (splitting on white space, ASCII strings > then converted to integers). > > Peter > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From biojava at hannes.oib.com Wed Dec 21 14:44:51 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Wed, 21 Dec 2011 15:44:51 +0100 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: On Wed, Dec 21, 2011 at 15:21, Andreas Prlic wrote: > Hi Hannes, > > if this is a more frequently used file format, would be great to get a > parser for this... Any patches are welcome ! ;-) > > Andreas I just put together some kludgy thing that works for me now, but there needs to be something done for this. I mailed a bit back and forth with Peter, and he pointed me to http://news.open-bio.org/news/2009/12/nar-fastq-format/ where Biojava is mentioned to understand that quality information too. I guess that got lost in the move to 3.0? If nothing happens until March/April, I'll pick up this as my pet project and work to help biojava to support FASTQ and the FASTA/QUAL format. But until then I'm totally swamped. If someone else (perhaps the one who wrote the FASTA Parser initially) could pick that up, that would be great. Hannes From andreas at sdsc.edu Wed Dec 21 15:09:45 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 21 Dec 2011 07:09:45 -0800 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: The fastq parser is in the legacy biojava 1.8 and can still be downloaded if you want. Not sure how hard it would be to migrate it to biojava3. A On Wed, Dec 21, 2011 at 6:44 AM, Hannes Brandst?tter-M?ller wrote: > On Wed, Dec 21, 2011 at 15:21, Andreas Prlic wrote: >> Hi Hannes, >> >> if this is a more frequently used file format, would be great to get a >> parser for this... Any patches are welcome ! ;-) >> >> Andreas > > I just put together some kludgy thing that works for me now, but there > needs to be something done for this. > I mailed a bit back and forth with Peter, and he pointed me to > http://news.open-bio.org/news/2009/12/nar-fastq-format/ where Biojava > is mentioned to understand that quality information too. I guess that > got lost in the move to 3.0? > > If nothing happens until March/April, I'll pick up this as my pet > project and work to help biojava to support FASTQ and the FASTA/QUAL > format. But until then I'm totally swamped. If someone else (perhaps > the one who wrote the FASTA Parser initially) could pick that up, that > would be great. > > Hannes From biojava at hannes.oib.com Thu Dec 22 07:34:32 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 22 Dec 2011 08:34:32 +0100 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? In-Reply-To: References: Message-ID: On Wed, Dec 21, 2011 at 16:09, Andreas Prlic wrote: > The fastq parser is in the legacy biojava 1.8 and can still be > downloaded if you want. Not sure how hard it would be to migrate it to > biojava3. > > A FASTQ support in 3.0 would be nice. If noone else is doing it, I'll take a look at migrating it, but that won't happen before April next year. Hannes From biojava at hannes.oib.com Thu Dec 22 09:18:37 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 22 Dec 2011 10:18:37 +0100 Subject: [Biojava-l] Cookbook entry - feedback please Message-ID: Hi! I recently ran into the problem of having 2 very similar DNA Sequences and wanting to get (1) a difference count and (2) a consensus sequence. I asked on stackoverflow and biostar.stackexchange.com for some input, and the answers pointed me towards enums and a lookup table using these. So I sat down and wrote that (that LUT took me quite some time, I hope I did not make errors in there) and made a small class to quickly get to the result. I post the code as cookbook page for future reference, possible inclusion of the code into the library and of course, feedback. http://biojava.org/wiki/BioJava:CookBook:Core:SequenceCompare Hannes From andreas at sdsc.edu Thu Dec 22 13:29:32 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 22 Dec 2011 05:29:32 -0800 Subject: [Biojava-l] Cookbook entry - feedback please In-Reply-To: References: Message-ID: Did you try to align the sequences first? That would also give you the number of identical nucleotides. Andreas On Thu, Dec 22, 2011 at 1:18 AM, Hannes Brandst?tter-M?ller wrote: > Hi! > > I recently ran into the problem of having 2 very similar DNA Sequences > and wanting to get (1) a difference count and (2) a consensus > sequence. > > I asked on stackoverflow and biostar.stackexchange.com for some input, > and the answers pointed me towards enums and a lookup table using > these. So I sat down and wrote that (that LUT took me quite some time, > I hope I did not make errors in there) and made a small class to > quickly get to the result. > > I post the code as cookbook page for future reference, possible > inclusion of the code into the library and of course, feedback. > > http://biojava.org/wiki/BioJava:CookBook:Core:SequenceCompare > > Hannes > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From biojava at hannes.oib.com Thu Dec 22 14:58:49 2011 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 22 Dec 2011 15:58:49 +0100 Subject: [Biojava-l] Cookbook entry - feedback please In-Reply-To: References: Message-ID: In my case, the sequences were all pre-aligned, but yes, that could be done first. A question concerning consensus sequences: how do you handle gaps? N (as I understood it) does not allow a gap. Is there a way to encode a "might be gap here, or A or G"? Hannes On Thu, Dec 22, 2011 at 14:29, Andreas Prlic wrote: > Did you try to align the sequences first? That would also give you the > number of identical nucleotides. > > Andreas > > > > On Thu, Dec 22, 2011 at 1:18 AM, Hannes Brandst?tter-M?ller > wrote: >> Hi! >> >> I recently ran into the problem of having 2 very similar DNA Sequences >> and wanting to get (1) a difference count and (2) a consensus >> sequence. >> >> I asked on stackoverflow and biostar.stackexchange.com for some input, >> and the answers pointed me towards enums and a lookup table using >> these. So I sat down and wrote that (that LUT took me quite some time, >> I hope I did not make errors in there) and made a small class to >> quickly get to the result. >> >> I post the code as cookbook page for future reference, possible >> inclusion of the code into the library and of course, feedback. >> >> http://biojava.org/wiki/BioJava:CookBook:Core:SequenceCompare >> >> Hannes >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Thu Dec 22 15:26:44 2011 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 22 Dec 2011 07:26:44 -0800 Subject: [Biojava-l] Cookbook entry - feedback please In-Reply-To: References: Message-ID: > A question concerning consensus sequences: how do you handle gaps? N > (as I understood it) does not allow a gap. Is there a way to encode a > "might be gap here, or A or G"? If the input is a multiple sequence alignment then you could count frequencies at each position and take the most frequently occurring nucleotide. For each position you could count a conservation score. Andreas From P.V.Troshin at dundee.ac.uk Fri Dec 23 23:47:58 2011 From: P.V.Troshin at dundee.ac.uk (Peter Troshin) Date: Fri, 23 Dec 2011 23:47:58 +0000 Subject: [Biojava-l] Is a modification of the FASTA parser for my needs easy or should I implement something else? Message-ID: <4EF5132E020000ED0001016B@ia-gw-6.dundee.ac.uk> There is a nice FastQ parser and many other goodies available from the http://picard.sourceforge.net/command-line-overview.shtml (just download the source code) if you need one. Regards, Peter >>> Andreas Prlic 12/21/11 3:12 PM >>> The fastq parser is in the legacy biojava 1.8 and can still be downloaded if you want. Not sure how hard it would be to migrate it to biojava3. A On Wed, Dec 21, 2011 at 6:44 AM, Hannes Brandst?tter-M?ller wrote: > On Wed, Dec 21, 2011 at 15:21, Andreas Prlic wrote: >> Hi Hannes, >> >> if this is a more frequently used file format, would be great to get a >> parser for this... Any patches are welcome ! ;-) >> >> Andreas > > I just put together some kludgy thing that works for me now, but there > needs to be something done for this. > I mailed a bit back and forth with Peter, and he pointed me to > http://news.open-bio.org/news/2009/12/nar-fastq-format/ where Biojava > is mentioned to understand that quality information too. I guess that > got lost in the move to 3.0? > > If nothing happens until March/April, I'll pick up this as my pet > project and work to help biojava to support FASTQ and the FASTA/QUAL > format. But until then I'm totally swamped. If someone else (perhaps > the one who wrote the FASTA Parser initially) could pick that up, that > would be great. > > Hannes _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l ************************************************************ Please consider the environment. Do you really need to print this email?