From markjschreiber at gmail.com Fri Dec 1 23:27:16 2006 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 2 Dec 2006 12:27:16 +0800 Subject: [Biojava-l] [Biojava-dev] Willing to Contribute in BioJava In-Reply-To: References:

Message-ID: <93b45ca50612012027t28154242g540f847649dd6791@mail.gmail.com> Hi - There are some areas of the cookbook that need updating to make use of the new biojava1.5 APIs. We also need more unit tests for a number of classes. - Mark On 11/27/06, Sulaman Nawaz wrote: > > > > Subject: Willing to Contribute in BioJava > > > > Hi everybody I mailed you earlier about biojava contribution, I have gone through API and Cookbook, please help me in identifying how can I help in contributing in this open source project or Creating any Bio Application using Biojava, I know java programming + XML + Database(SQL) etc. I want to do as a course project and it could even be extended to Final Program project. I > > > > > > > > .............................. > WITH > BEST REGARDS > SULAMAN NAWAZ > > E_MAIL:kingsulaman at hotmail.com(primary) > da_green_berret at yahoo.com(sec.) > phone +923005096825(mobile) > +92514580113(res.) > ________________________________ Get free, personalized commercial-free online radio with MSN Radio powered by Pandora > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > From ilhami.visne at gmail.com Tue Dec 12 17:57:07 2006 From: ilhami.visne at gmail.com (ilhami visne) Date: Tue, 12 Dec 2006 23:57:07 +0100 Subject: [Biojava-l] Restriction Mapper - Thread (or dual core cpu) problem Message-ID: <457F33C3.8070705@gmail.com> hello, in last summer, i wrote a program, which uses Restriction Mapper. As it was in example (if i remember correctly), for each enzyme i used one thread. everytime i got error. then i noticed, if i use only one enzyme, i get no error. i thought, this could be a thread-safe issue, because if enzyme count is more than one, more than one thread will run. therefore i have changed my program to single threaded. and it has worked well, even for many enzymes. till this week... one of my clients has run my program on a dual cpu machine. Guess what? Again same error!!! i have a single-cpu laptop. a friend of mine has a dual-core cpu laptop. i have tried myself on this machine. And yeah. that is the problem, because for the same file i don't get any error on my single-core machine, but everytime the same error on dual-core cpu. Two more important information: 1. here i got an error for HpaII but it can be any other enzyme. 2. my file has 24000 sequences. the sequence, by which this exception is thrown, is random too. sometimes the 5600. sequence, another time the 17456. sequence. it changes too. i checked, all sequences are normal. is this a known issue? is there any solution? The exception: Exception in thread "Thread-13" org.biojava.bio.BioRuntimeException: Failed to complete search for HpaII CCGG (1/3) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:137) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) Caused by: java.lang.NullPointerException at org.biojava.bio.seq.io.SymbolListCharSequence.charAt(SymbolListCharSequence.java:115) at java.lang.Character.codePointAt(Unknown Source) at java.util.regex.Pattern$Single.match(Unknown Source) at java.util.regex.Pattern$Curly.match(Unknown Source) at java.util.regex.Pattern$Start.match(Unknown Source) at java.util.regex.Matcher.search(Unknown Source) at java.util.regex.Matcher.find(Unknown Source) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:104) ... 1 more thanx in advance. P.S.: i got here a nullpointer exception. if i remember correctly, i got that time ArrayIndexOutOfBound exception. the index was bigger than the length of the sequence. From mark.schreiber at novartis.com Tue Dec 12 20:32:57 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 13 Dec 2006 09:32:57 +0800 Subject: [Biojava-l] Restriction Mapper - Thread (or dual core cpu) problem Message-ID: This does indeed sound like a thread problem. Can you post this to the biojava bugzilla (there is a link on the homepage) to make sure we fix it. Can you post code that replicates the bug as well. This raises an interesting point. Because Java is inherently multi-threaded on multi-core machines you can expose thread issues you never knew you had. We may see more of this sort of thing for a while as more dual core CPUs become commidity. - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com www.dengueinfo.org phone +65 6722 2973 fax +65 6722 2910 ilhami visne Sent by: biojava-l-bounces at lists.open-bio.org 12/13/2006 06:57 AM To: biojava-l at lists.open-bio.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Restriction Mapper - Thread (or dual core cpu) problem hello, in last summer, i wrote a program, which uses Restriction Mapper. As it was in example (if i remember correctly), for each enzyme i used one thread. everytime i got error. then i noticed, if i use only one enzyme, i get no error. i thought, this could be a thread-safe issue, because if enzyme count is more than one, more than one thread will run. therefore i have changed my program to single threaded. and it has worked well, even for many enzymes. till this week... one of my clients has run my program on a dual cpu machine. Guess what? Again same error!!! i have a single-cpu laptop. a friend of mine has a dual-core cpu laptop. i have tried myself on this machine. And yeah. that is the problem, because for the same file i don't get any error on my single-core machine, but everytime the same error on dual-core cpu. Two more important information: 1. here i got an error for HpaII but it can be any other enzyme. 2. my file has 24000 sequences. the sequence, by which this exception is thrown, is random too. sometimes the 5600. sequence, another time the 17456. sequence. it changes too. i checked, all sequences are normal. is this a known issue? is there any solution? The exception: Exception in thread "Thread-13" org.biojava.bio.BioRuntimeException: Failed to complete search for HpaII CCGG (1/3) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:137) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) Caused by: java.lang.NullPointerException at org.biojava.bio.seq.io.SymbolListCharSequence.charAt(SymbolListCharSequence.java:115) at java.lang.Character.codePointAt(Unknown Source) at java.util.regex.Pattern$Single.match(Unknown Source) at java.util.regex.Pattern$Curly.match(Unknown Source) at java.util.regex.Pattern$Start.match(Unknown Source) at java.util.regex.Matcher.search(Unknown Source) at java.util.regex.Matcher.find(Unknown Source) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:104) ... 1 more thanx in advance. P.S.: i got here a nullpointer exception. if i remember correctly, i got that time ArrayIndexOutOfBound exception. the index was bigger than the length of the sequence. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From mucous at gmail.com Fri Dec 22 15:26:10 2006 From: mucous at gmail.com (Denis Yuen) Date: Fri, 22 Dec 2006 15:26:10 -0500 Subject: [Biojava-l] High Order HMM Message-ID: <1b61230612221226r43ac6351sec9bd8b9521e7fbb@mail.gmail.com> Hi, New to HMMs and BioJava, so what I'm asking for is probably a dumb question. But I figure it better to ask it rather than sit here and be puzzled... >From the wiki article http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples and the post http://portal.open-bio.org/pipermail/biojava-l/2006-March/005387.html I get the sense that in order to create a third-order HMM, reading a protein sequence, and emitting symbols (e.g. create an alphabet TriGreek from "alpha","beta","delta"), you would need to create one state for each amino acid, and associate each state with a OrderNDistribution using a cross product alphabet as in AlphabetManager.generateCrossProductAlphaFromName("(Protein x Protein x TriGreek)"). So if you walked through a trimer AGF which emitted "alpha", you would end in the state "F", which uses a OrderNDistribution where the first protein (in the cross product alphabet) corresponds to the "A", the second protein corresponds to the "G", and the last term corresponds to "alpha." This seems odd, so what I don't get, is should I be mixing emissions with previous states in the cross product alphabet to create a third order HMM? Or is there a better way? I'm even more confused about how to define transition weights. Obviously, I'm wrong about something... How do you define states/distributions in a third order HMM? Thanks From markjschreiber at gmail.com Sat Dec 23 07:08:58 2006 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 23 Dec 2006 20:08:58 +0800 Subject: [Biojava-l] High Order HMM In-Reply-To: <1b61230612221226r43ac6351sec9bd8b9521e7fbb@mail.gmail.com> References: <1b61230612221226r43ac6351sec9bd8b9521e7fbb@mail.gmail.com> Message-ID: <93b45ca50612230408y712e1220sa29d29ac935f7088@mail.gmail.com> > Hi, > > New to HMMs and BioJava, so what I'm asking for is probably a dumb question. > But I figure it better to ask it rather than sit here and be puzzled... > > >From the wiki article > http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples > and the post http://portal.open-bio.org/pipermail/biojava-l/2006-March/005387.html > > I get the sense that in order to create a third-order HMM, reading a > protein sequence, and emitting symbols (e.g. create an alphabet > TriGreek from "alpha","beta","delta"), you would need to create one > state for each amino acid, and associate each state with a > OrderNDistribution using a cross product alphabet as in > AlphabetManager.generateCrossProductAlphaFromName("(Protein x Protein > x TriGreek)"). > > So if you walked through a trimer AGF which emitted "alpha", you would > end in the state "F", which uses a OrderNDistribution where the first > protein (in the cross product alphabet) corresponds to the "A", the > second protein corresponds to the "G", and the last term corresponds > to "alpha." > Your problem sounds like you are trying to estimate observations of amino acid delta based on the previous 2 observations (a second order model). Thus you would use a OrderNDistribution in which p(Delta) is conditioned on ProteinxProtein. > This seems odd, so what I don't get, is should I be mixing emissions > with previous states in the cross product alphabet to create a third > order HMM? Or is there a better way? An alternative would be to have your states emit 3 amino acids at once. This would be a normal Distribution over the alphabet proteinXproteinXprotein. Each amino acid triple would be completely independant of the previous triple. This is not the same as the OrderNAlphabet which emits single amino acids based on the previous two. > > I'm even more confused about how to define transition weights. > Each state contains a Distribution of States. These states are from the Alphabet of States that the state is connected to. The State classes implement Symbol so can belong to Alphabets. The Distribution of States gives the probability of transitioning to each State in the Alphabet of States that the origin state connects to. If your model is fully ergodic each state connects to every other state so the transition Alphabet contains every other state (in fact in fully ergodic models states can connect to themselves so the transition Alphabet would include all states including the Magic state). If you model has a more complex architecture then the transition Alphabet will include only the states you can transition to. Hope this helps. > Obviously, I'm wrong about something... How do you define > states/distributions in a third order HMM? > > Thanks > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Fri Dec 29 03:03:45 2006 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 29 Dec 2006 16:03:45 +0800 Subject: [Biojava-l] Biojava 1.5 beta2 Released Message-ID: <93b45ca50612290003m33458ea0k918d2e6bf82d0f1a@mail.gmail.com> Dear All - Just in time for 2007 we have released a new beta release of biojava 1.5. The new version has quite a few more bug fixes and also contains a (very) experimental preview of a phylogentics package. The release can be obtained at http://biojava.org/wiki/BioJava:Download Happy New Year! - Mark From markjschreiber at gmail.com Sat Dec 2 04:27:16 2006 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 2 Dec 2006 12:27:16 +0800 Subject: [Biojava-l] [Biojava-dev] Willing to Contribute in BioJava In-Reply-To: References:

Message-ID: <93b45ca50612012027t28154242g540f847649dd6791@mail.gmail.com> Hi - There are some areas of the cookbook that need updating to make use of the new biojava1.5 APIs. We also need more unit tests for a number of classes. - Mark On 11/27/06, Sulaman Nawaz wrote: > > > > Subject: Willing to Contribute in BioJava > > > > Hi everybody I mailed you earlier about biojava contribution, I have gone through API and Cookbook, please help me in identifying how can I help in contributing in this open source project or Creating any Bio Application using Biojava, I know java programming + XML + Database(SQL) etc. I want to do as a course project and it could even be extended to Final Program project. I > > > > > > > > .............................. > WITH > BEST REGARDS > SULAMAN NAWAZ > > E_MAIL:kingsulaman at hotmail.com(primary) > da_green_berret at yahoo.com(sec.) > phone +923005096825(mobile) > +92514580113(res.) > ________________________________ Get free, personalized commercial-free online radio with MSN Radio powered by Pandora > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > From ilhami.visne at gmail.com Tue Dec 12 22:57:07 2006 From: ilhami.visne at gmail.com (ilhami visne) Date: Tue, 12 Dec 2006 23:57:07 +0100 Subject: [Biojava-l] Restriction Mapper - Thread (or dual core cpu) problem Message-ID: <457F33C3.8070705@gmail.com> hello, in last summer, i wrote a program, which uses Restriction Mapper. As it was in example (if i remember correctly), for each enzyme i used one thread. everytime i got error. then i noticed, if i use only one enzyme, i get no error. i thought, this could be a thread-safe issue, because if enzyme count is more than one, more than one thread will run. therefore i have changed my program to single threaded. and it has worked well, even for many enzymes. till this week... one of my clients has run my program on a dual cpu machine. Guess what? Again same error!!! i have a single-cpu laptop. a friend of mine has a dual-core cpu laptop. i have tried myself on this machine. And yeah. that is the problem, because for the same file i don't get any error on my single-core machine, but everytime the same error on dual-core cpu. Two more important information: 1. here i got an error for HpaII but it can be any other enzyme. 2. my file has 24000 sequences. the sequence, by which this exception is thrown, is random too. sometimes the 5600. sequence, another time the 17456. sequence. it changes too. i checked, all sequences are normal. is this a known issue? is there any solution? The exception: Exception in thread "Thread-13" org.biojava.bio.BioRuntimeException: Failed to complete search for HpaII CCGG (1/3) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:137) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) Caused by: java.lang.NullPointerException at org.biojava.bio.seq.io.SymbolListCharSequence.charAt(SymbolListCharSequence.java:115) at java.lang.Character.codePointAt(Unknown Source) at java.util.regex.Pattern$Single.match(Unknown Source) at java.util.regex.Pattern$Curly.match(Unknown Source) at java.util.regex.Pattern$Start.match(Unknown Source) at java.util.regex.Matcher.search(Unknown Source) at java.util.regex.Matcher.find(Unknown Source) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:104) ... 1 more thanx in advance. P.S.: i got here a nullpointer exception. if i remember correctly, i got that time ArrayIndexOutOfBound exception. the index was bigger than the length of the sequence. From mark.schreiber at novartis.com Wed Dec 13 01:32:57 2006 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 13 Dec 2006 09:32:57 +0800 Subject: [Biojava-l] Restriction Mapper - Thread (or dual core cpu) problem Message-ID: This does indeed sound like a thread problem. Can you post this to the biojava bugzilla (there is a link on the homepage) to make sure we fix it. Can you post code that replicates the bug as well. This raises an interesting point. Because Java is inherently multi-threaded on multi-core machines you can expose thread issues you never knew you had. We may see more of this sort of thing for a while as more dual core CPUs become commidity. - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com www.dengueinfo.org phone +65 6722 2973 fax +65 6722 2910 ilhami visne Sent by: biojava-l-bounces at lists.open-bio.org 12/13/2006 06:57 AM To: biojava-l at lists.open-bio.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-l] Restriction Mapper - Thread (or dual core cpu) problem hello, in last summer, i wrote a program, which uses Restriction Mapper. As it was in example (if i remember correctly), for each enzyme i used one thread. everytime i got error. then i noticed, if i use only one enzyme, i get no error. i thought, this could be a thread-safe issue, because if enzyme count is more than one, more than one thread will run. therefore i have changed my program to single threaded. and it has worked well, even for many enzymes. till this week... one of my clients has run my program on a dual cpu machine. Guess what? Again same error!!! i have a single-cpu laptop. a friend of mine has a dual-core cpu laptop. i have tried myself on this machine. And yeah. that is the problem, because for the same file i don't get any error on my single-core machine, but everytime the same error on dual-core cpu. Two more important information: 1. here i got an error for HpaII but it can be any other enzyme. 2. my file has 24000 sequences. the sequence, by which this exception is thrown, is random too. sometimes the 5600. sequence, another time the 17456. sequence. it changes too. i checked, all sequences are normal. is this a known issue? is there any solution? The exception: Exception in thread "Thread-13" org.biojava.bio.BioRuntimeException: Failed to complete search for HpaII CCGG (1/3) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:137) at org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) Caused by: java.lang.NullPointerException at org.biojava.bio.seq.io.SymbolListCharSequence.charAt(SymbolListCharSequence.java:115) at java.lang.Character.codePointAt(Unknown Source) at java.util.regex.Pattern$Single.match(Unknown Source) at java.util.regex.Pattern$Curly.match(Unknown Source) at java.util.regex.Pattern$Start.match(Unknown Source) at java.util.regex.Matcher.search(Unknown Source) at java.util.regex.Matcher.find(Unknown Source) at org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:104) ... 1 more thanx in advance. P.S.: i got here a nullpointer exception. if i remember correctly, i got that time ArrayIndexOutOfBound exception. the index was bigger than the length of the sequence. _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From mucous at gmail.com Fri Dec 22 20:26:10 2006 From: mucous at gmail.com (Denis Yuen) Date: Fri, 22 Dec 2006 15:26:10 -0500 Subject: [Biojava-l] High Order HMM Message-ID: <1b61230612221226r43ac6351sec9bd8b9521e7fbb@mail.gmail.com> Hi, New to HMMs and BioJava, so what I'm asking for is probably a dumb question. But I figure it better to ask it rather than sit here and be puzzled... >From the wiki article http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples and the post http://portal.open-bio.org/pipermail/biojava-l/2006-March/005387.html I get the sense that in order to create a third-order HMM, reading a protein sequence, and emitting symbols (e.g. create an alphabet TriGreek from "alpha","beta","delta"), you would need to create one state for each amino acid, and associate each state with a OrderNDistribution using a cross product alphabet as in AlphabetManager.generateCrossProductAlphaFromName("(Protein x Protein x TriGreek)"). So if you walked through a trimer AGF which emitted "alpha", you would end in the state "F", which uses a OrderNDistribution where the first protein (in the cross product alphabet) corresponds to the "A", the second protein corresponds to the "G", and the last term corresponds to "alpha." This seems odd, so what I don't get, is should I be mixing emissions with previous states in the cross product alphabet to create a third order HMM? Or is there a better way? I'm even more confused about how to define transition weights. Obviously, I'm wrong about something... How do you define states/distributions in a third order HMM? Thanks From markjschreiber at gmail.com Sat Dec 23 12:08:58 2006 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 23 Dec 2006 20:08:58 +0800 Subject: [Biojava-l] High Order HMM In-Reply-To: <1b61230612221226r43ac6351sec9bd8b9521e7fbb@mail.gmail.com> References: <1b61230612221226r43ac6351sec9bd8b9521e7fbb@mail.gmail.com> Message-ID: <93b45ca50612230408y712e1220sa29d29ac935f7088@mail.gmail.com> > Hi, > > New to HMMs and BioJava, so what I'm asking for is probably a dumb question. > But I figure it better to ask it rather than sit here and be puzzled... > > >From the wiki article > http://www.biojava.org/wiki/BioJava:Tutorial:Dynamic_programming_examples > and the post http://portal.open-bio.org/pipermail/biojava-l/2006-March/005387.html > > I get the sense that in order to create a third-order HMM, reading a > protein sequence, and emitting symbols (e.g. create an alphabet > TriGreek from "alpha","beta","delta"), you would need to create one > state for each amino acid, and associate each state with a > OrderNDistribution using a cross product alphabet as in > AlphabetManager.generateCrossProductAlphaFromName("(Protein x Protein > x TriGreek)"). > > So if you walked through a trimer AGF which emitted "alpha", you would > end in the state "F", which uses a OrderNDistribution where the first > protein (in the cross product alphabet) corresponds to the "A", the > second protein corresponds to the "G", and the last term corresponds > to "alpha." > Your problem sounds like you are trying to estimate observations of amino acid delta based on the previous 2 observations (a second order model). Thus you would use a OrderNDistribution in which p(Delta) is conditioned on ProteinxProtein. > This seems odd, so what I don't get, is should I be mixing emissions > with previous states in the cross product alphabet to create a third > order HMM? Or is there a better way? An alternative would be to have your states emit 3 amino acids at once. This would be a normal Distribution over the alphabet proteinXproteinXprotein. Each amino acid triple would be completely independant of the previous triple. This is not the same as the OrderNAlphabet which emits single amino acids based on the previous two. > > I'm even more confused about how to define transition weights. > Each state contains a Distribution of States. These states are from the Alphabet of States that the state is connected to. The State classes implement Symbol so can belong to Alphabets. The Distribution of States gives the probability of transitioning to each State in the Alphabet of States that the origin state connects to. If your model is fully ergodic each state connects to every other state so the transition Alphabet contains every other state (in fact in fully ergodic models states can connect to themselves so the transition Alphabet would include all states including the Magic state). If you model has a more complex architecture then the transition Alphabet will include only the states you can transition to. Hope this helps. > Obviously, I'm wrong about something... How do you define > states/distributions in a third order HMM? > > Thanks > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Fri Dec 29 08:03:45 2006 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 29 Dec 2006 16:03:45 +0800 Subject: [Biojava-l] Biojava 1.5 beta2 Released Message-ID: <93b45ca50612290003m33458ea0k918d2e6bf82d0f1a@mail.gmail.com> Dear All - Just in time for 2007 we have released a new beta release of biojava 1.5. The new version has quite a few more bug fixes and also contains a (very) experimental preview of a phylogentics package. The release can be obtained at http://biojava.org/wiki/BioJava:Download Happy New Year! - Mark