From huijieqiao at gmail.com Thu Apr 1 23:02:37 2010 From: huijieqiao at gmail.com (Huijie Qiao) Date: Fri, 2 Apr 2010 11:02:37 +0800 Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat" Message-ID: version 1.7.1 line 361 else if (sectionKey.equals(SOURCE_TAG)) { // ignore - can get all this from the first feature actually the content in the SOURCE_TAG and the first feature are different in some gb file. For example, the example file in http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb The Source TAG is SOURCE Bos taurus (cattle) ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos. and the first feature tag is FEATURES Location/Qualifiers source 1..1136 /organism="Bos taurus" /mol_type="mRNA" /db_xref="taxon:9913" /clone="pBB2I" /tissue_type="liver" I can't get the hierarchy info through the follow codes. NCBITaxon taxon = seq.getTaxon(); System.out.println(taxon.getNameHierarchy()); output is "." From holland at eaglegenomics.com Fri Apr 2 03:38:44 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 2 Apr 2010 08:38:44 +0100 Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat" In-Reply-To: References: Message-ID: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com> The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism. If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email. thanks, Richard On 2 Apr 2010, at 04:02, Huijie Qiao wrote: > version 1.7.1 > > line 361 > else if (sectionKey.equals(SOURCE_TAG)) { > // ignore - can get all this from the first feature > > actually the content in the SOURCE_TAG and the first feature are different > in some gb file. > > For example, the example file in > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb > > The Source TAG is > SOURCE Bos taurus (cattle) > ORGANISM Bos taurus > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; > Pecora; Bovidae; Bovinae; Bos. > > and the first feature tag is > FEATURES Location/Qualifiers > source 1..1136 > /organism="Bos taurus" > /mol_type="mRNA" > /db_xref="taxon:9913" > /clone="pBB2I" > /tissue_type="liver" > > I can't get the hierarchy info through the follow codes. > NCBITaxon taxon = seq.getTaxon(); > System.out.println(taxon.getNameHierarchy()); output is "." > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From martin.jones at ed.ac.uk Fri Apr 2 07:23:21 2010 From: martin.jones at ed.ac.uk (Martin Jones) Date: Fri, 2 Apr 2010 12:23:21 +0100 Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat" In-Reply-To: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com> References: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com> Message-ID: You can also get the hierarchy directly from the NCBI taxonomy dump... this is in Groovy but gives you the idea: HashMap taxid2node = [:] HashMap child2parent = [:] def nodePattern = ~/^(\d+)\t\|\t(\d+)\t\|\t(.+?)\t\|/ def count=0 new File("/home/martin/nodes.dmp").eachLine{ line -> count++ def matcher = (line =~ nodePattern) if (matcher.matches()){ Integer myId = matcher[0][1].toInteger() Integer parentId = matcher[0][2].toInteger() String myRank = matcher[0][3] def node = new TreeNode(taxid : myId, rank:myRank) taxid2node[(myId)] = node child2parent[(myId)] = parentId } } // do something with the hash -Martin On 2 April 2010 08:38, Richard Holland wrote: > The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism. > > If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email. > > thanks, > Richard > > On 2 Apr 2010, at 04:02, Huijie Qiao wrote: > >> version 1.7.1 >> >> line 361 >> else if (sectionKey.equals(SOURCE_TAG)) { >> ? ? ?// ignore - can get all this from the first feature >> >> actually the content in the SOURCE_TAG and the first feature are different >> in some gb file. >> >> For example, the example file in >> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb >> >> The Source TAG is >> SOURCE ? ? ?Bos taurus (cattle) >> ?ORGANISM ?Bos taurus >> ? ? ? ? ? ?Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; >> Euteleostomi; >> ? ? ? ? ? ?Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; >> ? ? ? ? ? ?Pecora; Bovidae; Bovinae; Bos. >> >> and the first feature tag is >> FEATURES ? ? ? ? ? ? Location/Qualifiers >> ? ? source ? ? ? ? ?1..1136 >> ? ? ? ? ? ? ? ? ? ? /organism="Bos taurus" >> ? ? ? ? ? ? ? ? ? ? /mol_type="mRNA" >> ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9913" >> ? ? ? ? ? ? ? ? ? ? /clone="pBB2I" >> ? ? ? ? ? ? ? ? ? ? /tissue_type="liver" >> >> I can't get the hierarchy info through the follow codes. >> NCBITaxon taxon = seq.getTaxon(); >> System.out.println(taxon.getNameHierarchy()); output is "." >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From andreas.prlic at gmail.com Sat Apr 3 11:08:57 2010 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Sat, 3 Apr 2010 08:08:57 -0700 Subject: [Biojava-l] Anonymous svn down Message-ID: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> Hi, the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? Andreas From rmb32 at cornell.edu Sat Apr 3 16:09:27 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 13:09:27 -0700 Subject: [Biojava-l] Google Summer of Code is *ON* for OBF projects! Message-ID: <4BB7A077.4070802@cornell.edu> Hi all, Reminder: GSoC student proposals must be submitted to Google by April 9th, 19:00 UTC. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 5 proposals submitted to our org in Google's web app. Keep them coming, and let's see some really good ones! Rob Buels OBF GSoC 2010 Administrator From jianjiong.gao at gmail.com Sun Apr 4 02:33:15 2010 From: jianjiong.gao at gmail.com (Jianjiong Gao) Date: Sun, 4 Apr 2010 01:33:15 -0500 Subject: [Biojava-l] GSoC project question Message-ID: Hello, My name is Jianjiong Gao, a graduate student in Computer Science Department at University of Missouri-Columbia. I am very interested in applying for your GSoC project "Identification and Classification of Posttranslational Modification of Proteins". This project is highly related to my dissertation topic "Bioinformatic analysis and prediction of phosphorylation and other PTMs." Although I have not touched the structural part of PTM till now, I am really interested in learning and expanding my research on this field. After reading the project description on the idea page (http://biojava.org/wiki/Google_Summer_of_Code), I have several questions regarding the *approach* section: > 1. Establish a list of known PTMs and write code to locate these PTMs in a 3D protein structure. Q1: There are many different types of PTMs. Do you have list of PTMs of interest? Do you have priorities on different PTMs? Q2: Is there any available algorithm to locate the PTMs in a 3D protein structure? What is the difficulty on this task? Q3: The PDB file contains annotations of residue modifications such as HETATM AND MODRES. Can we utilized this information for localizing the PTMs? > 2. Determine the protein residues that carry PTMs based on distance thresholds. > 3. Traverse the sugar molecules and establish their link pattern based on connectivity. Q4: Is this task to determine the types of glycosylation, i.e., N-linked glycosylation, O-N-acetylgalactosamine, O-glucose, etc? Q5: Is there any available algorithm to do this? What is the difficulty in this task? It looks complicated with so many different types of glycosylation and structure isomers. > 4. Present the PTMs as text in a linear notation and 2D graphical representations if time permits. Q6: Can we used the SMILES format (http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification) here? Or do we have any other better options? Thanks very much for your time. I am looking forward to hearing from you. Best Regards, -JJ From rmb32 at cornell.edu Sun Apr 4 00:37:38 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 21:37:38 -0700 Subject: [Biojava-l] Reminder: GSoC student applications due April 9, 19:00 UTC Message-ID: <4BB81792.8060001@cornell.edu> Hi all, Sending this again with a different subject line, just in case. GSoC student proposals must be submitted to Google through their web application by *April 9th, 19:00 UTC*. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 6 proposals submitted to our org in Google's web app. Keep them coming, and keep them good! Rob Buels OBF GSoC 2010 Administrator From nagendravns at gmail.com Sun Apr 4 12:12:11 2010 From: nagendravns at gmail.com (nagendra kumar) Date: Sun, 4 Apr 2010 21:42:11 +0530 Subject: [Biojava-l] how to add api Message-ID: sir i want bio java develop one project please give me detail how bio java api install in system From chapman at cs.wisc.edu Sun Apr 4 13:54:59 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Sun, 04 Apr 2010 12:54:59 -0500 Subject: [Biojava-l] how to add api In-Reply-To: References: Message-ID: <4BB8D273.7080601@cs.wisc.edu> Everything you need is at: http://biojava.org/wiki/BioJava:Download On 4/4/2010 11:12 AM, nagendra kumar wrote: > sir i want bio java develop one project please give me detail how bio java > api install in system > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From anantpossible at gmail.com Sun Apr 4 13:58:15 2010 From: anantpossible at gmail.com (Anant Jain) Date: Sun, 4 Apr 2010 23:28:15 +0530 Subject: [Biojava-l] how to add api In-Reply-To: References: Message-ID: On 4/4/10, nagendra kumar wrote: > > sir i want bio java develop one project please give me detail how bio java > api install in system > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > HI, To use biojava API, all you need to download Biojava Jar from and perform following steps... 1. Extract jar, you will get some more jars and files,,, 2. You need to paste these jars in following location "C:\Program Files\Java\jre6\lib\ext", if your java install directory is C drive. -- Anant Jain B.Tech Bioinformatics, RHCE From sacomoto at gmail.com Tue Apr 6 01:29:23 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Tue, 6 Apr 2010 02:29:23 -0300 Subject: [Biojava-l] GSoC project on MSA Message-ID: Hello, I'm currently a graduate student at University of S?o Paulo (Brazil) and I'm quite interested in applying for the all-Java MSA project. I'm already familiar with the multiple sequence alignment problem, I developed a lossless filter for this problem as my undergraduate final project, the work is described here [http://www.almob.org/content/4/1/3] and there is an online version of the algorithm here [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu]. Now, regarding the project, just to make it clear, when you say in the "straightforward approach for building up the MSA progressively", you mean the standard dynamic programming approach for pairwise alignment following the guide tree built in the second step, right? One last question, should I send my proposal direct to the Google's web app or here first? Thanks, Gustavo Sacomoto From andreas at sdsc.edu Tue Apr 6 13:46:16 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 6 Apr 2010 10:46:16 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Gustavo, With straightforward I meant that we only have 3 months for this project and we should not try to solve all problems at the same time. Probably a realistic approach is to start with trying to keep things modular and simple (think interfaces and implementations) and stick to standard solutions that have been shown to work elsewhere. If there is more time in the project one can then replace some of the implementations with technically more advanced ones. Since we are doing things in Java I am interested in having support for parallelisation wherever possible. Another issue is how to verify that the created alignments are meaningful. One could e.g. use the biojava structure modules to calculate protein structure alignments to verify the quality of the obtained multiple sequence alignments. All applications have to be made via Google. We are providing comments on drafts of proposals and try to work together with applicants to improve the submissions. Note: The application deadline is soon and speed is important now. Andreas On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto < sacomoto at gmail.com> wrote: > Hello, > > I'm currently a graduate student at University of S?o Paulo (Brazil) > and I'm quite interested in applying for the all-Java MSA project. I'm > already familiar with the multiple sequence alignment problem, I > developed a lossless filter for this problem as my undergraduate final > project, the work is described here > [http://www.almob.org/content/4/1/3] and there is an online version of > the algorithm here > [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu]. > > Now, regarding the project, just to make it clear, when you say in the > "straightforward approach for building up the MSA progressively", you > mean the standard dynamic programming approach for pairwise alignment > following the guide tree built in the second step, right? > > One last question, should I send my proposal direct to the Google's > web app or here first? > > Thanks, > > Gustavo Sacomoto > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From sacomoto at gmail.com Tue Apr 6 14:53:04 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Tue, 6 Apr 2010 15:53:04 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hello Andreas, On Tue, Apr 6, 2010 at 2:46 PM, Andreas Prlic wrote: > Hi Gustavo, > > With straightforward I meant that we only have 3 months for this project and > we should not try to solve all problems at the same time. Probably a > realistic approach is to start with trying to keep things modular and simple > (think interfaces and implementations) and stick to standard solutions that > have been shown to work elsewhere. If there is more time in the project one > can then replace some of the implementations with technically more advanced > ones. I think my question wasn't very clear, my intention in this project is to follow the approach (with the tree steps) outlined in the project's page. Using the classical progressive alignment heuristic: build the distance matrix, build the guide tree and using this tree progressively align more sequences together. What I propose for the third step is a first implementation using the (more simple) dynamic programming described in the first CLUSTAL paper (I thinks it's from 1988) and incrementally improving the algorithm to get closer to the one described in CLUSTALW paper (from 1994). Is this more or less what you had in mind? > Since we are doing things in Java I am interested in having support for > parallelisation wherever possible. Another issue is how to verify that the > created alignments are meaningful. One could e.g. use the biojava structure > modules to calculate protein structure alignments to verify the quality of > the obtained multiple sequence alignments. About parallel strategies, I think a relative easy way we could use it is in the distance matrix construction, we could have several threads calculating the pairwise alignment for different pairs of sequence in the set. Now, the alignment quality measures is a tougher issue. The CLUSTALW paper doesn't give any way to measure the quality of the result, they consider a good alignment the one that is hard to improve by eye (But they claim that for sequences sufficient similar, no pair less than 35% identical, the results are good). Can I do the same as in CLUSTALW paper and leave the quality measure to the user? How concerned should I be with that in this project? > All applications have to be made via Google. We are providing comments? on > drafts of proposals and try to work together with applicants to improve the > submissions. Note: The application deadline is soon and speed is important > now. I will try send to this mailing list a proposal draft until tomorrow to have some feedback from you. > Andreas > > > > On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto > wrote: >> >> Hello, >> >> I'm currently a graduate student at University of S?o Paulo (Brazil) >> and I'm quite interested in applying for the all-Java MSA project. I'm >> already familiar with the multiple sequence alignment problem, I >> developed a lossless filter for this problem as my undergraduate final >> project, the work is described here >> [http://www.almob.org/content/4/1/3] and there is an online version of >> the algorithm here >> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu]. >> >> Now, regarding the project, just to make it clear, when you say in the >> "straightforward approach for building up the MSA progressively", you >> mean the standard dynamic programming approach for pairwise alignment >> following the guide tree built in the second step, right? >> >> One last question, should I send my proposal direct to the Google's >> web app or here first? >> >> Thanks, >> >> Gustavo Sacomoto >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > Thanks for your help. gustavo From andreas at sdsc.edu Tue Apr 6 17:27:15 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 6 Apr 2010 14:27:15 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Gustavo, In principle I agree to all, see details below: I think my question wasn't very clear, my intention in this project is > to follow the approach (with the tree steps) outlined in the project's > page. Using the classical progressive alignment heuristic: build the > distance matrix, build the guide tree and using this tree > progressively align more sequences together. > yes > > What I propose for the third step is a first implementation using the > (more simple) dynamic programming described in the first CLUSTAL paper > (I thinks it's from 1988) and incrementally improving the algorithm to > get closer to the one described in CLUSTALW paper (from 1994). Is this > more or less what you had in mind? > yes, sounds good. > > About parallel strategies, I think a relative easy way we could use it > is in the distance matrix construction, we could have several threads > calculating the pairwise alignment for different pairs of sequence in > the set. > Correct. Probably a first implementation would be for a single machine/ multi CPU. More advanced implementations could provide support e.g. for Map/Reduce, JPPF, or something like that... Now, the alignment quality measures is a tougher issue. The CLUSTALW > paper doesn't give any way to measure the quality of the result, they > consider a good alignment the one that is hard to improve by eye (But > they claim that for sequences sufficient similar, no pair less than > 35% identical, the results are good). Can I do the same as in CLUSTALW > paper and leave the quality measure to the user? How concerned should > I be with that in this project? > Getting an overall core-algorithm that works should be priority. The benchmarking part is not mandatory, but something to keep in mind... I have plenty of material for that, once we get to that stage... I will try send to this mailing list a proposal draft until tomorrow > to have some feedback from you. > Excellent, looking forward to it. Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From sacomoto at gmail.com Wed Apr 7 01:29:31 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Wed, 7 Apr 2010 02:29:31 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Andreas, My proposal is pasted at the end of this e-mail. I'm waiting for your feedback. Thanks, gustavo ------------------------------------------------------------- GSoC proposal Abstract -------- This project aims to develop an all-Java implementation of a multiple sequence alignment (MSA) algorithm to be added to the Biojava toolkit, using the progressive algorithm described in the CLUSTALW paper [1]. The Importance -------------- Multiple sequence alignment is a frequently performed task in sequence analysis with the goal to identify new members of protein families and infer phylogenetic relationships between proteins and genes. At the present there is no Java-only implementation for this algorithm. As such the number of already existing and Java related BioInformatics tools and web sites would benefit from this implementation and sequence analysis could be more easily performed by the end-user. About Me -------- I am a graduate student at University of S?o Paulo (Brazil), I got my undergraduate degree from the same university with a major in Computer Science and a minor in Biology. I have been involved with Bioinformatics for 5 years, always with sequence analysis with particular interest in the MSA problem. Also, in my undergraduate final project I developed a lossless filter (pruning algorithm) for the MSA problem, the work is published in [3] and there is an online implementation of the algorithm in [4]. Finally, I have experience with the C, C++, Java, Python and Ruby programming languages; Git and SVN version control systems. Project Plan ------------ The project is divided in four main steps, at the end of each step a completely functional and bug-free new algorithm will be added to the Biojava code base. It should be noticed that each step has a strong dependence on the previous one, so before move to the next step a careful testing will be done. The four steps are described below, estimated times for accomplishment of each step are also given and in some steps extra enhancements are described, they will be implemented if there is some time remaining after all steps are completed. ** 1. Study the Biojava pairwise alignment code and update it to be compliant with Biojava 3. The pairwise alignment will play an important role in the MSA algorithm. This step is also important for me to get used to the Biojava coding standards and get in touch with the Biojava dev community. ETA: 2 weeks. ** 2. Implement the algorithm to build the distance matrix. This is done using the pairwise alignment for each pair of sequence in the set to be aligned. ETA: 1 week. EXTRA: Enhance the basic algorithm to use parallel strategies, use several threads to calculate the pairwise alignment for different pairs in the sequence set. ** 3. Implement the algorithm to build the guide tree. The guide tree is based on the distance matrix built in the last step, the tree construction strategy adopted will be the Neighbor Joining Algorithm. ETA: 2 weeks. ** 4. Implement the algorithm for progressive MSA using the guide tree. This is certainly the most difficult part of the project, so to make sure we are going to deliver a fully functional MSA algorithm, a safer approach is going to be taken. In the first place, a dynamic programming algorithm described in [2] will be implemented. Once this get successfully done and the code fully integrated to the Biojava code base, the features described in [1] are going to be incrementally added (and tested) in order to implement the full dynamic programming algorithm. ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. EXTRA: Implement some benchmark technique to measure the final alignment quality. References ---------- [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 [3] http://www.almob.org/content/4/1/3 [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: > Hi Gustavo, > > In principle I agree to all, see details below: > > > I think my question wasn't very clear, my intention in this project is >> >> to follow the approach (with the tree steps) outlined in the project's >> page. Using the classical progressive alignment heuristic: build the >> distance matrix, build the guide tree and using this tree >> progressively align more sequences together. > > yes > >> >> What I propose for the third step is a first implementation using the >> (more simple) dynamic programming described in the first CLUSTAL paper >> (I thinks it's from 1988) and incrementally improving the algorithm to >> get closer to the one described in CLUSTALW paper (from 1994). Is this >> more or less what you had in mind? > > yes, sounds good. > >> >> About parallel strategies, I think a relative easy way we could use it >> is in the distance matrix construction, we could have several threads >> calculating the pairwise alignment for different pairs of sequence in >> the set. > > Correct. Probably a first implementation would be for a single machine/ > multi CPU. More advanced implementations could provide support e.g. for > Map/Reduce, JPPF, or something like that... > >> Now, the alignment quality measures is a tougher issue. The CLUSTALW >> paper doesn't give any way to measure the quality of the result, they >> consider a good alignment the one that is hard to improve by eye (But >> they claim that for sequences sufficient similar, no pair less than >> 35% identical, the results are good). Can I do the same as in CLUSTALW >> paper and leave the quality measure to the user? How concerned should >> I be with that in this project? > > Getting an overall core-algorithm that works should be priority. The > benchmarking part is not mandatory, but something to keep in mind... I have > plenty of material for that, once we get to that stage... > >> I will try send to this mailing list a proposal draft until tomorrow >> to have some feedback from you. > > Excellent, looking forward to it. > > Andreas > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From sma.hmc at gmail.com Wed Apr 7 03:52:34 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Wed, 7 Apr 2010 00:52:34 -0700 Subject: [Biojava-l] Questions about Summer of Code Project Message-ID: I had previously sent this, but was not part of the mailing list, so I can only assume it got lost in a spam loop. I was interested in applying for the All-Java Multiple Sequence Alignment Google Summer of Code project. I wanted to create a project plan but had some questions about the package as it stands now. 1. What exactly has changed with the transition to BioJava 3? From what I've read on the BioJava 3 proposal page, it seems like that the changes are to the organization of the code. Additionally there are some new standards to follow. Java 6 usage is desired, but I am unsure of what of the new features could be used in modifying pairwise sequence alignments. 2. Is the Neighbor Joining Algorithm really the best for this? Are other multiple alignments implementations desired? I have implemented the neighbor joining algorithm very inefficiently in python, it was not particularly difficult. This step seems like it will not take very long. Additionally, parallelism, I have no experience with parallelism in Java and will only have some experience with it in C, will that be an issue? 3. Is there a specific paper with the exact algorithm that should be implemented here? General: Will use cases be provided? Will test data be provided? These would both be useful in coding the test cases which seem to be coded first. Additionally, I have access to my current windows machine as well as as Linux machine for testing, but no Mac. While in theory with java, if it works on one, then it works on another, and especially with if it works on Linux, it should be fine on Mac, should I be worried about strange peculiarities? Thanks, Singer Ma Harvey Mudd College 2011 From ayates at ebi.ac.uk Wed Apr 7 07:27:27 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 7 Apr 2010 12:27:27 +0100 Subject: [Biojava-l] Anonymous svn down In-Reply-To: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> Message-ID: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> By the looks of things this is quite a simple process to do: http://github.com/guides/import-from-subversion http://blog.woobling.org/2009/06/git-svn-abandon.html http://blog.johngoulah.com/2009/11/migrating-svn-to-git/ The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up Andy On 3 Apr 2010, at 16:08, Andreas Prlic wrote: > Hi, > > the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? > > Andreas > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From Stefan.Bleckmann at uni-duesseldorf.de Wed Apr 7 08:08:45 2010 From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann) Date: Wed, 07 Apr 2010 14:08:45 +0200 Subject: [Biojava-l] SubstitutionMatrix Message-ID: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> Hi all! I have a problems reading the NUC4.2 and 4.4 matrix files with the SubstitutionMatrix class included in BioJava 1.7.1. A small example: File d = new File("/Users/-----/Desktop/NUC"); FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager.alphabetForName("DNA"); try { @SuppressWarnings("unused") final SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet,d); } catch (NumberFormatException e) { e.printStackTrace(); } catch (NoSuchElementException e) { e.printStackTrace(); } catch (BioException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } Thrown exception: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:686) at org.biojava.bio.alignment.SubstitutionMatrix.parseMatrix(SubstitutionMatrix.java:304) at org.biojava.bio.alignment.SubstitutionMatrix.(SubstitutionMatrix.java:100) at MatrixTest.main(MatrixTest.java:30) All BLOSUM matrix files I have downloaded work, so I don't think there is a problem like wrong encoding or something similar. Anybody an idea? Cheers Stefan From andreas.draeger at uni-tuebingen.de Wed Apr 7 09:32:23 2010 From: andreas.draeger at uni-tuebingen.de (Andreas =?iso-8859-1?b?RHLkZ2Vy?=) Date: Wed, 07 Apr 2010 15:32:23 +0200 Subject: [Biojava-l] SubstitutionMatrix In-Reply-To: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> Message-ID: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de> Hi Stefan, Thank you for this hint. I don't know what the problem is. Recently, I tested it and it worked. I'll have a look on it tomorrow and come back to you with an answer pretty soon! Cheers Andreas Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From holland at eaglegenomics.com Wed Apr 7 09:48:21 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 7 Apr 2010 14:48:21 +0100 Subject: [Biojava-l] SubstitutionMatrix In-Reply-To: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de> References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de> Message-ID: <20ACD602-7575-46DB-AFD7-348AEB37CF68@eaglegenomics.com> I've found the problem already - the SubstitutionMatrix class has a few inconsistencies in the use of trimmed and untrimmed versions of lines. The guessAlphabet() method in this case is falling over because of an unchecked blank line in the matrix file. I've submitted a patch to trunk which fixes all the inconsistencies and should also fix this problem with the NUC files. On 7 Apr 2010, at 14:32, Andreas Dr?ger wrote: > Hi Stefan, > > Thank you for this hint. I don't know what the problem is. Recently, I tested it and it worked. I'll have a look on it tomorrow and come back to you with an answer pretty soon! > > Cheers > Andreas > > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From Stefan.Bleckmann at uni-duesseldorf.de Wed Apr 7 10:01:04 2010 From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann) Date: Wed, 07 Apr 2010 16:01:04 +0200 Subject: [Biojava-l] SubstitutionMatrix Message-ID: <512EA47A-6F40-4A38-B69D-5990D273C9DD@uni-duesseldorf.de> Hi Richard, Thx for your fast replay. I found the same solution. Two additional line breaks in the file was the problem which I didn't saw in the editor I used to check the file. Cheers Stefan From andreas.prlic at gmail.com Wed Apr 7 11:13:04 2010 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Wed, 7 Apr 2010 08:13:04 -0700 Subject: [Biojava-l] Anonymous svn down In-Reply-To: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> Message-ID: Hi Andy, In the meanwhile Kyle Ellrott already has set up a first github clone... http://github.com/biojava/biojava We are just monitoring it a bit to make sure it works properly... Is the usermapping important? We have some 50+ users so that might be painful... Andreas On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates wrote: > By the looks of things this is quite a simple process to do: > > http://github.com/guides/import-from-subversion > > http://blog.woobling.org/2009/06/git-svn-abandon.html > > http://blog.johngoulah.com/2009/11/migrating-svn-to-git/ > > The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up > > Andy > > On 3 Apr 2010, at 16:08, Andreas Prlic wrote: > >> Hi, >> >> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? >> >> Andreas >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ > > > > > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From ayates at ebi.ac.uk Wed Apr 7 11:17:27 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 7 Apr 2010 16:17:27 +0100 Subject: [Biojava-l] Anonymous svn down In-Reply-To: References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> Message-ID: <647FD3F8-5222-487C-872F-DF00B693C809@ebi.ac.uk> Hey Andreas, The user mapping file only matters if we want a coherent link between our SVN users & those who have a github account. For example any commit of mine appears as ayates however it would probably be of more use to link to my github user since that would have more information about what I'm doing with the repo e.g. writing some snazzy new BJ3 code :). Andy On 7 Apr 2010, at 16:13, Andreas Prlic wrote: > Hi Andy, > > In the meanwhile Kyle Ellrott already has set up a first github clone... > > http://github.com/biojava/biojava > > We are just monitoring it a bit to make sure it works properly... > > Is the usermapping important? We have some 50+ users so that might be > painful... > > Andreas > > On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates wrote: >> By the looks of things this is quite a simple process to do: >> >> http://github.com/guides/import-from-subversion >> >> http://blog.woobling.org/2009/06/git-svn-abandon.html >> >> http://blog.johngoulah.com/2009/11/migrating-svn-to-git/ >> >> The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up >> >> Andy >> >> On 3 Apr 2010, at 16:08, Andreas Prlic wrote: >> >>> Hi, >>> >>> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? >>> >>> Andreas >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From andreas at sdsc.edu Wed Apr 7 15:12:27 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 7 Apr 2010 12:12:27 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Gustavo, here my 0.02$: * For some of your steps there is already code available in BioJava. MIght be good to take a look at what is already there... (look at the alignment and phylo modules for dynamic programming and Neighbour-Joining) * What about risks? Where do you expect difficulties and how to work around them? * Step 4: Can you add more details? How do you plan to approach this? E.g. Clustalw has a number of rules implemented at this stage. Do you plan to support multiple rules as well and how to do this technically. Something nice would be the possibility to use structure alignments to guide the sequence alignments. (structure module) Andreas > ------------------------------------------------------------- > > GSoC proposal > > Abstract > -------- > > This project aims to develop an all-Java implementation of a multiple > sequence alignment (MSA) algorithm to be added to the Biojava toolkit, > using the progressive algorithm described in the CLUSTALW paper [1]. > > The Importance > -------------- > > Multiple sequence alignment is a frequently performed task in sequence > analysis with the goal to identify new members of protein families and > infer phylogenetic relationships between proteins and genes. At the > present there is no Java-only implementation for this algorithm. As > such the number of already existing and Java related BioInformatics > tools and web sites would benefit from this implementation and > sequence analysis could be more easily performed by the end-user. > > About Me > -------- > > I am a graduate student at University of S?o Paulo (Brazil), I got my > undergraduate degree from the same university with a major in Computer > Science and a minor in Biology. I have been involved with > Bioinformatics for 5 years, always with sequence analysis with > particular interest in the MSA problem. Also, in my undergraduate > final project I developed a lossless filter (pruning algorithm) for > the MSA problem, the work is published in [3] and there is an online > implementation of the algorithm in [4]. Finally, I have experience > with the C, C++, Java, Python and Ruby programming languages; Git and > SVN version control systems. > > Project Plan > ------------ > > The project is divided in four main steps, at the end of each step a > completely functional and bug-free new algorithm will be added to the > Biojava code base. It should be noticed that each step has a strong > dependence on the previous one, so before move to the next step a > careful testing will be done. > > The four steps are described below, estimated times for accomplishment > of each step are also given and in some steps extra enhancements are > described, they will be implemented if there is some time remaining > after all steps are completed. > > ** 1. Study the Biojava pairwise alignment code and update it to be > compliant with Biojava 3. > > ?The pairwise alignment will play an important role in the MSA > algorithm. This step is also important for me to get used to the > Biojava coding standards and get in touch with the Biojava dev > community. > > ?ETA: 2 weeks. > > ** 2. Implement the algorithm to build the distance matrix. > > ?This is done using the pairwise alignment for each pair of sequence > in the set to be aligned. > > ?ETA: 1 week. > > ?EXTRA: Enhance the basic algorithm to use parallel strategies, use > several threads to calculate the pairwise alignment for different > pairs in the sequence set. > > ** 3. Implement the algorithm to build the guide tree. > > ?The guide tree is based on the distance matrix built in the last > step, the tree construction strategy adopted will be the Neighbor > Joining Algorithm. > > ?ETA: 2 weeks. > > ** 4. Implement the algorithm for progressive MSA using the guide tree. > > ?This is certainly the most difficult part of the project, so to make > sure we are going to deliver a fully functional MSA algorithm, a safer > approach is going to be taken. In the first place, a dynamic > programming algorithm described in [2] will be implemented. Once this > get successfully done and the code fully integrated to the Biojava > code base, the features described in [1] are going to be incrementally > added (and tested) in order to implement the full dynamic programming > algorithm. > > ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. > > ?EXTRA: Implement some benchmark technique to measure the final > alignment quality. > > References > ---------- > > [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 > [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 > [3] http://www.almob.org/content/4/1/3 > [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu > > > > On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >> Hi Gustavo, >> >> In principle I agree to all, see details below: >> >> >> I think my question wasn't very clear, my intention in this project is >>> >>> to follow the approach (with the tree steps) outlined in the project's >>> page. Using the classical progressive alignment heuristic: build the >>> distance matrix, build the guide tree and using this tree >>> progressively align more sequences together. >> >> yes >> >>> >>> What I propose for the third step is a first implementation using the >>> (more simple) dynamic programming described in the first CLUSTAL paper >>> (I thinks it's from 1988) and incrementally improving the algorithm to >>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>> more or less what you had in mind? >> >> yes, sounds good. >> >>> >>> About parallel strategies, I think a relative easy way we could use it >>> is in the distance matrix construction, we could have several threads >>> calculating the pairwise alignment for different pairs of sequence in >>> the set. >> >> Correct. Probably a first implementation would be for a single machine/ >> multi CPU. More advanced implementations could provide support e.g. for >> Map/Reduce, JPPF, or something like that... >> >>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>> paper doesn't give any way to measure the quality of the result, they >>> consider a good alignment the one that is hard to improve by eye (But >>> they claim that for sequences sufficient similar, no pair less than >>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>> paper and leave the quality measure to the user? How concerned should >>> I be with that in this project? >> >> Getting an overall core-algorithm that works should be priority. The >> benchmarking part is not mandatory, but something to keep in mind... I have >> plenty of material for that, once we get to that stage... >> >>> I will try send to this mailing list a proposal draft until tomorrow >>> to have some feedback from you. >> >> Excellent, looking forward to it. >> >> Andreas >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Wed Apr 7 15:30:19 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 7 Apr 2010 12:30:19 -0700 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: References: Message-ID: Hi Singer, > I had previously sent this, but was not part of the mailing list, so I > can only assume it got lost in a spam loop. You need to be subscribed in order to be able to post... > I was interested in applying for the All-Java Multiple Sequence > Alignment Google Summer of Code project. Several students have expressed their interest in this project. Depending on how the funding situation will be, at maximum one will be able to work on this... There is also a 2nd BioJava related project or you could propose your own ideas... http://biojava.org/wiki/Google_Summer_of_Code I wanted to create a project > plan but had some questions about the package as it stands now. > > 1. What exactly has changed with the transition to BioJava 3? From > what I've read on the BioJava 3 proposal page, it seems like that the > changes are to the organization of the code. Additionally there are > some new standards to follow. Java 6 usage is desired, but I am unsure > of what of the new features could be used in modifying pairwise > sequence alignments. BioJava is more modular in version 3. There is a new module for working with sequences. The current alignment module is still based on the old version of BioJava though. > > 2. Is the Neighbor Joining Algorithm really the best for this? Are > other multiple alignments implementations desired? I have implemented > the neighbor joining algorithm very inefficiently in python, it was > not particularly difficult. NJ is a clustering technique, but there are also others. http://en.wikipedia.org/wiki/Neighbor-joining Another online lecture that might be useful is: http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html This step seems like it will not take very > long. Additionally, parallelism, I have no experience with parallelism > in Java and will only have some experience with it in C, will that be > an issue? I have never written multi threaded code in C, but I would guess it is much much easier in Java... > 3. Is there a specific paper with the exact algorithm that should be > implemented here? We have only 3 months for this project so having a modular core algorithm that can be extended would be a priority. I recommend reading the Clustalw, T-Coffee and Muscle papers. > General: Will use cases be provided? Will test data be provided? These > would both be useful in coding the test cases which seem to be coded > first. I can provide plenty of data for that. > Additionally, I have access to my current windows machine as well as > as Linux machine for testing, but no Mac. While in theory with java, > if it works on one, then it works on another, and especially with if > it works on Linux, it should be fine on Mac, should I be worried about > strange peculiarities? >From my experience Java works pretty fine on any platform. There might be issues with user interfaces that require testing, but we are not going to do user interfaces here... Andreas > > Thanks, > Singer Ma > Harvey Mudd College 2011 > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas.draeger at uni-tuebingen.de Thu Apr 8 03:13:17 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Thu, 08 Apr 2010 09:13:17 +0200 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] Message-ID: <4BBD820D.9070200@uni-tuebingen.de> Hi all, This e-mail is just for your information about somebody new, who'd like to contribute to our project. Cheers Andreas Subject: Re: Fwd: Proposing a project on "Biojava alignment lead" From: Andreas Dr?ger Date: Wed, 07 Apr 2010 09:27:13 +0200 To: Cai Shaojiang Hi Cai Shaojiang, Thank you for you e-mail! I don't know what happened to the e-mail list. Sometimes it takes a while due to the spam filters, I guess. > I am a PhD student from National University of Singapore. My major research area is local alignment algorithms and data structures for SNP identification. And I have used Java and Eclipse for years for software development. I am very interested in your GSoC programme. I find that there is a module called "biojava-alignment lead" whose mentor is you. I want to propose a new project on this module. I have several questions about this module. Yes, that's me. So great to get your support. > 1. It seems that pairwise alignment is to find similarity between two short sequences. Existing pairwise alignment is based on dynamic programming, is it Smith-Waterman algorithm? So, currently, BioJava contains three different alignment approaches. There are two deterministic algorithms, i.e., Smith-Waterman for local alignment and Needleman-Wunsch for global alignment. Third, there is the possibility to apply Hidden Markov Models for alignment. An example of the latter approach should be in the cookbook. > 2. What is the exact task of "refactoring of underlying data structures"? Yes, this is something, I did last week already but it could still be improved. The problem was that the alignment algorithms actually produced a kind of string that looks similar to the output of BLAST. This string contained the score, the computation time, the length of the alignment etc. The problem was that people wanted to perform higher-level computation on the score value or evaluate some other information. Now, the alignment will produce a data structure that contains all the information and can, in addition to that, also produce such a BLAST-like output. There is, however, still the following problem: The data structure requires both sequences in the pair-wise alignment to have an identical length. In case of local alignment this is especially stupid (actually), because gaps are inserted to fill the sequences. And then the data structure tries to keep the old sequence coordinates, leading to the effect that the numbers "query start", "query end", "subject start", and "subject end" are required to shift the sequences against each other when displaying the output. So, you cannot easily print the sequences below of each other, you first have to shift them. Please check out the latest version of this package via anonymeous svn and have a look ;-) > 3. My existing research area is aiming to deal with aligning short read (10s~100s bp) against extremely long sequences (e.g., human genome). Af far as I know, there is not existing such alignment tools implemented in Java. Would you consider this direction? See, this would be very nice to include. But this requires that we no longer fill the short sequence with many, many gap symbols (just a waist of memory), but improve the data structure. There is already an UnequalLenghtAlignment (just a data structure, no algorithm) and I think we could use this as a starting point. Then your algorithm should only produce such a data structure and this would be fine. > 4. It seems that the existing tools is just lacking of some refactoring and representation interfaces. Any more underlying tasks? Hm. Yes: With the release of BioJava 3 data structures have changed again. So maybe there's also some adaptation to the new structure required. > I am keeping an eye on GSoC from last month, but sorry to find out that I sent the initial email to the mailing list before I subscribe it... Ok. Sounds good. Thanks for your interest. So I suggest: Download the latest trunk, have a look, play around and if you can improve something we'll put it into the trunk and write your name into the authors' tag. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From ayates at ebi.ac.uk Thu Apr 8 06:23:06 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 8 Apr 2010 11:23:06 +0100 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: References: Message-ID: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> Hi Singer, To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are: * Mutable objects are the work of the devil & should be avoided * Tasks & Futures are quite lightweight things to produce; threads are not * Multiple tasks can be given to a queue to be processed by a number of threads in a pool * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed * Assume that things will fail * Write your program with a view to be concurrent; do not force concurrency on an already written program Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). Andy On 7 Apr 2010, at 20:30, Andreas Prlic wrote: > Hi Singer, > >> I had previously sent this, but was not part of the mailing list, so I >> can only assume it got lost in a spam loop. > > You need to be subscribed in order to be able to post... > >> I was interested in applying for the All-Java Multiple Sequence >> Alignment Google Summer of Code project. > > Several students have expressed their interest in this project. > Depending on how the funding situation will be, at maximum one will be > able to work on this... There is also a 2nd BioJava related project or > you could propose your own ideas... > http://biojava.org/wiki/Google_Summer_of_Code > > > I wanted to create a project >> plan but had some questions about the package as it stands now. >> >> 1. What exactly has changed with the transition to BioJava 3? From >> what I've read on the BioJava 3 proposal page, it seems like that the >> changes are to the organization of the code. Additionally there are >> some new standards to follow. Java 6 usage is desired, but I am unsure >> of what of the new features could be used in modifying pairwise >> sequence alignments. > > BioJava is more modular in version 3. There is a new module for > working with sequences. The current alignment module is still based on > the old version of BioJava though. > >> >> 2. Is the Neighbor Joining Algorithm really the best for this? Are >> other multiple alignments implementations desired? I have implemented >> the neighbor joining algorithm very inefficiently in python, it was >> not particularly difficult. > > NJ is a clustering technique, but there are also others. > http://en.wikipedia.org/wiki/Neighbor-joining > Another online lecture that might be useful is: > http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html > > This step seems like it will not take very >> long. Additionally, parallelism, I have no experience with parallelism >> in Java and will only have some experience with it in C, will that be >> an issue? > > I have never written multi threaded code in C, but I would guess it is > much much easier in Java... > >> 3. Is there a specific paper with the exact algorithm that should be >> implemented here? > > We have only 3 months for this project so having a modular core > algorithm that can be extended would be a priority. I recommend > reading the Clustalw, T-Coffee and Muscle papers. > >> General: Will use cases be provided? Will test data be provided? These >> would both be useful in coding the test cases which seem to be coded >> first. > > I can provide plenty of data for that. > > >> Additionally, I have access to my current windows machine as well as >> as Linux machine for testing, but no Mac. While in theory with java, >> if it works on one, then it works on another, and especially with if >> it works on Linux, it should be fine on Mac, should I be worried about >> strange peculiarities? > >> From my experience Java works pretty fine on any platform. There might > be issues with user interfaces that require testing, but we are not > going to do user interfaces here... > > Andreas > > >> >> Thanks, >> Singer Ma >> Harvey Mudd College 2011 >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From sma.hmc at gmail.com Thu Apr 8 06:38:41 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Thu, 8 Apr 2010 03:38:41 -0700 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> References: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> Message-ID: So, my questions were generated from looking past just the Summer of Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as part of its proposal, lists: Make methods parallel-aware and take advantage of this when possible, and provide a global variable to specify how much parallelisation can take place. on http://www.biojava.org/wiki/BioJava3_Proposal How important it this to incorporate into the Summer of Code project? Obviously anything that is already concurrent can remain that way, but for the new code in multiple sequence alignment, does this need to be parallel-aware? Clearly, in a multiple sequence alignment, certain things can be made parallel such as the initial distance matrix calculation, parts of the neighbor joining algorithm, etc. If I were to contribute, I would want to uphold the agreed upon standards as much as possible. I am just unsure of my capability to make multiple sequence alignment parallel-aware. Singer On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates wrote: > Hi Singer, > > To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are: > > * Mutable objects are the work of the devil & should be avoided > * Tasks & Futures are quite lightweight things to produce; threads are not > * Multiple tasks can be given to a queue to be processed by a number of threads in a pool > * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed > * Assume that things will fail > * Write your program with a view to be concurrent; do not force concurrency on an already written program > > Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). > > Andy > > On 7 Apr 2010, at 20:30, Andreas Prlic wrote: > >> Hi Singer, >> >>> I had previously sent this, but was not part of the mailing list, so I >>> can only assume it got lost in a spam loop. >> >> You need to be subscribed in order to be able to post... >> >>> I was interested in applying for the All-Java Multiple Sequence >>> Alignment Google Summer of Code project. >> >> Several students have expressed their interest ?in this project. >> Depending on how the funding situation will be, at maximum one will be >> able to work on this... There is also a 2nd BioJava related project or >> you could propose your own ideas... >> http://biojava.org/wiki/Google_Summer_of_Code >> >> >> I wanted to create a project >>> plan but had some questions about the package as it stands now. >>> >>> 1. What exactly has changed with the transition to BioJava 3? From >>> what I've read on the BioJava 3 proposal page, it seems like that the >>> changes are to the organization of the code. Additionally there are >>> some new standards to follow. Java 6 usage is desired, but I am unsure >>> of what of the new features could be used in modifying pairwise >>> sequence alignments. >> >> BioJava is more modular in version 3. There is a new module for >> working with sequences. The current alignment module is still based on >> the old version of BioJava though. >> >>> >>> 2. Is the Neighbor Joining Algorithm really the best for this? Are >>> other multiple alignments implementations desired? I have implemented >>> the neighbor joining algorithm very inefficiently in python, it was >>> not particularly difficult. >> >> NJ is a clustering technique, but there are also others. >> http://en.wikipedia.org/wiki/Neighbor-joining >> Another online lecture that might be useful is: >> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html >> >> This step seems like it will not take very >>> long. Additionally, parallelism, I have no experience with parallelism >>> in Java and will only have some experience with it in C, will that be >>> an issue? >> >> I have never written multi threaded code in C, but I would guess it is >> much much easier in Java... >> >>> 3. Is there a specific paper with the exact algorithm that should be >>> implemented here? >> >> We have only 3 months for this project so having a modular core >> algorithm that can be extended would be a priority. I recommend >> reading the Clustalw, T-Coffee and Muscle papers. >> >>> General: Will use cases be provided? Will test data be provided? These >>> would both be useful in coding the test cases which seem to be coded >>> first. >> >> I can provide plenty of data for that. >> >> >>> Additionally, I have access to my current windows machine as well as >>> as Linux machine for testing, but no Mac. While in theory with java, >>> if it works on one, then it works on another, and especially with if >>> it works on Linux, it should be fine on Mac, should I be worried about >>> strange peculiarities? >> >>> From my experience Java works pretty fine on any platform. There might >> be issues with user interfaces that require testing, but we are not >> going to do ?user interfaces here... >> >> Andreas >> >> >>> >>> Thanks, >>> Singer Ma >>> Harvey Mudd College 2011 >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ > > > > > From ayates at ebi.ac.uk Thu Apr 8 06:46:15 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 8 Apr 2010 11:46:15 +0100 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: References: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> Message-ID: <91C9DF16-E6EF-4B7A-ADC4-E781275514EB@ebi.ac.uk> Ahhh okay. So when we wrote this section it was with a view towards being able to do things in a concurrent manner as & when that framework appears. BioJava3 is still in an incubation phase; a lot of code is in place but we are all having to do this along with work commitments (which in my case is working on a Perl project so my work/BJ contributions are very limited). Anyway to go back to the question about being "framework" standard. The MSA algorithm would be the first case we would have to make concurrent (as far as I am aware but Scooter is a better person to confirm this) and so the framework of building a concurrent application would come from this project. If the code is written using the standard concurrent library interfaces then it should be possible to transplant it into any concurrent Java framework and that's really the important thing here. Andy On 8 Apr 2010, at 11:38, Singer Ma wrote: > So, my questions were generated from looking past just the Summer of > Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as > part of its proposal, lists: > > Make methods parallel-aware and take advantage of this when possible, > and provide a global variable to specify how much parallelisation can > take place. > > on http://www.biojava.org/wiki/BioJava3_Proposal > > How important it this to incorporate into the Summer of Code project? > Obviously anything that is already concurrent can remain that way, but > for the new code in multiple sequence alignment, does this need to be > parallel-aware? Clearly, in a multiple sequence alignment, certain > things can be made parallel such as the initial distance matrix > calculation, parts of the neighbor joining algorithm, etc. If I were > to contribute, I would want to uphold the agreed upon standards as > much as possible. I am just unsure of my capability to make multiple > sequence alignment parallel-aware. > > Singer > > On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates wrote: >> Hi Singer, >> >> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are: >> >> * Mutable objects are the work of the devil & should be avoided >> * Tasks & Futures are quite lightweight things to produce; threads are not >> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool >> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed >> * Assume that things will fail >> * Write your program with a view to be concurrent; do not force concurrency on an already written program >> >> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). >> >> Andy >> >> On 7 Apr 2010, at 20:30, Andreas Prlic wrote: >> >>> Hi Singer, >>> >>>> I had previously sent this, but was not part of the mailing list, so I >>>> can only assume it got lost in a spam loop. >>> >>> You need to be subscribed in order to be able to post... >>> >>>> I was interested in applying for the All-Java Multiple Sequence >>>> Alignment Google Summer of Code project. >>> >>> Several students have expressed their interest in this project. >>> Depending on how the funding situation will be, at maximum one will be >>> able to work on this... There is also a 2nd BioJava related project or >>> you could propose your own ideas... >>> http://biojava.org/wiki/Google_Summer_of_Code >>> >>> >>> I wanted to create a project >>>> plan but had some questions about the package as it stands now. >>>> >>>> 1. What exactly has changed with the transition to BioJava 3? From >>>> what I've read on the BioJava 3 proposal page, it seems like that the >>>> changes are to the organization of the code. Additionally there are >>>> some new standards to follow. Java 6 usage is desired, but I am unsure >>>> of what of the new features could be used in modifying pairwise >>>> sequence alignments. >>> >>> BioJava is more modular in version 3. There is a new module for >>> working with sequences. The current alignment module is still based on >>> the old version of BioJava though. >>> >>>> >>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are >>>> other multiple alignments implementations desired? I have implemented >>>> the neighbor joining algorithm very inefficiently in python, it was >>>> not particularly difficult. >>> >>> NJ is a clustering technique, but there are also others. >>> http://en.wikipedia.org/wiki/Neighbor-joining >>> Another online lecture that might be useful is: >>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html >>> >>> This step seems like it will not take very >>>> long. Additionally, parallelism, I have no experience with parallelism >>>> in Java and will only have some experience with it in C, will that be >>>> an issue? >>> >>> I have never written multi threaded code in C, but I would guess it is >>> much much easier in Java... >>> >>>> 3. Is there a specific paper with the exact algorithm that should be >>>> implemented here? >>> >>> We have only 3 months for this project so having a modular core >>> algorithm that can be extended would be a priority. I recommend >>> reading the Clustalw, T-Coffee and Muscle papers. >>> >>>> General: Will use cases be provided? Will test data be provided? These >>>> would both be useful in coding the test cases which seem to be coded >>>> first. >>> >>> I can provide plenty of data for that. >>> >>> >>>> Additionally, I have access to my current windows machine as well as >>>> as Linux machine for testing, but no Mac. While in theory with java, >>>> if it works on one, then it works on another, and especially with if >>>> it works on Linux, it should be fine on Mac, should I be worried about >>>> strange peculiarities? >>> >>>> From my experience Java works pretty fine on any platform. There might >>> be issues with user interfaces that require testing, but we are not >>> going to do user interfaces here... >>> >>> Andreas >>> >>> >>>> >>>> Thanks, >>>> Singer Ma >>>> Harvey Mudd College 2011 >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From mitlox at op.pl Thu Apr 8 07:30:13 2010 From: mitlox at op.pl (xyz) Date: Thu, 8 Apr 2010 21:30:13 +1000 Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: References: <20100330215047.084f6b00@wp01> Message-ID: <20100408213013.63a99b8c@wp01> On Wed, 31 Mar 2010 23:56:42 -0400 (EDT) Michael Heuer wrote: > import static ...RichSequence.Tools.*; > import static ...RichSequence.IOTools.*; > > Fastq fastq = ...; > Namespace namepace = ...; > RichSequence richSequence = createRichSequence( > namespace, > fastq.getDescription(), > fastq.getSequence(), > DNATools.getDNA()); > > writeFasta(outputStream, richSequence, namespace); I have tried this but I got this error: Fastq2Fasta.java:52: cannot find symbol symbol : method createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet) location: class Fastq2Fasta RichSequence richSequence = createRichSequence(ns, 1 error The complete code looks now : import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import org.biojava.bio.program.fastq.Fastq; import org.biojava.bio.program.fastq.FastqBuilder; import org.biojava.bio.program.fastq.FastqReader; import org.biojava.bio.program.fastq.FastqVariant; import org.biojava.bio.program.fastq.FastqWriter; import org.biojava.bio.program.fastq.IlluminaFastqReader; import org.biojava.bio.program.fastq.IlluminaFastqWriter; import org.biojava.bio.seq.DNATools; import org.biojavax.SimpleNamespace; import org.biojavax.bio.seq.RichSequence; public class Fastq2Fasta { public static void main(String[] args) throws FileNotFoundException, IOException { FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); FastqReader qReader = new IlluminaFastqReader(); FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); FastqWriter qWriter = new IlluminaFastqWriter(); //SimpleNamespace ns = new SimpleNamespace("biojava"); FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta"); for (Fastq fastq : qReader.read(inputFastq)) { System.out.println(fastq.getDescription()); System.out.println(fastq.getSequence()); String trimSeq = fastq.getSequence().substring(0, fastq.getSequence().length() - 6); System.out.println(trimSeq); System.out.println(fastq.getQuality()); String trimQual = fastq.getQuality().substring(0, fastq.getQuality().length() - 6); System.out.println(trimQual); FastqBuilder trimFastq = new FastqBuilder(); trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA); trimFastq.withDescription(fastq.getDescription()); trimFastq.appendSequence(trimSeq); trimFastq.appendQuality(trimQual); qWriter.write(outputFastq, trimFastq.build()); SimpleNamespace ns = new SimpleNamespace("biojava"); RichSequence richSequence = createRichSequence(ns, fastq.getDescription(), trimSeq, DNATools.getDNA()); RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns); } } } What did I wrong? > > > Suggestions: > > 1) > > After I trimmed the fastq files the header information for quality > > is empty > > > > @HWI-EAS406:5:1:0:1390#0/1 > > GGGTGATGGCCGCTGCCGATGGCGTCAAAA > > + > > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO > > > > this reduced the size of the files but is it compatible with > > SOAP and TopHat? > > Sorry, not sure what you are asking here. > Usually @-headerand and +-header are equal eg. @HWI-EAS406:5:1:0:1390#0/1 +HWI-EAS406:5:1:0:1390#0/1 but after trimming and writting to fastq file I got this @HWI-EAS406:5:1:0:1390#0/1 + The +-header is empty. Is this ok like this and standard compatible? Best regards, From mitlox at op.pl Thu Apr 8 07:30:52 2010 From: mitlox at op.pl (xyz) Date: Thu, 8 Apr 2010 21:30:52 +1000 Subject: [Biojava-l] readFasta problem Message-ID: <20100408213052.662beb8e@wp01> Hello, I would like to read fasta file without to specify whether it is DNA, RNA or Protein in code and I wrote this code import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import org.biojava.bio.BioException; import org.biojavax.SimpleNamespace; import org.biojavax.bio.seq.RichSequence; import org.biojavax.bio.seq.RichSequenceIterator; public class SortFasta { public static void main(String[] args) throws FileNotFoundException, BioException { BufferedReader br = new BufferedReader(new FileReader("sortFasta.fasta")); SimpleNamespace ns = new SimpleNamespace("biojava"); // You can use any of the convenience methods found in the BioJava 1.6 API //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br, ns); RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns); // Since a single file can contain more than a sequence, you need // to iterate over rsi to get the information. while (rsi.hasNext()) { RichSequence rs = rsi.nextRichSequence(); System.out.println(rs.getComments()); System.out.println(rs.seqString()); } } } but unfortunately it I have got following error: it the details that follow to biojava-l at biojava.org or post a bug report to http://bugzilla.open-bio.org/ Format_object=org.biojavax.bio.seq.io.FastaFormat Accession= Id= Comments=problem parsing symbols Parse_block=atccccc Stack trace follows .... at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ... 1 more Caused by: java.lang.NullPointerException at org.biojava.bio.symbol.SimpleSymbolList.(SimpleSymbolList.java:165) at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ... 2 more Java Result: 1 What did I wrong? Thank you in advance. Best regards, From holland at eaglegenomics.com Thu Apr 8 07:41:25 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 8 Apr 2010 12:41:25 +0100 Subject: [Biojava-l] readFasta problem In-Reply-To: <20100408213052.662beb8e@wp01> References: <20100408213052.662beb8e@wp01> Message-ID: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> You have passed null into the tokenizer parameter of RichSequence.IOTools.readFasta() - this is not allowed. The parser cannot guess the type of sequence, it must be told what to expect by specifying the tokenizer to use. (Importantly this also means that you cannot mix different types of sequence within the same file to be parsed.) On 8 Apr 2010, at 12:30, xyz wrote: > Hello, > I would like to read fasta file without to specify whether it is DNA, > RNA or Protein in code and I wrote this code > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileReader; > import org.biojava.bio.BioException; > import org.biojavax.SimpleNamespace; > import org.biojavax.bio.seq.RichSequence; > import org.biojavax.bio.seq.RichSequenceIterator; > > public class SortFasta { > > public static void main(String[] args) throws FileNotFoundException, > BioException { > > > BufferedReader br = new BufferedReader(new > FileReader("sortFasta.fasta")); > SimpleNamespace ns = new SimpleNamespace("biojava"); > > // You can use any of the convenience methods found in the BioJava 1.6 API > //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br, ns); > RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns); > > // Since a single file can contain more than a sequence, you need > // to iterate over rsi to get the information. > while (rsi.hasNext()) { > RichSequence rs = rsi.nextRichSequence(); > System.out.println(rs.getComments()); > System.out.println(rs.seqString()); > } > } > } > but unfortunately it I have got following error: > it the details that follow to biojava-l at biojava.org or post a bug > report to http://bugzilla.open-bio.org/ > > Format_object=org.biojavax.bio.seq.io.FastaFormat > Accession= > Id= > Comments=problem parsing symbols > Parse_block=atccccc > Stack trace follows .... > > > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ... > 1 more Caused by: java.lang.NullPointerException at > org.biojava.bio.symbol.SimpleSymbolList.(SimpleSymbolList.java:165) > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ... > 2 more Java Result: 1 > > What did I wrong? > > Thank you in advance. > > Best regards, > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Thu Apr 8 07:36:36 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 8 Apr 2010 12:36:36 +0100 Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: <20100408213013.63a99b8c@wp01> References: <20100330215047.084f6b00@wp01> <20100408213013.63a99b8c@wp01> Message-ID: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com> You haven't included the two import static lines in your code. See first two lines of Michael's example code (expanding the ellipses to the full classpath). On 8 Apr 2010, at 12:30, xyz wrote: > On Wed, 31 Mar 2010 23:56:42 -0400 (EDT) > Michael Heuer wrote: > >> import static ...RichSequence.Tools.*; >> import static ...RichSequence.IOTools.*; >> >> Fastq fastq = ...; >> Namespace namepace = ...; >> RichSequence richSequence = createRichSequence( >> namespace, >> fastq.getDescription(), >> fastq.getSequence(), >> DNATools.getDNA()); >> >> writeFasta(outputStream, richSequence, namespace); > > I have tried this but I got this error: > Fastq2Fasta.java:52: cannot find symbol > symbol : method > createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet) > location: class Fastq2Fasta RichSequence richSequence = > createRichSequence(ns, > 1 error > > The complete code looks now : > > import java.io.FileInputStream; > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.IOException; > import org.biojava.bio.program.fastq.Fastq; > import org.biojava.bio.program.fastq.FastqBuilder; > import org.biojava.bio.program.fastq.FastqReader; > import org.biojava.bio.program.fastq.FastqVariant; > import org.biojava.bio.program.fastq.FastqWriter; > import org.biojava.bio.program.fastq.IlluminaFastqReader; > import org.biojava.bio.program.fastq.IlluminaFastqWriter; > import org.biojava.bio.seq.DNATools; > import org.biojavax.SimpleNamespace; > import org.biojavax.bio.seq.RichSequence; > > > public class Fastq2Fasta { > > public static void main(String[] args) throws FileNotFoundException, > IOException { > > FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); > FastqReader qReader = new IlluminaFastqReader(); > > FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); > FastqWriter qWriter = new IlluminaFastqWriter(); > > //SimpleNamespace ns = new SimpleNamespace("biojava"); > > FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta"); > > > for (Fastq fastq : qReader.read(inputFastq)) { > System.out.println(fastq.getDescription()); > System.out.println(fastq.getSequence()); > String trimSeq = fastq.getSequence().substring(0, > fastq.getSequence().length() - 6); > System.out.println(trimSeq); > System.out.println(fastq.getQuality()); > String trimQual = fastq.getQuality().substring(0, > fastq.getQuality().length() - 6); > System.out.println(trimQual); > > FastqBuilder trimFastq = new FastqBuilder(); > trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA); > trimFastq.withDescription(fastq.getDescription()); > trimFastq.appendSequence(trimSeq); > trimFastq.appendQuality(trimQual); > > qWriter.write(outputFastq, trimFastq.build()); > > > SimpleNamespace ns = new SimpleNamespace("biojava"); > RichSequence richSequence = createRichSequence(ns, > fastq.getDescription(), trimSeq, DNATools.getDNA()); > RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns); > } > } > } > > What did I wrong? > > >> >>> Suggestions: >>> 1) >>> After I trimmed the fastq files the header information for quality >>> is empty >>> >>> @HWI-EAS406:5:1:0:1390#0/1 >>> GGGTGATGGCCGCTGCCGATGGCGTCAAAA >>> + >>> OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO >>> >>> this reduced the size of the files but is it compatible with >>> SOAP and TopHat? >> >> Sorry, not sure what you are asking here. >> > Usually @-headerand and +-header are equal eg. > @HWI-EAS406:5:1:0:1390#0/1 > +HWI-EAS406:5:1:0:1390#0/1 > but after trimming and writting to fastq file I got this > @HWI-EAS406:5:1:0:1390#0/1 > + > The +-header is empty. Is this ok like this and standard compatible? > > Best regards, > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From chapman at cs.wisc.edu Thu Apr 8 08:47:12 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Thu, 08 Apr 2010 07:47:12 -0500 Subject: [Biojava-l] GSoC Application Message-ID: <4BBDD050.6090208@cs.wisc.edu> I would appreciate any feedback on my proposal from mentors or other developers. Check it out at: http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817 Thanks in advance, Mark From caishaojiang at gmail.com Thu Apr 8 09:28:11 2010 From: caishaojiang at gmail.com (Cai Shaojiang) Date: Thu, 8 Apr 2010 06:28:11 -0700 Subject: [Biojava-l] [Fwd: Re: GSoC project on MSA] In-Reply-To: <4BBDCFD2.3000507@uni-tuebingen.de> References: <4BBC80A8.5000608@uni-tuebingen.de> <4BBDCFD2.3000507@uni-tuebingen.de> Message-ID: Dear Sir: I have submitted the proposal through Google. Cheers. On Thu, Apr 8, 2010 at 5:45 AM, Andreas Dr?ger < andreas.draeger at uni-tuebingen.de> wrote: > Hi Cai, > > Oh yes, it is in the alignment package. But it is only an interface. It > already has two sub-types: AbstractULAlignment and this has the > implementation SubULAlignment. We should check first if we can already use > these data structures to easily produce a paired alignment. Can you see how > the AlignmentPair is produced by the alignment algorithms in the alignment > package? We should do something similar but with this different data > structure, I suggest. > > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > -- Cai Shaojiang Department of Information Systems, School of Computing, National University of Singapore Telephone: +65 93-4870-93 Email: caishaojiang at gmail.com; shaoj at comp.nus.edu.sg From sacomoto at gmail.com Thu Apr 8 12:26:55 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Thu, 8 Apr 2010 13:26:55 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Andreas, On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic wrote: > Hi Gustavo, > > here my 0.02$: > > * For some of your steps there is already code available in BioJava. > MIght be good to take a look at what is already there... ? (look at > the alignment and phylo modules for dynamic programming and > Neighbour-Joining) > > * What about risks? Where do you expect difficulties and how to work > around them? > > * Step 4: Can you add more details? How do you plan to approach this? > E.g. Clustalw has a number of rules implemented at this stage. Do you > plan to support multiple rules as well and how to do this technically. > Something nice would be the possibility to use structure alignments to > guide the sequence alignments. (structure module) Based on it I rewrote the step 4 and add a "Main Risks" section. I pasted just the new version of step 4 and the new section at the end of this e-mal. Thank you very much for your feedback. gustavo ------------------------------------------------------------------------------------------- ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. A progressive MSA is a heuristic approach for the MSA problem, at each step a pairwise alignment between two sequences, a sequence and an alignment or between two alignments is done. So, the multiple alignment is built incrementally, at each iteration more sequences are aligned together. The guide tree gives an order for this incremental alignment, in a bottom-up (in the tree) fashion sequences (or groups of sequences) with greater similarity are aligned first. Therefore, in order to have a more flexible and reusable code, the code design will allow any binary tree of the sequences to be used as a guide tree, not only the one built in the last step. This will allow a priori phylogenetic or tertiary similarity (structural similarity) knowledge be used to guide the multiple alignment order. This is certainly the most difficult part of the project, so to make sure we are going to deliver a fully functional MSA algorithm, a safer approach is going to be taken. In the first place, a a basic algorithm described in [2] will be implemented. Once this get successfully done and the code fully integrated to the Biojava code base, the features described in [1] are going to be incrementally added (and tested) in order to implement the full algorithm. This step is further divided in substeps. *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. This is the generalized pairwise alignment used in each iteration of the progressive MSA. Gaps already presents in one of the alignments (profiles) remain fixed, gap opening penalties remain unchanged, this means that opening new gaps inside existent gaps will be fully penalized. The code for this algorithm is similar to, the already present in Biojava, code for regular pairwise alignment. *** 4.2 Implement the basic progressive MSA algorithm. In this substep is going to be implemented the incremental algorithm to built the MSA, transversing a guide tree (parameter, could be the one built in step 3 or any other one) in a bottom-up fashion and using the algorithm from substep 4.1 at each iteration. *** 4.3 Implement the MSA wrapper. The MSA wrapper is going to be a method that wraps steps 2, 3 and 4.2, giving a simple method (for the final user) to calculate the MSA. Receiving as parameters the set of sequences to be aligned, the gap opening penalty, gap extend penalty and residue matrix. Returning the MSA for the sequence set. At the end of this substep, we get a basic fully functional MSA algorithm, using the progressive heuristic. *** 4.4 Implement gaps penalties rescaling and parameter default values. Gap penalties to open a new gap an extend a existing one (the affine gap weight model) are user defined parameters. This substep will define default values, based on the residue matrix, for this parameters and implement global rescaling rules (based on sequences sizes) for this parameters. *** 4.5 Enhance the DP algorithm to use different sequences weight. Based on the guide tree, for each sequence a different weight (divergent sequences receive high values) is calculated and used in the scoring scheme of the generalized DP algorithm. *** 4.6 Enhance the DP algorithm to use position based gap penalties. The DP algorithm from substep 4.1 uses globally defined gap opening penalty. In this substep, the algorithm is going to be modified do use position based penalty, this is simple, once is known an array of opening penalties for each sequence position. This array is calculated based on several hierarchical (only apply the first one that fits, if any) rules, those are rescaling rules and the array is initialized with the original gap opening penalty. Given the hierarchical nature of the rules, they can be implemented in a incremental way, from the highest priority rule to the lowest, the algorithm of each step being a refinement of the previous one. I am omitting the detailed description of each rule. However, to verify if a given rule apply to a given position, all that is necessary is to check at most 16 adjacent positions and the same position in the other already aligned sequences. At the end of each of the following steps we a have functional algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. **** 4.6.1 Lowered gap opening penalties at existing gaps. **** 4.6.2 Increased gap opening penalties near existing gaps. **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. **** 4.6.4 Residue specific gap penalties. ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. EXTRA: Implement some benchmark technique to measure the final alignment quality. Main Risks ---------- The main risk to this project is the intrinsic complexity of the MSA progressive algorithm. To deal with that we decided to break the implementation in a large number of small and manageable steps, and the steps are designed in a way that, at the end of each of them, we will have a complete and testable new function (or a modification of an existing one). Besides that, to be extra careful the project aims to produce a simple full functional MSA algorithm as early as possible, the estimated time is 8 weeks, this way we guarantee to deliver at a simpler, but working and bug-free, version. > Andreas > > >> ------------------------------------------------------------- >> >> GSoC proposal >> >> Abstract >> -------- >> >> This project aims to develop an all-Java implementation of a multiple >> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >> using the progressive algorithm described in the CLUSTALW paper [1]. >> >> The Importance >> -------------- >> >> Multiple sequence alignment is a frequently performed task in sequence >> analysis with the goal to identify new members of protein families and >> infer phylogenetic relationships between proteins and genes. At the >> present there is no Java-only implementation for this algorithm. As >> such the number of already existing and Java related BioInformatics >> tools and web sites would benefit from this implementation and >> sequence analysis could be more easily performed by the end-user. >> >> About Me >> -------- >> >> I am a graduate student at University of S?o Paulo (Brazil), I got my >> undergraduate degree from the same university with a major in Computer >> Science and a minor in Biology. I have been involved with >> Bioinformatics for 5 years, always with sequence analysis with >> particular interest in the MSA problem. Also, in my undergraduate >> final project I developed a lossless filter (pruning algorithm) for >> the MSA problem, the work is published in [3] and there is an online >> implementation of the algorithm in [4]. Finally, I have experience >> with the C, C++, Java, Python and Ruby programming languages; Git and >> SVN version control systems. >> >> Project Plan >> ------------ >> >> The project is divided in four main steps, at the end of each step a >> completely functional and bug-free new algorithm will be added to the >> Biojava code base. It should be noticed that each step has a strong >> dependence on the previous one, so before move to the next step a >> careful testing will be done. >> >> The four steps are described below, estimated times for accomplishment >> of each step are also given and in some steps extra enhancements are >> described, they will be implemented if there is some time remaining >> after all steps are completed. >> >> ** 1. Study the Biojava pairwise alignment code and update it to be >> compliant with Biojava 3. >> >> ?The pairwise alignment will play an important role in the MSA >> algorithm. This step is also important for me to get used to the >> Biojava coding standards and get in touch with the Biojava dev >> community. >> >> ?ETA: 2 weeks. >> >> ** 2. Implement the algorithm to build the distance matrix. >> >> ?This is done using the pairwise alignment for each pair of sequence >> in the set to be aligned. >> >> ?ETA: 1 week. >> >> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use >> several threads to calculate the pairwise alignment for different >> pairs in the sequence set. >> >> ** 3. Implement the algorithm to build the guide tree. >> >> ?The guide tree is based on the distance matrix built in the last >> step, the tree construction strategy adopted will be the Neighbor >> Joining Algorithm. >> >> ?ETA: 2 weeks. >> >> ** 4. Implement the algorithm for progressive MSA using the guide tree. >> >> ?This is certainly the most difficult part of the project, so to make >> sure we are going to deliver a fully functional MSA algorithm, a safer >> approach is going to be taken. In the first place, a dynamic >> programming algorithm described in [2] will be implemented. Once this >> get successfully done and the code fully integrated to the Biojava >> code base, the features described in [1] are going to be incrementally >> added (and tested) in order to implement the full dynamic programming >> algorithm. >> >> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >> >> ?EXTRA: Implement some benchmark technique to measure the final >> alignment quality. >> >> References >> ---------- >> >> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >> [3] http://www.almob.org/content/4/1/3 >> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >> >> >> >> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >>> Hi Gustavo, >>> >>> In principle I agree to all, see details below: >>> >>> >>> I think my question wasn't very clear, my intention in this project is >>>> >>>> to follow the approach (with the tree steps) outlined in the project's >>>> page. Using the classical progressive alignment heuristic: build the >>>> distance matrix, build the guide tree and using this tree >>>> progressively align more sequences together. >>> >>> yes >>> >>>> >>>> What I propose for the third step is a first implementation using the >>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>> more or less what you had in mind? >>> >>> yes, sounds good. >>> >>>> >>>> About parallel strategies, I think a relative easy way we could use it >>>> is in the distance matrix construction, we could have several threads >>>> calculating the pairwise alignment for different pairs of sequence in >>>> the set. >>> >>> Correct. Probably a first implementation would be for a single machine/ >>> multi CPU. More advanced implementations could provide support e.g. for >>> Map/Reduce, JPPF, or something like that... >>> >>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>> paper doesn't give any way to measure the quality of the result, they >>>> consider a good alignment the one that is hard to improve by eye (But >>>> they claim that for sequences sufficient similar, no pair less than >>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>> paper and leave the quality measure to the user? How concerned should >>>> I be with that in this project? >>> >>> Getting an overall core-algorithm that works should be priority. The >>> benchmarking part is not mandatory, but something to keep in mind... I have >>> plenty of material for that, once we get to that stage... >>> >>>> I will try send to this mailing list a proposal draft until tomorrow >>>> to have some feedback from you. >>> >>> Excellent, looking forward to it. >>> >>> Andreas >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From andreas at sdsc.edu Thu Apr 8 13:26:03 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 8 Apr 2010 10:26:03 -0700 Subject: [Biojava-l] GSoC Application In-Reply-To: <4BBDD050.6090208@cs.wisc.edu> References: <4BBDD050.6090208@cs.wisc.edu> Message-ID: Hi Mark, looks pretty good, * The time schedule feels tight. Where do you see possible difficulties and risks. What might take longer than expected? * I would like to be able to use 3D structure alignment information to guide the final alignment. This should increase reliability of the final alignment for remote sequence similarities. Any thoughts on how to accomplish this? Andreas On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman wrote: > I would appreciate any feedback on my proposal from mentors or other > developers. ?Check it out at: > http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817 > > Thanks in advance, > Mark > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Apr 8 13:36:56 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 8 Apr 2010 10:36:56 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Looks pretty good. One issue during the progressive alignment build up: 3D structure alignments can increase the reliability of the sequence alignments, particularly if the sequences are only distantly related. Having a way to incorporate the 3D structure info would be nice... Andreas On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto wrote: > Hi Andreas, > > On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic wrote: >> Hi Gustavo, >> >> here my 0.02$: >> >> * For some of your steps there is already code available in BioJava. >> MIght be good to take a look at what is already there... ? (look at >> the alignment and phylo modules for dynamic programming and >> Neighbour-Joining) >> >> * What about risks? Where do you expect difficulties and how to work >> around them? >> >> * Step 4: Can you add more details? How do you plan to approach this? >> E.g. Clustalw has a number of rules implemented at this stage. Do you >> plan to support multiple rules as well and how to do this technically. >> Something nice would be the possibility to use structure alignments to >> guide the sequence alignments. (structure module) > > Based on it I rewrote the step 4 and add a "Main Risks" section. > > I pasted just the new version of step 4 and the new section at the end > of this e-mal. > > Thank you very much for your feedback. > > gustavo > > > > ------------------------------------------------------------------------------------------- > > ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. > > ?A progressive MSA is a heuristic approach for the MSA problem, at > each step a pairwise alignment between two sequences, a sequence and > an alignment or between two alignments is done. So, the multiple > alignment is built incrementally, at each iteration more sequences are > aligned together. The guide tree gives an order for this incremental > alignment, in a bottom-up (in the tree) fashion sequences (or groups > of sequences) with greater similarity are aligned first. Therefore, in > order to have a more flexible and reusable code, the code design will > allow any binary tree of the sequences to be used as a guide tree, not > only the one built in the last step. This will allow a priori > phylogenetic or tertiary similarity (structural similarity) knowledge > be used to guide the multiple alignment order. > > ?This is certainly the most difficult part of the project, so to make > sure we are going to deliver a fully functional MSA algorithm, a safer > approach is going to be taken. In the first place, a a basic algorithm > described in [2] will be implemented. Once this get successfully done > and the code fully integrated to the Biojava code base, the features > described in [1] are going to be incrementally added (and tested) in > order to implement the full algorithm. This step is further divided in > substeps. > > *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. > > ?This is the generalized pairwise alignment used in each iteration of > the progressive MSA. Gaps ?already presents in one of the alignments > (profiles) remain fixed, gap opening penalties remain unchanged, this > means that opening new gaps inside existent gaps will be fully > penalized. The code for this algorithm is similar to, the already > present in Biojava, code for regular pairwise alignment. > > *** 4.2 Implement the basic progressive MSA algorithm. > > ?In this substep is going to be implemented the incremental algorithm > to built the MSA, transversing a guide tree (parameter, could be the > one built in step 3 or any other one) in a bottom-up fashion and using > the algorithm from substep 4.1 at each iteration. > > *** 4.3 Implement the MSA wrapper. > > ?The MSA wrapper is going to be a method that wraps steps 2, 3 and > 4.2, giving a simple method (for the final user) to calculate the MSA. > Receiving as parameters the set of sequences to be aligned, the gap > opening penalty, gap extend penalty and residue matrix. Returning the > MSA for the sequence set. > ?At the end of this substep, we get a basic fully functional MSA > algorithm, using the progressive heuristic. > > *** 4.4 Implement gaps penalties rescaling and parameter default values. > > ?Gap penalties to open a new gap an extend a existing one (the affine > gap weight model) are user defined parameters. This substep will > define default values, based on the residue matrix, for this > parameters and implement global rescaling rules (based on sequences > sizes) for this parameters. > > *** 4.5 Enhance the DP algorithm to use different sequences weight. > > ?Based on the guide tree, for each sequence a different weight > (divergent sequences receive high values) is calculated and used in > the scoring scheme of the generalized DP algorithm. > > *** 4.6 Enhance the DP algorithm to use position based gap penalties. > > ?The DP algorithm from substep 4.1 uses globally defined gap opening > penalty. In this substep, the algorithm is going to be modified do use > position based penalty, this is simple, once is known an array of > opening penalties for each sequence position. This array is calculated > based on several hierarchical (only apply the first one that fits, if > any) rules, those are rescaling rules and the array is initialized > with the original gap opening penalty. > > Given the hierarchical nature of the rules, they can be implemented in > a incremental way, from the highest priority rule to the lowest, the > algorithm of each step being a refinement of the previous one. I am > omitting the detailed description of each rule. However, to verify if > a given rule apply to a given position, all that is necessary is to > check at most 16 adjacent positions and the same position in the other > already aligned sequences. > > At the end of each of the following steps we a have functional > algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. > > **** 4.6.1 Lowered gap opening penalties at existing gaps. > **** 4.6.2 Increased gap opening penalties near existing gaps. > **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. > **** 4.6.4 Residue specific gap penalties. > > ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. > > ?EXTRA: Implement some benchmark technique to measure the final > alignment quality. > > Main Risks > ---------- > > The main risk to this project is the intrinsic complexity of the MSA > progressive algorithm. To deal with that we decided to break the > implementation in a large number of small and manageable steps, and > the steps are designed in a way that, at the end of each of them, we > will have a complete and testable new function (or a modification of > an existing one). Besides that, to be extra careful the project aims > to produce a simple full functional MSA algorithm as early as > possible, the estimated time is 8 weeks, this way we guarantee to > deliver at a simpler, but working and bug-free, version. > > > > >> Andreas >> >> >>> ------------------------------------------------------------- >>> >>> GSoC proposal >>> >>> Abstract >>> -------- >>> >>> This project aims to develop an all-Java implementation of a multiple >>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >>> using the progressive algorithm described in the CLUSTALW paper [1]. >>> >>> The Importance >>> -------------- >>> >>> Multiple sequence alignment is a frequently performed task in sequence >>> analysis with the goal to identify new members of protein families and >>> infer phylogenetic relationships between proteins and genes. At the >>> present there is no Java-only implementation for this algorithm. As >>> such the number of already existing and Java related BioInformatics >>> tools and web sites would benefit from this implementation and >>> sequence analysis could be more easily performed by the end-user. >>> >>> About Me >>> -------- >>> >>> I am a graduate student at University of S?o Paulo (Brazil), I got my >>> undergraduate degree from the same university with a major in Computer >>> Science and a minor in Biology. I have been involved with >>> Bioinformatics for 5 years, always with sequence analysis with >>> particular interest in the MSA problem. Also, in my undergraduate >>> final project I developed a lossless filter (pruning algorithm) for >>> the MSA problem, the work is published in [3] and there is an online >>> implementation of the algorithm in [4]. Finally, I have experience >>> with the C, C++, Java, Python and Ruby programming languages; Git and >>> SVN version control systems. >>> >>> Project Plan >>> ------------ >>> >>> The project is divided in four main steps, at the end of each step a >>> completely functional and bug-free new algorithm will be added to the >>> Biojava code base. It should be noticed that each step has a strong >>> dependence on the previous one, so before move to the next step a >>> careful testing will be done. >>> >>> The four steps are described below, estimated times for accomplishment >>> of each step are also given and in some steps extra enhancements are >>> described, they will be implemented if there is some time remaining >>> after all steps are completed. >>> >>> ** 1. Study the Biojava pairwise alignment code and update it to be >>> compliant with Biojava 3. >>> >>> ?The pairwise alignment will play an important role in the MSA >>> algorithm. This step is also important for me to get used to the >>> Biojava coding standards and get in touch with the Biojava dev >>> community. >>> >>> ?ETA: 2 weeks. >>> >>> ** 2. Implement the algorithm to build the distance matrix. >>> >>> ?This is done using the pairwise alignment for each pair of sequence >>> in the set to be aligned. >>> >>> ?ETA: 1 week. >>> >>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use >>> several threads to calculate the pairwise alignment for different >>> pairs in the sequence set. >>> >>> ** 3. Implement the algorithm to build the guide tree. >>> >>> ?The guide tree is based on the distance matrix built in the last >>> step, the tree construction strategy adopted will be the Neighbor >>> Joining Algorithm. >>> >>> ?ETA: 2 weeks. >>> >>> ** 4. Implement the algorithm for progressive MSA using the guide tree. >>> >>> ?This is certainly the most difficult part of the project, so to make >>> sure we are going to deliver a fully functional MSA algorithm, a safer >>> approach is going to be taken. In the first place, a dynamic >>> programming algorithm described in [2] will be implemented. Once this >>> get successfully done and the code fully integrated to the Biojava >>> code base, the features described in [1] are going to be incrementally >>> added (and tested) in order to implement the full dynamic programming >>> algorithm. >>> >>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >>> >>> ?EXTRA: Implement some benchmark technique to measure the final >>> alignment quality. >>> >>> References >>> ---------- >>> >>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >>> [3] http://www.almob.org/content/4/1/3 >>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >>> >>> >>> >>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >>>> Hi Gustavo, >>>> >>>> In principle I agree to all, see details below: >>>> >>>> >>>> I think my question wasn't very clear, my intention in this project is >>>>> >>>>> to follow the approach (with the tree steps) outlined in the project's >>>>> page. Using the classical progressive alignment heuristic: build the >>>>> distance matrix, build the guide tree and using this tree >>>>> progressively align more sequences together. >>>> >>>> yes >>>> >>>>> >>>>> What I propose for the third step is a first implementation using the >>>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>>> more or less what you had in mind? >>>> >>>> yes, sounds good. >>>> >>>>> >>>>> About parallel strategies, I think a relative easy way we could use it >>>>> is in the distance matrix construction, we could have several threads >>>>> calculating the pairwise alignment for different pairs of sequence in >>>>> the set. >>>> >>>> Correct. Probably a first implementation would be for a single machine/ >>>> multi CPU. More advanced implementations could provide support e.g. for >>>> Map/Reduce, JPPF, or something like that... >>>> >>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>>> paper doesn't give any way to measure the quality of the result, they >>>>> consider a good alignment the one that is hard to improve by eye (But >>>>> they claim that for sequences sufficient similar, no pair less than >>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>>> paper and leave the quality measure to the user? How concerned should >>>>> I be with that in this project? >>>> >>>> Getting an overall core-algorithm that works should be priority. The >>>> benchmarking part is not mandatory, but something to keep in mind... I have >>>> plenty of material for that, once we get to that stage... >>>> >>>>> I will try send to this mailing list a proposal draft until tomorrow >>>>> to have some feedback from you. >>>> >>>> Excellent, looking forward to it. >>>> >>>> Andreas >>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> Dr. Andreas Prlic >>>> Senior Scientist, RCSB PDB Protein Data Bank >>>> University of California, San Diego >>>> (+1) 858.246.0526 >>>> ----------------------------------------------------------------------- >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From chapman at cs.wisc.edu Thu Apr 8 16:45:21 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Thu, 08 Apr 2010 15:45:21 -0500 Subject: [Biojava-l] GSoC Application In-Reply-To: References: <4BBDD050.6090208@cs.wisc.edu> Message-ID: <4BBE4061.3000204@cs.wisc.edu> Hi Andreas, Thanks for the feedback. Difficulties and risks: By viewing progressive multiple sequence alignment as four separate stages, I believe the pieces become easier to manage. However, I also expect a few of my ideas to prove quite challenging to implement. One of these challenges will be efficient parallelization. Instead of spending all summer finding the optimal approach, I plan to make routines which are called in sequence in a simple implementation and in parallel in a separate one. Later work could then extend the parallelism to a distributed computing framework such as hadoop or condor. Another difficult aspect is to make a general interface for choosing anchors in profile-profile alignment. The Myers-Miller algorithm chooses optimal midpoints as anchors in an internal decision process. I hope to generalize this to allow external identification of candidate anchors, as well. Structural alignment integration: At least three options exist for inserting structural information into the multiple sequence alignment task: pairwise scoring, anchoring, and profile scoring. First, scores from pairwise structural alignments could be used to construct the similarity matrix. This would create a guide tree that aligns sequences with similar structures earlier in the progressive alignment. Second, structural alignment could identify possible anchors. The profile-profile alignments would then conserve known structures when two profiles share some anchor candidates. Both of these options are in my plan. The third option would follow the consistency method of profile-profile alignment which replaces scoring from a substitution matrix with a consistency score. This technique is used in T-Coffee and ProbCons. The consistency score comes from how often residues in each profile aligned when combining information from pairwise alignments. If these were structural pairwise alignments, then the multiple sequence alignment would preserve structural information. Later work could implement this method as an alternative profile-profile alignment. I'll try to incorporate these ideas when I revise my application later tonight. And thanks again for your input. Mark On 4/8/2010 12:26 PM, Andreas Prlic wrote: > Hi Mark, > > looks pretty good, > > * The time schedule feels tight. Where do you see possible > difficulties and risks. What might take longer than expected? > > * I would like to be able to use 3D structure alignment information to > guide the final alignment. This should increase reliability of the > final alignment for remote sequence similarities. Any thoughts on how > to accomplish this? > > Andreas > > > > > On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman wrote: >> I would appreciate any feedback on my proposal from mentors or other >> developers. Check it out at: >> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817 >> >> Thanks in advance, >> Mark >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > From sacomoto at gmail.com Thu Apr 8 20:36:27 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Thu, 8 Apr 2010 21:36:27 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Andreas, On Thu, Apr 8, 2010 at 2:36 PM, Andreas Prlic wrote: > Looks pretty good. > > One issue during the progressive alignment build up: 3D structure > alignments can increase the reliability of the sequence alignments, > particularly if the sequences are only distantly related. Having a way > to incorporate the 3D structure info would be nice... A first idea to incorporate some information about 3D structure alignment is to extract from this alignment some matching substrings, i.e. obtain the sequence substrings that correspond to the superimposed points in the 3D alignment. And then, force the final MSA to contain those same aligned substrings, in order to do that the DP algorithm of step 4.1 should be modified in a way described here [ http://www.ncbi.nlm.nih.gov/pubmed/9018604 ] . Thanks again. gustavo > Andreas > > On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto > wrote: >> Hi Andreas, >> >> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic wrote: >>> Hi Gustavo, >>> >>> here my 0.02$: >>> >>> * For some of your steps there is already code available in BioJava. >>> MIght be good to take a look at what is already there... ? (look at >>> the alignment and phylo modules for dynamic programming and >>> Neighbour-Joining) >>> >>> * What about risks? Where do you expect difficulties and how to work >>> around them? >>> >>> * Step 4: Can you add more details? How do you plan to approach this? >>> E.g. Clustalw has a number of rules implemented at this stage. Do you >>> plan to support multiple rules as well and how to do this technically. >>> Something nice would be the possibility to use structure alignments to >>> guide the sequence alignments. (structure module) >> >> Based on it I rewrote the step 4 and add a "Main Risks" section. >> >> I pasted just the new version of step 4 and the new section at the end >> of this e-mal. >> >> Thank you very much for your feedback. >> >> gustavo >> >> >> >> ------------------------------------------------------------------------------------------- >> >> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. >> >> ?A progressive MSA is a heuristic approach for the MSA problem, at >> each step a pairwise alignment between two sequences, a sequence and >> an alignment or between two alignments is done. So, the multiple >> alignment is built incrementally, at each iteration more sequences are >> aligned together. The guide tree gives an order for this incremental >> alignment, in a bottom-up (in the tree) fashion sequences (or groups >> of sequences) with greater similarity are aligned first. Therefore, in >> order to have a more flexible and reusable code, the code design will >> allow any binary tree of the sequences to be used as a guide tree, not >> only the one built in the last step. This will allow a priori >> phylogenetic or tertiary similarity (structural similarity) knowledge >> be used to guide the multiple alignment order. >> >> ?This is certainly the most difficult part of the project, so to make >> sure we are going to deliver a fully functional MSA algorithm, a safer >> approach is going to be taken. In the first place, a a basic algorithm >> described in [2] will be implemented. Once this get successfully done >> and the code fully integrated to the Biojava code base, the features >> described in [1] are going to be incrementally added (and tested) in >> order to implement the full algorithm. This step is further divided in >> substeps. >> >> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. >> >> ?This is the generalized pairwise alignment used in each iteration of >> the progressive MSA. Gaps ?already presents in one of the alignments >> (profiles) remain fixed, gap opening penalties remain unchanged, this >> means that opening new gaps inside existent gaps will be fully >> penalized. The code for this algorithm is similar to, the already >> present in Biojava, code for regular pairwise alignment. >> >> *** 4.2 Implement the basic progressive MSA algorithm. >> >> ?In this substep is going to be implemented the incremental algorithm >> to built the MSA, transversing a guide tree (parameter, could be the >> one built in step 3 or any other one) in a bottom-up fashion and using >> the algorithm from substep 4.1 at each iteration. >> >> *** 4.3 Implement the MSA wrapper. >> >> ?The MSA wrapper is going to be a method that wraps steps 2, 3 and >> 4.2, giving a simple method (for the final user) to calculate the MSA. >> Receiving as parameters the set of sequences to be aligned, the gap >> opening penalty, gap extend penalty and residue matrix. Returning the >> MSA for the sequence set. >> ?At the end of this substep, we get a basic fully functional MSA >> algorithm, using the progressive heuristic. >> >> *** 4.4 Implement gaps penalties rescaling and parameter default values. >> >> ?Gap penalties to open a new gap an extend a existing one (the affine >> gap weight model) are user defined parameters. This substep will >> define default values, based on the residue matrix, for this >> parameters and implement global rescaling rules (based on sequences >> sizes) for this parameters. >> >> *** 4.5 Enhance the DP algorithm to use different sequences weight. >> >> ?Based on the guide tree, for each sequence a different weight >> (divergent sequences receive high values) is calculated and used in >> the scoring scheme of the generalized DP algorithm. >> >> *** 4.6 Enhance the DP algorithm to use position based gap penalties. >> >> ?The DP algorithm from substep 4.1 uses globally defined gap opening >> penalty. In this substep, the algorithm is going to be modified do use >> position based penalty, this is simple, once is known an array of >> opening penalties for each sequence position. This array is calculated >> based on several hierarchical (only apply the first one that fits, if >> any) rules, those are rescaling rules and the array is initialized >> with the original gap opening penalty. >> >> Given the hierarchical nature of the rules, they can be implemented in >> a incremental way, from the highest priority rule to the lowest, the >> algorithm of each step being a refinement of the previous one. I am >> omitting the detailed description of each rule. However, to verify if >> a given rule apply to a given position, all that is necessary is to >> check at most 16 adjacent positions and the same position in the other >> already aligned sequences. >> >> At the end of each of the following steps we a have functional >> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. >> >> **** 4.6.1 Lowered gap opening penalties at existing gaps. >> **** 4.6.2 Increased gap opening penalties near existing gaps. >> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. >> **** 4.6.4 Residue specific gap penalties. >> >> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >> >> ?EXTRA: Implement some benchmark technique to measure the final >> alignment quality. >> >> Main Risks >> ---------- >> >> The main risk to this project is the intrinsic complexity of the MSA >> progressive algorithm. To deal with that we decided to break the >> implementation in a large number of small and manageable steps, and >> the steps are designed in a way that, at the end of each of them, we >> will have a complete and testable new function (or a modification of >> an existing one). Besides that, to be extra careful the project aims >> to produce a simple full functional MSA algorithm as early as >> possible, the estimated time is 8 weeks, this way we guarantee to >> deliver at a simpler, but working and bug-free, version. >> >> >> >> >>> Andreas >>> >>> >>>> ------------------------------------------------------------- >>>> >>>> GSoC proposal >>>> >>>> Abstract >>>> -------- >>>> >>>> This project aims to develop an all-Java implementation of a multiple >>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >>>> using the progressive algorithm described in the CLUSTALW paper [1]. >>>> >>>> The Importance >>>> -------------- >>>> >>>> Multiple sequence alignment is a frequently performed task in sequence >>>> analysis with the goal to identify new members of protein families and >>>> infer phylogenetic relationships between proteins and genes. At the >>>> present there is no Java-only implementation for this algorithm. As >>>> such the number of already existing and Java related BioInformatics >>>> tools and web sites would benefit from this implementation and >>>> sequence analysis could be more easily performed by the end-user. >>>> >>>> About Me >>>> -------- >>>> >>>> I am a graduate student at University of S?o Paulo (Brazil), I got my >>>> undergraduate degree from the same university with a major in Computer >>>> Science and a minor in Biology. I have been involved with >>>> Bioinformatics for 5 years, always with sequence analysis with >>>> particular interest in the MSA problem. Also, in my undergraduate >>>> final project I developed a lossless filter (pruning algorithm) for >>>> the MSA problem, the work is published in [3] and there is an online >>>> implementation of the algorithm in [4]. Finally, I have experience >>>> with the C, C++, Java, Python and Ruby programming languages; Git and >>>> SVN version control systems. >>>> >>>> Project Plan >>>> ------------ >>>> >>>> The project is divided in four main steps, at the end of each step a >>>> completely functional and bug-free new algorithm will be added to the >>>> Biojava code base. It should be noticed that each step has a strong >>>> dependence on the previous one, so before move to the next step a >>>> careful testing will be done. >>>> >>>> The four steps are described below, estimated times for accomplishment >>>> of each step are also given and in some steps extra enhancements are >>>> described, they will be implemented if there is some time remaining >>>> after all steps are completed. >>>> >>>> ** 1. Study the Biojava pairwise alignment code and update it to be >>>> compliant with Biojava 3. >>>> >>>> ?The pairwise alignment will play an important role in the MSA >>>> algorithm. This step is also important for me to get used to the >>>> Biojava coding standards and get in touch with the Biojava dev >>>> community. >>>> >>>> ?ETA: 2 weeks. >>>> >>>> ** 2. Implement the algorithm to build the distance matrix. >>>> >>>> ?This is done using the pairwise alignment for each pair of sequence >>>> in the set to be aligned. >>>> >>>> ?ETA: 1 week. >>>> >>>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use >>>> several threads to calculate the pairwise alignment for different >>>> pairs in the sequence set. >>>> >>>> ** 3. Implement the algorithm to build the guide tree. >>>> >>>> ?The guide tree is based on the distance matrix built in the last >>>> step, the tree construction strategy adopted will be the Neighbor >>>> Joining Algorithm. >>>> >>>> ?ETA: 2 weeks. >>>> >>>> ** 4. Implement the algorithm for progressive MSA using the guide tree. >>>> >>>> ?This is certainly the most difficult part of the project, so to make >>>> sure we are going to deliver a fully functional MSA algorithm, a safer >>>> approach is going to be taken. In the first place, a dynamic >>>> programming algorithm described in [2] will be implemented. Once this >>>> get successfully done and the code fully integrated to the Biojava >>>> code base, the features described in [1] are going to be incrementally >>>> added (and tested) in order to implement the full dynamic programming >>>> algorithm. >>>> >>>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >>>> >>>> ?EXTRA: Implement some benchmark technique to measure the final >>>> alignment quality. >>>> >>>> References >>>> ---------- >>>> >>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >>>> [3] http://www.almob.org/content/4/1/3 >>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >>>> >>>> >>>> >>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >>>>> Hi Gustavo, >>>>> >>>>> In principle I agree to all, see details below: >>>>> >>>>> >>>>> I think my question wasn't very clear, my intention in this project is >>>>>> >>>>>> to follow the approach (with the tree steps) outlined in the project's >>>>>> page. Using the classical progressive alignment heuristic: build the >>>>>> distance matrix, build the guide tree and using this tree >>>>>> progressively align more sequences together. >>>>> >>>>> yes >>>>> >>>>>> >>>>>> What I propose for the third step is a first implementation using the >>>>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>>>> more or less what you had in mind? >>>>> >>>>> yes, sounds good. >>>>> >>>>>> >>>>>> About parallel strategies, I think a relative easy way we could use it >>>>>> is in the distance matrix construction, we could have several threads >>>>>> calculating the pairwise alignment for different pairs of sequence in >>>>>> the set. >>>>> >>>>> Correct. Probably a first implementation would be for a single machine/ >>>>> multi CPU. More advanced implementations could provide support e.g. for >>>>> Map/Reduce, JPPF, or something like that... >>>>> >>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>>>> paper doesn't give any way to measure the quality of the result, they >>>>>> consider a good alignment the one that is hard to improve by eye (But >>>>>> they claim that for sequences sufficient similar, no pair less than >>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>>>> paper and leave the quality measure to the user? How concerned should >>>>>> I be with that in this project? >>>>> >>>>> Getting an overall core-algorithm that works should be priority. The >>>>> benchmarking part is not mandatory, but something to keep in mind... I have >>>>> plenty of material for that, once we get to that stage... >>>>> >>>>>> I will try send to this mailing list a proposal draft until tomorrow >>>>>> to have some feedback from you. >>>>> >>>>> Excellent, looking forward to it. >>>>> >>>>> Andreas >>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> Dr. Andreas Prlic >>>>> Senior Scientist, RCSB PDB Protein Data Bank >>>>> University of California, San Diego >>>>> (+1) 858.246.0526 >>>>> ----------------------------------------------------------------------- >>>>> >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From sheoran143 at gmail.com Sun Apr 11 15:16:29 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Sun, 11 Apr 2010 14:16:29 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class Message-ID: <4BC2200D.8000109@gmail.com> Hi, Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry Thanks Deepak Sheoran From holland at eaglegenomics.com Sun Apr 11 15:53:06 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 11 Apr 2010 20:53:06 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: <4BC2200D.8000109@gmail.com> References: <4BC2200D.8000109@gmail.com> Message-ID: I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). thanks, Richard On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: > Hi, > > Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. > > 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) > 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. > > > > > > > > ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry > > Thanks > Deepak Sheoran > > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From sheoran143 at gmail.com Sun Apr 11 17:08:22 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Sun, 11 Apr 2010 16:08:22 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> Message-ID: <4BC23A46.7090304@gmail.com> I am using same table with biojava and bioperl taxon program and the output I get is below: *Biojava:* For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things) *Bioperl:* For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things) Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id. *Taxon and Taxon_name Table content which is being relevant in discussion:* taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class 2901 3609 276240 genus Rhamnus scientific name 3610 4403 3609 species Platanus occidentalis scientific name 29052 48579 4403 species Suillus placidus scientific name 114412 143975 48579 species Diadasia australis scientific name 143976 176516 143975 species Arnicastrum guerrerense scientific name 30680 50447 176516 family Labiduridae scientific name 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name 9394 11632 17394 family Retroviridae scientific name 277861 327045 9394 subfamily Orthoretrovirinae scientific name 122448 153057 277861 genus Alpharetrovirus scientific name 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name 9584 11876 301952 species Avian sarcoma virus scientifice name Thanks Deepak On 4/11/2010 2:53 PM, Richard Holland wrote: > I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). > > thanks, > Richard > > On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: > > >> Hi, >> >> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >> >> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >> >> >> >> >> >> >> >> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >> >> Thanks >> Deepak Sheoran >> >> >> > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > From sheoran143 at gmail.com Sun Apr 11 18:48:00 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Sun, 11 Apr 2010 17:48:00 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: <4BC23A46.7090304@gmail.com> References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: <4BC251A0.4090602@gmail.com> If we don't want to change the current code in biojava and still want to fix this bug I have found a way, 1) we can do this by changing one of hibernate files called "Taxon.hbm.xml" and replace the line with by changing the above setting in hibernate setting I am able to get the correct linage for ncbi_taxon_id = 11876(Avian sarcoma virus) which is Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. 2) But the possible issue which we might get is with Taxonomy loader class which want to insert something for parent taxon_id into taxon table which I think won't be possible if we do this change to hibernate con-fig file. Deepak Sheoran On 4/11/2010 4:08 PM, Deepak Sheoran wrote: > I am using same table with biojava and bioperl taxon program and the > output I get is below: > > *Biojava:* > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the > lineage i get is > Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia > australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum > var. haydenii. > > Biojava process of finding names: > 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 > (wrong way of doing things) > > *Bioperl:* > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the > lineage i get is > Retroviridae; Orthoretrovirinae; Alpharetrovirus; > unclassified Alpharetrovirus. > > Bioperl process of finding names: > 11876==>353825==>153057==>327045==>11632 (Right way of doing things) > > Hint: biojava search ncbi_taxon_id column with a value from > parent_taxon_id where bioperl search taxon_id column with a value from > parent_taxon_id. > > *Taxon and Taxon_name Table content which is being relevant in > discussion:* > > taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class > 2901 3609 276240 genus Rhamnus scientific name > 3610 4403 3609 species Platanus occidentalis scientific name > 29052 48579 4403 species Suillus placidus scientific name > 114412 143975 48579 species Diadasia australis scientific name > 143976 176516 143975 species Arnicastrum guerrerense scientific name > 30680 50447 176516 family Labiduridae scientific name > 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii > scientific name > 9394 11632 17394 family Retroviridae scientific name > 277861 327045 9394 subfamily Orthoretrovirinae scientific name > 122448 153057 277861 genus Alpharetrovirus scientific name > 301952 353825 122448 no rank unclassified Alpharetrovirus > scientific name > 9584 > 11876 > 301952 > species > Avian sarcoma virus > scientifice name > > > Thanks > Deepak > > On 4/11/2010 2:53 PM, Richard Holland wrote: >> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). >> >> thanks, >> Richard >> >> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: >> >> >>> Hi, >>> >>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >>> >>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >>> >>> >>> >>> >>> >>> >>> >>> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >>> >>> Thanks >>> Deepak Sheoran >>> >>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E:holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> > From holland at eaglegenomics.com Mon Apr 12 02:57:57 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 12 Apr 2010 07:57:57 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: <4BC23A46.7090304@gmail.com> References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: Thanks Deepak. I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used. BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.) I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results. This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now. cheers, Richard On 11 Apr 2010, at 22:08, Deepak Sheoran wrote: > I am using same table with biojava and bioperl taxon program and the output I get is below: > > Biojava: > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is > Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. > > Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things) > > Bioperl: > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is > Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. > > Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things) > > Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id. > > Taxon and Taxon_name Table content which is being relevant in discussion: > > taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class > 2901 3609 276240 genus Rhamnus scientific name > 3610 4403 3609 species Platanus occidentalis scientific name > 29052 48579 4403 species Suillus placidus scientific name > 114412 143975 48579 species Diadasia australis scientific name > 143976 176516 143975 species Arnicastrum guerrerense scientific name > 30680 50447 176516 family Labiduridae scientific name > 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name > 9394 11632 17394 family Retroviridae scientific name > 277861 327045 9394 subfamily Orthoretrovirinae scientific name > 122448 153057 277861 genus Alpharetrovirus scientific name > 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name > 9584 > 11876 > 301952 > species > Avian sarcoma virus > scientifice name > > Thanks > Deepak > > On 4/11/2010 2:53 PM, Richard Holland wrote: >> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). >> >> thanks, >> Richard >> >> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: >> >> >> >>> Hi, >>> >>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >>> >>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >>> >>> >>> >>> >>> >>> >>> >>> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >>> >>> Thanks >>> Deepak Sheoran >>> >>> >>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: >> holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Mon Apr 12 03:07:55 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 12 Apr 2010 08:07:55 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: Incidentally, BioJava's approach matches the description in the BioSQL docs at: http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME (first example SQL statement - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join) The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this description. cheers, Richard On 12 Apr 2010, at 07:57, Richard Holland wrote: > Thanks Deepak. > > I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. > > BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used. > > BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.) > > I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results. > > This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now. > > cheers, > Richard > > On 11 Apr 2010, at 22:08, Deepak Sheoran wrote: > >> I am using same table with biojava and bioperl taxon program and the output I get is below: >> >> Biojava: >> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is >> Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. >> >> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things) >> >> Bioperl: >> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is >> Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. >> >> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things) >> >> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id. >> >> Taxon and Taxon_name Table content which is being relevant in discussion: >> >> taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class >> 2901 3609 276240 genus Rhamnus scientific name >> 3610 4403 3609 species Platanus occidentalis scientific name >> 29052 48579 4403 species Suillus placidus scientific name >> 114412 143975 48579 species Diadasia australis scientific name >> 143976 176516 143975 species Arnicastrum guerrerense scientific name >> 30680 50447 176516 family Labiduridae scientific name >> 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name >> 9394 11632 17394 family Retroviridae scientific name >> 277861 327045 9394 subfamily Orthoretrovirinae scientific name >> 122448 153057 277861 genus Alpharetrovirus scientific name >> 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name >> 9584 >> 11876 >> 301952 >> species >> Avian sarcoma virus >> scientifice name >> >> Thanks >> Deepak >> >> On 4/11/2010 2:53 PM, Richard Holland wrote: >>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). >>> >>> thanks, >>> Richard >>> >>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: >>> >>> >>> >>>> Hi, >>>> >>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >>>> >>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >>>> >>>> Thanks >>>> Deepak Sheoran >>>> >>>> >>>> >>>> >>> -- >>> Richard Holland, BSc MBCS >>> Operations and Delivery Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: >>> holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> >>> >>> >>> >> > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From mara.axiom at gmail.com Tue Apr 13 10:55:50 2010 From: mara.axiom at gmail.com (Mara Axiom) Date: Tue, 13 Apr 2010 10:55:50 -0400 Subject: [Biojava-l] BioJava implementation of a phylogenetic tree reconstruction algorithm Message-ID: Hello all, Does anyone have BioJava implementation of a phylogenetic tree reconstruction algorithm, except neighbor-joining or UPGMA? I need this for a research. We have neighbor-joining or UPGMA implementation already, and we want to look at other algorithms other than these. I am new to BioJava, any information will help. Here is what we want. 1 - Compare sequences in a FASTA file, and find sequences that are similar to each other. 2 - Construct the tree. 3 - Output the tree in Newick (XML will work too) format. In particular we are interested in implementation of BNNP ( http://www.cs.cmu.edu/~guyb/papers/SDBHRS06.pdf) and Align Free ( http://www.math.ucla.edu/~roch/research_files/align-free.pdf) algorithms, but we are open to other algorithms too. Please do not recommend a P-tree reconstruction tool. We are only interested in a source code to meet our specific purpose. Thanks in advance, Mara From biopython at maubp.freeserve.co.uk Thu Apr 15 13:54:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Apr 2010 18:54:56 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: Hi, I've CC'd this to the BioSQL mailing list for cross project discussion. On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland wrote: > Thanks Deepak. > > I've had a look at the code and I believe its due to the > different ways in which BioJava and BioPerl load the > taxon table. > > BioJava sets the ncbi_taxon_id and parent_taxon_id > columns based on the values from the NCBI taxonomy > file. The taxon_id column in BioJava is a meaningless > auto-generated value that is never used. > > BioPerl however is generating taxon_id values and > linking them by setting parent_taxon_id to the > generated value. The parent value from the NCBI > taxonomy file is therefore replaced with the BioPerl > generated parent ID, meaning that instead of linking > from parent_taxon_id to ncbi_taxon_id as per BioJava, > the link is to taxon_id instead. (I'm basing this > comment on looking at load_ncbi_taxonomy.pl from > the BioSQL archives.) Note that old versions of load_ncbi_taxonomy.pl (which is part of BioSQL, not part of BioPerl) would set taxon_id equal to ncbi_taxon_id, see: http://bugzilla.open-bio.org/show_bug.cgi?id=2470 This may help explain the confusion. > I believe if you load the taxonomy table using BioJava, > you should see BioJava giving correct behaviour. > Likewise if you load it using BioPerl, BioPerl will > behave correctly. But if you load with one then query > with the other, you'll get incorrect results. > > This sounds like a case for discussion on both lists - > a matter of standardisation between the two projects. > Not quickly/easily solvable for now. Its not just two projects (BioPerl & BioJava) (grin). Its at least five projects (BioSQL itself plus BioRuby and Biopython). I'm not sure about BioRuby's implementation, but currently I think BioJava is the odd one out - BioPerl, Biopython, and the BioSQL's load_ncbi_taxonomy.pl all make entries in parent_taxon_id reference the automatically generated taxon_id (please correct me if I am wrong). My personal view is that bioperl-db is the reference implementation and should be followed in the event of any ambiguity within BioSQL. In this particular case, there is actually a BioSQL script to check against too (load_ncbi_taxonomy.pl). Hopefully Hilmar can give us an official verdict... Peter From andreas.draeger at uni-tuebingen.de Wed Apr 7 09:22:26 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 07 Apr 2010 15:22:26 +0200 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] Message-ID: <4BBC8712.90907@uni-tuebingen.de> Hi all, This e-mail is just for your information about somebody new, who'd like to contribute to our project. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 -------------- next part -------------- An embedded message was scrubbed... From: =?ISO-8859-1?Q?Andreas_Dr=E4ger?= Subject: Re: Fwd: Proposing a project on "Biojava alignment lead" Date: Wed, 07 Apr 2010 09:27:13 +0200 Size: 4779 URL: From jbdundas at gmail.com Fri Apr 16 09:57:41 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 16 Apr 2010 19:27:41 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de> References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: Dear Sir, I am very interested in contributing to this project. I am looking for a good problem,more on the research side. I can also help in coding (I also work as a software engineer-j2ee/eclipse/jboss/tomcat .. Anything that I could work on... Regards, Jitesh Dundas On 4/8/10, Andreas Dr?ger wrote: > Hi all, > > This e-mail is just for your information about somebody new, who'd like > to contribute to our project. > > Cheers > Andreas > > > Subject: > Re: Fwd: Proposing a project on "Biojava alignment lead" > From: > Andreas Dr?ger > Date: > Wed, 07 Apr 2010 09:27:13 +0200 > To: > Cai Shaojiang > > Hi Cai Shaojiang, > > Thank you for you e-mail! I don't know what happened to the e-mail list. > Sometimes it takes a while due to the spam filters, I guess. > > > I am a PhD student from National University of Singapore. My major > research area is local alignment algorithms and data structures for SNP > identification. And I have used Java and Eclipse for years for software > development. I am very interested in your GSoC programme. I find that > there is a module called "biojava-alignment lead" whose mentor is you. I > want to propose a new project on this module. I have several questions > about this module. > > Yes, that's me. So great to get your support. > > > 1. It seems that pairwise alignment is to find similarity between two > short sequences. Existing pairwise alignment is based on dynamic > programming, is it Smith-Waterman algorithm? > > So, currently, BioJava contains three different alignment approaches. > There are two deterministic algorithms, i.e., Smith-Waterman for local > alignment and Needleman-Wunsch for global alignment. Third, there is the > possibility to apply Hidden Markov Models for alignment. An example of > the latter approach should be in the cookbook. > > > 2. What is the exact task of "refactoring of underlying data structures"? > > Yes, this is something, I did last week already but it could still be > improved. The problem was that the alignment algorithms actually > produced a kind of string that looks similar to the output of BLAST. > This string contained the score, the computation time, the length of the > alignment etc. The problem was that people wanted to perform > higher-level computation on the score value or evaluate some other > information. Now, the alignment will produce a data structure that > contains all the information and can, in addition to that, also produce > such a BLAST-like output. There is, however, still the following > problem: The data structure requires both sequences in the pair-wise > alignment to have an identical length. In case of local alignment this > is especially stupid (actually), because gaps are inserted to fill the > sequences. And then the data structure tries to keep the old sequence > coordinates, leading to the effect that the numbers "query start", > "query end", "subject start", and "subject end" are required to shift > the sequences against each other when displaying the output. So, you > cannot easily print the sequences below of each other, you first have to > shift them. Please check out the latest version of this package via > anonymeous svn and have a look ;-) > > > 3. My existing research area is aiming to deal with aligning short > read (10s~100s bp) against extremely long sequences (e.g., human > genome). Af far as I know, there is not existing such alignment tools > implemented in Java. Would you consider this direction? > > See, this would be very nice to include. But this requires that we no > longer fill the short sequence with many, many gap symbols (just a waist > of memory), but improve the data structure. There is already an > UnequalLenghtAlignment (just a data structure, no algorithm) and I think > we could use this as a starting point. Then your algorithm should only > produce such a data structure and this would be fine. > > > 4. It seems that the existing tools is just lacking of some > refactoring and representation interfaces. Any more underlying tasks? > > Hm. Yes: With the release of BioJava 3 data structures have changed > again. So maybe there's also some adaptation to the new structure required. > > > I am keeping an eye on GSoC from last month, but sorry to find out > that I sent the initial email to the mailing list before I subscribe it... > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > latest trunk, have a look, play around and if you can improve something > we'll put it into the trunk and write your name into the authors' tag. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From chapman at cs.wisc.edu Fri Apr 16 13:28:33 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Fri, 16 Apr 2010 12:28:33 -0500 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: <4BC89E41.4030009@cs.wisc.edu> A great place to start finding ideas is the wiki. Both http://biojava.org/wiki/BioJava:Modules and http://biojava.org/wiki/BioJava3_Proposal list the next steps planned/desired for BioJava. What research area did you have in mind? Have fun, Mark On 4/16/2010 8:57 AM, jitesh dundas wrote: > Dear Sir, > > I am very interested in contributing to this project. > > I am looking for a good problem,more on the research side. I can also > help in coding (I also work as a software > engineer-j2ee/eclipse/jboss/tomcat .. > > Anything that I could work on... > > Regards, > Jitesh Dundas > > On 4/8/10, Andreas Dr?ger wrote: >> Hi all, >> >> This e-mail is just for your information about somebody new, who'd like >> to contribute to our project. >> >> Cheers >> Andreas >> >> >> Subject: >> Re: Fwd: Proposing a project on "Biojava alignment lead" >> From: >> Andreas Dr?ger >> Date: >> Wed, 07 Apr 2010 09:27:13 +0200 >> To: >> Cai Shaojiang >> >> Hi Cai Shaojiang, >> >> Thank you for you e-mail! I don't know what happened to the e-mail list. >> Sometimes it takes a while due to the spam filters, I guess. >> >> > I am a PhD student from National University of Singapore. My major >> research area is local alignment algorithms and data structures for SNP >> identification. And I have used Java and Eclipse for years for software >> development. I am very interested in your GSoC programme. I find that >> there is a module called "biojava-alignment lead" whose mentor is you. I >> want to propose a new project on this module. I have several questions >> about this module. >> >> Yes, that's me. So great to get your support. >> >> > 1. It seems that pairwise alignment is to find similarity between two >> short sequences. Existing pairwise alignment is based on dynamic >> programming, is it Smith-Waterman algorithm? >> >> So, currently, BioJava contains three different alignment approaches. >> There are two deterministic algorithms, i.e., Smith-Waterman for local >> alignment and Needleman-Wunsch for global alignment. Third, there is the >> possibility to apply Hidden Markov Models for alignment. An example of >> the latter approach should be in the cookbook. >> >> > 2. What is the exact task of "refactoring of underlying data structures"? >> >> Yes, this is something, I did last week already but it could still be >> improved. The problem was that the alignment algorithms actually >> produced a kind of string that looks similar to the output of BLAST. >> This string contained the score, the computation time, the length of the >> alignment etc. The problem was that people wanted to perform >> higher-level computation on the score value or evaluate some other >> information. Now, the alignment will produce a data structure that >> contains all the information and can, in addition to that, also produce >> such a BLAST-like output. There is, however, still the following >> problem: The data structure requires both sequences in the pair-wise >> alignment to have an identical length. In case of local alignment this >> is especially stupid (actually), because gaps are inserted to fill the >> sequences. And then the data structure tries to keep the old sequence >> coordinates, leading to the effect that the numbers "query start", >> "query end", "subject start", and "subject end" are required to shift >> the sequences against each other when displaying the output. So, you >> cannot easily print the sequences below of each other, you first have to >> shift them. Please check out the latest version of this package via >> anonymeous svn and have a look ;-) >> >> > 3. My existing research area is aiming to deal with aligning short >> read (10s~100s bp) against extremely long sequences (e.g., human >> genome). Af far as I know, there is not existing such alignment tools >> implemented in Java. Would you consider this direction? >> >> See, this would be very nice to include. But this requires that we no >> longer fill the short sequence with many, many gap symbols (just a waist >> of memory), but improve the data structure. There is already an >> UnequalLenghtAlignment (just a data structure, no algorithm) and I think >> we could use this as a starting point. Then your algorithm should only >> produce such a data structure and this would be fine. >> >> > 4. It seems that the existing tools is just lacking of some >> refactoring and representation interfaces. Any more underlying tasks? >> >> Hm. Yes: With the release of BioJava 3 data structures have changed >> again. So maybe there's also some adaptation to the new structure required. >> >> > I am keeping an eye on GSoC from last month, but sorry to find out >> that I sent the initial email to the mailing list before I subscribe it... >> >> Ok. Sounds good. Thanks for your interest. So I suggest: Download the >> latest trunk, have a look, play around and if you can improve something >> we'll put it into the trunk and write your name into the authors' tag. >> >> Cheers >> Andreas >> >> -- >> Dipl.-Bioinform. Andreas Dr?ger >> Eberhard Karls University T?bingen >> Center for Bioinformatics (ZBIT) >> Sand 1 >> 72076 T?bingen >> Germany >> >> Phone: +49-7071-29-70436 >> Fax: +49-7071-29-5091 >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From sheoran143 at gmail.com Fri Apr 16 14:43:59 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Fri, 16 Apr 2010 13:43:59 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: <4BC8AFEF.70107@gmail.com> What my experience says on this issue we should make use of taxon_id because its a unique key in a local instance of biosql. ncbi_taxon_id should only be used for mapping purpose only so that a person can map his local taxon_id to a ncbi_taxon_id otherwise it defeat the sole purpose of having taxon_id as primary key in taxon table. The main goal which I think when biosql is designed is to make it independent of any other organization like genbank or NCBI but its a feature so that we can map a number(ncbi_taxon_id) given by a know authority to a local number (taxon_id). Deepak Sheoran On 4/15/2010 12:54 PM, Peter wrote: > Hi, > > I've CC'd this to the BioSQL mailing list for cross project > discussion. > > On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland wrote: > >> Thanks Deepak. >> >> I've had a look at the code and I believe its due to the >> different ways in which BioJava and BioPerl load the >> taxon table. >> >> BioJava sets the ncbi_taxon_id and parent_taxon_id >> columns based on the values from the NCBI taxonomy >> file. The taxon_id column in BioJava is a meaningless >> auto-generated value that is never used. >> >> BioPerl however is generating taxon_id values and >> linking them by setting parent_taxon_id to the >> generated value. The parent value from the NCBI >> taxonomy file is therefore replaced with the BioPerl >> generated parent ID, meaning that instead of linking >> from parent_taxon_id to ncbi_taxon_id as per BioJava, >> the link is to taxon_id instead. (I'm basing this >> comment on looking at load_ncbi_taxonomy.pl from >> the BioSQL archives.) >> > Note that old versions of load_ncbi_taxonomy.pl > (which is part of BioSQL, not part of BioPerl) would > set taxon_id equal to ncbi_taxon_id, see: > http://bugzilla.open-bio.org/show_bug.cgi?id=2470 > > This may help explain the confusion. > > >> I believe if you load the taxonomy table using BioJava, >> you should see BioJava giving correct behaviour. >> Likewise if you load it using BioPerl, BioPerl will >> behave correctly. But if you load with one then query >> with the other, you'll get incorrect results. >> >> This sounds like a case for discussion on both lists - >> a matter of standardisation between the two projects. >> Not quickly/easily solvable for now. >> > Its not just two projects (BioPerl& BioJava) (grin). > Its at least five projects (BioSQL itself plus BioRuby > and Biopython). > > I'm not sure about BioRuby's implementation, but > currently I think BioJava is the odd one out - BioPerl, > Biopython, and the BioSQL's load_ncbi_taxonomy.pl > all make entries in parent_taxon_id reference the > automatically generated taxon_id (please correct > me if I am wrong). > > My personal view is that bioperl-db is the reference > implementation and should be followed in the event > of any ambiguity within BioSQL. In this particular > case, there is actually a BioSQL script to check > against too (load_ncbi_taxonomy.pl). > > Hopefully Hilmar can give us an official verdict... > > Peter > From jbdundas at gmail.com Fri Apr 16 22:20:12 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 17 Apr 2010 07:50:12 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BC89E41.4030009@cs.wisc.edu> References: <4BBD820D.9070200@uni-tuebingen.de> <4BC89E41.4030009@cs.wisc.edu> Message-ID: Hi Everyone, I went throug the URLs sent by Dr Chapman. Interesting work that you are doing here.:)... I was wondering if there is anyone who could consider on these. I would like to also be a part of the research work being carried out using Biojava( especially in sequence alignment, miRNA signature Analysis (especially for cancers)...) 1) A set of tools for converting flat data (e.g. sequence strings, taxononmy strings) into BioJava-like objects (e.g. SymbolLists, NCBITaxon). These BioJava-like objects could then be used for more advanced applications. A set of tools for manipulating the BioJava-like objects. 2) Module?: biojava-ws-blast Module?: biojava-ws-biolit Proposed Module: biojava-j2ee Lead: Mark Schreiber - This would probably take the form of SessionBeans and WebServices that can be deployed to Glassfish/ JBoss etc to provide biological services for people who want to make client server or SOA apps. 3) I also liked what Mr. Gang Wu is working on(I read the discussions). I was wondering if I could do something of that sort... May I request the leads to tell me how I could chip in... Regards, Jitesh Dundas On 4/16/10, Mark Chapman wrote: > A great place to start finding ideas is the wiki. > Both http://biojava.org/wiki/BioJava:Modules > and http://biojava.org/wiki/BioJava3_Proposal > list the next steps planned/desired for BioJava. > > What research area did you have in mind? > > Have fun, > Mark > > > On 4/16/2010 8:57 AM, jitesh dundas wrote: >> Dear Sir, >> >> I am very interested in contributing to this project. >> >> I am looking for a good problem,more on the research side. I can also >> help in coding (I also work as a software >> engineer-j2ee/eclipse/jboss/tomcat .. >> >> Anything that I could work on... >> >> Regards, >> Jitesh Dundas >> >> On 4/8/10, Andreas Dr?ger wrote: >>> Hi all, >>> >>> This e-mail is just for your information about somebody new, who'd like >>> to contribute to our project. >>> >>> Cheers >>> Andreas >>> >>> >>> Subject: >>> Re: Fwd: Proposing a project on "Biojava alignment lead" >>> From: >>> Andreas Dr?ger >>> Date: >>> Wed, 07 Apr 2010 09:27:13 +0200 >>> To: >>> Cai Shaojiang >>> >>> Hi Cai Shaojiang, >>> >>> Thank you for you e-mail! I don't know what happened to the e-mail list. >>> Sometimes it takes a while due to the spam filters, I guess. >>> >>> > I am a PhD student from National University of Singapore. My major >>> research area is local alignment algorithms and data structures for SNP >>> identification. And I have used Java and Eclipse for years for software >>> development. I am very interested in your GSoC programme. I find that >>> there is a module called "biojava-alignment lead" whose mentor is you. I >>> want to propose a new project on this module. I have several questions >>> about this module. >>> >>> Yes, that's me. So great to get your support. >>> >>> > 1. It seems that pairwise alignment is to find similarity between >>> two >>> short sequences. Existing pairwise alignment is based on dynamic >>> programming, is it Smith-Waterman algorithm? >>> >>> So, currently, BioJava contains three different alignment approaches. >>> There are two deterministic algorithms, i.e., Smith-Waterman for local >>> alignment and Needleman-Wunsch for global alignment. Third, there is the >>> possibility to apply Hidden Markov Models for alignment. An example of >>> the latter approach should be in the cookbook. >>> >>> > 2. What is the exact task of "refactoring of underlying data >>> structures"? >>> >>> Yes, this is something, I did last week already but it could still be >>> improved. The problem was that the alignment algorithms actually >>> produced a kind of string that looks similar to the output of BLAST. >>> This string contained the score, the computation time, the length of the >>> alignment etc. The problem was that people wanted to perform >>> higher-level computation on the score value or evaluate some other >>> information. Now, the alignment will produce a data structure that >>> contains all the information and can, in addition to that, also produce >>> such a BLAST-like output. There is, however, still the following >>> problem: The data structure requires both sequences in the pair-wise >>> alignment to have an identical length. In case of local alignment this >>> is especially stupid (actually), because gaps are inserted to fill the >>> sequences. And then the data structure tries to keep the old sequence >>> coordinates, leading to the effect that the numbers "query start", >>> "query end", "subject start", and "subject end" are required to shift >>> the sequences against each other when displaying the output. So, you >>> cannot easily print the sequences below of each other, you first have to >>> shift them. Please check out the latest version of this package via >>> anonymeous svn and have a look ;-) >>> >>> > 3. My existing research area is aiming to deal with aligning short >>> read (10s~100s bp) against extremely long sequences (e.g., human >>> genome). Af far as I know, there is not existing such alignment tools >>> implemented in Java. Would you consider this direction? >>> >>> See, this would be very nice to include. But this requires that we no >>> longer fill the short sequence with many, many gap symbols (just a waist >>> of memory), but improve the data structure. There is already an >>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think >>> we could use this as a starting point. Then your algorithm should only >>> produce such a data structure and this would be fine. >>> >>> > 4. It seems that the existing tools is just lacking of some >>> refactoring and representation interfaces. Any more underlying tasks? >>> >>> Hm. Yes: With the release of BioJava 3 data structures have changed >>> again. So maybe there's also some adaptation to the new structure >>> required. >>> >>> > I am keeping an eye on GSoC from last month, but sorry to find out >>> that I sent the initial email to the mailing list before I subscribe >>> it... >>> >>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the >>> latest trunk, have a look, play around and if you can improve something >>> we'll put it into the trunk and write your name into the authors' tag. >>> >>> Cheers >>> Andreas >>> >>> -- >>> Dipl.-Bioinform. Andreas Dr?ger >>> Eberhard Karls University T?bingen >>> Center for Bioinformatics (ZBIT) >>> Sand 1 >>> 72076 T?bingen >>> Germany >>> >>> Phone: +49-7071-29-70436 >>> Fax: +49-7071-29-5091 >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > From jbdundas at gmail.com Fri Apr 16 22:31:46 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 17 Apr 2010 08:01:46 +0530 Subject: [Biojava-l] Analytical Tool- Prediction of Unknown Protein's location on an a Predicted pathway Message-ID: Dear All, I wanted to propose an analytical tool in BioJava. For e.g.) if we have a large datasets with complete pathway information and the related information(e.g. p53 pathway will have all the genes,proteins,miRNA s involved,etc ) mentioned, could we find the location of a specific unknown (and just predicted protein) protein/gene on a predicted pathway. This was a suggestion on the possible t ings on the analytical side that we could do.Could we think of doing something of this sort for BioJava (or atleast make it capable to handle such aspects) Any ideas / comments are most welcome... Regards, Jitesh Dundas On 4/17/10, jitesh dundas wrote: > Hi Everyone, > > I went throug the URLs sent by Dr Chapman. Interesting work that you > are doing here.:)... > > I was wondering if there is anyone who could consider on these. I > would like to also be a part of the research work being carried out > using Biojava( especially in sequence alignment, miRNA signature > Analysis (especially for cancers)...) > > 1) A set of tools for converting flat data (e.g. sequence strings, > taxononmy strings) into BioJava-like objects (e.g. SymbolLists, > NCBITaxon). These BioJava-like objects could then be used for more > advanced applications. > A set of tools for manipulating the BioJava-like objects. > > 2) Module?: biojava-ws-blast Module?: biojava-ws-biolit > Proposed Module: biojava-j2ee Lead: Mark Schreiber > > - This would probably take the form of SessionBeans and WebServices > that can be deployed to Glassfish/ JBoss etc to provide biological > services for people who want to make client server or SOA apps. > > 3) I also liked what Mr. Gang Wu is working on(I read the > discussions). I was wondering if I could > do something of that sort... > > May I request the leads to tell me how I could chip in... > > Regards, > Jitesh Dundas > > > > On 4/16/10, Mark Chapman wrote: >> A great place to start finding ideas is the wiki. >> Both http://biojava.org/wiki/BioJava:Modules >> and http://biojava.org/wiki/BioJava3_Proposal >> list the next steps planned/desired for BioJava. >> >> What research area did you have in mind? >> >> Have fun, >> Mark >> >> >> On 4/16/2010 8:57 AM, jitesh dundas wrote: >>> Dear Sir, >>> >>> I am very interested in contributing to this project. >>> >>> I am looking for a good problem,more on the research side. I can also >>> help in coding (I also work as a software >>> engineer-j2ee/eclipse/jboss/tomcat .. >>> >>> Anything that I could work on... >>> >>> Regards, >>> Jitesh Dundas >>> >>> On 4/8/10, Andreas Dr?ger wrote: >>>> Hi all, >>>> >>>> This e-mail is just for your information about somebody new, who'd like >>>> to contribute to our project. >>>> >>>> Cheers >>>> Andreas >>>> >>>> >>>> Subject: >>>> Re: Fwd: Proposing a project on "Biojava alignment lead" >>>> From: >>>> Andreas Dr?ger >>>> Date: >>>> Wed, 07 Apr 2010 09:27:13 +0200 >>>> To: >>>> Cai Shaojiang >>>> >>>> Hi Cai Shaojiang, >>>> >>>> Thank you for you e-mail! I don't know what happened to the e-mail >>>> list. >>>> Sometimes it takes a while due to the spam filters, I guess. >>>> >>>> > I am a PhD student from National University of Singapore. My major >>>> research area is local alignment algorithms and data structures for SNP >>>> identification. And I have used Java and Eclipse for years for software >>>> development. I am very interested in your GSoC programme. I find that >>>> there is a module called "biojava-alignment lead" whose mentor is you. >>>> I >>>> want to propose a new project on this module. I have several questions >>>> about this module. >>>> >>>> Yes, that's me. So great to get your support. >>>> >>>> > 1. It seems that pairwise alignment is to find similarity between >>>> two >>>> short sequences. Existing pairwise alignment is based on dynamic >>>> programming, is it Smith-Waterman algorithm? >>>> >>>> So, currently, BioJava contains three different alignment approaches. >>>> There are two deterministic algorithms, i.e., Smith-Waterman for local >>>> alignment and Needleman-Wunsch for global alignment. Third, there is >>>> the >>>> possibility to apply Hidden Markov Models for alignment. An example of >>>> the latter approach should be in the cookbook. >>>> >>>> > 2. What is the exact task of "refactoring of underlying data >>>> structures"? >>>> >>>> Yes, this is something, I did last week already but it could still be >>>> improved. The problem was that the alignment algorithms actually >>>> produced a kind of string that looks similar to the output of BLAST. >>>> This string contained the score, the computation time, the length of >>>> the >>>> alignment etc. The problem was that people wanted to perform >>>> higher-level computation on the score value or evaluate some other >>>> information. Now, the alignment will produce a data structure that >>>> contains all the information and can, in addition to that, also produce >>>> such a BLAST-like output. There is, however, still the following >>>> problem: The data structure requires both sequences in the pair-wise >>>> alignment to have an identical length. In case of local alignment this >>>> is especially stupid (actually), because gaps are inserted to fill the >>>> sequences. And then the data structure tries to keep the old sequence >>>> coordinates, leading to the effect that the numbers "query start", >>>> "query end", "subject start", and "subject end" are required to shift >>>> the sequences against each other when displaying the output. So, you >>>> cannot easily print the sequences below of each other, you first have >>>> to >>>> shift them. Please check out the latest version of this package via >>>> anonymeous svn and have a look ;-) >>>> >>>> > 3. My existing research area is aiming to deal with aligning short >>>> read (10s~100s bp) against extremely long sequences (e.g., human >>>> genome). Af far as I know, there is not existing such alignment tools >>>> implemented in Java. Would you consider this direction? >>>> >>>> See, this would be very nice to include. But this requires that we no >>>> longer fill the short sequence with many, many gap symbols (just a >>>> waist >>>> of memory), but improve the data structure. There is already an >>>> UnequalLenghtAlignment (just a data structure, no algorithm) and I >>>> think >>>> we could use this as a starting point. Then your algorithm should only >>>> produce such a data structure and this would be fine. >>>> >>>> > 4. It seems that the existing tools is just lacking of some >>>> refactoring and representation interfaces. Any more underlying tasks? >>>> >>>> Hm. Yes: With the release of BioJava 3 data structures have changed >>>> again. So maybe there's also some adaptation to the new structure >>>> required. >>>> >>>> > I am keeping an eye on GSoC from last month, but sorry to find out >>>> that I sent the initial email to the mailing list before I subscribe >>>> it... >>>> >>>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the >>>> latest trunk, have a look, play around and if you can improve something >>>> we'll put it into the trunk and write your name into the authors' tag. >>>> >>>> Cheers >>>> Andreas >>>> >>>> -- >>>> Dipl.-Bioinform. Andreas Dr?ger >>>> Eberhard Karls University T?bingen >>>> Center for Bioinformatics (ZBIT) >>>> Sand 1 >>>> 72076 T?bingen >>>> Germany >>>> >>>> Phone: +49-7071-29-70436 >>>> Fax: +49-7071-29-5091 >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > From jbdundas at gmail.com Sat Apr 17 09:34:20 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 17 Apr 2010 19:04:20 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de> References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: Dear SIr, Could anyone tell me where I could start? Is there any lead who might need my help in Software Development and research-oriebted aspects? Any comments on my previous emails would be most welcomed... Regards, JItesh Dundas On 4/8/10, Andreas Dr?ger wrote: > > Hi all, > > This e-mail is just for your information about somebody new, who'd like to > contribute to our project. > > Cheers > Andreas > > > Subject: > Re: Fwd: Proposing a project on "Biojava alignment lead" > From: > Andreas Dr?ger > Date: > Wed, 07 Apr 2010 09:27:13 +0200 > To: > Cai Shaojiang > > Hi Cai Shaojiang, > > Thank you for you e-mail! I don't know what happened to the e-mail list. > Sometimes it takes a while due to the spam filters, I guess. > > > I am a PhD student from National University of Singapore. My major > research area is local alignment algorithms and data structures for SNP > identification. And I have used Java and Eclipse for years for software > development. I am very interested in your GSoC programme. I find that there > is a module called "biojava-alignment lead" whose mentor is you. I want to > propose a new project on this module. I have several questions about this > module. > > Yes, that's me. So great to get your support. > > > 1. It seems that pairwise alignment is to find similarity between two > short sequences. Existing pairwise alignment is based on dynamic > programming, is it Smith-Waterman algorithm? > > So, currently, BioJava contains three different alignment approaches. > There are two deterministic algorithms, i.e., Smith-Waterman for local > alignment and Needleman-Wunsch for global alignment. Third, there is the > possibility to apply Hidden Markov Models for alignment. An example of the > latter approach should be in the cookbook. > > > 2. What is the exact task of "refactoring of underlying data structures"? > > Yes, this is something, I did last week already but it could still be > improved. The problem was that the alignment algorithms actually produced a > kind of string that looks similar to the output of BLAST. This string > contained the score, the computation time, the length of the alignment etc. > The problem was that people wanted to perform higher-level computation on > the score value or evaluate some other information. Now, the alignment will > produce a data structure that contains all the information and can, in > addition to that, also produce such a BLAST-like output. There is, however, > still the following problem: The data structure requires both sequences in > the pair-wise alignment to have an identical length. In case of local > alignment this is especially stupid (actually), because gaps are inserted to > fill the sequences. And then the data structure tries to keep the old > sequence coordinates, leading to the effect that the numbers "query start", > "query end", "subject start", and "subject end" are required to shift the > sequences against each other when displaying the output. So, you cannot > easily print the sequences below of each other, you first have to shift > them. Please check out the latest version of this package via anonymeous svn > and have a look ;-) > > > 3. My existing research area is aiming to deal with aligning short read > (10s~100s bp) against extremely long sequences (e.g., human genome). Af far > as I know, there is not existing such alignment tools implemented in Java. > Would you consider this direction? > > See, this would be very nice to include. But this requires that we no > longer fill the short sequence with many, many gap symbols (just a waist of > memory), but improve the data structure. There is already an > UnequalLenghtAlignment (just a data structure, no algorithm) and I think we > could use this as a starting point. Then your algorithm should only produce > such a data structure and this would be fine. > > > 4. It seems that the existing tools is just lacking of some refactoring > and representation interfaces. Any more underlying tasks? > > Hm. Yes: With the release of BioJava 3 data structures have changed again. > So maybe there's also some adaptation to the new structure required. > > > I am keeping an eye on GSoC from last month, but sorry to find out that I > sent the initial email to the mailing list before I subscribe it... > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > latest trunk, have a look, play around and if you can improve something > we'll put it into the trunk and write your name into the authors' tag. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From caishaojiang at gmail.com Sun Apr 18 23:16:39 2010 From: caishaojiang at gmail.com (Cai Shaojiang) Date: Sun, 18 Apr 2010 20:16:39 -0700 Subject: [Biojava-l] [Fwd: Re: GSoC project on MSA] In-Reply-To: <4BC84CD5.7030703@uni-tuebingen.de> References: <4BBC80A8.5000608@uni-tuebingen.de> <4BBDCFD2.3000507@uni-tuebingen.de> <4BC84CD5.7030703@uni-tuebingen.de> Message-ID: Sorry to disturb you again. But when i wanted to modify my proposal in GSOC, i got the error "This page is inactive at this time." So we cannot modify the proposal now? Could you help me? Thanks. From andreas at sdsc.edu Sun Apr 18 23:58:05 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 18 Apr 2010 20:58:05 -0700 Subject: [Biojava-l] Fwd: Biojava3-genetics In-Reply-To: <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl> References: <4BC806F4.3090302@wur.nl> <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl> Message-ID: Hi Richard, I am forwarding your message to the mailing list, since that is the best place to meet other people interested in genetics application. The BioJava source code is available via anonymous svn or the download page on the wiki. Andreas ---------- Forwarded message ---------- From: Finkers, Richard Date: Sat, Apr 17, 2010 at 12:46 AM Subject: RE: Biojava3-genetics To: Andreas Prlic Hi Andreas, To start with, associations with e.g. sequence variation (454) and phenotype data within larger sets of genetically different individuals. This will be code which I will have to write the coming year for one of my projects. I am planning to use this in combination the sequence and phylogeny based biojava modules. I also might consider migrating some of my current code to this module. This includes graphical representations of genetic data but also some statistical analysis for which we use the package R for the calculations but the rest of the data handling / formatting is done in Java. Some of the functionality, that I am thinking about, is available from other packages but I did not find the (java) source code. Richard -----Original Message----- From: andreas.prlic at gmail.com on behalf of Andreas Prlic Sent: Fri 2010-04-16 19:39 To: Finkers, Richard Cc: biojava-dev at lists.open-bio.org Subject: Re: Biojava3-genetics Hi Richard, any contribution is welcome. What do you have in mind in particular? Perhaps there is already something there along those lines... Andreas On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers wrote: > Dear List, > > I would be interested in adding a module for genetic analysis to the > biojava3 project. Are there others who are interested in this as well and > with who should I discuss this further? > > Thanks, > Richard > > > -- > Dr. Richard Finkers > Researcher Plant Breeding > Wageningen UR Plant Breeding > P.O. Box 16, 6700 AA, Wageningen, The Netherlands > Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB > Wageningen, The Netherlands > Tel. +31-317-484165 Fax +31-317-418094 > http://www.plantbreeding.wur.nl/ > https://www.eu-sol.wur.nl/ > https://cbsgdbase.wur.nl/ > http://solgenomics.wur.nl/ > http://www.disclaimer-uk.wur.nl/ > > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Mon Apr 19 00:14:24 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 18 Apr 2010 21:14:24 -0700 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: Hi Jitesh, BioJava is an open source project with the goal to support Bioinformatics applications. While we are always happy about any contribution, be it documentation, bug fixes or email support on the mailing list, for a research relate project it is probably easier to team up with your local university and do an internship there. Andreas On Sat, Apr 17, 2010 at 6:34 AM, jitesh dundas wrote: > Dear SIr, > > Could anyone tell me where I could start? Is there any lead who might need > my help in Software Development and research-oriebted aspects? > > Any comments on my previous emails would be most welcomed... > > Regards, > JItesh Dundas > > > On 4/8/10, Andreas Dr?ger wrote: > > > > Hi all, > > > > This e-mail is just for your information about somebody new, who'd like > to > > contribute to our project. > > > > Cheers > > Andreas > > > > > > Subject: > > Re: Fwd: Proposing a project on "Biojava alignment lead" > > From: > > Andreas Dr?ger > > Date: > > Wed, 07 Apr 2010 09:27:13 +0200 > > To: > > Cai Shaojiang > > > > Hi Cai Shaojiang, > > > > Thank you for you e-mail! I don't know what happened to the e-mail list. > > Sometimes it takes a while due to the spam filters, I guess. > > > > > I am a PhD student from National University of Singapore. My major > > research area is local alignment algorithms and data structures for SNP > > identification. And I have used Java and Eclipse for years for software > > development. I am very interested in your GSoC programme. I find that > there > > is a module called "biojava-alignment lead" whose mentor is you. I want > to > > propose a new project on this module. I have several questions about this > > module. > > > > Yes, that's me. So great to get your support. > > > > > 1. It seems that pairwise alignment is to find similarity between two > > short sequences. Existing pairwise alignment is based on dynamic > > programming, is it Smith-Waterman algorithm? > > > > So, currently, BioJava contains three different alignment approaches. > > There are two deterministic algorithms, i.e., Smith-Waterman for local > > alignment and Needleman-Wunsch for global alignment. Third, there is the > > possibility to apply Hidden Markov Models for alignment. An example of > the > > latter approach should be in the cookbook. > > > > > 2. What is the exact task of "refactoring of underlying data > structures"? > > > > Yes, this is something, I did last week already but it could still be > > improved. The problem was that the alignment algorithms actually produced > a > > kind of string that looks similar to the output of BLAST. This string > > contained the score, the computation time, the length of the alignment > etc. > > The problem was that people wanted to perform higher-level computation on > > the score value or evaluate some other information. Now, the alignment > will > > produce a data structure that contains all the information and can, in > > addition to that, also produce such a BLAST-like output. There is, > however, > > still the following problem: The data structure requires both sequences > in > > the pair-wise alignment to have an identical length. In case of local > > alignment this is especially stupid (actually), because gaps are inserted > to > > fill the sequences. And then the data structure tries to keep the old > > sequence coordinates, leading to the effect that the numbers "query > start", > > "query end", "subject start", and "subject end" are required to shift the > > sequences against each other when displaying the output. So, you cannot > > easily print the sequences below of each other, you first have to shift > > them. Please check out the latest version of this package via anonymeous > svn > > and have a look ;-) > > > > > 3. My existing research area is aiming to deal with aligning short read > > (10s~100s bp) against extremely long sequences (e.g., human genome). Af > far > > as I know, there is not existing such alignment tools implemented in > Java. > > Would you consider this direction? > > > > See, this would be very nice to include. But this requires that we no > > longer fill the short sequence with many, many gap symbols (just a waist > of > > memory), but improve the data structure. There is already an > > UnequalLenghtAlignment (just a data structure, no algorithm) and I think > we > > could use this as a starting point. Then your algorithm should only > produce > > such a data structure and this would be fine. > > > > > 4. It seems that the existing tools is just lacking of some refactoring > > and representation interfaces. Any more underlying tasks? > > > > Hm. Yes: With the release of BioJava 3 data structures have changed > again. > > So maybe there's also some adaptation to the new structure required. > > > > > I am keeping an eye on GSoC from last month, but sorry to find out that > I > > sent the initial email to the mailing list before I subscribe it... > > > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > > latest trunk, have a look, play around and if you can improve something > > we'll put it into the trunk and write your name into the authors' tag. > > > > Cheers > > Andreas > > > > -- > > Dipl.-Bioinform. Andreas Dr?ger > > Eberhard Karls University T?bingen > > Center for Bioinformatics (ZBIT) > > Sand 1 > > 72076 T?bingen > > Germany > > > > Phone: +49-7071-29-70436 > > Fax: +49-7071-29-5091 > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From jbdundas at gmail.com Mon Apr 19 04:33:57 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Mon, 19 Apr 2010 14:03:57 +0530 Subject: [Biojava-l] Fwd: Biojava3-genetics In-Reply-To: References: <4BC806F4.3090302@wur.nl> <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl> Message-ID: Dear Sir, I would like to work on this module. How can I help? Regards, Jitesh Dundas On 4/19/10, Andreas Prlic wrote: > Hi Richard, > > I am forwarding your message to the mailing list, since that is the best > place to meet other people interested in genetics application. > > The BioJava source code is available via anonymous svn or the download page > on the wiki. > > Andreas > > ---------- Forwarded message ---------- > From: Finkers, Richard > Date: Sat, Apr 17, 2010 at 12:46 AM > Subject: RE: Biojava3-genetics > To: Andreas Prlic > > > Hi Andreas, > > To start with, associations with e.g. sequence variation (454) and phenotype > data within larger sets of genetically different individuals. This will be > code which I will have to write the coming year for one of my projects. I am > planning to use this in combination the sequence and phylogeny based biojava > modules. > > I also might consider migrating some of my current code to this module. This > includes graphical representations of genetic data but also some statistical > analysis for which we use the package R for the calculations but the rest of > the data handling / formatting is done in Java. > > Some of the functionality, that I am thinking about, is available from other > packages but I did not find the (java) source code. > > Richard > > > > > -----Original Message----- > From: andreas.prlic at gmail.com on behalf of Andreas Prlic > Sent: Fri 2010-04-16 19:39 > To: Finkers, Richard > Cc: biojava-dev at lists.open-bio.org > Subject: Re: Biojava3-genetics > > Hi Richard, > > any contribution is welcome. What do you have in mind in particular? Perhaps > there is already something there along those lines... > > Andreas > > On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers >wrote: > >> Dear List, >> >> I would be interested in adding a module for genetic analysis to the >> biojava3 project. Are there others who are interested in this as well and >> with who should I discuss this further? >> >> Thanks, >> Richard >> >> >> -- >> Dr. Richard Finkers >> Researcher Plant Breeding >> Wageningen UR Plant Breeding >> P.O. Box 16, 6700 AA, Wageningen, The Netherlands >> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB >> Wageningen, The Netherlands >> Tel. +31-317-484165 Fax +31-317-418094 >> http://www.plantbreeding.wur.nl/ >> https://www.eu-sol.wur.nl/ >> https://cbsgdbase.wur.nl/ >> http://solgenomics.wur.nl/ >> http://www.disclaimer-uk.wur.nl/ >> >> > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas.draeger at uni-tuebingen.de Tue Apr 20 23:17:05 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 21 Apr 2010 12:17:05 +0900 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: <4BCE6E31.70504@uni-tuebingen.de> Hi Jitesh, Thanks for your interest to contribute to our BioJava project! In the alignment package, lots of help is required. What would be very nice, is a verstatile visual representation of the alignment data structures that can be included into graphical user interfaces with little effort. To this end, it should be very flexible and abstract. Would you be interested? Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From mitlox at op.pl Wed Apr 21 06:46:22 2010 From: mitlox at op.pl (xyz) Date: Wed, 21 Apr 2010 20:46:22 +1000 Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com> References: <20100330215047.084f6b00@wp01> <20100408213013.63a99b8c@wp01> <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com> Message-ID: <20100421204622.68f9ac1b@wp01> On Thu, 8 Apr 2010 12:36:36 +0100 Richard Holland wrote: > You haven't included the two import static lines in your code. See > first two lines of Michael's example code (expanding the ellipses to > the full classpath). > Thank you it was enough to include import static org.biojavax.bio.seq.RichSequence.Tools.createRichSequence; Usually Netbeans solve this kind of problems for me, but this time was no help from the IDE. From mitlox at op.pl Wed Apr 21 07:18:24 2010 From: mitlox at op.pl (xyz) Date: Wed, 21 Apr 2010 21:18:24 +1000 Subject: [Biojava-l] readFasta problem In-Reply-To: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> Message-ID: <20100421211824.75b7ada2@wp01> On Thu, 8 Apr 2010 12:41:25 +0100 Richard Holland wrote: > You have passed null into the tokenizer parameter of > RichSequence.IOTools.readFasta() - this is not allowed. The parser > cannot guess the type of sequence, it must be told what to expect by > specifying the tokenizer to use. (Importantly this also means that > you cannot mix different types of sequence within the same file to be > parsed.) > Thank you. Q1: Does RichSequenceIterator read the complete file in memory and then I retrieve each read from memory? Or does it read the file line by line and I get each read? Q2: Why am I not able to retrieve the header from the following fasta file: >1 atccccc >2 atccccctttttt >3 atccccccccccccccccctttt >4 tttttttccccccccccccccccccccccc >5 tttttttcccccccccccccccccccccca with the following code: import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import org.biojava.bio.BioException; import org.biojava.bio.seq.io.SymbolTokenization; import org.biojava.bio.symbol.AlphabetManager; import org.biojavax.bio.seq.RichSequence; import org.biojavax.bio.seq.RichSequenceIterator; public class SortFasta { public static void main(String[] args) throws FileNotFoundException, BioException { BufferedReader br = new BufferedReader(new FileReader("sortFasta.fasta")); String type = "DNA"; SymbolTokenization toke = AlphabetManager.alphabetForName(type) .getTokenization("token"); RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, null); while (rsi.hasNext()) { RichSequence rs = rsi.nextRichSequence(); System.out.println(rs.getDescription()); System.out.println(rs.seqString()); } } } What did I wrong in order to retrieve the header? From holland at eaglegenomics.com Wed Apr 21 07:29:57 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 21 Apr 2010 12:29:57 +0100 Subject: [Biojava-l] readFasta problem In-Reply-To: <20100421211824.75b7ada2@wp01> References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> <20100421211824.75b7ada2@wp01> Message-ID: On 21 Apr 2010, at 12:18, xyz wrote: > On Thu, 8 Apr 2010 12:41:25 +0100 > Richard Holland wrote: > >> You have passed null into the tokenizer parameter of >> RichSequence.IOTools.readFasta() - this is not allowed. The parser >> cannot guess the type of sequence, it must be told what to expect by >> specifying the tokenizer to use. (Importantly this also means that >> you cannot mix different types of sequence within the same file to be >> parsed.) >> > > Thank you. > > Q1: > Does RichSequenceIterator read the complete file in memory and then I > retrieve each read from memory? Or does it read the file line by line > and I get each read? Line by line. > Q2: > Why am I not able to retrieve the header from the following fasta file: >> 1 > atccccc >> 2 > atccccctttttt >> 3 > atccccccccccccccccctttt >> 4 > tttttttccccccccccccccccccccccc >> 5 > tttttttcccccccccccccccccccccca > > with the following code: > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileReader; > import org.biojava.bio.BioException; > import org.biojava.bio.seq.io.SymbolTokenization; > import org.biojava.bio.symbol.AlphabetManager; > import org.biojavax.bio.seq.RichSequence; > import org.biojavax.bio.seq.RichSequenceIterator; > > public class SortFasta { > > public static void main(String[] args) throws FileNotFoundException, > BioException { > > > BufferedReader br = new BufferedReader(new > FileReader("sortFasta.fasta")); String type = "DNA"; > SymbolTokenization toke = AlphabetManager.alphabetForName(type) > .getTokenization("token"); > > > RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, > null); > > while (rsi.hasNext()) { > RichSequence rs = rsi.nextRichSequence(); > System.out.println(rs.getDescription()); > System.out.println(rs.seqString()); > } > } > } > > What did I wrong in order to retrieve the header? Try the other methods on RichSequence - getName() for instance. cheers, Richard -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From mitlox at op.pl Wed Apr 21 08:40:48 2010 From: mitlox at op.pl (xyz) Date: Wed, 21 Apr 2010 22:40:48 +1000 Subject: [Biojava-l] NCBI Accession Number prefixes Message-ID: <20100421224048.1848c2f2@wp01> Hello, is it possible to download GenBank entries (AC) with BioJava? Thank you in advance. Best regards, From holland at eaglegenomics.com Wed Apr 21 08:44:16 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 21 Apr 2010 13:44:16 +0100 Subject: [Biojava-l] NCBI Accession Number prefixes In-Reply-To: <20100421224048.1848c2f2@wp01> References: <20100421224048.1848c2f2@wp01> Message-ID: <577294DB-EABD-48DF-A55A-5DA9629AC352@eaglegenomics.com> See http://www.biojava.org/docs/api/org/biojavax/bio/db/ncbi/GenbankRichSequenceDB.html On 21 Apr 2010, at 13:40, xyz wrote: > Hello, > is it possible to download GenBank entries (AC) with BioJava? > > Thank you in advance. > > Best regards, > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jbdundas at gmail.com Wed Apr 21 09:45:00 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Wed, 21 Apr 2010 19:15:00 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BCE6E31.70504@uni-tuebingen.de> References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: Yes Sir, I will be very interested. Please send me the details. I will be working on Weekends though as office work is taking my time right now. Regards, jd On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger < andreas.draeger at uni-tuebingen.de> wrote: > Hi Jitesh, > > Thanks for your interest to contribute to our BioJava project! In the > alignment package, lots of help is required. What would be very nice, is a > verstatile visual representation of the alignment data structures that can > be included into graphical user interfaces with little effort. To this end, > it should be very flexible and abstract. Would you be interested? > > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > From er.indupandey at gmail.com Fri Apr 23 04:11:05 2010 From: er.indupandey at gmail.com (indu pandey) Date: Fri, 23 Apr 2010 01:11:05 -0700 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: hi all can any body help me in creating code in biojava for converting dna sequence to corresponding amino acid sequence regards indu On 4/21/10, jitesh dundas wrote: > > Yes Sir, I will be very interested. Please send me the details. I will be > working on Weekends though as office work is taking my time right now. > > Regards, > jd > > On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger < > andreas.draeger at uni-tuebingen.de> wrote: > > > Hi Jitesh, > > > > Thanks for your interest to contribute to our BioJava project! In the > > alignment package, lots of help is required. What would be very nice, is > a > > verstatile visual representation of the alignment data structures that > can > > be included into graphical user interfaces with little effort. To this > end, > > it should be very flexible and abstract. Would you be interested? > > > > > > Cheers > > Andreas > > > > -- > > Dipl.-Bioinform. Andreas Dr?ger > > Eberhard Karls University T?bingen > > Center for Bioinformatics (ZBIT) > > Sand 1 > > 72076 T?bingen > > Germany > > > > Phone: +49-7071-29-70436 > > Fax: +49-7071-29-5091 > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From genjasp at gmail.com Fri Apr 23 04:26:10 2010 From: genjasp at gmail.com (Alessandro Cipriani) Date: Fri, 23 Apr 2010 10:26:10 +0200 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: Hi Follow this link: http://www.biojava.org/wiki/BioJava:CookBook#Translation I think it could be usefull regards ale 2010/4/23 indu pandey > hi all > can any body help me in creating code in biojava for converting dna > sequence to corresponding amino acid sequence > > regards > indu > > On 4/21/10, jitesh dundas wrote: > > > > Yes Sir, I will be very interested. Please send me the details. I will be > > working on Weekends though as office work is taking my time right now. > > > > Regards, > > jd > > > > On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger < > > andreas.draeger at uni-tuebingen.de> wrote: > > > > > Hi Jitesh, > > > > > > Thanks for your interest to contribute to our BioJava project! In the > > > alignment package, lots of help is required. What would be very nice, > is > > a > > > verstatile visual representation of the alignment data structures that > > can > > > be included into graphical user interfaces with little effort. To this > > end, > > > it should be very flexible and abstract. Would you be interested? > > > > > > > > > Cheers > > > Andreas > > > > > > -- > > > Dipl.-Bioinform. Andreas Dr?ger > > > Eberhard Karls University T?bingen > > > Center for Bioinformatics (ZBIT) > > > Sand 1 > > > 72076 T?bingen > > > Germany > > > > > > Phone: +49-7071-29-70436 > > > Fax: +49-7071-29-5091 > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Alessandro Cipriani (+39) 3206009509 http://www.cipriania.it skype:genjasp at gmail.com msn:jaspzz From thomascramera at dnastar.com Fri Apr 23 18:58:05 2010 From: thomascramera at dnastar.com (Andy Thomas-Cramer) Date: Fri, 23 Apr 2010 17:58:05 -0500 Subject: [Biojava-l] PDBFileParser and Atom element symbol Message-ID: Is there an easy way to identify the type of atom referenced by an Atom object? For example, if Atom.getName() is "CA", is the element calcium or the atom carbon alpha? If not, would it be feasible to add a method providing this in Atom, AtomImpl, and parsing it in PDBFileParser, using the columns defined at http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? From andreas at sdsc.edu Fri Apr 23 19:52:15 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 23 Apr 2010 16:52:15 -0700 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Hi Andy, you could check with Atom.getFullname(), which contains the space characters from the PDB file: e.g Calpha: " CA ", Calcium "CA " in addition the parent group of a Calpha atom is usually an AminoAcid and for Calciums it is a Hetatom group... Andreas On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer < thomascramera at dnastar.com> wrote: > > > Is there an easy way to identify the type of atom referenced by an Atom > object? > > For example, if Atom.getName() is "CA", is the element calcium or the > atom carbon alpha? > > If not, would it be feasible to add a method providing this in Atom, > AtomImpl, and parsing it in PDBFileParser, using the columns defined at > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From mitlox at op.pl Sun Apr 25 01:19:25 2010 From: mitlox at op.pl (xyz) Date: Sun, 25 Apr 2010 15:19:25 +1000 Subject: [Biojava-l] readFasta problem In-Reply-To: References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> <20100421211824.75b7ada2@wp01> Message-ID: <20100425151925.1c5c9a03@wp01> On Wed, 21 Apr 2010 12:29:57 +0100 Richard Holland wrote: > > Q1: > > Does RichSequenceIterator read the complete file in memory and then > > I retrieve each read from memory? Or does it read the file line by > > line and I get each read? > > > Line by line. That save memory. > > Q2: > > Why am I not able to retrieve the header from the following fasta > > file: > >> 1 > > atccccc > >> 2 > > atccccctttttt > >> 3 > > atccccccccccccccccctttt > >> 4 > > tttttttccccccccccccccccccccccc > >> 5 > > tttttttcccccccccccccccccccccca > > Try the other methods on RichSequence - getName() for instance. Thank you getName() works. I have tried to write fasta file line by line with IOTools, but I have got the following error: Exception in thread "main" java.lang.RuntimeException: Uncompilable source code 1 at SortFasta.main(SortFasta.java:31) atccccc Java Result: 1 Here is the complete code: import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; import org.biojava.bio.BioException; import org.biojava.bio.seq.io.SymbolTokenization; import org.biojava.bio.symbol.AlphabetManager; import org.biojavax.bio.seq.RichSequence; import org.biojavax.bio.seq.RichSequenceIterator; public class SortFasta { public static void main(String[] args) throws FileNotFoundException, BioException { BufferedReader br = new BufferedReader(new FileReader("sortFasta.fasta")); String type = "DNA"; SymbolTokenization toke = AlphabetManager.alphabetForName(type) .getTokenization("token"); FileOutputStream outputFasta = new FileOutputStream("test.fasta"); RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, null); while (rsi.hasNext()) { RichSequence rs = rsi.nextRichSequence(); System.out.println(rs.getName()); System.out.println(rs.seqString()); RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null, rs.getName() + "1"); } } } How is it possible to write fasta files line by line? From holland at eaglegenomics.com Sun Apr 25 04:21:22 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 25 Apr 2010 09:21:22 +0100 Subject: [Biojava-l] readFasta problem In-Reply-To: <20100425151925.1c5c9a03@wp01> References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> <20100421211824.75b7ada2@wp01> <20100425151925.1c5c9a03@wp01> Message-ID: <316097DC-6011-4205-83BC-9A24398D034D@eaglegenomics.com> Hi. You are calling a non-existing version of writeFasta. I'm surprised your code even compiles! Have a look at the JavaDocs to find out what you can actually do with writeFasta. For a start, it takes Sequence and FastaHeader objects as parameters, not Strings as you are trying to do. http://www.biojava.org/docs/api17/org/biojavax/bio/seq/RichSequence.IOTools.html cheers, Richard On 25 Apr 2010, at 06:19, xyz wrote: > On Wed, 21 Apr 2010 12:29:57 +0100 > Richard Holland wrote: > >>> Q1: >>> Does RichSequenceIterator read the complete file in memory and then >>> I retrieve each read from memory? Or does it read the file line by >>> line and I get each read? >> >> >> Line by line. > > That save memory. > >>> Q2: >>> Why am I not able to retrieve the header from the following fasta >>> file: >>>> 1 >>> atccccc >>>> 2 >>> atccccctttttt >>>> 3 >>> atccccccccccccccccctttt >>>> 4 >>> tttttttccccccccccccccccccccccc >>>> 5 >>> tttttttcccccccccccccccccccccca >> >> Try the other methods on RichSequence - getName() for instance. > > Thank you getName() works. > > I have tried to write fasta file line by line with IOTools, but I have > got the following error: > Exception in thread "main" java.lang.RuntimeException: Uncompilable > source code 1 > at SortFasta.main(SortFasta.java:31) > atccccc > Java Result: 1 > > Here is the complete code: > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.FileReader; > import org.biojava.bio.BioException; > import org.biojava.bio.seq.io.SymbolTokenization; > import org.biojava.bio.symbol.AlphabetManager; > import org.biojavax.bio.seq.RichSequence; > import org.biojavax.bio.seq.RichSequenceIterator; > > public class SortFasta { > > public static void main(String[] args) throws FileNotFoundException, > BioException { > > > BufferedReader br = new BufferedReader(new > FileReader("sortFasta.fasta")); String type = "DNA"; > SymbolTokenization toke = AlphabetManager.alphabetForName(type) > .getTokenization("token"); > > FileOutputStream outputFasta = new FileOutputStream("test.fasta"); > > RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, > null); > > while (rsi.hasNext()) { > RichSequence rs = rsi.nextRichSequence(); > System.out.println(rs.getName()); > System.out.println(rs.seqString()); > > RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null, > rs.getName() + "1"); > } > } > } > > How is it possible to write fasta files line by line? -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas.draeger at uni-tuebingen.de Sun Apr 25 21:04:44 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Mon, 26 Apr 2010 10:04:44 +0900 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: <4BD4E6AC.8030901@uni-tuebingen.de> Dear Indu, If you have a question regarding to BioJava, please do not just reply to some previous e-mail. In this case, your question appears in the e-mail tree related to the BioJava alignment lead. However, you have a question related to working and manipulating symbols. Therefore, you should better open a new thread. Sorry for telling you that but this is necessary to keep an overview about all the e-mails. Best wishes Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From asidhu at biomap.org Mon Apr 26 02:27:30 2010 From: asidhu at biomap.org (Amandeep Sidhu) Date: Mon, 26 Apr 2010 14:27:30 +0800 Subject: [Biojava-l] CFP: 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 Message-ID: IEEE CBMS 2010 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 Perth, Australia, 12-15 October 2010 http://www.cbms2010.curtin.edu.au/ The 23rd IEEE International Symposium on Computer-Based Medical Systems (CBMS 2010) is intended to provide an international forum for discussing the latest results in the field of computational medicine. The scientific program of CBMS 2010 will consist of invited keynote talks given by leading scientists in the field, and regular and special track sessions that cover a broad array of issues which relate computing to medicine. RELEVANT TOPICS Network and Telemedicine Systems Medical Databases & Information Systems Computer-Aided Diagnosis Medical Devices with Embedded Computers Bioinformatics in Medicine Software Systems in Medicine Pervasive Health Systems and Services Web-based Delivery of Medical Information Medical Image Segmentation & Compression Content Analysis of Biomedical Image Data Knowledge-Based & Decision Support Systems Hand-held Computing Applications in Medicine Knowledge Discovery & Data Mining Signal and Image Processing in Medicine Multimedia Biomedical Databases CBMS 2010 invites original previously unpublished contributions that are not submitted concurrently to a journal or another conference. Many of the above listed topics are represented by corresponding Special Tracks, while others are solely covered by the general CBMS track. Prospective authors are expected to submit their contributions to one of the corresponding Special Tracks or to the general track if none of the special tracks is relevant. SPECIAL TRACKS ST1: Computational Proteomics and Genomics ST2: Knowledge Discovery and Decision Systems in Biomedicine ST3: Ontologies for Biomedical Systems ST4: HealthGrid & Cloud Computing ST5: Technology Enhanced Learning in Medical Education ST6: Intelligent Patient Management ST7: Data Streams in Healthcare ST8: Supporting Collaboration among Healthcare Workers ST9: Telemedicine ST10: Computer-Based Systems for Mental Health ST11: Image Informatics in Biomedical Research and Clinical Medicine ST12: e-Health SUBMISSION GUIDELINES Papers should be submitted electronically using EasyChair online submission system. The papers must be prepared following the IEEE two-column format and should not exceed the length of 6 (six) Letter-sized pages. LaTeX or Microsoft Word templates can be used when preparing the papers. Please, note that only PDF format of submissions is allowed. Submission web site: http://www.easychair.org/conferences/?conf=cbms2010 All submissions will be peer-reviewed by at least three reviewers. The proceedings will be published by the IEEE Computer Society Press. At least one of the authors of accepted papers is required to register and present the work at the conference; otherwise their papers will be removed from the digital library after the conference. IMPORTANT DATES Submission deadline for regular papers: 24 June 2010 Deadline for tutorial submission: 24 June 2010 Notification of acceptation for papers and tutorials: 2 Aug 2010 Final camera ready due: 2 Sep 2010 Author registration: 2 Sep 2010 INTENDED AUDIENCE Engineers, scientists, clinicians and managers involved in medical computing projects are encouraged to submit papers to the symposium and/or attend the symposium. The symposium provides its attendees with an opportunity to experience state-of-the-art research and development in a variety of topics directly and indirectly related to their own work. In addition to research papers, keynote speakers and tutorial sessions it provides participants with an opportunity to come up-to-date on important technological issues. The symposium encourages the participation of students engaged in research/development in computer-based medical systems. Organizing Committee GENERAL CHAIRS Tharam Dillon, Curtin University of Technology, Australia Daniel Rubin, National Center for Biomedical Ontologies, USA William Gallagher, University College Dublin, Ireland PROGRAM CHAIRS Amandeep Sidhu, Curtin University of Technology, Australia Alexey Tsymbal, Siemens, Germany PUBLICATION CHAIRS Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands Tony Hu, Drexel University, USA SPECIAL TRACK CHAIRS Maja Hadzic, Curtin University of Technology, Australia Jake Chen, Indiana University, USA TUTORIAL CHAIRS Phoebe Chen, La Trobe University, Australia Xiaofang Zhou, University of Queensland, Australia PUBLICITY CHAIRS Carolyn McGregor, University of Ontario Institute of Technology, Canada Meifania Chen, Curtin University of Technology, Australia From thomascramera at dnastar.com Mon Apr 26 10:51:23 2010 From: thomascramera at dnastar.com (Andy Thomas-Cramer) Date: Mon, 26 Apr 2010 09:51:23 -0500 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Thank you. I had not noticed the pattern that columns 13-14 at least sometimes contain the element symbol, whether one- or two-character. Questions: * Is this pattern documented in the PDB specification? * If this pattern can be relied on, why are columns 77-78 also dedicated to the element symbol? * Should reliance on the pattern be hidden behind a BioJava method? ________________________________ From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf Of Andreas Prlic Sent: Friday, April 23, 2010 6:52 PM To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol Hi Andy, you could check with Atom.getFullname(), which contains the space characters from the PDB file: e.g Calpha: " CA ", Calcium "CA " in addition the parent group of a Calpha atom is usually an AminoAcid and for Calciums it is a Hetatom group... Andreas On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer wrote: Is there an easy way to identify the type of atom referenced by an Atom object? For example, if Atom.getName() is "CA", is the element calcium or the atom carbon alpha? If not, would it be feasible to add a method providing this in Atom, AtomImpl, and parsing it in PDBFileParser, using the columns defined at http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Mon Apr 26 21:07:53 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 26 Apr 2010 18:07:53 -0700 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Hi Andy Questions: > * Is this pattern documented in the PDB specification? > see here: http://www.wwpdb.org/documentation/format23/sect9.html#ATOM > * If this pattern can be relied on, why are columns 77-78 also dedicated to > the element symbol? > That is the atom's element symbol (as given in the periodic table), in contrast to the first name, which contains numbering information. * Should reliance on the pattern be hidden behind a BioJava method? > If you think that is important we could probably provide an enum for all atom types. There are two categories though: the periodic table symbol and the one that is related to the position in an amino acid.... Andreas > > > > ------------------------------ > > *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On > Behalf Of *Andreas Prlic > *Sent:* Friday, April 23, 2010 6:52 PM > *To:* Andy Thomas-Cramer > *Cc:* biojava-l at lists.open-bio.org > *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol > > > > Hi Andy, > > you could check with Atom.getFullname(), which contains the space > characters from the PDB file: > e.g Calpha: " CA ", Calcium "CA " > > in addition the parent group of a Calpha atom is usually an AminoAcid and > for Calciums it is a Hetatom group... > > Andreas > > On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer < > thomascramera at dnastar.com> wrote: > > > > Is there an easy way to identify the type of atom referenced by an Atom > object? > > For example, if Atom.getName() is "CA", is the element calcium or the > atom carbon alpha? > > If not, would it be feasible to add a method providing this in Atom, > AtomImpl, and parsing it in PDBFileParser, using the columns defined at > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From rmb32 at cornell.edu Mon Apr 26 18:02:11 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 15:02:11 -0700 Subject: [Biojava-l] Google Summer of Code - accepted students Message-ID: <4BD60D63.1040400@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From andreas at sdsc.edu Tue Apr 27 01:33:51 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 26 Apr 2010 22:33:51 -0700 Subject: [Biojava-l] accepted GSoC projects Message-ID: Dear all, Google has released the results for GSoC: Congratulations to Mark Chapman and Jianjiong Gao for having been accepted to work on the MSA and PTM projects for BioJava! Let's start the "community bonding" process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we all are looking forward to work with you on this during the summer. The Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle Ellrott for the MSA project (and me). I want to thank all of of you who submitted proposals or showed interest in other ways for the Google Summer of Code. We hope you are not too disappointed if your application did not get accepted this time. We had a large number (52) applications and the the overall quality of the submissions was very high. We would like to stay in touch with you and we hope that you are interested in BioJava also beyond the scope of GSoC. There are a number of different ways how to contribute: We are always looking for people who provide code and patches to further improve our library, help out with the documentation on the Wiki page, or answer questions on the mailing lists. Let's all give Mark and Jianjiong a warm welcome to the BioJava community. For those of you who are interested in following the progress of the projects, as usually, the development related discussions are going to be on the biojava-dev list. Happy coding! Andreas From rmb32 at cornell.edu Tue Apr 27 01:52:57 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 22:52:57 -0700 Subject: [Biojava-l] Google Summer of Code - accepted students Message-ID: <4BD67BB9.3000804@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From jianjiong.gao at gmail.com Tue Apr 27 15:13:12 2010 From: jianjiong.gao at gmail.com (Jianjiong Gao) Date: Tue, 27 Apr 2010 14:13:12 -0500 Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects In-Reply-To: References: Message-ID: Dear Dr. Prlic and Everyone, Thanks for the warm welcome. I am so glad that I have the chance to work with the BioJava community this summer. I would like to briefly introduce myself. My name is Jianjiong (JJ) Gao. I am a PhD student in Computer Science at University of Missouri, Columbia. My study is focusing on Bioinformatics, specifically computational proteomics and PTMs. I came across BioJava about two years ago when I was working on a plugin for Cytoscape, and was attracted by the idea of providing generic Java API for bioinformatics applications. I was thinking maybe someday I could do some coding for BioJava. And now I got the chance :) Best Regards, -JJ On Tue, Apr 27, 2010 at 12:33 AM, Andreas Prlic wrote: > Dear all, > > Google has released the results for GSoC: Congratulations to Mark Chapman > and Jianjiong Gao for having been accepted to work on the MSA and PTM > projects for BioJava! Let's start the "community bonding" process ( > http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) ?and we all are > looking forward to work with you on this during the summer. The Mentors and > co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle > Ellrott for the MSA project (and me). > > I want to thank all of of you who submitted proposals or showed interest in > other ways for the Google Summer of Code. We hope you are not too > disappointed if your application did not get accepted this time. We had a > large number (52) applications and the the overall quality of the > submissions was very high. We would like to stay in touch with you and we > hope that you are interested in BioJava also beyond the scope of GSoC. There > are a number of different ways how to contribute: ?We are always looking for > people who provide code and patches to further improve our library, help out > with the documentation on the Wiki page, or answer questions on the mailing > lists. > > Let's all give Mark and Jianjiong ?a warm welcome to the BioJava community. > For those of you who are interested in following the progress of the > projects, as usually, the development related discussions are going to be on > the biojava-dev list. > > Happy coding! > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From chapman at cs.wisc.edu Wed Apr 28 00:18:25 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Tue, 27 Apr 2010 23:18:25 -0500 Subject: [Biojava-l] accepted GSoC projects In-Reply-To: References: Message-ID: <4BD7B711.9090108@cs.wisc.edu> Hi all, Thank you to Google, Open Bioinformatics Foundation, BioJava, and my mentors for this opportunity. As a short introduction, I am Mark Chapman, a graduate student in Computer Sciences at the University of Wisconsin - Madison. My focus is in artificial intelligence and bioinformatics. This summer, I will add a Multiple Sequence Alignment module to BioJava. My first task will be to update the alignment module to BioJava3 and to design the interface for MSA. My second goal is to implement a progressive MSA styled after clustalw. After that, I will add alternative routines for each step. Any ideas for the MSA project as well as more sources of programming wisdom are quite welcome. For example, Andreas suggested a series about Java parallelism and lazy execution (http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). I also noted a useful tip for iterative development (http://en.flossmanuals.net/GSoCMentoring/Workflow). Thanks again, Mark On 4/27/2010 12:33 AM, Andreas Prlic wrote: > Dear all, > > Google has released the results for GSoC: Congratulations to Mark > Chapman and Jianjiong Gao for having been accepted to work on the MSA > and PTM projects for BioJava! Let's start the "community bonding" > process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we > all are looking forward to work with you on this during the summer. The > Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis > and Kyle Ellrott for the MSA project (and me). > > I want to thank all of of you who submitted proposals or showed interest > in other ways for the Google Summer of Code. We hope you are not too > disappointed if your application did not get accepted this time. We had > a large number (52) applications and the the overall quality of the > submissions was very high. We would like to stay in touch with you and > we hope that you are interested in BioJava also beyond the scope of > GSoC. There are a number of different ways how to contribute: We are > always looking for people who provide code and patches to further > improve our library, help out with the documentation on the Wiki page, > or answer questions on the mailing lists. > > Let's all give Mark and Jianjiong a warm welcome to the BioJava > community. For those of you who are interested in following the > progress of the projects, as usually, the development related > discussions are going to be on the biojava-dev list. > > Happy coding! > > Andreas > > From bernd.jagla at pasteur.fr Wed Apr 28 03:25:05 2010 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Wed, 28 Apr 2010 09:25:05 +0200 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region Message-ID: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> Hi there, I am trying to retrieve information (features) from the UCSC genome browser using the DAS interface. I am looking at the org.biojava.bio.program.das sources. I can retrieve all top level entry points with DASSequenceDB(dbURL) (Apperently the last entry from the return XML object gives a [Fatal Error] :1:1: Content is not allowed in prolog. Which I am ignoring...) and also the DSN entries using: DAS das = new DAS(); das.addDasURL(new URL(dbURLString)); for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); ) {.... When I try to access features for a top level entry point, i.e. a reference sequence I have the impression that first all features for a given reference sequence are being downloaded. My questions: How can I access only the features of a specific region? I guess in DAS terms I want to specify the segment part of the URL (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 00). I would also like to get the list of available features. How can I achieve this? From a wireshark output I can see that this is being retrieved somehow behind the scene. How can I access this information? I am looking at TestDAS*.java; are there any other examples around that I can use to learn from? Thanks a lot for your kind support, Best, Bernd From er.indupandey at gmail.com Wed Apr 28 12:22:10 2010 From: er.indupandey at gmail.com (indu pandey) Date: Wed, 28 Apr 2010 09:22:10 -0700 Subject: [Biojava-l] regarding errors Message-ID: hi When i m trying to run this code package javaapplication10; import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class TranscribeDNAtoRNA { public static void main(String[] args) { try { //make a DNA SymbolList SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT"); //transcribe it to RNA (after BioJava 1.4 this method is deprecated) symL = RNATools.transcribe(symL); //(after BioJava 1.4 use this method instead) symL = DNATools.toRNA(symL); //just to prove it worked System.out.println(symL.seqString()); } catch (IllegalSymbolException ex) { //this will happen if you try and make the DNA seq using non IUB symbols ex.printStackTrace(); }catch (IllegalAlphabetException ex) { //this will happen if you try and transcribe a non DNA SymbolList ex.printStackTrace(); } } } i get following errors:. *org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and translation table source alphabets don't match: RNA and DNA at org.biojava.bio.symbol.TranslatedSymbolList.(TranslatedSymbolList.java:75) at org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125) at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490) at javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23) * From andreas at sdsc.edu Wed Apr 28 13:31:58 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 28 Apr 2010 10:31:58 -0700 Subject: [Biojava-l] accepted GSoC projects In-Reply-To: <4BD7B711.9090108@cs.wisc.edu> References: <4BD7B711.9090108@cs.wisc.edu> Message-ID: > Any ideas for the MSA project as well as more sources of programming wisdom > are quite welcome. For example, Andreas suggested a series about Java > parallelism and lazy execution ( > http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). > credits for the links go to Scooter, who recommended those ;-) My general recommendation is to read Joshua Bloch's "Effective Java". http://java.sun.com/docs/books/effective/ It is a collection of rules that should help in avoiding some frequently made mistakes... Andreas > I also noted a useful tip for iterative development ( > http://en.flossmanuals.net/GSoCMentoring/Workflow). > > Thanks again, > Mark > > > > On 4/27/2010 12:33 AM, Andreas Prlic wrote: > >> Dear all, >> >> Google has released the results for GSoC: Congratulations to Mark >> Chapman and Jianjiong Gao for having been accepted to work on the MSA >> and PTM projects for BioJava! Let's start the "community bonding" >> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we >> all are looking forward to work with you on this during the summer. The >> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis >> and Kyle Ellrott for the MSA project (and me). >> >> I want to thank all of of you who submitted proposals or showed interest >> in other ways for the Google Summer of Code. We hope you are not too >> disappointed if your application did not get accepted this time. We had >> a large number (52) applications and the the overall quality of the >> submissions was very high. We would like to stay in touch with you and >> we hope that you are interested in BioJava also beyond the scope of >> GSoC. There are a number of different ways how to contribute: We are >> always looking for people who provide code and patches to further >> improve our library, help out with the documentation on the Wiki page, >> or answer questions on the mailing lists. >> >> Let's all give Mark and Jianjiong a warm welcome to the BioJava >> community. For those of you who are interested in following the >> progress of the projects, as usually, the development related >> discussions are going to be on the biojava-dev list. >> >> Happy coding! >> >> Andreas >> >> >> -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From jw12 at sanger.ac.uk Wed Apr 28 16:21:13 2010 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Wed, 28 Apr 2010 21:21:13 +0100 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> Message-ID: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Hi Bernd For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section called "Downloading data from the UCSC DAS server" for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert ) for DAS client creation, but there is a also a good javascript library as well called JSDas. Any more info then don't hesitate to ask. Jonathan. On 28 Apr 2010, at 08:25, Bernd Jagla wrote: > Hi there, > > I am trying to retrieve information (features) from the UCSC genome > browser > using the DAS interface. > I am looking at the org.biojava.bio.program.das sources. I can > retrieve all > top level entry points with > DASSequenceDB(dbURL) > (Apperently the last entry from the return XML object gives a > [Fatal Error] :1:1: Content is not allowed in prolog. > Which I am ignoring...) > > and also the DSN entries using: > DAS das = new DAS(); > das.addDasURL(new URL(dbURLString)); > for(Iterator i = das.getReferenceServers().iterator(); > i.hasNext(); ) > {.... > > When I try to access features for a top level entry point, i.e. a > reference > sequence I have the impression that first all features for a given > reference > sequence are being downloaded. > > My questions: > > How can I access only the features of a specific region? I guess in > DAS > terms I want to specify the segment part of the URL > (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 > 00). > > I would also like to get the list of available features. How can I > achieve > this? From a wireshark output I can see that this is being retrieved > somehow > behind the scene. How can I access this information? > > I am looking at TestDAS*.java; are there any other examples around > that I > can use to learn from? > > Thanks a lot for your kind support, > > Best, > > Bernd > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From chapman at cs.wisc.edu Wed Apr 28 21:09:07 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Wed, 28 Apr 2010 20:09:07 -0500 Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects In-Reply-To: <6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu> References: <4BD7B711.9090108@cs.wisc.edu> <6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu> Message-ID: <4BD8DC33.7010607@cs.wisc.edu> Here is a summary of the concurrency lessons I learned that are useful with or without the functional programming paradigm -- 1: implement Callable to submit tasks for concurrent/parallel/lazy execution - call() methods just wrap a call to the computation intensive method 2: share a fixed size thread pool with task queue to avoid - overhead of thread creation/destruction, - too many simultaneous threads, and - most blocking issues 3: place thread blocking Future.get() calls within tasks later in the queue - while(!Future.isDone()) Thread.yield(); may also help keep the pool active 4: execution in a task queue also enables easier logging and progress listening There are two obvious places concurrent execution will fit in the MSA module -- 1: building the distance matrix - queue pairwise alignment/scoring tasks in loop over all sequence pairs 2: progressive alignment - queue profile-profile alignment tasks in postfix traversal of guide tree (from leaves to root) All our library copies of "Effective Java" are checked out, so I ordered a copy for my personal library. The sample chapter on generics sold me. Mark On 4/28/2010 12:57 PM, Scooter Willis wrote: > Andreas > > Those links were sent to me by Mark Southern who sits a couple doors down and a past BioJava contributor for the sequence viewer. We should avoid bringing in any external parallel frameworks but at minimum give ourselves enough abstraction with a backend multi-threaded job-processing approach to take advantage of a multi-processor box and a cluster via Terracotta. If the abstraction of the jobs and the mapping of resources is generic enough then that allows different implementations in various cluster environments for those who have found the next best thing in parallel computing! > > Scooter > > On Apr 28, 2010, at 1:31 PM, Andreas Prlic wrote: > >>> Any ideas for the MSA project as well as more sources of programming wisdom >>> are quite welcome. For example, Andreas suggested a series about Java >>> parallelism and lazy execution ( >>> http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). >>> >> >> >> credits for the links go to Scooter, who recommended those ;-) My general >> recommendation is to read Joshua Bloch's "Effective Java". >> http://java.sun.com/docs/books/effective/ It is a collection of rules that >> should help in avoiding some frequently made mistakes... >> >> Andreas >> >> >> >> >> >> >>> I also noted a useful tip for iterative development ( >>> http://en.flossmanuals.net/GSoCMentoring/Workflow). >>> >>> Thanks again, >>> Mark >>> >>> >>> >>> On 4/27/2010 12:33 AM, Andreas Prlic wrote: >>> >>>> Dear all, >>>> >>>> Google has released the results for GSoC: Congratulations to Mark >>>> Chapman and Jianjiong Gao for having been accepted to work on the MSA >>>> and PTM projects for BioJava! Let's start the "community bonding" >>>> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we >>>> all are looking forward to work with you on this during the summer. The >>>> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis >>>> and Kyle Ellrott for the MSA project (and me). >>>> >>>> I want to thank all of of you who submitted proposals or showed interest >>>> in other ways for the Google Summer of Code. We hope you are not too >>>> disappointed if your application did not get accepted this time. We had >>>> a large number (52) applications and the the overall quality of the >>>> submissions was very high. We would like to stay in touch with you and >>>> we hope that you are interested in BioJava also beyond the scope of >>>> GSoC. There are a number of different ways how to contribute: We are >>>> always looking for people who provide code and patches to further >>>> improve our library, help out with the documentation on the Wiki page, >>>> or answer questions on the mailing lists. >>>> >>>> Let's all give Mark and Jianjiong a warm welcome to the BioJava >>>> community. For those of you who are interested in following the >>>> progress of the projects, as usually, the development related >>>> discussions are going to be on the biojava-dev list. >>>> >>>> Happy coding! >>>> >>>> Andreas >>>> >>>> >>>> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > From bernd.jagla at pasteur.fr Thu Apr 29 02:30:03 2010 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Thu, 29 Apr 2010 08:30:03 +0200 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Message-ID: Hi Jonathan, Just to clarify, I need to write my own das client? I was hoping to be able to use most of the functionality especially for the parsing of the XML and creating the URLs by means of functions/methods that are already around. I am now going into debug mode for the DAS package in biojava to look for the XML parsing, if you any further pointers on specific methods I should be looking at it would mean a lot to me. In short, I think I can create the URLs from scratch with not much effort. I don't currently know how to put the XML into a data structure and how this data structure should look like. Thanks for your kind help, Bernd _____ From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] Sent: Wednesday, April 28, 2010 10:21 PM To: Bernd Jagla Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence region Hi Bernd For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section called "Downloading data from the UCSC DAS server" for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for DAS client creation, but there is a also a good javascript library as well called JSDas. Any more info then don't hesitate to ask. Jonathan. On 28 Apr 2010, at 08:25, Bernd Jagla wrote: Hi there, I am trying to retrieve information (features) from the UCSC genome browser using the DAS interface. I am looking at the org.biojava.bio.program.das sources. I can retrieve all top level entry points with DASSequenceDB(dbURL) (Apperently the last entry from the return XML object gives a [Fatal Error] :1:1: Content is not allowed in prolog. Which I am ignoring...) and also the DSN entries using: DAS das = new DAS(); das.addDasURL(new URL(dbURLString)); for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); ) {.... When I try to access features for a top level entry point, i.e. a reference sequence I have the impression that first all features for a given reference sequence are being downloaded. My questions: How can I access only the features of a specific region? I guess in DAS terms I want to specify the segment part of the URL (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 00). I would also like to get the list of available features. How can I achieve this? From a wireshark output I can see that this is being retrieved somehow behind the scene. How can I access this information? I am looking at TestDAS*.java; are there any other examples around that I can use to learn from? Thanks a lot for your kind support, Best, Bernd _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jw12 at sanger.ac.uk Thu Apr 29 04:26:40 2010 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Thu, 29 Apr 2010 09:26:40 +0100 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Message-ID: The link I gave you http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert shows examples of how to connect to 'European' style das sources. For the UCSC and GBrowse type DAS sources you may have to play around with the urls to get the info you want as they work slightly differently to other DAS data sources and use the types to filter data. I would suggest contacting the UCSC for more info. The dasobert library is what you should use- the DASSequenceDB.java that you are currently looking at in biojava are old and not really supported anymore. > I was hoping to be able to use most of the functionality especially > for the parsing of the XML and creating the URLs by means of > functions/methods that are already around? this is what the dasobert library is for ;) On 29 Apr 2010, at 07:30, Bernd Jagla wrote: > Hi Jonathan, > > Just to clarify, I need to write my own das client? I was hoping to > be able to use most of the functionality especially for the parsing > of the XML and creating the URLs by means of functions/methods that > are already around? > I am now going into debug mode for the DAS package in biojava to > look for the XML parsing, if you any further pointers on specific > methods I should be looking at it would mean a lot to me? > In short, I think I can create the URLs from scratch with not much > effort. I don?t currently know how to put the XML into a data > structure and how this data structure should look like. > > Thanks for your kind help, > > Bernd > > From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] > Sent: Wednesday, April 28, 2010 10:21 PM > To: Bernd Jagla > Cc: biojava-l at lists.open-bio.org > Subject: Re: [Biojava-l] DAS client: how to retrieve features for a > sequence region > > Hi Bernd > > For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads > there is a section called "Downloading data from the UCSC DAS server" > > for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 > > the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert > ) for DAS client creation, but there is a also a good javascript > library as well called JSDas. > > Any more info then don't hesitate to ask. > > Jonathan. > > > On 28 Apr 2010, at 08:25, Bernd Jagla wrote: > > > Hi there, > > I am trying to retrieve information (features) from the UCSC genome > browser > using the DAS interface. > I am looking at the org.biojava.bio.program.das sources. I can > retrieve all > top level entry points with > DASSequenceDB(dbURL) > (Apperently the last entry from the return XML object gives a > [Fatal Error] :1:1: Content is not allowed in prolog. > Which I am ignoring...) > > and also the DSN entries using: > DAS das = new DAS(); > das.addDasURL(new URL(dbURLString)); > for(Iterator i = das.getReferenceServers().iterator(); > i.hasNext(); ) > {.... > > When I try to access features for a top level entry point, i.e. a > reference > sequence I have the impression that first all features for a given > reference > sequence are being downloaded. > > My questions: > > How can I access only the features of a specific region? I guess in > DAS > terms I want to specify the segment part of the URL > (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 > 00). > > I would also like to get the list of available features. How can I > achieve > this? From a wireshark output I can see that this is being retrieved > somehow > behind the scene. How can I access this information? > > I am looking at TestDAS*.java; are there any other examples around > that I > can use to learn from? > > Thanks a lot for your kind support, > > Best, > > Bernd > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > Jonathan Warren > Senior Developer and DAS coordinator > jw12 at sanger.ac.uk > Ext: 2314 > Telephone: 01223 492314 > > > > > > > -- The Wellcome Trust Sanger Institute is operated by Genome > Research Limited, a charity registered in England with number > 1021457 and a company registered in England with number 2742969, > whose registered office is 215 Euston Road, London, NW1 2BE. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ayates at ebi.ac.uk Thu Apr 29 04:51:23 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 29 Apr 2010 09:51:23 +0100 Subject: [Biojava-l] regarding errors In-Reply-To: References: Message-ID: I believe your problem is that you are attempting to transcribe the DNA to RNA twice. If you comment out the line: //symL = RNATools.transcribe(symL); Then you should find the code will work Regards, Andy On 28 Apr 2010, at 17:22, indu pandey wrote: > hi > > When i m trying to run this code > > package javaapplication10; > import org.biojava.bio.symbol.*; > import org.biojava.bio.seq.*; > > public class TranscribeDNAtoRNA { > public static void main(String[] args) { > try { > //make a DNA SymbolList > SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT"); > //transcribe it to RNA (after BioJava 1.4 this method is deprecated) > symL = RNATools.transcribe(symL); > //(after BioJava 1.4 use this method instead) > symL = DNATools.toRNA(symL); > //just to prove it worked > System.out.println(symL.seqString()); > } > catch (IllegalSymbolException ex) { > //this will happen if you try and make the DNA seq using non IUB > symbols > ex.printStackTrace(); > }catch (IllegalAlphabetException ex) { > //this will happen if you try and transcribe a non DNA SymbolList > ex.printStackTrace(); > } > } > } > > > i get following errors:. > > *org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and > translation table source alphabets don't match: RNA and DNA > at > org.biojava.bio.symbol.TranslatedSymbolList.(TranslatedSymbolList.java:75) > at > org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125) > at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490) > at > javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23) > * > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From bernd.jagla at pasteur.fr Thu Apr 29 05:57:58 2010 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Thu, 29 Apr 2010 11:57:58 +0200 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Message-ID: Great that is very helpful. One more question: Should I be using the Das1 or Das2 implementations. The demo I am looking at uses Das2 (I think), but I am running into problems. By modifying things in the Das2SourceHandler I can now get Ids (instead of using uri). Is this the right way of approaching this or should I be looking somewhere else.. When you say I have to play around with the URLs can you give me an example? Is the problem described above part of this? (this is not the URL but rather the XML..) Sorry for these questions, but I find it extremely difficult to get my head around all these different versions (DAS1/2; dasobert/programs.das; European/Rest;.) Thanks a lot, Bernd PS. I guess I should have attended the recent meeting. ;( _____ From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] Sent: Thursday, April 29, 2010 10:27 AM To: Bernd Jagla Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence region The link I gave you http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert shows examples of how to connect to 'European' style das sources. For the UCSC and GBrowse type DAS sources you may have to play around with the urls to get the info you want as they work slightly differently to other DAS data sources and use the types to filter data. I would suggest contacting the UCSC for more info. The dasobert library is what you should use- the DASSequenceDB.java that you are currently looking at in biojava are old and not really supported anymore. I was hoping to be able to use most of the functionality especially for the parsing of the XML and creating the URLs by means of functions/methods that are already around. this is what the dasobert library is for ;) On 29 Apr 2010, at 07:30, Bernd Jagla wrote: Hi Jonathan, Just to clarify, I need to write my own das client? I was hoping to be able to use most of the functionality especially for the parsing of the XML and creating the URLs by means of functions/methods that are already around. I am now going into debug mode for the DAS package in biojava to look for the XML parsing, if you any further pointers on specific methods I should be looking at it would mean a lot to me. In short, I think I can create the URLs from scratch with not much effort. I don't currently know how to put the XML into a data structure and how this data structure should look like. Thanks for your kind help, Bernd _____ From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] Sent: Wednesday, April 28, 2010 10:21 PM To: Bernd Jagla Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence region Hi Bernd For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section called "Downloading data from the UCSC DAS server" for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for DAS client creation, but there is a also a good javascript library as well called JSDas. Any more info then don't hesitate to ask. Jonathan. On 28 Apr 2010, at 08:25, Bernd Jagla wrote: Hi there, I am trying to retrieve information (features) from the UCSC genome browser using the DAS interface. I am looking at the org.biojava.bio.program.das sources. I can retrieve all top level entry points with DASSequenceDB(dbURL) (Apperently the last entry from the return XML object gives a [Fatal Error] :1:1: Content is not allowed in prolog. Which I am ignoring...) and also the DSN entries using: DAS das = new DAS(); das.addDasURL(new URL(dbURLString)); for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); ) {.... When I try to access features for a top level entry point, i.e. a reference sequence I have the impression that first all features for a given reference sequence are being downloaded. My questions: How can I access only the features of a specific region? I guess in DAS terms I want to specify the segment part of the URL (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 00). I would also like to get the list of available features. How can I achieve this? From a wireshark output I can see that this is being retrieved somehow behind the scene. How can I access this information? I am looking at TestDAS*.java; are there any other examples around that I can use to learn from? Thanks a lot for your kind support, Best, Bernd _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From thomascramera at dnastar.com Thu Apr 29 14:14:27 2010 From: thomascramera at dnastar.com (Andy Thomas-Cramer) Date: Thu, 29 Apr 2010 13:14:27 -0500 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Yes, I would like to have direct access to the element symbol data that's in the file. Otherwise, anyone that needs the element type has to create rules for interpreting it from the "atom name" field. It feels wrong to attempt to deduce data when it is provided explicitly. These PDB remediation project notes suggest using the element symbol specified in 77-78 http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D426#SEC3 "Atom types are provided for every atom (i.e. ATOM record columns 77-78), so prior atom name justification conventions should no longer be assumed in reading atom names." JMOL uses the PDB element symbol if present, else interprets from the "atom name" field. http://wiki.jmol.org/index.php/AtomSets "On PDB format, Jmol will identify the element from columns 77-78 (element symbol, right-justified). If this is absent, then it will interpret the "atom name" field (columns 13-14) to deduce the element identity." JMOL is LGPL. If it interpretation is desirable, could start with its current approach. Personally, I would be happy just with access to the data in the file. ________________________________ From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf Of Andreas Prlic Sent: Monday, April 26, 2010 8:08 PM To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol Hi Andy Questions: * Is this pattern documented in the PDB specification? see here: http://www.wwpdb.org/documentation/format23/sect9.html#ATOM * If this pattern can be relied on, why are columns 77-78 also dedicated to the element symbol? That is the atom's element symbol (as given in the periodic table), in contrast to the first name, which contains numbering information. * Should reliance on the pattern be hidden behind a BioJava method? If you think that is important we could probably provide an enum for all atom types. There are two categories though: the periodic table symbol and the one that is related to the position in an amino acid.... Andreas ________________________________ From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf Of Andreas Prlic Sent: Friday, April 23, 2010 6:52 PM To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol Hi Andy, you could check with Atom.getFullname(), which contains the space characters from the PDB file: e.g Calpha: " CA ", Calcium "CA " in addition the parent group of a Calpha atom is usually an AminoAcid and for Calciums it is a Hetatom group... Andreas On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer wrote: Is there an easy way to identify the type of atom referenced by an Atom object? For example, if Atom.getName() is "CA", is the element calcium or the atom carbon alpha? If not, would it be feasible to add a method providing this in Atom, AtomImpl, and parsing it in PDBFileParser, using the columns defined at http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From pwrose at ucsd.edu Thu Apr 29 15:53:33 2010 From: pwrose at ucsd.edu (Peter Rose) Date: Thu, 29 Apr 2010 12:53:33 -0700 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: <002f01cae7d5$a673fcf0$f35bf6d0$@edu> Since there was a request to be able to access element information, I've added an Element enum to the org.biojava.bio.structure package that I had developed for another application. Each element has a number of properties such as atomic number, mass, min and max valence, electronegativity, etc. that should be useful. The AtomImpl class now has a getter and setter for Element. Also, the PDB parser now populates the Element in the Atom class. By default the PDB parser tries to parse the element from columns 77-78. As a fallback for mis-formatted PDB files that don't contain an element column, the element is parsed from the atom name. We'll also add element support for the cif parser soon. -Peter ________________________________________________ Peter Rose, Ph.D. Scientific Lead RCSB Protein Data Bank (www.pdb.org) San Diego Supercomputer Center (SDSC) and Skaggs School of Pharmacy and Pharmaceutical Sciences Pharmaceutical Sciences Building University of California San Diego -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of biojava-l-request at lists.open-bio.org Sent: Tuesday, April 27, 2010 9:00 AM To: biojava-l at lists.open-bio.org Subject: Biojava-l Digest, Vol 87, Issue 26 Send Biojava-l mailing list submissions to biojava-l at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.open-bio.org/mailman/listinfo/biojava-l or, via email, send a message with subject or body 'help' to biojava-l-request at lists.open-bio.org You can reach the person managing the list at biojava-l-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Biojava-l digest..." Today's Topics: 1. Re: PDBFileParser and Atom element symbol (Andreas Prlic) 2. Google Summer of Code - accepted students (Robert Buels) 3. accepted GSoC projects (Andreas Prlic) 4. Google Summer of Code - accepted students (Robert Buels) ---------------------------------------------------------------------- Message: 1 Date: Mon, 26 Apr 2010 18:07:53 -0700 From: Andreas Prlic Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi Andy Questions: > * Is this pattern documented in the PDB specification? > see here: http://www.wwpdb.org/documentation/format23/sect9.html#ATOM > * If this pattern can be relied on, why are columns 77-78 also dedicated to > the element symbol? > That is the atom's element symbol (as given in the periodic table), in contrast to the first name, which contains numbering information. * Should reliance on the pattern be hidden behind a BioJava method? > If you think that is important we could probably provide an enum for all atom types. There are two categories though: the periodic table symbol and the one that is related to the position in an amino acid.... Andreas > > > > ------------------------------ > > *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On > Behalf Of *Andreas Prlic > *Sent:* Friday, April 23, 2010 6:52 PM > *To:* Andy Thomas-Cramer > *Cc:* biojava-l at lists.open-bio.org > *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol > > > > Hi Andy, > > you could check with Atom.getFullname(), which contains the space > characters from the PDB file: > e.g Calpha: " CA ", Calcium "CA " > > in addition the parent group of a Calpha atom is usually an AminoAcid and > for Calciums it is a Hetatom group... > > Andreas > > On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer < > thomascramera at dnastar.com> wrote: > > > > Is there an easy way to identify the type of atom referenced by an Atom > object? > > For example, if Atom.getName() is "CA", is the element calcium or the > atom carbon alpha? > > If not, would it be feasible to add a method providing this in Atom, > AtomImpl, and parsing it in PDBFileParser, using the columns defined at > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- ------------------------------ Message: 2 Date: Mon, 26 Apr 2010 15:02:11 -0700 From: Robert Buels Subject: [Biojava-l] Google Summer of Code - accepted students To: rmb32 at cornell.edu Message-ID: <4BD60D63.1040400 at cornell.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator ------------------------------ Message: 3 Date: Mon, 26 Apr 2010 22:33:51 -0700 From: Andreas Prlic Subject: [Biojava-l] accepted GSoC projects To: Jianjiong Gao , Mark Chapman , Biojava , biojava-dev Cc: "Rose, Peter" , Scooter Willis , Kyle Ellrott Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Dear all, Google has released the results for GSoC: Congratulations to Mark Chapman and Jianjiong Gao for having been accepted to work on the MSA and PTM projects for BioJava! Let's start the "community bonding" process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we all are looking forward to work with you on this during the summer. The Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle Ellrott for the MSA project (and me). I want to thank all of of you who submitted proposals or showed interest in other ways for the Google Summer of Code. We hope you are not too disappointed if your application did not get accepted this time. We had a large number (52) applications and the the overall quality of the submissions was very high. We would like to stay in touch with you and we hope that you are interested in BioJava also beyond the scope of GSoC. There are a number of different ways how to contribute: We are always looking for people who provide code and patches to further improve our library, help out with the documentation on the Wiki page, or answer questions on the mailing lists. Let's all give Mark and Jianjiong a warm welcome to the BioJava community. For those of you who are interested in following the progress of the projects, as usually, the development related discussions are going to be on the biojava-dev list. Happy coding! Andreas ------------------------------ Message: 4 Date: Mon, 26 Apr 2010 22:52:57 -0700 From: Robert Buels Subject: [Biojava-l] Google Summer of Code - accepted students To: BioPerl List , BioPython List , BioJava List , BioRuby List , BioSQL List , BioLib List , Open-Bio List , BioDAS List Message-ID: <4BD67BB9.3000804 at cornell.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator ------------------------------ _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l End of Biojava-l Digest, Vol 87, Issue 26 ***************************************** From marcel.huntemann at gmail.com Thu Apr 29 20:49:10 2010 From: marcel.huntemann at gmail.com (Marcel Huntemann) Date: Thu, 29 Apr 2010 17:49:10 -0700 Subject: [Biojava-l] Error during genbank parsing Message-ID: <4BDA2906.20801@Gmail.com> Hi! I get the following error during the parsing of a genbank file: Exception in thread "main" org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) at gov.doe.jgi.img.pangenomes.Controller.createGeneMap(Controller.java:303) at gov.doe.jgi.img.pangenomes.Controller.start(Controller.java:197) at gov.doe.jgi.img.pangenomes.Main.createAndStartController(Main.java:105) at gov.doe.jgi.img.pangenomes.Main.main(Main.java:35) Caused by: org.biojava.bio.seq.io.ParseException: A Exception Has Occurred During Parsing. Please submit the details that follow to biojava-l at biojava.org or post a bug report to http://bugzilla.open-bio.org/ Format_object=org.biojavax.bio.seq.io.GenbankFormat Accession=null Id=null Comments=Bad locus line Parse_block=LOCUS NC_008711 4597686 bp DNA circular 17-DEC-2009 Stack trace follows .... at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:322) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ... 4 more No matter which genbank file I use, I always get this error (for sure with a different LOCUS line. The strange thing is that this used to work about 1/2 - 1 year ago. No I wanted to use my program again and get always this error, although I didn't really change anything on that code. The only thing I can think of that's different, since the last time I used it (when it worked), is that I switched from a 32bit Linux to a 64bit Linux machine. But can that really cause it? Here's my code and how I use it: for ( String taxonId : givenTaxonIds ) { gbkFile = new File( dirPath + taxonId + gbkSuffix ); if ( ! gbkFile.exists() ) { logr.fatal( "Couldn't find genbank file for taxonOID " + taxonId + "!\nI tried " + gbkFile.getPath() + ", but it doesn't exist!" ); System.exit( 0 ); } BufferedReader br = new BufferedReader( new FileReader( gbkFile ) ); Namespace ns = RichObjectFactory.getDefaultNamespace(); RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA( br, ns ); numberInGenome = 0; while ( seqs.hasNext() ) { RichSequence contig = seqs.nextRichSequence(); // Get genes and their positions Set features = contig.getFeatureSet(); positions = new ArrayList(); geneIds = new ArrayList(); for ( Feature richFeature : features ) { if ( richFeature.getType().equals( "CDS" ) ) { RichLocation loc = (RichLocation) richFeature.getLocation(); position = new int[3]; position[0] = loc.getMin(); position[1] = loc.getMax(); position[2] = loc.getStrand().intValue(); Annotation a = richFeature.getAnnotation(); split = a.getProperty( "note" ).toString().split( "=" ); geneIds.add( split[1].trim() ); positions.add( position ); } else if ( richFeature.getType().equals( "gene" ) ) { Annotation a = richFeature.getAnnotation(); if ( a.containsProperty( "pseudo" ) ) { RichLocation loc = (RichLocation) richFeature.getLocation(); position = new int[3]; position[0] = loc.getMin(); position[1] = loc.getMax(); position[2] = loc.getStrand().intValue(); split = a.getProperty( "note" ).toString().split( "=" ); geneIds.add( split[1].trim() ); positions.add( position ); } } } Thanks 4 the help, Marcel P.S.: Also the info on some of the biojava pages seems outdated. I got the latest version from your svn trunk and on the GetStarted page it says that one just has to call ant to build it. But there's now build.xml in the biojava folder. Instead there's a pom.xml, so I guess u switched to maven. I bet a lot of people don'tknow how to geal with and have no clue what to do, when the ant command didn't work... From narciso at cnpaf.embrapa.br Fri Apr 30 17:32:02 2010 From: narciso at cnpaf.embrapa.br (Marcelo Goncalves Narciso (Pesquisador)) Date: Fri, 30 Apr 2010 19:32:02 -0200 Subject: [Biojava-l] problems with intallation of biojava in windows 7 In-Reply-To: <20100430184758.M13673@cnpaf.embrapa.br> References: <20100430184758.M13673@cnpaf.embrapa.br> Message-ID: <20100430212950.M75279@cnpaf.embrapa.br> Hi, people, I need your help. When I try to install biojava in windows 7, it happens: > C:\Users\narciso\biojava>java -jar biojava-1.7.1-all.jar > Failed to load Main-Class manifest attribute from > biojava-1.7.1-all.jar How can I fix it? Thanks a lot Marcelo From heuermh at acm.org Thu Apr 1 03:56:42 2010 From: heuermh at acm.org (Michael Heuer) Date: Wed, 31 Mar 2010 23:56:42 -0400 (EDT) Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: <20100330215047.084f6b00@wp01> Message-ID: xyz wrote: > Thank you it works, but after I extended the code with > RichSequence.IOTools.writeFasta(outputFasta, trimSeq, ns, > fastq.getDescription()); > in order to get also a trimmed fasta file I got the following error: > > Fastq2Fasta.java:51: cannot > find symbol symbol : method > writeFasta(java.io.FileOutputStream,java.lang.String,org.biojavax.SimpleNamespace,java.lang.String) > location: class org.biojavax.bio.seq.RichSequence.IOTools > RichSequence.IOTools.writeFasta(outputFasta, trimSeq, ns, > fastq.getDescription()); 1 error The fastq package has not yet been integrated with biojava core or the biojavax packages. If you would like to use RichSequence.IOTools, you would need to create a RichSequence from each Fastq object before writing. Something like import static ...RichSequence.Tools.*; import static ...RichSequence.IOTools.*; Fastq fastq = ...; Namespace namepace = ...; RichSequence richSequence = createRichSequence( namespace, fastq.getDescription(), fastq.getSequence(), DNATools.getDNA()); writeFasta(outputStream, richSequence, namespace); may work. > Suggestions: > 1) > After I trimmed the fastq files the header information for quality > is empty > > @HWI-EAS406:5:1:0:1390#0/1 > GGGTGATGGCCGCTGCCGATGGCGTCAAAA > + > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO > > this reduced the size of the files but is it compatible with > SOAP and TopHat? Sorry, not sure what you are asking here. > 2) > I was using fastq files up to 6 GBytes and I have not run any benchmarks > with different Buffer/stream combination on big text files and therefore > I am not sure that is enough to use just FileInputStream or > FileOutputStream. BioJavaX is using BufferedReader br = new > BufferedReader(new FileReader()) are there any speed difference? AbstractFastqReader.read(InputStream) uses a BufferedReader, and all the other read methods pass through that one. michael From huijieqiao at gmail.com Fri Apr 2 03:02:37 2010 From: huijieqiao at gmail.com (Huijie Qiao) Date: Fri, 2 Apr 2010 11:02:37 +0800 Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat" Message-ID: version 1.7.1 line 361 else if (sectionKey.equals(SOURCE_TAG)) { // ignore - can get all this from the first feature actually the content in the SOURCE_TAG and the first feature are different in some gb file. For example, the example file in http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb The Source TAG is SOURCE Bos taurus (cattle) ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos. and the first feature tag is FEATURES Location/Qualifiers source 1..1136 /organism="Bos taurus" /mol_type="mRNA" /db_xref="taxon:9913" /clone="pBB2I" /tissue_type="liver" I can't get the hierarchy info through the follow codes. NCBITaxon taxon = seq.getTaxon(); System.out.println(taxon.getNameHierarchy()); output is "." From holland at eaglegenomics.com Fri Apr 2 07:38:44 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 2 Apr 2010 08:38:44 +0100 Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat" In-Reply-To: References: Message-ID: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com> The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism. If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email. thanks, Richard On 2 Apr 2010, at 04:02, Huijie Qiao wrote: > version 1.7.1 > > line 361 > else if (sectionKey.equals(SOURCE_TAG)) { > // ignore - can get all this from the first feature > > actually the content in the SOURCE_TAG and the first feature are different > in some gb file. > > For example, the example file in > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb > > The Source TAG is > SOURCE Bos taurus (cattle) > ORGANISM Bos taurus > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; > Pecora; Bovidae; Bovinae; Bos. > > and the first feature tag is > FEATURES Location/Qualifiers > source 1..1136 > /organism="Bos taurus" > /mol_type="mRNA" > /db_xref="taxon:9913" > /clone="pBB2I" > /tissue_type="liver" > > I can't get the hierarchy info through the follow codes. > NCBITaxon taxon = seq.getTaxon(); > System.out.println(taxon.getNameHierarchy()); output is "." > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From martin.jones at ed.ac.uk Fri Apr 2 11:23:21 2010 From: martin.jones at ed.ac.uk (Martin Jones) Date: Fri, 2 Apr 2010 12:23:21 +0100 Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat" In-Reply-To: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com> References: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com> Message-ID: You can also get the hierarchy directly from the NCBI taxonomy dump... this is in Groovy but gives you the idea: HashMap taxid2node = [:] HashMap child2parent = [:] def nodePattern = ~/^(\d+)\t\|\t(\d+)\t\|\t(.+?)\t\|/ def count=0 new File("/home/martin/nodes.dmp").eachLine{ line -> count++ def matcher = (line =~ nodePattern) if (matcher.matches()){ Integer myId = matcher[0][1].toInteger() Integer parentId = matcher[0][2].toInteger() String myRank = matcher[0][3] def node = new TreeNode(taxid : myId, rank:myRank) taxid2node[(myId)] = node child2parent[(myId)] = parentId } } // do something with the hash -Martin On 2 April 2010 08:38, Richard Holland wrote: > The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism. > > If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email. > > thanks, > Richard > > On 2 Apr 2010, at 04:02, Huijie Qiao wrote: > >> version 1.7.1 >> >> line 361 >> else if (sectionKey.equals(SOURCE_TAG)) { >> ? ? ?// ignore - can get all this from the first feature >> >> actually the content in the SOURCE_TAG and the first feature are different >> in some gb file. >> >> For example, the example file in >> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb >> >> The Source TAG is >> SOURCE ? ? ?Bos taurus (cattle) >> ?ORGANISM ?Bos taurus >> ? ? ? ? ? ?Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; >> Euteleostomi; >> ? ? ? ? ? ?Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; >> ? ? ? ? ? ?Pecora; Bovidae; Bovinae; Bos. >> >> and the first feature tag is >> FEATURES ? ? ? ? ? ? Location/Qualifiers >> ? ? source ? ? ? ? ?1..1136 >> ? ? ? ? ? ? ? ? ? ? /organism="Bos taurus" >> ? ? ? ? ? ? ? ? ? ? /mol_type="mRNA" >> ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9913" >> ? ? ? ? ? ? ? ? ? ? /clone="pBB2I" >> ? ? ? ? ? ? ? ? ? ? /tissue_type="liver" >> >> I can't get the hierarchy info through the follow codes. >> NCBITaxon taxon = seq.getTaxon(); >> System.out.println(taxon.getNameHierarchy()); output is "." >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From andreas.prlic at gmail.com Sat Apr 3 15:08:57 2010 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Sat, 3 Apr 2010 08:08:57 -0700 Subject: [Biojava-l] Anonymous svn down Message-ID: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> Hi, the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? Andreas From rmb32 at cornell.edu Sat Apr 3 20:09:27 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 13:09:27 -0700 Subject: [Biojava-l] Google Summer of Code is *ON* for OBF projects! Message-ID: <4BB7A077.4070802@cornell.edu> Hi all, Reminder: GSoC student proposals must be submitted to Google by April 9th, 19:00 UTC. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 5 proposals submitted to our org in Google's web app. Keep them coming, and let's see some really good ones! Rob Buels OBF GSoC 2010 Administrator From jianjiong.gao at gmail.com Sun Apr 4 06:33:15 2010 From: jianjiong.gao at gmail.com (Jianjiong Gao) Date: Sun, 4 Apr 2010 01:33:15 -0500 Subject: [Biojava-l] GSoC project question Message-ID: Hello, My name is Jianjiong Gao, a graduate student in Computer Science Department at University of Missouri-Columbia. I am very interested in applying for your GSoC project "Identification and Classification of Posttranslational Modification of Proteins". This project is highly related to my dissertation topic "Bioinformatic analysis and prediction of phosphorylation and other PTMs." Although I have not touched the structural part of PTM till now, I am really interested in learning and expanding my research on this field. After reading the project description on the idea page (http://biojava.org/wiki/Google_Summer_of_Code), I have several questions regarding the *approach* section: > 1. Establish a list of known PTMs and write code to locate these PTMs in a 3D protein structure. Q1: There are many different types of PTMs. Do you have list of PTMs of interest? Do you have priorities on different PTMs? Q2: Is there any available algorithm to locate the PTMs in a 3D protein structure? What is the difficulty on this task? Q3: The PDB file contains annotations of residue modifications such as HETATM AND MODRES. Can we utilized this information for localizing the PTMs? > 2. Determine the protein residues that carry PTMs based on distance thresholds. > 3. Traverse the sugar molecules and establish their link pattern based on connectivity. Q4: Is this task to determine the types of glycosylation, i.e., N-linked glycosylation, O-N-acetylgalactosamine, O-glucose, etc? Q5: Is there any available algorithm to do this? What is the difficulty in this task? It looks complicated with so many different types of glycosylation and structure isomers. > 4. Present the PTMs as text in a linear notation and 2D graphical representations if time permits. Q6: Can we used the SMILES format (http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification) here? Or do we have any other better options? Thanks very much for your time. I am looking forward to hearing from you. Best Regards, -JJ From rmb32 at cornell.edu Sun Apr 4 04:37:38 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Sat, 03 Apr 2010 21:37:38 -0700 Subject: [Biojava-l] Reminder: GSoC student applications due April 9, 19:00 UTC Message-ID: <4BB81792.8060001@cornell.edu> Hi all, Sending this again with a different subject line, just in case. GSoC student proposals must be submitted to Google through their web application by *April 9th, 19:00 UTC*. That's less than a week away. Students: you should ALREADY be working with mentors on the project mailing lists, they can help you get your proposal into shape. So far, we have 6 proposals submitted to our org in Google's web app. Keep them coming, and keep them good! Rob Buels OBF GSoC 2010 Administrator From nagendravns at gmail.com Sun Apr 4 16:12:11 2010 From: nagendravns at gmail.com (nagendra kumar) Date: Sun, 4 Apr 2010 21:42:11 +0530 Subject: [Biojava-l] how to add api Message-ID: sir i want bio java develop one project please give me detail how bio java api install in system From chapman at cs.wisc.edu Sun Apr 4 17:54:59 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Sun, 04 Apr 2010 12:54:59 -0500 Subject: [Biojava-l] how to add api In-Reply-To: References: Message-ID: <4BB8D273.7080601@cs.wisc.edu> Everything you need is at: http://biojava.org/wiki/BioJava:Download On 4/4/2010 11:12 AM, nagendra kumar wrote: > sir i want bio java develop one project please give me detail how bio java > api install in system > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From anantpossible at gmail.com Sun Apr 4 17:58:15 2010 From: anantpossible at gmail.com (Anant Jain) Date: Sun, 4 Apr 2010 23:28:15 +0530 Subject: [Biojava-l] how to add api In-Reply-To: References: Message-ID: On 4/4/10, nagendra kumar wrote: > > sir i want bio java develop one project please give me detail how bio java > api install in system > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > HI, To use biojava API, all you need to download Biojava Jar from and perform following steps... 1. Extract jar, you will get some more jars and files,,, 2. You need to paste these jars in following location "C:\Program Files\Java\jre6\lib\ext", if your java install directory is C drive. -- Anant Jain B.Tech Bioinformatics, RHCE From sacomoto at gmail.com Tue Apr 6 05:29:23 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Tue, 6 Apr 2010 02:29:23 -0300 Subject: [Biojava-l] GSoC project on MSA Message-ID: Hello, I'm currently a graduate student at University of S?o Paulo (Brazil) and I'm quite interested in applying for the all-Java MSA project. I'm already familiar with the multiple sequence alignment problem, I developed a lossless filter for this problem as my undergraduate final project, the work is described here [http://www.almob.org/content/4/1/3] and there is an online version of the algorithm here [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu]. Now, regarding the project, just to make it clear, when you say in the "straightforward approach for building up the MSA progressively", you mean the standard dynamic programming approach for pairwise alignment following the guide tree built in the second step, right? One last question, should I send my proposal direct to the Google's web app or here first? Thanks, Gustavo Sacomoto From andreas at sdsc.edu Tue Apr 6 17:46:16 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 6 Apr 2010 10:46:16 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Gustavo, With straightforward I meant that we only have 3 months for this project and we should not try to solve all problems at the same time. Probably a realistic approach is to start with trying to keep things modular and simple (think interfaces and implementations) and stick to standard solutions that have been shown to work elsewhere. If there is more time in the project one can then replace some of the implementations with technically more advanced ones. Since we are doing things in Java I am interested in having support for parallelisation wherever possible. Another issue is how to verify that the created alignments are meaningful. One could e.g. use the biojava structure modules to calculate protein structure alignments to verify the quality of the obtained multiple sequence alignments. All applications have to be made via Google. We are providing comments on drafts of proposals and try to work together with applicants to improve the submissions. Note: The application deadline is soon and speed is important now. Andreas On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto < sacomoto at gmail.com> wrote: > Hello, > > I'm currently a graduate student at University of S?o Paulo (Brazil) > and I'm quite interested in applying for the all-Java MSA project. I'm > already familiar with the multiple sequence alignment problem, I > developed a lossless filter for this problem as my undergraduate final > project, the work is described here > [http://www.almob.org/content/4/1/3] and there is an online version of > the algorithm here > [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu]. > > Now, regarding the project, just to make it clear, when you say in the > "straightforward approach for building up the MSA progressively", you > mean the standard dynamic programming approach for pairwise alignment > following the guide tree built in the second step, right? > > One last question, should I send my proposal direct to the Google's > web app or here first? > > Thanks, > > Gustavo Sacomoto > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From sacomoto at gmail.com Tue Apr 6 18:53:04 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Tue, 6 Apr 2010 15:53:04 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hello Andreas, On Tue, Apr 6, 2010 at 2:46 PM, Andreas Prlic wrote: > Hi Gustavo, > > With straightforward I meant that we only have 3 months for this project and > we should not try to solve all problems at the same time. Probably a > realistic approach is to start with trying to keep things modular and simple > (think interfaces and implementations) and stick to standard solutions that > have been shown to work elsewhere. If there is more time in the project one > can then replace some of the implementations with technically more advanced > ones. I think my question wasn't very clear, my intention in this project is to follow the approach (with the tree steps) outlined in the project's page. Using the classical progressive alignment heuristic: build the distance matrix, build the guide tree and using this tree progressively align more sequences together. What I propose for the third step is a first implementation using the (more simple) dynamic programming described in the first CLUSTAL paper (I thinks it's from 1988) and incrementally improving the algorithm to get closer to the one described in CLUSTALW paper (from 1994). Is this more or less what you had in mind? > Since we are doing things in Java I am interested in having support for > parallelisation wherever possible. Another issue is how to verify that the > created alignments are meaningful. One could e.g. use the biojava structure > modules to calculate protein structure alignments to verify the quality of > the obtained multiple sequence alignments. About parallel strategies, I think a relative easy way we could use it is in the distance matrix construction, we could have several threads calculating the pairwise alignment for different pairs of sequence in the set. Now, the alignment quality measures is a tougher issue. The CLUSTALW paper doesn't give any way to measure the quality of the result, they consider a good alignment the one that is hard to improve by eye (But they claim that for sequences sufficient similar, no pair less than 35% identical, the results are good). Can I do the same as in CLUSTALW paper and leave the quality measure to the user? How concerned should I be with that in this project? > All applications have to be made via Google. We are providing comments? on > drafts of proposals and try to work together with applicants to improve the > submissions. Note: The application deadline is soon and speed is important > now. I will try send to this mailing list a proposal draft until tomorrow to have some feedback from you. > Andreas > > > > On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto > wrote: >> >> Hello, >> >> I'm currently a graduate student at University of S?o Paulo (Brazil) >> and I'm quite interested in applying for the all-Java MSA project. I'm >> already familiar with the multiple sequence alignment problem, I >> developed a lossless filter for this problem as my undergraduate final >> project, the work is described here >> [http://www.almob.org/content/4/1/3] and there is an online version of >> the algorithm here >> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu]. >> >> Now, regarding the project, just to make it clear, when you say in the >> "straightforward approach for building up the MSA progressively", you >> mean the standard dynamic programming approach for pairwise alignment >> following the guide tree built in the second step, right? >> >> One last question, should I send my proposal direct to the Google's >> web app or here first? >> >> Thanks, >> >> Gustavo Sacomoto >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > Thanks for your help. gustavo From andreas at sdsc.edu Tue Apr 6 21:27:15 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 6 Apr 2010 14:27:15 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Gustavo, In principle I agree to all, see details below: I think my question wasn't very clear, my intention in this project is > to follow the approach (with the tree steps) outlined in the project's > page. Using the classical progressive alignment heuristic: build the > distance matrix, build the guide tree and using this tree > progressively align more sequences together. > yes > > What I propose for the third step is a first implementation using the > (more simple) dynamic programming described in the first CLUSTAL paper > (I thinks it's from 1988) and incrementally improving the algorithm to > get closer to the one described in CLUSTALW paper (from 1994). Is this > more or less what you had in mind? > yes, sounds good. > > About parallel strategies, I think a relative easy way we could use it > is in the distance matrix construction, we could have several threads > calculating the pairwise alignment for different pairs of sequence in > the set. > Correct. Probably a first implementation would be for a single machine/ multi CPU. More advanced implementations could provide support e.g. for Map/Reduce, JPPF, or something like that... Now, the alignment quality measures is a tougher issue. The CLUSTALW > paper doesn't give any way to measure the quality of the result, they > consider a good alignment the one that is hard to improve by eye (But > they claim that for sequences sufficient similar, no pair less than > 35% identical, the results are good). Can I do the same as in CLUSTALW > paper and leave the quality measure to the user? How concerned should > I be with that in this project? > Getting an overall core-algorithm that works should be priority. The benchmarking part is not mandatory, but something to keep in mind... I have plenty of material for that, once we get to that stage... I will try send to this mailing list a proposal draft until tomorrow > to have some feedback from you. > Excellent, looking forward to it. Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From sacomoto at gmail.com Wed Apr 7 05:29:31 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Wed, 7 Apr 2010 02:29:31 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Andreas, My proposal is pasted at the end of this e-mail. I'm waiting for your feedback. Thanks, gustavo ------------------------------------------------------------- GSoC proposal Abstract -------- This project aims to develop an all-Java implementation of a multiple sequence alignment (MSA) algorithm to be added to the Biojava toolkit, using the progressive algorithm described in the CLUSTALW paper [1]. The Importance -------------- Multiple sequence alignment is a frequently performed task in sequence analysis with the goal to identify new members of protein families and infer phylogenetic relationships between proteins and genes. At the present there is no Java-only implementation for this algorithm. As such the number of already existing and Java related BioInformatics tools and web sites would benefit from this implementation and sequence analysis could be more easily performed by the end-user. About Me -------- I am a graduate student at University of S?o Paulo (Brazil), I got my undergraduate degree from the same university with a major in Computer Science and a minor in Biology. I have been involved with Bioinformatics for 5 years, always with sequence analysis with particular interest in the MSA problem. Also, in my undergraduate final project I developed a lossless filter (pruning algorithm) for the MSA problem, the work is published in [3] and there is an online implementation of the algorithm in [4]. Finally, I have experience with the C, C++, Java, Python and Ruby programming languages; Git and SVN version control systems. Project Plan ------------ The project is divided in four main steps, at the end of each step a completely functional and bug-free new algorithm will be added to the Biojava code base. It should be noticed that each step has a strong dependence on the previous one, so before move to the next step a careful testing will be done. The four steps are described below, estimated times for accomplishment of each step are also given and in some steps extra enhancements are described, they will be implemented if there is some time remaining after all steps are completed. ** 1. Study the Biojava pairwise alignment code and update it to be compliant with Biojava 3. The pairwise alignment will play an important role in the MSA algorithm. This step is also important for me to get used to the Biojava coding standards and get in touch with the Biojava dev community. ETA: 2 weeks. ** 2. Implement the algorithm to build the distance matrix. This is done using the pairwise alignment for each pair of sequence in the set to be aligned. ETA: 1 week. EXTRA: Enhance the basic algorithm to use parallel strategies, use several threads to calculate the pairwise alignment for different pairs in the sequence set. ** 3. Implement the algorithm to build the guide tree. The guide tree is based on the distance matrix built in the last step, the tree construction strategy adopted will be the Neighbor Joining Algorithm. ETA: 2 weeks. ** 4. Implement the algorithm for progressive MSA using the guide tree. This is certainly the most difficult part of the project, so to make sure we are going to deliver a fully functional MSA algorithm, a safer approach is going to be taken. In the first place, a dynamic programming algorithm described in [2] will be implemented. Once this get successfully done and the code fully integrated to the Biojava code base, the features described in [1] are going to be incrementally added (and tested) in order to implement the full dynamic programming algorithm. ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. EXTRA: Implement some benchmark technique to measure the final alignment quality. References ---------- [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 [3] http://www.almob.org/content/4/1/3 [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: > Hi Gustavo, > > In principle I agree to all, see details below: > > > I think my question wasn't very clear, my intention in this project is >> >> to follow the approach (with the tree steps) outlined in the project's >> page. Using the classical progressive alignment heuristic: build the >> distance matrix, build the guide tree and using this tree >> progressively align more sequences together. > > yes > >> >> What I propose for the third step is a first implementation using the >> (more simple) dynamic programming described in the first CLUSTAL paper >> (I thinks it's from 1988) and incrementally improving the algorithm to >> get closer to the one described in CLUSTALW paper (from 1994). Is this >> more or less what you had in mind? > > yes, sounds good. > >> >> About parallel strategies, I think a relative easy way we could use it >> is in the distance matrix construction, we could have several threads >> calculating the pairwise alignment for different pairs of sequence in >> the set. > > Correct. Probably a first implementation would be for a single machine/ > multi CPU. More advanced implementations could provide support e.g. for > Map/Reduce, JPPF, or something like that... > >> Now, the alignment quality measures is a tougher issue. The CLUSTALW >> paper doesn't give any way to measure the quality of the result, they >> consider a good alignment the one that is hard to improve by eye (But >> they claim that for sequences sufficient similar, no pair less than >> 35% identical, the results are good). Can I do the same as in CLUSTALW >> paper and leave the quality measure to the user? How concerned should >> I be with that in this project? > > Getting an overall core-algorithm that works should be priority. The > benchmarking part is not mandatory, but something to keep in mind... I have > plenty of material for that, once we get to that stage... > >> I will try send to this mailing list a proposal draft until tomorrow >> to have some feedback from you. > > Excellent, looking forward to it. > > Andreas > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From sma.hmc at gmail.com Wed Apr 7 07:52:34 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Wed, 7 Apr 2010 00:52:34 -0700 Subject: [Biojava-l] Questions about Summer of Code Project Message-ID: I had previously sent this, but was not part of the mailing list, so I can only assume it got lost in a spam loop. I was interested in applying for the All-Java Multiple Sequence Alignment Google Summer of Code project. I wanted to create a project plan but had some questions about the package as it stands now. 1. What exactly has changed with the transition to BioJava 3? From what I've read on the BioJava 3 proposal page, it seems like that the changes are to the organization of the code. Additionally there are some new standards to follow. Java 6 usage is desired, but I am unsure of what of the new features could be used in modifying pairwise sequence alignments. 2. Is the Neighbor Joining Algorithm really the best for this? Are other multiple alignments implementations desired? I have implemented the neighbor joining algorithm very inefficiently in python, it was not particularly difficult. This step seems like it will not take very long. Additionally, parallelism, I have no experience with parallelism in Java and will only have some experience with it in C, will that be an issue? 3. Is there a specific paper with the exact algorithm that should be implemented here? General: Will use cases be provided? Will test data be provided? These would both be useful in coding the test cases which seem to be coded first. Additionally, I have access to my current windows machine as well as as Linux machine for testing, but no Mac. While in theory with java, if it works on one, then it works on another, and especially with if it works on Linux, it should be fine on Mac, should I be worried about strange peculiarities? Thanks, Singer Ma Harvey Mudd College 2011 From ayates at ebi.ac.uk Wed Apr 7 11:27:27 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 7 Apr 2010 12:27:27 +0100 Subject: [Biojava-l] Anonymous svn down In-Reply-To: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> Message-ID: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> By the looks of things this is quite a simple process to do: http://github.com/guides/import-from-subversion http://blog.woobling.org/2009/06/git-svn-abandon.html http://blog.johngoulah.com/2009/11/migrating-svn-to-git/ The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up Andy On 3 Apr 2010, at 16:08, Andreas Prlic wrote: > Hi, > > the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? > > Andreas > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From Stefan.Bleckmann at uni-duesseldorf.de Wed Apr 7 12:08:45 2010 From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann) Date: Wed, 07 Apr 2010 14:08:45 +0200 Subject: [Biojava-l] SubstitutionMatrix Message-ID: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> Hi all! I have a problems reading the NUC4.2 and 4.4 matrix files with the SubstitutionMatrix class included in BioJava 1.7.1. A small example: File d = new File("/Users/-----/Desktop/NUC"); FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager.alphabetForName("DNA"); try { @SuppressWarnings("unused") final SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet,d); } catch (NumberFormatException e) { e.printStackTrace(); } catch (NoSuchElementException e) { e.printStackTrace(); } catch (BioException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } Thrown exception: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:686) at org.biojava.bio.alignment.SubstitutionMatrix.parseMatrix(SubstitutionMatrix.java:304) at org.biojava.bio.alignment.SubstitutionMatrix.(SubstitutionMatrix.java:100) at MatrixTest.main(MatrixTest.java:30) All BLOSUM matrix files I have downloaded work, so I don't think there is a problem like wrong encoding or something similar. Anybody an idea? Cheers Stefan From andreas.draeger at uni-tuebingen.de Wed Apr 7 13:32:23 2010 From: andreas.draeger at uni-tuebingen.de (Andreas =?iso-8859-1?b?RHLkZ2Vy?=) Date: Wed, 07 Apr 2010 15:32:23 +0200 Subject: [Biojava-l] SubstitutionMatrix In-Reply-To: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> Message-ID: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de> Hi Stefan, Thank you for this hint. I don't know what the problem is. Recently, I tested it and it worked. I'll have a look on it tomorrow and come back to you with an answer pretty soon! Cheers Andreas Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From holland at eaglegenomics.com Wed Apr 7 13:48:21 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 7 Apr 2010 14:48:21 +0100 Subject: [Biojava-l] SubstitutionMatrix In-Reply-To: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de> References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de> <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de> Message-ID: <20ACD602-7575-46DB-AFD7-348AEB37CF68@eaglegenomics.com> I've found the problem already - the SubstitutionMatrix class has a few inconsistencies in the use of trimmed and untrimmed versions of lines. The guessAlphabet() method in this case is falling over because of an unchecked blank line in the matrix file. I've submitted a patch to trunk which fixes all the inconsistencies and should also fix this problem with the NUC files. On 7 Apr 2010, at 14:32, Andreas Dr?ger wrote: > Hi Stefan, > > Thank you for this hint. I don't know what the problem is. Recently, I tested it and it worked. I'll have a look on it tomorrow and come back to you with an answer pretty soon! > > Cheers > Andreas > > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From Stefan.Bleckmann at uni-duesseldorf.de Wed Apr 7 14:01:04 2010 From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann) Date: Wed, 07 Apr 2010 16:01:04 +0200 Subject: [Biojava-l] SubstitutionMatrix Message-ID: <512EA47A-6F40-4A38-B69D-5990D273C9DD@uni-duesseldorf.de> Hi Richard, Thx for your fast replay. I found the same solution. Two additional line breaks in the file was the problem which I didn't saw in the editor I used to check the file. Cheers Stefan From andreas.prlic at gmail.com Wed Apr 7 15:13:04 2010 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Wed, 7 Apr 2010 08:13:04 -0700 Subject: [Biojava-l] Anonymous svn down In-Reply-To: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> Message-ID: Hi Andy, In the meanwhile Kyle Ellrott already has set up a first github clone... http://github.com/biojava/biojava We are just monitoring it a bit to make sure it works properly... Is the usermapping important? We have some 50+ users so that might be painful... Andreas On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates wrote: > By the looks of things this is quite a simple process to do: > > http://github.com/guides/import-from-subversion > > http://blog.woobling.org/2009/06/git-svn-abandon.html > > http://blog.johngoulah.com/2009/11/migrating-svn-to-git/ > > The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up > > Andy > > On 3 Apr 2010, at 16:08, Andreas Prlic wrote: > >> Hi, >> >> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? >> >> Andreas >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ > > > > > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From ayates at ebi.ac.uk Wed Apr 7 15:17:27 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 7 Apr 2010 16:17:27 +0100 Subject: [Biojava-l] Anonymous svn down In-Reply-To: References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com> <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk> Message-ID: <647FD3F8-5222-487C-872F-DF00B693C809@ebi.ac.uk> Hey Andreas, The user mapping file only matters if we want a coherent link between our SVN users & those who have a github account. For example any commit of mine appears as ayates however it would probably be of more use to link to my github user since that would have more information about what I'm doing with the repo e.g. writing some snazzy new BJ3 code :). Andy On 7 Apr 2010, at 16:13, Andreas Prlic wrote: > Hi Andy, > > In the meanwhile Kyle Ellrott already has set up a first github clone... > > http://github.com/biojava/biojava > > We are just monitoring it a bit to make sure it works properly... > > Is the usermapping important? We have some 50+ users so that might be > painful... > > Andreas > > On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates wrote: >> By the looks of things this is quite a simple process to do: >> >> http://github.com/guides/import-from-subversion >> >> http://blog.woobling.org/2009/06/git-svn-abandon.html >> >> http://blog.johngoulah.com/2009/11/migrating-svn-to-git/ >> >> The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up >> >> Andy >> >> On 3 Apr 2010, at 16:08, Andreas Prlic wrote: >> >>> Hi, >>> >>> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github? >>> >>> Andreas >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From andreas at sdsc.edu Wed Apr 7 19:12:27 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 7 Apr 2010 12:12:27 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Gustavo, here my 0.02$: * For some of your steps there is already code available in BioJava. MIght be good to take a look at what is already there... (look at the alignment and phylo modules for dynamic programming and Neighbour-Joining) * What about risks? Where do you expect difficulties and how to work around them? * Step 4: Can you add more details? How do you plan to approach this? E.g. Clustalw has a number of rules implemented at this stage. Do you plan to support multiple rules as well and how to do this technically. Something nice would be the possibility to use structure alignments to guide the sequence alignments. (structure module) Andreas > ------------------------------------------------------------- > > GSoC proposal > > Abstract > -------- > > This project aims to develop an all-Java implementation of a multiple > sequence alignment (MSA) algorithm to be added to the Biojava toolkit, > using the progressive algorithm described in the CLUSTALW paper [1]. > > The Importance > -------------- > > Multiple sequence alignment is a frequently performed task in sequence > analysis with the goal to identify new members of protein families and > infer phylogenetic relationships between proteins and genes. At the > present there is no Java-only implementation for this algorithm. As > such the number of already existing and Java related BioInformatics > tools and web sites would benefit from this implementation and > sequence analysis could be more easily performed by the end-user. > > About Me > -------- > > I am a graduate student at University of S?o Paulo (Brazil), I got my > undergraduate degree from the same university with a major in Computer > Science and a minor in Biology. I have been involved with > Bioinformatics for 5 years, always with sequence analysis with > particular interest in the MSA problem. Also, in my undergraduate > final project I developed a lossless filter (pruning algorithm) for > the MSA problem, the work is published in [3] and there is an online > implementation of the algorithm in [4]. Finally, I have experience > with the C, C++, Java, Python and Ruby programming languages; Git and > SVN version control systems. > > Project Plan > ------------ > > The project is divided in four main steps, at the end of each step a > completely functional and bug-free new algorithm will be added to the > Biojava code base. It should be noticed that each step has a strong > dependence on the previous one, so before move to the next step a > careful testing will be done. > > The four steps are described below, estimated times for accomplishment > of each step are also given and in some steps extra enhancements are > described, they will be implemented if there is some time remaining > after all steps are completed. > > ** 1. Study the Biojava pairwise alignment code and update it to be > compliant with Biojava 3. > > ?The pairwise alignment will play an important role in the MSA > algorithm. This step is also important for me to get used to the > Biojava coding standards and get in touch with the Biojava dev > community. > > ?ETA: 2 weeks. > > ** 2. Implement the algorithm to build the distance matrix. > > ?This is done using the pairwise alignment for each pair of sequence > in the set to be aligned. > > ?ETA: 1 week. > > ?EXTRA: Enhance the basic algorithm to use parallel strategies, use > several threads to calculate the pairwise alignment for different > pairs in the sequence set. > > ** 3. Implement the algorithm to build the guide tree. > > ?The guide tree is based on the distance matrix built in the last > step, the tree construction strategy adopted will be the Neighbor > Joining Algorithm. > > ?ETA: 2 weeks. > > ** 4. Implement the algorithm for progressive MSA using the guide tree. > > ?This is certainly the most difficult part of the project, so to make > sure we are going to deliver a fully functional MSA algorithm, a safer > approach is going to be taken. In the first place, a dynamic > programming algorithm described in [2] will be implemented. Once this > get successfully done and the code fully integrated to the Biojava > code base, the features described in [1] are going to be incrementally > added (and tested) in order to implement the full dynamic programming > algorithm. > > ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. > > ?EXTRA: Implement some benchmark technique to measure the final > alignment quality. > > References > ---------- > > [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 > [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 > [3] http://www.almob.org/content/4/1/3 > [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu > > > > On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >> Hi Gustavo, >> >> In principle I agree to all, see details below: >> >> >> I think my question wasn't very clear, my intention in this project is >>> >>> to follow the approach (with the tree steps) outlined in the project's >>> page. Using the classical progressive alignment heuristic: build the >>> distance matrix, build the guide tree and using this tree >>> progressively align more sequences together. >> >> yes >> >>> >>> What I propose for the third step is a first implementation using the >>> (more simple) dynamic programming described in the first CLUSTAL paper >>> (I thinks it's from 1988) and incrementally improving the algorithm to >>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>> more or less what you had in mind? >> >> yes, sounds good. >> >>> >>> About parallel strategies, I think a relative easy way we could use it >>> is in the distance matrix construction, we could have several threads >>> calculating the pairwise alignment for different pairs of sequence in >>> the set. >> >> Correct. Probably a first implementation would be for a single machine/ >> multi CPU. More advanced implementations could provide support e.g. for >> Map/Reduce, JPPF, or something like that... >> >>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>> paper doesn't give any way to measure the quality of the result, they >>> consider a good alignment the one that is hard to improve by eye (But >>> they claim that for sequences sufficient similar, no pair less than >>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>> paper and leave the quality measure to the user? How concerned should >>> I be with that in this project? >> >> Getting an overall core-algorithm that works should be priority. The >> benchmarking part is not mandatory, but something to keep in mind... I have >> plenty of material for that, once we get to that stage... >> >>> I will try send to this mailing list a proposal draft until tomorrow >>> to have some feedback from you. >> >> Excellent, looking forward to it. >> >> Andreas >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Wed Apr 7 19:30:19 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 7 Apr 2010 12:30:19 -0700 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: References: Message-ID: Hi Singer, > I had previously sent this, but was not part of the mailing list, so I > can only assume it got lost in a spam loop. You need to be subscribed in order to be able to post... > I was interested in applying for the All-Java Multiple Sequence > Alignment Google Summer of Code project. Several students have expressed their interest in this project. Depending on how the funding situation will be, at maximum one will be able to work on this... There is also a 2nd BioJava related project or you could propose your own ideas... http://biojava.org/wiki/Google_Summer_of_Code I wanted to create a project > plan but had some questions about the package as it stands now. > > 1. What exactly has changed with the transition to BioJava 3? From > what I've read on the BioJava 3 proposal page, it seems like that the > changes are to the organization of the code. Additionally there are > some new standards to follow. Java 6 usage is desired, but I am unsure > of what of the new features could be used in modifying pairwise > sequence alignments. BioJava is more modular in version 3. There is a new module for working with sequences. The current alignment module is still based on the old version of BioJava though. > > 2. Is the Neighbor Joining Algorithm really the best for this? Are > other multiple alignments implementations desired? I have implemented > the neighbor joining algorithm very inefficiently in python, it was > not particularly difficult. NJ is a clustering technique, but there are also others. http://en.wikipedia.org/wiki/Neighbor-joining Another online lecture that might be useful is: http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html This step seems like it will not take very > long. Additionally, parallelism, I have no experience with parallelism > in Java and will only have some experience with it in C, will that be > an issue? I have never written multi threaded code in C, but I would guess it is much much easier in Java... > 3. Is there a specific paper with the exact algorithm that should be > implemented here? We have only 3 months for this project so having a modular core algorithm that can be extended would be a priority. I recommend reading the Clustalw, T-Coffee and Muscle papers. > General: Will use cases be provided? Will test data be provided? These > would both be useful in coding the test cases which seem to be coded > first. I can provide plenty of data for that. > Additionally, I have access to my current windows machine as well as > as Linux machine for testing, but no Mac. While in theory with java, > if it works on one, then it works on another, and especially with if > it works on Linux, it should be fine on Mac, should I be worried about > strange peculiarities? >From my experience Java works pretty fine on any platform. There might be issues with user interfaces that require testing, but we are not going to do user interfaces here... Andreas > > Thanks, > Singer Ma > Harvey Mudd College 2011 > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas.draeger at uni-tuebingen.de Thu Apr 8 07:13:17 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Thu, 08 Apr 2010 09:13:17 +0200 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] Message-ID: <4BBD820D.9070200@uni-tuebingen.de> Hi all, This e-mail is just for your information about somebody new, who'd like to contribute to our project. Cheers Andreas Subject: Re: Fwd: Proposing a project on "Biojava alignment lead" From: Andreas Dr?ger Date: Wed, 07 Apr 2010 09:27:13 +0200 To: Cai Shaojiang Hi Cai Shaojiang, Thank you for you e-mail! I don't know what happened to the e-mail list. Sometimes it takes a while due to the spam filters, I guess. > I am a PhD student from National University of Singapore. My major research area is local alignment algorithms and data structures for SNP identification. And I have used Java and Eclipse for years for software development. I am very interested in your GSoC programme. I find that there is a module called "biojava-alignment lead" whose mentor is you. I want to propose a new project on this module. I have several questions about this module. Yes, that's me. So great to get your support. > 1. It seems that pairwise alignment is to find similarity between two short sequences. Existing pairwise alignment is based on dynamic programming, is it Smith-Waterman algorithm? So, currently, BioJava contains three different alignment approaches. There are two deterministic algorithms, i.e., Smith-Waterman for local alignment and Needleman-Wunsch for global alignment. Third, there is the possibility to apply Hidden Markov Models for alignment. An example of the latter approach should be in the cookbook. > 2. What is the exact task of "refactoring of underlying data structures"? Yes, this is something, I did last week already but it could still be improved. The problem was that the alignment algorithms actually produced a kind of string that looks similar to the output of BLAST. This string contained the score, the computation time, the length of the alignment etc. The problem was that people wanted to perform higher-level computation on the score value or evaluate some other information. Now, the alignment will produce a data structure that contains all the information and can, in addition to that, also produce such a BLAST-like output. There is, however, still the following problem: The data structure requires both sequences in the pair-wise alignment to have an identical length. In case of local alignment this is especially stupid (actually), because gaps are inserted to fill the sequences. And then the data structure tries to keep the old sequence coordinates, leading to the effect that the numbers "query start", "query end", "subject start", and "subject end" are required to shift the sequences against each other when displaying the output. So, you cannot easily print the sequences below of each other, you first have to shift them. Please check out the latest version of this package via anonymeous svn and have a look ;-) > 3. My existing research area is aiming to deal with aligning short read (10s~100s bp) against extremely long sequences (e.g., human genome). Af far as I know, there is not existing such alignment tools implemented in Java. Would you consider this direction? See, this would be very nice to include. But this requires that we no longer fill the short sequence with many, many gap symbols (just a waist of memory), but improve the data structure. There is already an UnequalLenghtAlignment (just a data structure, no algorithm) and I think we could use this as a starting point. Then your algorithm should only produce such a data structure and this would be fine. > 4. It seems that the existing tools is just lacking of some refactoring and representation interfaces. Any more underlying tasks? Hm. Yes: With the release of BioJava 3 data structures have changed again. So maybe there's also some adaptation to the new structure required. > I am keeping an eye on GSoC from last month, but sorry to find out that I sent the initial email to the mailing list before I subscribe it... Ok. Sounds good. Thanks for your interest. So I suggest: Download the latest trunk, have a look, play around and if you can improve something we'll put it into the trunk and write your name into the authors' tag. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From ayates at ebi.ac.uk Thu Apr 8 10:23:06 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 8 Apr 2010 11:23:06 +0100 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: References: Message-ID: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> Hi Singer, To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are: * Mutable objects are the work of the devil & should be avoided * Tasks & Futures are quite lightweight things to produce; threads are not * Multiple tasks can be given to a queue to be processed by a number of threads in a pool * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed * Assume that things will fail * Write your program with a view to be concurrent; do not force concurrency on an already written program Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). Andy On 7 Apr 2010, at 20:30, Andreas Prlic wrote: > Hi Singer, > >> I had previously sent this, but was not part of the mailing list, so I >> can only assume it got lost in a spam loop. > > You need to be subscribed in order to be able to post... > >> I was interested in applying for the All-Java Multiple Sequence >> Alignment Google Summer of Code project. > > Several students have expressed their interest in this project. > Depending on how the funding situation will be, at maximum one will be > able to work on this... There is also a 2nd BioJava related project or > you could propose your own ideas... > http://biojava.org/wiki/Google_Summer_of_Code > > > I wanted to create a project >> plan but had some questions about the package as it stands now. >> >> 1. What exactly has changed with the transition to BioJava 3? From >> what I've read on the BioJava 3 proposal page, it seems like that the >> changes are to the organization of the code. Additionally there are >> some new standards to follow. Java 6 usage is desired, but I am unsure >> of what of the new features could be used in modifying pairwise >> sequence alignments. > > BioJava is more modular in version 3. There is a new module for > working with sequences. The current alignment module is still based on > the old version of BioJava though. > >> >> 2. Is the Neighbor Joining Algorithm really the best for this? Are >> other multiple alignments implementations desired? I have implemented >> the neighbor joining algorithm very inefficiently in python, it was >> not particularly difficult. > > NJ is a clustering technique, but there are also others. > http://en.wikipedia.org/wiki/Neighbor-joining > Another online lecture that might be useful is: > http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html > > This step seems like it will not take very >> long. Additionally, parallelism, I have no experience with parallelism >> in Java and will only have some experience with it in C, will that be >> an issue? > > I have never written multi threaded code in C, but I would guess it is > much much easier in Java... > >> 3. Is there a specific paper with the exact algorithm that should be >> implemented here? > > We have only 3 months for this project so having a modular core > algorithm that can be extended would be a priority. I recommend > reading the Clustalw, T-Coffee and Muscle papers. > >> General: Will use cases be provided? Will test data be provided? These >> would both be useful in coding the test cases which seem to be coded >> first. > > I can provide plenty of data for that. > > >> Additionally, I have access to my current windows machine as well as >> as Linux machine for testing, but no Mac. While in theory with java, >> if it works on one, then it works on another, and especially with if >> it works on Linux, it should be fine on Mac, should I be worried about >> strange peculiarities? > >> From my experience Java works pretty fine on any platform. There might > be issues with user interfaces that require testing, but we are not > going to do user interfaces here... > > Andreas > > >> >> Thanks, >> Singer Ma >> Harvey Mudd College 2011 >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From sma.hmc at gmail.com Thu Apr 8 10:38:41 2010 From: sma.hmc at gmail.com (Singer Ma) Date: Thu, 8 Apr 2010 03:38:41 -0700 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> References: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> Message-ID: So, my questions were generated from looking past just the Summer of Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as part of its proposal, lists: Make methods parallel-aware and take advantage of this when possible, and provide a global variable to specify how much parallelisation can take place. on http://www.biojava.org/wiki/BioJava3_Proposal How important it this to incorporate into the Summer of Code project? Obviously anything that is already concurrent can remain that way, but for the new code in multiple sequence alignment, does this need to be parallel-aware? Clearly, in a multiple sequence alignment, certain things can be made parallel such as the initial distance matrix calculation, parts of the neighbor joining algorithm, etc. If I were to contribute, I would want to uphold the agreed upon standards as much as possible. I am just unsure of my capability to make multiple sequence alignment parallel-aware. Singer On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates wrote: > Hi Singer, > > To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are: > > * Mutable objects are the work of the devil & should be avoided > * Tasks & Futures are quite lightweight things to produce; threads are not > * Multiple tasks can be given to a queue to be processed by a number of threads in a pool > * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed > * Assume that things will fail > * Write your program with a view to be concurrent; do not force concurrency on an already written program > > Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). > > Andy > > On 7 Apr 2010, at 20:30, Andreas Prlic wrote: > >> Hi Singer, >> >>> I had previously sent this, but was not part of the mailing list, so I >>> can only assume it got lost in a spam loop. >> >> You need to be subscribed in order to be able to post... >> >>> I was interested in applying for the All-Java Multiple Sequence >>> Alignment Google Summer of Code project. >> >> Several students have expressed their interest ?in this project. >> Depending on how the funding situation will be, at maximum one will be >> able to work on this... There is also a 2nd BioJava related project or >> you could propose your own ideas... >> http://biojava.org/wiki/Google_Summer_of_Code >> >> >> I wanted to create a project >>> plan but had some questions about the package as it stands now. >>> >>> 1. What exactly has changed with the transition to BioJava 3? From >>> what I've read on the BioJava 3 proposal page, it seems like that the >>> changes are to the organization of the code. Additionally there are >>> some new standards to follow. Java 6 usage is desired, but I am unsure >>> of what of the new features could be used in modifying pairwise >>> sequence alignments. >> >> BioJava is more modular in version 3. There is a new module for >> working with sequences. The current alignment module is still based on >> the old version of BioJava though. >> >>> >>> 2. Is the Neighbor Joining Algorithm really the best for this? Are >>> other multiple alignments implementations desired? I have implemented >>> the neighbor joining algorithm very inefficiently in python, it was >>> not particularly difficult. >> >> NJ is a clustering technique, but there are also others. >> http://en.wikipedia.org/wiki/Neighbor-joining >> Another online lecture that might be useful is: >> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html >> >> This step seems like it will not take very >>> long. Additionally, parallelism, I have no experience with parallelism >>> in Java and will only have some experience with it in C, will that be >>> an issue? >> >> I have never written multi threaded code in C, but I would guess it is >> much much easier in Java... >> >>> 3. Is there a specific paper with the exact algorithm that should be >>> implemented here? >> >> We have only 3 months for this project so having a modular core >> algorithm that can be extended would be a priority. I recommend >> reading the Clustalw, T-Coffee and Muscle papers. >> >>> General: Will use cases be provided? Will test data be provided? These >>> would both be useful in coding the test cases which seem to be coded >>> first. >> >> I can provide plenty of data for that. >> >> >>> Additionally, I have access to my current windows machine as well as >>> as Linux machine for testing, but no Mac. While in theory with java, >>> if it works on one, then it works on another, and especially with if >>> it works on Linux, it should be fine on Mac, should I be worried about >>> strange peculiarities? >> >>> From my experience Java works pretty fine on any platform. There might >> be issues with user interfaces that require testing, but we are not >> going to do ?user interfaces here... >> >> Andreas >> >> >>> >>> Thanks, >>> Singer Ma >>> Harvey Mudd College 2011 >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ > > > > > From ayates at ebi.ac.uk Thu Apr 8 10:46:15 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 8 Apr 2010 11:46:15 +0100 Subject: [Biojava-l] Questions about Summer of Code Project In-Reply-To: References: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk> Message-ID: <91C9DF16-E6EF-4B7A-ADC4-E781275514EB@ebi.ac.uk> Ahhh okay. So when we wrote this section it was with a view towards being able to do things in a concurrent manner as & when that framework appears. BioJava3 is still in an incubation phase; a lot of code is in place but we are all having to do this along with work commitments (which in my case is working on a Perl project so my work/BJ contributions are very limited). Anyway to go back to the question about being "framework" standard. The MSA algorithm would be the first case we would have to make concurrent (as far as I am aware but Scooter is a better person to confirm this) and so the framework of building a concurrent application would come from this project. If the code is written using the standard concurrent library interfaces then it should be possible to transplant it into any concurrent Java framework and that's really the important thing here. Andy On 8 Apr 2010, at 11:38, Singer Ma wrote: > So, my questions were generated from looking past just the Summer of > Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as > part of its proposal, lists: > > Make methods parallel-aware and take advantage of this when possible, > and provide a global variable to specify how much parallelisation can > take place. > > on http://www.biojava.org/wiki/BioJava3_Proposal > > How important it this to incorporate into the Summer of Code project? > Obviously anything that is already concurrent can remain that way, but > for the new code in multiple sequence alignment, does this need to be > parallel-aware? Clearly, in a multiple sequence alignment, certain > things can be made parallel such as the initial distance matrix > calculation, parts of the neighbor joining algorithm, etc. If I were > to contribute, I would want to uphold the agreed upon standards as > much as possible. I am just unsure of my capability to make multiple > sequence alignment parallel-aware. > > Singer > > On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates wrote: >> Hi Singer, >> >> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are: >> >> * Mutable objects are the work of the devil & should be avoided >> * Tasks & Futures are quite lightweight things to produce; threads are not >> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool >> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed >> * Assume that things will fail >> * Write your program with a view to be concurrent; do not force concurrency on an already written program >> >> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). >> >> Andy >> >> On 7 Apr 2010, at 20:30, Andreas Prlic wrote: >> >>> Hi Singer, >>> >>>> I had previously sent this, but was not part of the mailing list, so I >>>> can only assume it got lost in a spam loop. >>> >>> You need to be subscribed in order to be able to post... >>> >>>> I was interested in applying for the All-Java Multiple Sequence >>>> Alignment Google Summer of Code project. >>> >>> Several students have expressed their interest in this project. >>> Depending on how the funding situation will be, at maximum one will be >>> able to work on this... There is also a 2nd BioJava related project or >>> you could propose your own ideas... >>> http://biojava.org/wiki/Google_Summer_of_Code >>> >>> >>> I wanted to create a project >>>> plan but had some questions about the package as it stands now. >>>> >>>> 1. What exactly has changed with the transition to BioJava 3? From >>>> what I've read on the BioJava 3 proposal page, it seems like that the >>>> changes are to the organization of the code. Additionally there are >>>> some new standards to follow. Java 6 usage is desired, but I am unsure >>>> of what of the new features could be used in modifying pairwise >>>> sequence alignments. >>> >>> BioJava is more modular in version 3. There is a new module for >>> working with sequences. The current alignment module is still based on >>> the old version of BioJava though. >>> >>>> >>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are >>>> other multiple alignments implementations desired? I have implemented >>>> the neighbor joining algorithm very inefficiently in python, it was >>>> not particularly difficult. >>> >>> NJ is a clustering technique, but there are also others. >>> http://en.wikipedia.org/wiki/Neighbor-joining >>> Another online lecture that might be useful is: >>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html >>> >>> This step seems like it will not take very >>>> long. Additionally, parallelism, I have no experience with parallelism >>>> in Java and will only have some experience with it in C, will that be >>>> an issue? >>> >>> I have never written multi threaded code in C, but I would guess it is >>> much much easier in Java... >>> >>>> 3. Is there a specific paper with the exact algorithm that should be >>>> implemented here? >>> >>> We have only 3 months for this project so having a modular core >>> algorithm that can be extended would be a priority. I recommend >>> reading the Clustalw, T-Coffee and Muscle papers. >>> >>>> General: Will use cases be provided? Will test data be provided? These >>>> would both be useful in coding the test cases which seem to be coded >>>> first. >>> >>> I can provide plenty of data for that. >>> >>> >>>> Additionally, I have access to my current windows machine as well as >>>> as Linux machine for testing, but no Mac. While in theory with java, >>>> if it works on one, then it works on another, and especially with if >>>> it works on Linux, it should be fine on Mac, should I be worried about >>>> strange peculiarities? >>> >>>> From my experience Java works pretty fine on any platform. There might >>> be issues with user interfaces that require testing, but we are not >>> going to do user interfaces here... >>> >>> Andreas >>> >>> >>>> >>>> Thanks, >>>> Singer Ma >>>> Harvey Mudd College 2011 >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From mitlox at op.pl Thu Apr 8 11:30:13 2010 From: mitlox at op.pl (xyz) Date: Thu, 8 Apr 2010 21:30:13 +1000 Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: References: <20100330215047.084f6b00@wp01> Message-ID: <20100408213013.63a99b8c@wp01> On Wed, 31 Mar 2010 23:56:42 -0400 (EDT) Michael Heuer wrote: > import static ...RichSequence.Tools.*; > import static ...RichSequence.IOTools.*; > > Fastq fastq = ...; > Namespace namepace = ...; > RichSequence richSequence = createRichSequence( > namespace, > fastq.getDescription(), > fastq.getSequence(), > DNATools.getDNA()); > > writeFasta(outputStream, richSequence, namespace); I have tried this but I got this error: Fastq2Fasta.java:52: cannot find symbol symbol : method createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet) location: class Fastq2Fasta RichSequence richSequence = createRichSequence(ns, 1 error The complete code looks now : import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import org.biojava.bio.program.fastq.Fastq; import org.biojava.bio.program.fastq.FastqBuilder; import org.biojava.bio.program.fastq.FastqReader; import org.biojava.bio.program.fastq.FastqVariant; import org.biojava.bio.program.fastq.FastqWriter; import org.biojava.bio.program.fastq.IlluminaFastqReader; import org.biojava.bio.program.fastq.IlluminaFastqWriter; import org.biojava.bio.seq.DNATools; import org.biojavax.SimpleNamespace; import org.biojavax.bio.seq.RichSequence; public class Fastq2Fasta { public static void main(String[] args) throws FileNotFoundException, IOException { FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); FastqReader qReader = new IlluminaFastqReader(); FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); FastqWriter qWriter = new IlluminaFastqWriter(); //SimpleNamespace ns = new SimpleNamespace("biojava"); FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta"); for (Fastq fastq : qReader.read(inputFastq)) { System.out.println(fastq.getDescription()); System.out.println(fastq.getSequence()); String trimSeq = fastq.getSequence().substring(0, fastq.getSequence().length() - 6); System.out.println(trimSeq); System.out.println(fastq.getQuality()); String trimQual = fastq.getQuality().substring(0, fastq.getQuality().length() - 6); System.out.println(trimQual); FastqBuilder trimFastq = new FastqBuilder(); trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA); trimFastq.withDescription(fastq.getDescription()); trimFastq.appendSequence(trimSeq); trimFastq.appendQuality(trimQual); qWriter.write(outputFastq, trimFastq.build()); SimpleNamespace ns = new SimpleNamespace("biojava"); RichSequence richSequence = createRichSequence(ns, fastq.getDescription(), trimSeq, DNATools.getDNA()); RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns); } } } What did I wrong? > > > Suggestions: > > 1) > > After I trimmed the fastq files the header information for quality > > is empty > > > > @HWI-EAS406:5:1:0:1390#0/1 > > GGGTGATGGCCGCTGCCGATGGCGTCAAAA > > + > > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO > > > > this reduced the size of the files but is it compatible with > > SOAP and TopHat? > > Sorry, not sure what you are asking here. > Usually @-headerand and +-header are equal eg. @HWI-EAS406:5:1:0:1390#0/1 +HWI-EAS406:5:1:0:1390#0/1 but after trimming and writting to fastq file I got this @HWI-EAS406:5:1:0:1390#0/1 + The +-header is empty. Is this ok like this and standard compatible? Best regards, From mitlox at op.pl Thu Apr 8 11:30:52 2010 From: mitlox at op.pl (xyz) Date: Thu, 8 Apr 2010 21:30:52 +1000 Subject: [Biojava-l] readFasta problem Message-ID: <20100408213052.662beb8e@wp01> Hello, I would like to read fasta file without to specify whether it is DNA, RNA or Protein in code and I wrote this code import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import org.biojava.bio.BioException; import org.biojavax.SimpleNamespace; import org.biojavax.bio.seq.RichSequence; import org.biojavax.bio.seq.RichSequenceIterator; public class SortFasta { public static void main(String[] args) throws FileNotFoundException, BioException { BufferedReader br = new BufferedReader(new FileReader("sortFasta.fasta")); SimpleNamespace ns = new SimpleNamespace("biojava"); // You can use any of the convenience methods found in the BioJava 1.6 API //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br, ns); RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns); // Since a single file can contain more than a sequence, you need // to iterate over rsi to get the information. while (rsi.hasNext()) { RichSequence rs = rsi.nextRichSequence(); System.out.println(rs.getComments()); System.out.println(rs.seqString()); } } } but unfortunately it I have got following error: it the details that follow to biojava-l at biojava.org or post a bug report to http://bugzilla.open-bio.org/ Format_object=org.biojavax.bio.seq.io.FastaFormat Accession= Id= Comments=problem parsing symbols Parse_block=atccccc Stack trace follows .... at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ... 1 more Caused by: java.lang.NullPointerException at org.biojava.bio.symbol.SimpleSymbolList.(SimpleSymbolList.java:165) at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ... 2 more Java Result: 1 What did I wrong? Thank you in advance. Best regards, From holland at eaglegenomics.com Thu Apr 8 11:41:25 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 8 Apr 2010 12:41:25 +0100 Subject: [Biojava-l] readFasta problem In-Reply-To: <20100408213052.662beb8e@wp01> References: <20100408213052.662beb8e@wp01> Message-ID: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> You have passed null into the tokenizer parameter of RichSequence.IOTools.readFasta() - this is not allowed. The parser cannot guess the type of sequence, it must be told what to expect by specifying the tokenizer to use. (Importantly this also means that you cannot mix different types of sequence within the same file to be parsed.) On 8 Apr 2010, at 12:30, xyz wrote: > Hello, > I would like to read fasta file without to specify whether it is DNA, > RNA or Protein in code and I wrote this code > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileReader; > import org.biojava.bio.BioException; > import org.biojavax.SimpleNamespace; > import org.biojavax.bio.seq.RichSequence; > import org.biojavax.bio.seq.RichSequenceIterator; > > public class SortFasta { > > public static void main(String[] args) throws FileNotFoundException, > BioException { > > > BufferedReader br = new BufferedReader(new > FileReader("sortFasta.fasta")); > SimpleNamespace ns = new SimpleNamespace("biojava"); > > // You can use any of the convenience methods found in the BioJava 1.6 API > //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br, ns); > RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns); > > // Since a single file can contain more than a sequence, you need > // to iterate over rsi to get the information. > while (rsi.hasNext()) { > RichSequence rs = rsi.nextRichSequence(); > System.out.println(rs.getComments()); > System.out.println(rs.seqString()); > } > } > } > but unfortunately it I have got following error: > it the details that follow to biojava-l at biojava.org or post a bug > report to http://bugzilla.open-bio.org/ > > Format_object=org.biojavax.bio.seq.io.FastaFormat > Accession= > Id= > Comments=problem parsing symbols > Parse_block=atccccc > Stack trace follows .... > > > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ... > 1 more Caused by: java.lang.NullPointerException at > org.biojava.bio.symbol.SimpleSymbolList.(SimpleSymbolList.java:165) > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ... > 2 more Java Result: 1 > > What did I wrong? > > Thank you in advance. > > Best regards, > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Thu Apr 8 11:36:36 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 8 Apr 2010 12:36:36 +0100 Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: <20100408213013.63a99b8c@wp01> References: <20100330215047.084f6b00@wp01> <20100408213013.63a99b8c@wp01> Message-ID: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com> You haven't included the two import static lines in your code. See first two lines of Michael's example code (expanding the ellipses to the full classpath). On 8 Apr 2010, at 12:30, xyz wrote: > On Wed, 31 Mar 2010 23:56:42 -0400 (EDT) > Michael Heuer wrote: > >> import static ...RichSequence.Tools.*; >> import static ...RichSequence.IOTools.*; >> >> Fastq fastq = ...; >> Namespace namepace = ...; >> RichSequence richSequence = createRichSequence( >> namespace, >> fastq.getDescription(), >> fastq.getSequence(), >> DNATools.getDNA()); >> >> writeFasta(outputStream, richSequence, namespace); > > I have tried this but I got this error: > Fastq2Fasta.java:52: cannot find symbol > symbol : method > createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet) > location: class Fastq2Fasta RichSequence richSequence = > createRichSequence(ns, > 1 error > > The complete code looks now : > > import java.io.FileInputStream; > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.IOException; > import org.biojava.bio.program.fastq.Fastq; > import org.biojava.bio.program.fastq.FastqBuilder; > import org.biojava.bio.program.fastq.FastqReader; > import org.biojava.bio.program.fastq.FastqVariant; > import org.biojava.bio.program.fastq.FastqWriter; > import org.biojava.bio.program.fastq.IlluminaFastqReader; > import org.biojava.bio.program.fastq.IlluminaFastqWriter; > import org.biojava.bio.seq.DNATools; > import org.biojavax.SimpleNamespace; > import org.biojavax.bio.seq.RichSequence; > > > public class Fastq2Fasta { > > public static void main(String[] args) throws FileNotFoundException, > IOException { > > FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); > FastqReader qReader = new IlluminaFastqReader(); > > FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); > FastqWriter qWriter = new IlluminaFastqWriter(); > > //SimpleNamespace ns = new SimpleNamespace("biojava"); > > FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta"); > > > for (Fastq fastq : qReader.read(inputFastq)) { > System.out.println(fastq.getDescription()); > System.out.println(fastq.getSequence()); > String trimSeq = fastq.getSequence().substring(0, > fastq.getSequence().length() - 6); > System.out.println(trimSeq); > System.out.println(fastq.getQuality()); > String trimQual = fastq.getQuality().substring(0, > fastq.getQuality().length() - 6); > System.out.println(trimQual); > > FastqBuilder trimFastq = new FastqBuilder(); > trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA); > trimFastq.withDescription(fastq.getDescription()); > trimFastq.appendSequence(trimSeq); > trimFastq.appendQuality(trimQual); > > qWriter.write(outputFastq, trimFastq.build()); > > > SimpleNamespace ns = new SimpleNamespace("biojava"); > RichSequence richSequence = createRichSequence(ns, > fastq.getDescription(), trimSeq, DNATools.getDNA()); > RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns); > } > } > } > > What did I wrong? > > >> >>> Suggestions: >>> 1) >>> After I trimmed the fastq files the header information for quality >>> is empty >>> >>> @HWI-EAS406:5:1:0:1390#0/1 >>> GGGTGATGGCCGCTGCCGATGGCGTCAAAA >>> + >>> OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO >>> >>> this reduced the size of the files but is it compatible with >>> SOAP and TopHat? >> >> Sorry, not sure what you are asking here. >> > Usually @-headerand and +-header are equal eg. > @HWI-EAS406:5:1:0:1390#0/1 > +HWI-EAS406:5:1:0:1390#0/1 > but after trimming and writting to fastq file I got this > @HWI-EAS406:5:1:0:1390#0/1 > + > The +-header is empty. Is this ok like this and standard compatible? > > Best regards, > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From chapman at cs.wisc.edu Thu Apr 8 12:47:12 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Thu, 08 Apr 2010 07:47:12 -0500 Subject: [Biojava-l] GSoC Application Message-ID: <4BBDD050.6090208@cs.wisc.edu> I would appreciate any feedback on my proposal from mentors or other developers. Check it out at: http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817 Thanks in advance, Mark From caishaojiang at gmail.com Thu Apr 8 13:28:11 2010 From: caishaojiang at gmail.com (Cai Shaojiang) Date: Thu, 8 Apr 2010 06:28:11 -0700 Subject: [Biojava-l] [Fwd: Re: GSoC project on MSA] In-Reply-To: <4BBDCFD2.3000507@uni-tuebingen.de> References: <4BBC80A8.5000608@uni-tuebingen.de> <4BBDCFD2.3000507@uni-tuebingen.de> Message-ID: Dear Sir: I have submitted the proposal through Google. Cheers. On Thu, Apr 8, 2010 at 5:45 AM, Andreas Dr?ger < andreas.draeger at uni-tuebingen.de> wrote: > Hi Cai, > > Oh yes, it is in the alignment package. But it is only an interface. It > already has two sub-types: AbstractULAlignment and this has the > implementation SubULAlignment. We should check first if we can already use > these data structures to easily produce a paired alignment. Can you see how > the AlignmentPair is produced by the alignment algorithms in the alignment > package? We should do something similar but with this different data > structure, I suggest. > > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > -- Cai Shaojiang Department of Information Systems, School of Computing, National University of Singapore Telephone: +65 93-4870-93 Email: caishaojiang at gmail.com; shaoj at comp.nus.edu.sg From sacomoto at gmail.com Thu Apr 8 16:26:55 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Thu, 8 Apr 2010 13:26:55 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Andreas, On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic wrote: > Hi Gustavo, > > here my 0.02$: > > * For some of your steps there is already code available in BioJava. > MIght be good to take a look at what is already there... ? (look at > the alignment and phylo modules for dynamic programming and > Neighbour-Joining) > > * What about risks? Where do you expect difficulties and how to work > around them? > > * Step 4: Can you add more details? How do you plan to approach this? > E.g. Clustalw has a number of rules implemented at this stage. Do you > plan to support multiple rules as well and how to do this technically. > Something nice would be the possibility to use structure alignments to > guide the sequence alignments. (structure module) Based on it I rewrote the step 4 and add a "Main Risks" section. I pasted just the new version of step 4 and the new section at the end of this e-mal. Thank you very much for your feedback. gustavo ------------------------------------------------------------------------------------------- ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. A progressive MSA is a heuristic approach for the MSA problem, at each step a pairwise alignment between two sequences, a sequence and an alignment or between two alignments is done. So, the multiple alignment is built incrementally, at each iteration more sequences are aligned together. The guide tree gives an order for this incremental alignment, in a bottom-up (in the tree) fashion sequences (or groups of sequences) with greater similarity are aligned first. Therefore, in order to have a more flexible and reusable code, the code design will allow any binary tree of the sequences to be used as a guide tree, not only the one built in the last step. This will allow a priori phylogenetic or tertiary similarity (structural similarity) knowledge be used to guide the multiple alignment order. This is certainly the most difficult part of the project, so to make sure we are going to deliver a fully functional MSA algorithm, a safer approach is going to be taken. In the first place, a a basic algorithm described in [2] will be implemented. Once this get successfully done and the code fully integrated to the Biojava code base, the features described in [1] are going to be incrementally added (and tested) in order to implement the full algorithm. This step is further divided in substeps. *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. This is the generalized pairwise alignment used in each iteration of the progressive MSA. Gaps already presents in one of the alignments (profiles) remain fixed, gap opening penalties remain unchanged, this means that opening new gaps inside existent gaps will be fully penalized. The code for this algorithm is similar to, the already present in Biojava, code for regular pairwise alignment. *** 4.2 Implement the basic progressive MSA algorithm. In this substep is going to be implemented the incremental algorithm to built the MSA, transversing a guide tree (parameter, could be the one built in step 3 or any other one) in a bottom-up fashion and using the algorithm from substep 4.1 at each iteration. *** 4.3 Implement the MSA wrapper. The MSA wrapper is going to be a method that wraps steps 2, 3 and 4.2, giving a simple method (for the final user) to calculate the MSA. Receiving as parameters the set of sequences to be aligned, the gap opening penalty, gap extend penalty and residue matrix. Returning the MSA for the sequence set. At the end of this substep, we get a basic fully functional MSA algorithm, using the progressive heuristic. *** 4.4 Implement gaps penalties rescaling and parameter default values. Gap penalties to open a new gap an extend a existing one (the affine gap weight model) are user defined parameters. This substep will define default values, based on the residue matrix, for this parameters and implement global rescaling rules (based on sequences sizes) for this parameters. *** 4.5 Enhance the DP algorithm to use different sequences weight. Based on the guide tree, for each sequence a different weight (divergent sequences receive high values) is calculated and used in the scoring scheme of the generalized DP algorithm. *** 4.6 Enhance the DP algorithm to use position based gap penalties. The DP algorithm from substep 4.1 uses globally defined gap opening penalty. In this substep, the algorithm is going to be modified do use position based penalty, this is simple, once is known an array of opening penalties for each sequence position. This array is calculated based on several hierarchical (only apply the first one that fits, if any) rules, those are rescaling rules and the array is initialized with the original gap opening penalty. Given the hierarchical nature of the rules, they can be implemented in a incremental way, from the highest priority rule to the lowest, the algorithm of each step being a refinement of the previous one. I am omitting the detailed description of each rule. However, to verify if a given rule apply to a given position, all that is necessary is to check at most 16 adjacent positions and the same position in the other already aligned sequences. At the end of each of the following steps we a have functional algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. **** 4.6.1 Lowered gap opening penalties at existing gaps. **** 4.6.2 Increased gap opening penalties near existing gaps. **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. **** 4.6.4 Residue specific gap penalties. ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. EXTRA: Implement some benchmark technique to measure the final alignment quality. Main Risks ---------- The main risk to this project is the intrinsic complexity of the MSA progressive algorithm. To deal with that we decided to break the implementation in a large number of small and manageable steps, and the steps are designed in a way that, at the end of each of them, we will have a complete and testable new function (or a modification of an existing one). Besides that, to be extra careful the project aims to produce a simple full functional MSA algorithm as early as possible, the estimated time is 8 weeks, this way we guarantee to deliver at a simpler, but working and bug-free, version. > Andreas > > >> ------------------------------------------------------------- >> >> GSoC proposal >> >> Abstract >> -------- >> >> This project aims to develop an all-Java implementation of a multiple >> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >> using the progressive algorithm described in the CLUSTALW paper [1]. >> >> The Importance >> -------------- >> >> Multiple sequence alignment is a frequently performed task in sequence >> analysis with the goal to identify new members of protein families and >> infer phylogenetic relationships between proteins and genes. At the >> present there is no Java-only implementation for this algorithm. As >> such the number of already existing and Java related BioInformatics >> tools and web sites would benefit from this implementation and >> sequence analysis could be more easily performed by the end-user. >> >> About Me >> -------- >> >> I am a graduate student at University of S?o Paulo (Brazil), I got my >> undergraduate degree from the same university with a major in Computer >> Science and a minor in Biology. I have been involved with >> Bioinformatics for 5 years, always with sequence analysis with >> particular interest in the MSA problem. Also, in my undergraduate >> final project I developed a lossless filter (pruning algorithm) for >> the MSA problem, the work is published in [3] and there is an online >> implementation of the algorithm in [4]. Finally, I have experience >> with the C, C++, Java, Python and Ruby programming languages; Git and >> SVN version control systems. >> >> Project Plan >> ------------ >> >> The project is divided in four main steps, at the end of each step a >> completely functional and bug-free new algorithm will be added to the >> Biojava code base. It should be noticed that each step has a strong >> dependence on the previous one, so before move to the next step a >> careful testing will be done. >> >> The four steps are described below, estimated times for accomplishment >> of each step are also given and in some steps extra enhancements are >> described, they will be implemented if there is some time remaining >> after all steps are completed. >> >> ** 1. Study the Biojava pairwise alignment code and update it to be >> compliant with Biojava 3. >> >> ?The pairwise alignment will play an important role in the MSA >> algorithm. This step is also important for me to get used to the >> Biojava coding standards and get in touch with the Biojava dev >> community. >> >> ?ETA: 2 weeks. >> >> ** 2. Implement the algorithm to build the distance matrix. >> >> ?This is done using the pairwise alignment for each pair of sequence >> in the set to be aligned. >> >> ?ETA: 1 week. >> >> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use >> several threads to calculate the pairwise alignment for different >> pairs in the sequence set. >> >> ** 3. Implement the algorithm to build the guide tree. >> >> ?The guide tree is based on the distance matrix built in the last >> step, the tree construction strategy adopted will be the Neighbor >> Joining Algorithm. >> >> ?ETA: 2 weeks. >> >> ** 4. Implement the algorithm for progressive MSA using the guide tree. >> >> ?This is certainly the most difficult part of the project, so to make >> sure we are going to deliver a fully functional MSA algorithm, a safer >> approach is going to be taken. In the first place, a dynamic >> programming algorithm described in [2] will be implemented. Once this >> get successfully done and the code fully integrated to the Biojava >> code base, the features described in [1] are going to be incrementally >> added (and tested) in order to implement the full dynamic programming >> algorithm. >> >> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >> >> ?EXTRA: Implement some benchmark technique to measure the final >> alignment quality. >> >> References >> ---------- >> >> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >> [3] http://www.almob.org/content/4/1/3 >> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >> >> >> >> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >>> Hi Gustavo, >>> >>> In principle I agree to all, see details below: >>> >>> >>> I think my question wasn't very clear, my intention in this project is >>>> >>>> to follow the approach (with the tree steps) outlined in the project's >>>> page. Using the classical progressive alignment heuristic: build the >>>> distance matrix, build the guide tree and using this tree >>>> progressively align more sequences together. >>> >>> yes >>> >>>> >>>> What I propose for the third step is a first implementation using the >>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>> more or less what you had in mind? >>> >>> yes, sounds good. >>> >>>> >>>> About parallel strategies, I think a relative easy way we could use it >>>> is in the distance matrix construction, we could have several threads >>>> calculating the pairwise alignment for different pairs of sequence in >>>> the set. >>> >>> Correct. Probably a first implementation would be for a single machine/ >>> multi CPU. More advanced implementations could provide support e.g. for >>> Map/Reduce, JPPF, or something like that... >>> >>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>> paper doesn't give any way to measure the quality of the result, they >>>> consider a good alignment the one that is hard to improve by eye (But >>>> they claim that for sequences sufficient similar, no pair less than >>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>> paper and leave the quality measure to the user? How concerned should >>>> I be with that in this project? >>> >>> Getting an overall core-algorithm that works should be priority. The >>> benchmarking part is not mandatory, but something to keep in mind... I have >>> plenty of material for that, once we get to that stage... >>> >>>> I will try send to this mailing list a proposal draft until tomorrow >>>> to have some feedback from you. >>> >>> Excellent, looking forward to it. >>> >>> Andreas >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From andreas at sdsc.edu Thu Apr 8 17:26:03 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 8 Apr 2010 10:26:03 -0700 Subject: [Biojava-l] GSoC Application In-Reply-To: <4BBDD050.6090208@cs.wisc.edu> References: <4BBDD050.6090208@cs.wisc.edu> Message-ID: Hi Mark, looks pretty good, * The time schedule feels tight. Where do you see possible difficulties and risks. What might take longer than expected? * I would like to be able to use 3D structure alignment information to guide the final alignment. This should increase reliability of the final alignment for remote sequence similarities. Any thoughts on how to accomplish this? Andreas On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman wrote: > I would appreciate any feedback on my proposal from mentors or other > developers. ?Check it out at: > http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817 > > Thanks in advance, > Mark > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Apr 8 17:36:56 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 8 Apr 2010 10:36:56 -0700 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Looks pretty good. One issue during the progressive alignment build up: 3D structure alignments can increase the reliability of the sequence alignments, particularly if the sequences are only distantly related. Having a way to incorporate the 3D structure info would be nice... Andreas On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto wrote: > Hi Andreas, > > On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic wrote: >> Hi Gustavo, >> >> here my 0.02$: >> >> * For some of your steps there is already code available in BioJava. >> MIght be good to take a look at what is already there... ? (look at >> the alignment and phylo modules for dynamic programming and >> Neighbour-Joining) >> >> * What about risks? Where do you expect difficulties and how to work >> around them? >> >> * Step 4: Can you add more details? How do you plan to approach this? >> E.g. Clustalw has a number of rules implemented at this stage. Do you >> plan to support multiple rules as well and how to do this technically. >> Something nice would be the possibility to use structure alignments to >> guide the sequence alignments. (structure module) > > Based on it I rewrote the step 4 and add a "Main Risks" section. > > I pasted just the new version of step 4 and the new section at the end > of this e-mal. > > Thank you very much for your feedback. > > gustavo > > > > ------------------------------------------------------------------------------------------- > > ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. > > ?A progressive MSA is a heuristic approach for the MSA problem, at > each step a pairwise alignment between two sequences, a sequence and > an alignment or between two alignments is done. So, the multiple > alignment is built incrementally, at each iteration more sequences are > aligned together. The guide tree gives an order for this incremental > alignment, in a bottom-up (in the tree) fashion sequences (or groups > of sequences) with greater similarity are aligned first. Therefore, in > order to have a more flexible and reusable code, the code design will > allow any binary tree of the sequences to be used as a guide tree, not > only the one built in the last step. This will allow a priori > phylogenetic or tertiary similarity (structural similarity) knowledge > be used to guide the multiple alignment order. > > ?This is certainly the most difficult part of the project, so to make > sure we are going to deliver a fully functional MSA algorithm, a safer > approach is going to be taken. In the first place, a a basic algorithm > described in [2] will be implemented. Once this get successfully done > and the code fully integrated to the Biojava code base, the features > described in [1] are going to be incrementally added (and tested) in > order to implement the full algorithm. This step is further divided in > substeps. > > *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. > > ?This is the generalized pairwise alignment used in each iteration of > the progressive MSA. Gaps ?already presents in one of the alignments > (profiles) remain fixed, gap opening penalties remain unchanged, this > means that opening new gaps inside existent gaps will be fully > penalized. The code for this algorithm is similar to, the already > present in Biojava, code for regular pairwise alignment. > > *** 4.2 Implement the basic progressive MSA algorithm. > > ?In this substep is going to be implemented the incremental algorithm > to built the MSA, transversing a guide tree (parameter, could be the > one built in step 3 or any other one) in a bottom-up fashion and using > the algorithm from substep 4.1 at each iteration. > > *** 4.3 Implement the MSA wrapper. > > ?The MSA wrapper is going to be a method that wraps steps 2, 3 and > 4.2, giving a simple method (for the final user) to calculate the MSA. > Receiving as parameters the set of sequences to be aligned, the gap > opening penalty, gap extend penalty and residue matrix. Returning the > MSA for the sequence set. > ?At the end of this substep, we get a basic fully functional MSA > algorithm, using the progressive heuristic. > > *** 4.4 Implement gaps penalties rescaling and parameter default values. > > ?Gap penalties to open a new gap an extend a existing one (the affine > gap weight model) are user defined parameters. This substep will > define default values, based on the residue matrix, for this > parameters and implement global rescaling rules (based on sequences > sizes) for this parameters. > > *** 4.5 Enhance the DP algorithm to use different sequences weight. > > ?Based on the guide tree, for each sequence a different weight > (divergent sequences receive high values) is calculated and used in > the scoring scheme of the generalized DP algorithm. > > *** 4.6 Enhance the DP algorithm to use position based gap penalties. > > ?The DP algorithm from substep 4.1 uses globally defined gap opening > penalty. In this substep, the algorithm is going to be modified do use > position based penalty, this is simple, once is known an array of > opening penalties for each sequence position. This array is calculated > based on several hierarchical (only apply the first one that fits, if > any) rules, those are rescaling rules and the array is initialized > with the original gap opening penalty. > > Given the hierarchical nature of the rules, they can be implemented in > a incremental way, from the highest priority rule to the lowest, the > algorithm of each step being a refinement of the previous one. I am > omitting the detailed description of each rule. However, to verify if > a given rule apply to a given position, all that is necessary is to > check at most 16 adjacent positions and the same position in the other > already aligned sequences. > > At the end of each of the following steps we a have functional > algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. > > **** 4.6.1 Lowered gap opening penalties at existing gaps. > **** 4.6.2 Increased gap opening penalties near existing gaps. > **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. > **** 4.6.4 Residue specific gap penalties. > > ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. > > ?EXTRA: Implement some benchmark technique to measure the final > alignment quality. > > Main Risks > ---------- > > The main risk to this project is the intrinsic complexity of the MSA > progressive algorithm. To deal with that we decided to break the > implementation in a large number of small and manageable steps, and > the steps are designed in a way that, at the end of each of them, we > will have a complete and testable new function (or a modification of > an existing one). Besides that, to be extra careful the project aims > to produce a simple full functional MSA algorithm as early as > possible, the estimated time is 8 weeks, this way we guarantee to > deliver at a simpler, but working and bug-free, version. > > > > >> Andreas >> >> >>> ------------------------------------------------------------- >>> >>> GSoC proposal >>> >>> Abstract >>> -------- >>> >>> This project aims to develop an all-Java implementation of a multiple >>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >>> using the progressive algorithm described in the CLUSTALW paper [1]. >>> >>> The Importance >>> -------------- >>> >>> Multiple sequence alignment is a frequently performed task in sequence >>> analysis with the goal to identify new members of protein families and >>> infer phylogenetic relationships between proteins and genes. At the >>> present there is no Java-only implementation for this algorithm. As >>> such the number of already existing and Java related BioInformatics >>> tools and web sites would benefit from this implementation and >>> sequence analysis could be more easily performed by the end-user. >>> >>> About Me >>> -------- >>> >>> I am a graduate student at University of S?o Paulo (Brazil), I got my >>> undergraduate degree from the same university with a major in Computer >>> Science and a minor in Biology. I have been involved with >>> Bioinformatics for 5 years, always with sequence analysis with >>> particular interest in the MSA problem. Also, in my undergraduate >>> final project I developed a lossless filter (pruning algorithm) for >>> the MSA problem, the work is published in [3] and there is an online >>> implementation of the algorithm in [4]. Finally, I have experience >>> with the C, C++, Java, Python and Ruby programming languages; Git and >>> SVN version control systems. >>> >>> Project Plan >>> ------------ >>> >>> The project is divided in four main steps, at the end of each step a >>> completely functional and bug-free new algorithm will be added to the >>> Biojava code base. It should be noticed that each step has a strong >>> dependence on the previous one, so before move to the next step a >>> careful testing will be done. >>> >>> The four steps are described below, estimated times for accomplishment >>> of each step are also given and in some steps extra enhancements are >>> described, they will be implemented if there is some time remaining >>> after all steps are completed. >>> >>> ** 1. Study the Biojava pairwise alignment code and update it to be >>> compliant with Biojava 3. >>> >>> ?The pairwise alignment will play an important role in the MSA >>> algorithm. This step is also important for me to get used to the >>> Biojava coding standards and get in touch with the Biojava dev >>> community. >>> >>> ?ETA: 2 weeks. >>> >>> ** 2. Implement the algorithm to build the distance matrix. >>> >>> ?This is done using the pairwise alignment for each pair of sequence >>> in the set to be aligned. >>> >>> ?ETA: 1 week. >>> >>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use >>> several threads to calculate the pairwise alignment for different >>> pairs in the sequence set. >>> >>> ** 3. Implement the algorithm to build the guide tree. >>> >>> ?The guide tree is based on the distance matrix built in the last >>> step, the tree construction strategy adopted will be the Neighbor >>> Joining Algorithm. >>> >>> ?ETA: 2 weeks. >>> >>> ** 4. Implement the algorithm for progressive MSA using the guide tree. >>> >>> ?This is certainly the most difficult part of the project, so to make >>> sure we are going to deliver a fully functional MSA algorithm, a safer >>> approach is going to be taken. In the first place, a dynamic >>> programming algorithm described in [2] will be implemented. Once this >>> get successfully done and the code fully integrated to the Biojava >>> code base, the features described in [1] are going to be incrementally >>> added (and tested) in order to implement the full dynamic programming >>> algorithm. >>> >>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >>> >>> ?EXTRA: Implement some benchmark technique to measure the final >>> alignment quality. >>> >>> References >>> ---------- >>> >>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >>> [3] http://www.almob.org/content/4/1/3 >>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >>> >>> >>> >>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >>>> Hi Gustavo, >>>> >>>> In principle I agree to all, see details below: >>>> >>>> >>>> I think my question wasn't very clear, my intention in this project is >>>>> >>>>> to follow the approach (with the tree steps) outlined in the project's >>>>> page. Using the classical progressive alignment heuristic: build the >>>>> distance matrix, build the guide tree and using this tree >>>>> progressively align more sequences together. >>>> >>>> yes >>>> >>>>> >>>>> What I propose for the third step is a first implementation using the >>>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>>> more or less what you had in mind? >>>> >>>> yes, sounds good. >>>> >>>>> >>>>> About parallel strategies, I think a relative easy way we could use it >>>>> is in the distance matrix construction, we could have several threads >>>>> calculating the pairwise alignment for different pairs of sequence in >>>>> the set. >>>> >>>> Correct. Probably a first implementation would be for a single machine/ >>>> multi CPU. More advanced implementations could provide support e.g. for >>>> Map/Reduce, JPPF, or something like that... >>>> >>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>>> paper doesn't give any way to measure the quality of the result, they >>>>> consider a good alignment the one that is hard to improve by eye (But >>>>> they claim that for sequences sufficient similar, no pair less than >>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>>> paper and leave the quality measure to the user? How concerned should >>>>> I be with that in this project? >>>> >>>> Getting an overall core-algorithm that works should be priority. The >>>> benchmarking part is not mandatory, but something to keep in mind... I have >>>> plenty of material for that, once we get to that stage... >>>> >>>>> I will try send to this mailing list a proposal draft until tomorrow >>>>> to have some feedback from you. >>>> >>>> Excellent, looking forward to it. >>>> >>>> Andreas >>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> Dr. Andreas Prlic >>>> Senior Scientist, RCSB PDB Protein Data Bank >>>> University of California, San Diego >>>> (+1) 858.246.0526 >>>> ----------------------------------------------------------------------- >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From chapman at cs.wisc.edu Thu Apr 8 20:45:21 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Thu, 08 Apr 2010 15:45:21 -0500 Subject: [Biojava-l] GSoC Application In-Reply-To: References: <4BBDD050.6090208@cs.wisc.edu> Message-ID: <4BBE4061.3000204@cs.wisc.edu> Hi Andreas, Thanks for the feedback. Difficulties and risks: By viewing progressive multiple sequence alignment as four separate stages, I believe the pieces become easier to manage. However, I also expect a few of my ideas to prove quite challenging to implement. One of these challenges will be efficient parallelization. Instead of spending all summer finding the optimal approach, I plan to make routines which are called in sequence in a simple implementation and in parallel in a separate one. Later work could then extend the parallelism to a distributed computing framework such as hadoop or condor. Another difficult aspect is to make a general interface for choosing anchors in profile-profile alignment. The Myers-Miller algorithm chooses optimal midpoints as anchors in an internal decision process. I hope to generalize this to allow external identification of candidate anchors, as well. Structural alignment integration: At least three options exist for inserting structural information into the multiple sequence alignment task: pairwise scoring, anchoring, and profile scoring. First, scores from pairwise structural alignments could be used to construct the similarity matrix. This would create a guide tree that aligns sequences with similar structures earlier in the progressive alignment. Second, structural alignment could identify possible anchors. The profile-profile alignments would then conserve known structures when two profiles share some anchor candidates. Both of these options are in my plan. The third option would follow the consistency method of profile-profile alignment which replaces scoring from a substitution matrix with a consistency score. This technique is used in T-Coffee and ProbCons. The consistency score comes from how often residues in each profile aligned when combining information from pairwise alignments. If these were structural pairwise alignments, then the multiple sequence alignment would preserve structural information. Later work could implement this method as an alternative profile-profile alignment. I'll try to incorporate these ideas when I revise my application later tonight. And thanks again for your input. Mark On 4/8/2010 12:26 PM, Andreas Prlic wrote: > Hi Mark, > > looks pretty good, > > * The time schedule feels tight. Where do you see possible > difficulties and risks. What might take longer than expected? > > * I would like to be able to use 3D structure alignment information to > guide the final alignment. This should increase reliability of the > final alignment for remote sequence similarities. Any thoughts on how > to accomplish this? > > Andreas > > > > > On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman wrote: >> I would appreciate any feedback on my proposal from mentors or other >> developers. Check it out at: >> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817 >> >> Thanks in advance, >> Mark >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > From sacomoto at gmail.com Fri Apr 9 00:36:27 2010 From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto) Date: Thu, 8 Apr 2010 21:36:27 -0300 Subject: [Biojava-l] GSoC project on MSA In-Reply-To: References: Message-ID: Hi Andreas, On Thu, Apr 8, 2010 at 2:36 PM, Andreas Prlic wrote: > Looks pretty good. > > One issue during the progressive alignment build up: 3D structure > alignments can increase the reliability of the sequence alignments, > particularly if the sequences are only distantly related. Having a way > to incorporate the 3D structure info would be nice... A first idea to incorporate some information about 3D structure alignment is to extract from this alignment some matching substrings, i.e. obtain the sequence substrings that correspond to the superimposed points in the 3D alignment. And then, force the final MSA to contain those same aligned substrings, in order to do that the DP algorithm of step 4.1 should be modified in a way described here [ http://www.ncbi.nlm.nih.gov/pubmed/9018604 ] . Thanks again. gustavo > Andreas > > On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto > wrote: >> Hi Andreas, >> >> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic wrote: >>> Hi Gustavo, >>> >>> here my 0.02$: >>> >>> * For some of your steps there is already code available in BioJava. >>> MIght be good to take a look at what is already there... ? (look at >>> the alignment and phylo modules for dynamic programming and >>> Neighbour-Joining) >>> >>> * What about risks? Where do you expect difficulties and how to work >>> around them? >>> >>> * Step 4: Can you add more details? How do you plan to approach this? >>> E.g. Clustalw has a number of rules implemented at this stage. Do you >>> plan to support multiple rules as well and how to do this technically. >>> Something nice would be the possibility to use structure alignments to >>> guide the sequence alignments. (structure module) >> >> Based on it I rewrote the step 4 and add a "Main Risks" section. >> >> I pasted just the new version of step 4 and the new section at the end >> of this e-mal. >> >> Thank you very much for your feedback. >> >> gustavo >> >> >> >> ------------------------------------------------------------------------------------------- >> >> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper. >> >> ?A progressive MSA is a heuristic approach for the MSA problem, at >> each step a pairwise alignment between two sequences, a sequence and >> an alignment or between two alignments is done. So, the multiple >> alignment is built incrementally, at each iteration more sequences are >> aligned together. The guide tree gives an order for this incremental >> alignment, in a bottom-up (in the tree) fashion sequences (or groups >> of sequences) with greater similarity are aligned first. Therefore, in >> order to have a more flexible and reusable code, the code design will >> allow any binary tree of the sequences to be used as a guide tree, not >> only the one built in the last step. This will allow a priori >> phylogenetic or tertiary similarity (structural similarity) knowledge >> be used to guide the multiple alignment order. >> >> ?This is certainly the most difficult part of the project, so to make >> sure we are going to deliver a fully functional MSA algorithm, a safer >> approach is going to be taken. In the first place, a a basic algorithm >> described in [2] will be implemented. Once this get successfully done >> and the code fully integrated to the Biojava code base, the features >> described in [1] are going to be incrementally added (and tested) in >> order to implement the full algorithm. This step is further divided in >> substeps. >> >> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm. >> >> ?This is the generalized pairwise alignment used in each iteration of >> the progressive MSA. Gaps ?already presents in one of the alignments >> (profiles) remain fixed, gap opening penalties remain unchanged, this >> means that opening new gaps inside existent gaps will be fully >> penalized. The code for this algorithm is similar to, the already >> present in Biojava, code for regular pairwise alignment. >> >> *** 4.2 Implement the basic progressive MSA algorithm. >> >> ?In this substep is going to be implemented the incremental algorithm >> to built the MSA, transversing a guide tree (parameter, could be the >> one built in step 3 or any other one) in a bottom-up fashion and using >> the algorithm from substep 4.1 at each iteration. >> >> *** 4.3 Implement the MSA wrapper. >> >> ?The MSA wrapper is going to be a method that wraps steps 2, 3 and >> 4.2, giving a simple method (for the final user) to calculate the MSA. >> Receiving as parameters the set of sequences to be aligned, the gap >> opening penalty, gap extend penalty and residue matrix. Returning the >> MSA for the sequence set. >> ?At the end of this substep, we get a basic fully functional MSA >> algorithm, using the progressive heuristic. >> >> *** 4.4 Implement gaps penalties rescaling and parameter default values. >> >> ?Gap penalties to open a new gap an extend a existing one (the affine >> gap weight model) are user defined parameters. This substep will >> define default values, based on the residue matrix, for this >> parameters and implement global rescaling rules (based on sequences >> sizes) for this parameters. >> >> *** 4.5 Enhance the DP algorithm to use different sequences weight. >> >> ?Based on the guide tree, for each sequence a different weight >> (divergent sequences receive high values) is calculated and used in >> the scoring scheme of the generalized DP algorithm. >> >> *** 4.6 Enhance the DP algorithm to use position based gap penalties. >> >> ?The DP algorithm from substep 4.1 uses globally defined gap opening >> penalty. In this substep, the algorithm is going to be modified do use >> position based penalty, this is simple, once is known an array of >> opening penalties for each sequence position. This array is calculated >> based on several hierarchical (only apply the first one that fits, if >> any) rules, those are rescaling rules and the array is initialized >> with the original gap opening penalty. >> >> Given the hierarchical nature of the rules, they can be implemented in >> a incremental way, from the highest priority rule to the lowest, the >> algorithm of each step being a refinement of the previous one. I am >> omitting the detailed description of each rule. However, to verify if >> a given rule apply to a given position, all that is necessary is to >> check at most 16 adjacent positions and the same position in the other >> already aligned sequences. >> >> At the end of each of the following steps we a have functional >> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete. >> >> **** 4.6.1 Lowered gap opening penalties at existing gaps. >> **** 4.6.2 Increased gap opening penalties near existing gaps. >> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches. >> **** 4.6.4 Residue specific gap penalties. >> >> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >> >> ?EXTRA: Implement some benchmark technique to measure the final >> alignment quality. >> >> Main Risks >> ---------- >> >> The main risk to this project is the intrinsic complexity of the MSA >> progressive algorithm. To deal with that we decided to break the >> implementation in a large number of small and manageable steps, and >> the steps are designed in a way that, at the end of each of them, we >> will have a complete and testable new function (or a modification of >> an existing one). Besides that, to be extra careful the project aims >> to produce a simple full functional MSA algorithm as early as >> possible, the estimated time is 8 weeks, this way we guarantee to >> deliver at a simpler, but working and bug-free, version. >> >> >> >> >>> Andreas >>> >>> >>>> ------------------------------------------------------------- >>>> >>>> GSoC proposal >>>> >>>> Abstract >>>> -------- >>>> >>>> This project aims to develop an all-Java implementation of a multiple >>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit, >>>> using the progressive algorithm described in the CLUSTALW paper [1]. >>>> >>>> The Importance >>>> -------------- >>>> >>>> Multiple sequence alignment is a frequently performed task in sequence >>>> analysis with the goal to identify new members of protein families and >>>> infer phylogenetic relationships between proteins and genes. At the >>>> present there is no Java-only implementation for this algorithm. As >>>> such the number of already existing and Java related BioInformatics >>>> tools and web sites would benefit from this implementation and >>>> sequence analysis could be more easily performed by the end-user. >>>> >>>> About Me >>>> -------- >>>> >>>> I am a graduate student at University of S?o Paulo (Brazil), I got my >>>> undergraduate degree from the same university with a major in Computer >>>> Science and a minor in Biology. I have been involved with >>>> Bioinformatics for 5 years, always with sequence analysis with >>>> particular interest in the MSA problem. Also, in my undergraduate >>>> final project I developed a lossless filter (pruning algorithm) for >>>> the MSA problem, the work is published in [3] and there is an online >>>> implementation of the algorithm in [4]. Finally, I have experience >>>> with the C, C++, Java, Python and Ruby programming languages; Git and >>>> SVN version control systems. >>>> >>>> Project Plan >>>> ------------ >>>> >>>> The project is divided in four main steps, at the end of each step a >>>> completely functional and bug-free new algorithm will be added to the >>>> Biojava code base. It should be noticed that each step has a strong >>>> dependence on the previous one, so before move to the next step a >>>> careful testing will be done. >>>> >>>> The four steps are described below, estimated times for accomplishment >>>> of each step are also given and in some steps extra enhancements are >>>> described, they will be implemented if there is some time remaining >>>> after all steps are completed. >>>> >>>> ** 1. Study the Biojava pairwise alignment code and update it to be >>>> compliant with Biojava 3. >>>> >>>> ?The pairwise alignment will play an important role in the MSA >>>> algorithm. This step is also important for me to get used to the >>>> Biojava coding standards and get in touch with the Biojava dev >>>> community. >>>> >>>> ?ETA: 2 weeks. >>>> >>>> ** 2. Implement the algorithm to build the distance matrix. >>>> >>>> ?This is done using the pairwise alignment for each pair of sequence >>>> in the set to be aligned. >>>> >>>> ?ETA: 1 week. >>>> >>>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use >>>> several threads to calculate the pairwise alignment for different >>>> pairs in the sequence set. >>>> >>>> ** 3. Implement the algorithm to build the guide tree. >>>> >>>> ?The guide tree is based on the distance matrix built in the last >>>> step, the tree construction strategy adopted will be the Neighbor >>>> Joining Algorithm. >>>> >>>> ?ETA: 2 weeks. >>>> >>>> ** 4. Implement the algorithm for progressive MSA using the guide tree. >>>> >>>> ?This is certainly the most difficult part of the project, so to make >>>> sure we are going to deliver a fully functional MSA algorithm, a safer >>>> approach is going to be taken. In the first place, a dynamic >>>> programming algorithm described in [2] will be implemented. Once this >>>> get successfully done and the code fully integrated to the Biojava >>>> code base, the features described in [1] are going to be incrementally >>>> added (and tested) in order to implement the full dynamic programming >>>> algorithm. >>>> >>>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks. >>>> >>>> ?EXTRA: Implement some benchmark technique to measure the final >>>> alignment quality. >>>> >>>> References >>>> ---------- >>>> >>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417 >>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435 >>>> [3] http://www.almob.org/content/4/1/3 >>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu >>>> >>>> >>>> >>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic wrote: >>>>> Hi Gustavo, >>>>> >>>>> In principle I agree to all, see details below: >>>>> >>>>> >>>>> I think my question wasn't very clear, my intention in this project is >>>>>> >>>>>> to follow the approach (with the tree steps) outlined in the project's >>>>>> page. Using the classical progressive alignment heuristic: build the >>>>>> distance matrix, build the guide tree and using this tree >>>>>> progressively align more sequences together. >>>>> >>>>> yes >>>>> >>>>>> >>>>>> What I propose for the third step is a first implementation using the >>>>>> (more simple) dynamic programming described in the first CLUSTAL paper >>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to >>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this >>>>>> more or less what you had in mind? >>>>> >>>>> yes, sounds good. >>>>> >>>>>> >>>>>> About parallel strategies, I think a relative easy way we could use it >>>>>> is in the distance matrix construction, we could have several threads >>>>>> calculating the pairwise alignment for different pairs of sequence in >>>>>> the set. >>>>> >>>>> Correct. Probably a first implementation would be for a single machine/ >>>>> multi CPU. More advanced implementations could provide support e.g. for >>>>> Map/Reduce, JPPF, or something like that... >>>>> >>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW >>>>>> paper doesn't give any way to measure the quality of the result, they >>>>>> consider a good alignment the one that is hard to improve by eye (But >>>>>> they claim that for sequences sufficient similar, no pair less than >>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW >>>>>> paper and leave the quality measure to the user? How concerned should >>>>>> I be with that in this project? >>>>> >>>>> Getting an overall core-algorithm that works should be priority. The >>>>> benchmarking part is not mandatory, but something to keep in mind... I have >>>>> plenty of material for that, once we get to that stage... >>>>> >>>>>> I will try send to this mailing list a proposal draft until tomorrow >>>>>> to have some feedback from you. >>>>> >>>>> Excellent, looking forward to it. >>>>> >>>>> Andreas >>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> Dr. Andreas Prlic >>>>> Senior Scientist, RCSB PDB Protein Data Bank >>>>> University of California, San Diego >>>>> (+1) 858.246.0526 >>>>> ----------------------------------------------------------------------- >>>>> >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From sheoran143 at gmail.com Sun Apr 11 19:16:29 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Sun, 11 Apr 2010 14:16:29 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class Message-ID: <4BC2200D.8000109@gmail.com> Hi, Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry Thanks Deepak Sheoran From holland at eaglegenomics.com Sun Apr 11 19:53:06 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 11 Apr 2010 20:53:06 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: <4BC2200D.8000109@gmail.com> References: <4BC2200D.8000109@gmail.com> Message-ID: I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). thanks, Richard On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: > Hi, > > Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. > > 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) > 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. > > > > > > > > ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry > > Thanks > Deepak Sheoran > > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From sheoran143 at gmail.com Sun Apr 11 21:08:22 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Sun, 11 Apr 2010 16:08:22 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> Message-ID: <4BC23A46.7090304@gmail.com> I am using same table with biojava and bioperl taxon program and the output I get is below: *Biojava:* For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things) *Bioperl:* For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things) Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id. *Taxon and Taxon_name Table content which is being relevant in discussion:* taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class 2901 3609 276240 genus Rhamnus scientific name 3610 4403 3609 species Platanus occidentalis scientific name 29052 48579 4403 species Suillus placidus scientific name 114412 143975 48579 species Diadasia australis scientific name 143976 176516 143975 species Arnicastrum guerrerense scientific name 30680 50447 176516 family Labiduridae scientific name 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name 9394 11632 17394 family Retroviridae scientific name 277861 327045 9394 subfamily Orthoretrovirinae scientific name 122448 153057 277861 genus Alpharetrovirus scientific name 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name 9584 11876 301952 species Avian sarcoma virus scientifice name Thanks Deepak On 4/11/2010 2:53 PM, Richard Holland wrote: > I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). > > thanks, > Richard > > On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: > > >> Hi, >> >> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >> >> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >> >> >> >> >> >> >> >> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >> >> Thanks >> Deepak Sheoran >> >> >> > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > From sheoran143 at gmail.com Sun Apr 11 22:48:00 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Sun, 11 Apr 2010 17:48:00 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: <4BC23A46.7090304@gmail.com> References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: <4BC251A0.4090602@gmail.com> If we don't want to change the current code in biojava and still want to fix this bug I have found a way, 1) we can do this by changing one of hibernate files called "Taxon.hbm.xml" and replace the line with by changing the above setting in hibernate setting I am able to get the correct linage for ncbi_taxon_id = 11876(Avian sarcoma virus) which is Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. 2) But the possible issue which we might get is with Taxonomy loader class which want to insert something for parent taxon_id into taxon table which I think won't be possible if we do this change to hibernate con-fig file. Deepak Sheoran On 4/11/2010 4:08 PM, Deepak Sheoran wrote: > I am using same table with biojava and bioperl taxon program and the > output I get is below: > > *Biojava:* > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the > lineage i get is > Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia > australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum > var. haydenii. > > Biojava process of finding names: > 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 > (wrong way of doing things) > > *Bioperl:* > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the > lineage i get is > Retroviridae; Orthoretrovirinae; Alpharetrovirus; > unclassified Alpharetrovirus. > > Bioperl process of finding names: > 11876==>353825==>153057==>327045==>11632 (Right way of doing things) > > Hint: biojava search ncbi_taxon_id column with a value from > parent_taxon_id where bioperl search taxon_id column with a value from > parent_taxon_id. > > *Taxon and Taxon_name Table content which is being relevant in > discussion:* > > taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class > 2901 3609 276240 genus Rhamnus scientific name > 3610 4403 3609 species Platanus occidentalis scientific name > 29052 48579 4403 species Suillus placidus scientific name > 114412 143975 48579 species Diadasia australis scientific name > 143976 176516 143975 species Arnicastrum guerrerense scientific name > 30680 50447 176516 family Labiduridae scientific name > 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii > scientific name > 9394 11632 17394 family Retroviridae scientific name > 277861 327045 9394 subfamily Orthoretrovirinae scientific name > 122448 153057 277861 genus Alpharetrovirus scientific name > 301952 353825 122448 no rank unclassified Alpharetrovirus > scientific name > 9584 > 11876 > 301952 > species > Avian sarcoma virus > scientifice name > > > Thanks > Deepak > > On 4/11/2010 2:53 PM, Richard Holland wrote: >> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). >> >> thanks, >> Richard >> >> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: >> >> >>> Hi, >>> >>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >>> >>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >>> >>> >>> >>> >>> >>> >>> >>> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >>> >>> Thanks >>> Deepak Sheoran >>> >>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E:holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> > From holland at eaglegenomics.com Mon Apr 12 06:57:57 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 12 Apr 2010 07:57:57 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: <4BC23A46.7090304@gmail.com> References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: Thanks Deepak. I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used. BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.) I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results. This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now. cheers, Richard On 11 Apr 2010, at 22:08, Deepak Sheoran wrote: > I am using same table with biojava and bioperl taxon program and the output I get is below: > > Biojava: > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is > Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. > > Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things) > > Bioperl: > For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is > Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. > > Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things) > > Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id. > > Taxon and Taxon_name Table content which is being relevant in discussion: > > taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class > 2901 3609 276240 genus Rhamnus scientific name > 3610 4403 3609 species Platanus occidentalis scientific name > 29052 48579 4403 species Suillus placidus scientific name > 114412 143975 48579 species Diadasia australis scientific name > 143976 176516 143975 species Arnicastrum guerrerense scientific name > 30680 50447 176516 family Labiduridae scientific name > 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name > 9394 11632 17394 family Retroviridae scientific name > 277861 327045 9394 subfamily Orthoretrovirinae scientific name > 122448 153057 277861 genus Alpharetrovirus scientific name > 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name > 9584 > 11876 > 301952 > species > Avian sarcoma virus > scientifice name > > Thanks > Deepak > > On 4/11/2010 2:53 PM, Richard Holland wrote: >> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). >> >> thanks, >> Richard >> >> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: >> >> >> >>> Hi, >>> >>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >>> >>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >>> >>> >>> >>> >>> >>> >>> >>> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >>> >>> Thanks >>> Deepak Sheoran >>> >>> >>> >>> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: >> holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> >> > -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Mon Apr 12 07:07:55 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 12 Apr 2010 08:07:55 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: Incidentally, BioJava's approach matches the description in the BioSQL docs at: http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME (first example SQL statement - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join) The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this description. cheers, Richard On 12 Apr 2010, at 07:57, Richard Holland wrote: > Thanks Deepak. > > I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. > > BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used. > > BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.) > > I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results. > > This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now. > > cheers, > Richard > > On 11 Apr 2010, at 22:08, Deepak Sheoran wrote: > >> I am using same table with biojava and bioperl taxon program and the output I get is below: >> >> Biojava: >> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is >> Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. >> >> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240 (wrong way of doing things) >> >> Bioperl: >> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is >> Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus. >> >> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632 (Right way of doing things) >> >> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id. >> >> Taxon and Taxon_name Table content which is being relevant in discussion: >> >> taxon_id ncbi_taxon_id parent_taxon_id node_rank name name_class >> 2901 3609 276240 genus Rhamnus scientific name >> 3610 4403 3609 species Platanus occidentalis scientific name >> 29052 48579 4403 species Suillus placidus scientific name >> 114412 143975 48579 species Diadasia australis scientific name >> 143976 176516 143975 species Arnicastrum guerrerense scientific name >> 30680 50447 176516 family Labiduridae scientific name >> 254757 301952 50447 varietas Oreostemma alpigenum var. haydenii scientific name >> 9394 11632 17394 family Retroviridae scientific name >> 277861 327045 9394 subfamily Orthoretrovirinae scientific name >> 122448 153057 277861 genus Alpharetrovirus scientific name >> 301952 353825 122448 no rank unclassified Alpharetrovirus scientific name >> 9584 >> 11876 >> 301952 >> species >> Avian sarcoma virus >> scientifice name >> >> Thanks >> Deepak >> >> On 4/11/2010 2:53 PM, Richard Holland wrote: >>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead). >>> >>> thanks, >>> Richard >>> >>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote: >>> >>> >>> >>>> Hi, >>>> >>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it. >>>> >>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue) >>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry >>>> >>>> Thanks >>>> Deepak Sheoran >>>> >>>> >>>> >>>> >>> -- >>> Richard Holland, BSc MBCS >>> Operations and Delivery Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: >>> holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> >>> >>> >>> >> > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From mara.axiom at gmail.com Tue Apr 13 14:55:50 2010 From: mara.axiom at gmail.com (Mara Axiom) Date: Tue, 13 Apr 2010 10:55:50 -0400 Subject: [Biojava-l] BioJava implementation of a phylogenetic tree reconstruction algorithm Message-ID: Hello all, Does anyone have BioJava implementation of a phylogenetic tree reconstruction algorithm, except neighbor-joining or UPGMA? I need this for a research. We have neighbor-joining or UPGMA implementation already, and we want to look at other algorithms other than these. I am new to BioJava, any information will help. Here is what we want. 1 - Compare sequences in a FASTA file, and find sequences that are similar to each other. 2 - Construct the tree. 3 - Output the tree in Newick (XML will work too) format. In particular we are interested in implementation of BNNP ( http://www.cs.cmu.edu/~guyb/papers/SDBHRS06.pdf) and Align Free ( http://www.math.ucla.edu/~roch/research_files/align-free.pdf) algorithms, but we are open to other algorithms too. Please do not recommend a P-tree reconstruction tool. We are only interested in a source code to meet our specific purpose. Thanks in advance, Mara From biopython at maubp.freeserve.co.uk Thu Apr 15 17:54:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Apr 2010 18:54:56 +0100 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: Hi, I've CC'd this to the BioSQL mailing list for cross project discussion. On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland wrote: > Thanks Deepak. > > I've had a look at the code and I believe its due to the > different ways in which BioJava and BioPerl load the > taxon table. > > BioJava sets the ncbi_taxon_id and parent_taxon_id > columns based on the values from the NCBI taxonomy > file. The taxon_id column in BioJava is a meaningless > auto-generated value that is never used. > > BioPerl however is generating taxon_id values and > linking them by setting parent_taxon_id to the > generated value. The parent value from the NCBI > taxonomy file is therefore replaced with the BioPerl > generated parent ID, meaning that instead of linking > from parent_taxon_id to ncbi_taxon_id as per BioJava, > the link is to taxon_id instead. (I'm basing this > comment on looking at load_ncbi_taxonomy.pl from > the BioSQL archives.) Note that old versions of load_ncbi_taxonomy.pl (which is part of BioSQL, not part of BioPerl) would set taxon_id equal to ncbi_taxon_id, see: http://bugzilla.open-bio.org/show_bug.cgi?id=2470 This may help explain the confusion. > I believe if you load the taxonomy table using BioJava, > you should see BioJava giving correct behaviour. > Likewise if you load it using BioPerl, BioPerl will > behave correctly. But if you load with one then query > with the other, you'll get incorrect results. > > This sounds like a case for discussion on both lists - > a matter of standardisation between the two projects. > Not quickly/easily solvable for now. Its not just two projects (BioPerl & BioJava) (grin). Its at least five projects (BioSQL itself plus BioRuby and Biopython). I'm not sure about BioRuby's implementation, but currently I think BioJava is the odd one out - BioPerl, Biopython, and the BioSQL's load_ncbi_taxonomy.pl all make entries in parent_taxon_id reference the automatically generated taxon_id (please correct me if I am wrong). My personal view is that bioperl-db is the reference implementation and should be followed in the event of any ambiguity within BioSQL. In this particular case, there is actually a BioSQL script to check against too (load_ncbi_taxonomy.pl). Hopefully Hilmar can give us an official verdict... Peter From andreas.draeger at uni-tuebingen.de Wed Apr 7 13:22:26 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 07 Apr 2010 15:22:26 +0200 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] Message-ID: <4BBC8712.90907@uni-tuebingen.de> Hi all, This e-mail is just for your information about somebody new, who'd like to contribute to our project. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 -------------- next part -------------- An embedded message was scrubbed... From: =?ISO-8859-1?Q?Andreas_Dr=E4ger?= Subject: Re: Fwd: Proposing a project on "Biojava alignment lead" Date: Wed, 07 Apr 2010 09:27:13 +0200 Size: 4779 URL: From jbdundas at gmail.com Fri Apr 16 13:57:41 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 16 Apr 2010 19:27:41 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de> References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: Dear Sir, I am very interested in contributing to this project. I am looking for a good problem,more on the research side. I can also help in coding (I also work as a software engineer-j2ee/eclipse/jboss/tomcat .. Anything that I could work on... Regards, Jitesh Dundas On 4/8/10, Andreas Dr?ger wrote: > Hi all, > > This e-mail is just for your information about somebody new, who'd like > to contribute to our project. > > Cheers > Andreas > > > Subject: > Re: Fwd: Proposing a project on "Biojava alignment lead" > From: > Andreas Dr?ger > Date: > Wed, 07 Apr 2010 09:27:13 +0200 > To: > Cai Shaojiang > > Hi Cai Shaojiang, > > Thank you for you e-mail! I don't know what happened to the e-mail list. > Sometimes it takes a while due to the spam filters, I guess. > > > I am a PhD student from National University of Singapore. My major > research area is local alignment algorithms and data structures for SNP > identification. And I have used Java and Eclipse for years for software > development. I am very interested in your GSoC programme. I find that > there is a module called "biojava-alignment lead" whose mentor is you. I > want to propose a new project on this module. I have several questions > about this module. > > Yes, that's me. So great to get your support. > > > 1. It seems that pairwise alignment is to find similarity between two > short sequences. Existing pairwise alignment is based on dynamic > programming, is it Smith-Waterman algorithm? > > So, currently, BioJava contains three different alignment approaches. > There are two deterministic algorithms, i.e., Smith-Waterman for local > alignment and Needleman-Wunsch for global alignment. Third, there is the > possibility to apply Hidden Markov Models for alignment. An example of > the latter approach should be in the cookbook. > > > 2. What is the exact task of "refactoring of underlying data structures"? > > Yes, this is something, I did last week already but it could still be > improved. The problem was that the alignment algorithms actually > produced a kind of string that looks similar to the output of BLAST. > This string contained the score, the computation time, the length of the > alignment etc. The problem was that people wanted to perform > higher-level computation on the score value or evaluate some other > information. Now, the alignment will produce a data structure that > contains all the information and can, in addition to that, also produce > such a BLAST-like output. There is, however, still the following > problem: The data structure requires both sequences in the pair-wise > alignment to have an identical length. In case of local alignment this > is especially stupid (actually), because gaps are inserted to fill the > sequences. And then the data structure tries to keep the old sequence > coordinates, leading to the effect that the numbers "query start", > "query end", "subject start", and "subject end" are required to shift > the sequences against each other when displaying the output. So, you > cannot easily print the sequences below of each other, you first have to > shift them. Please check out the latest version of this package via > anonymeous svn and have a look ;-) > > > 3. My existing research area is aiming to deal with aligning short > read (10s~100s bp) against extremely long sequences (e.g., human > genome). Af far as I know, there is not existing such alignment tools > implemented in Java. Would you consider this direction? > > See, this would be very nice to include. But this requires that we no > longer fill the short sequence with many, many gap symbols (just a waist > of memory), but improve the data structure. There is already an > UnequalLenghtAlignment (just a data structure, no algorithm) and I think > we could use this as a starting point. Then your algorithm should only > produce such a data structure and this would be fine. > > > 4. It seems that the existing tools is just lacking of some > refactoring and representation interfaces. Any more underlying tasks? > > Hm. Yes: With the release of BioJava 3 data structures have changed > again. So maybe there's also some adaptation to the new structure required. > > > I am keeping an eye on GSoC from last month, but sorry to find out > that I sent the initial email to the mailing list before I subscribe it... > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > latest trunk, have a look, play around and if you can improve something > we'll put it into the trunk and write your name into the authors' tag. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From chapman at cs.wisc.edu Fri Apr 16 17:28:33 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Fri, 16 Apr 2010 12:28:33 -0500 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: <4BC89E41.4030009@cs.wisc.edu> A great place to start finding ideas is the wiki. Both http://biojava.org/wiki/BioJava:Modules and http://biojava.org/wiki/BioJava3_Proposal list the next steps planned/desired for BioJava. What research area did you have in mind? Have fun, Mark On 4/16/2010 8:57 AM, jitesh dundas wrote: > Dear Sir, > > I am very interested in contributing to this project. > > I am looking for a good problem,more on the research side. I can also > help in coding (I also work as a software > engineer-j2ee/eclipse/jboss/tomcat .. > > Anything that I could work on... > > Regards, > Jitesh Dundas > > On 4/8/10, Andreas Dr?ger wrote: >> Hi all, >> >> This e-mail is just for your information about somebody new, who'd like >> to contribute to our project. >> >> Cheers >> Andreas >> >> >> Subject: >> Re: Fwd: Proposing a project on "Biojava alignment lead" >> From: >> Andreas Dr?ger >> Date: >> Wed, 07 Apr 2010 09:27:13 +0200 >> To: >> Cai Shaojiang >> >> Hi Cai Shaojiang, >> >> Thank you for you e-mail! I don't know what happened to the e-mail list. >> Sometimes it takes a while due to the spam filters, I guess. >> >> > I am a PhD student from National University of Singapore. My major >> research area is local alignment algorithms and data structures for SNP >> identification. And I have used Java and Eclipse for years for software >> development. I am very interested in your GSoC programme. I find that >> there is a module called "biojava-alignment lead" whose mentor is you. I >> want to propose a new project on this module. I have several questions >> about this module. >> >> Yes, that's me. So great to get your support. >> >> > 1. It seems that pairwise alignment is to find similarity between two >> short sequences. Existing pairwise alignment is based on dynamic >> programming, is it Smith-Waterman algorithm? >> >> So, currently, BioJava contains three different alignment approaches. >> There are two deterministic algorithms, i.e., Smith-Waterman for local >> alignment and Needleman-Wunsch for global alignment. Third, there is the >> possibility to apply Hidden Markov Models for alignment. An example of >> the latter approach should be in the cookbook. >> >> > 2. What is the exact task of "refactoring of underlying data structures"? >> >> Yes, this is something, I did last week already but it could still be >> improved. The problem was that the alignment algorithms actually >> produced a kind of string that looks similar to the output of BLAST. >> This string contained the score, the computation time, the length of the >> alignment etc. The problem was that people wanted to perform >> higher-level computation on the score value or evaluate some other >> information. Now, the alignment will produce a data structure that >> contains all the information and can, in addition to that, also produce >> such a BLAST-like output. There is, however, still the following >> problem: The data structure requires both sequences in the pair-wise >> alignment to have an identical length. In case of local alignment this >> is especially stupid (actually), because gaps are inserted to fill the >> sequences. And then the data structure tries to keep the old sequence >> coordinates, leading to the effect that the numbers "query start", >> "query end", "subject start", and "subject end" are required to shift >> the sequences against each other when displaying the output. So, you >> cannot easily print the sequences below of each other, you first have to >> shift them. Please check out the latest version of this package via >> anonymeous svn and have a look ;-) >> >> > 3. My existing research area is aiming to deal with aligning short >> read (10s~100s bp) against extremely long sequences (e.g., human >> genome). Af far as I know, there is not existing such alignment tools >> implemented in Java. Would you consider this direction? >> >> See, this would be very nice to include. But this requires that we no >> longer fill the short sequence with many, many gap symbols (just a waist >> of memory), but improve the data structure. There is already an >> UnequalLenghtAlignment (just a data structure, no algorithm) and I think >> we could use this as a starting point. Then your algorithm should only >> produce such a data structure and this would be fine. >> >> > 4. It seems that the existing tools is just lacking of some >> refactoring and representation interfaces. Any more underlying tasks? >> >> Hm. Yes: With the release of BioJava 3 data structures have changed >> again. So maybe there's also some adaptation to the new structure required. >> >> > I am keeping an eye on GSoC from last month, but sorry to find out >> that I sent the initial email to the mailing list before I subscribe it... >> >> Ok. Sounds good. Thanks for your interest. So I suggest: Download the >> latest trunk, have a look, play around and if you can improve something >> we'll put it into the trunk and write your name into the authors' tag. >> >> Cheers >> Andreas >> >> -- >> Dipl.-Bioinform. Andreas Dr?ger >> Eberhard Karls University T?bingen >> Center for Bioinformatics (ZBIT) >> Sand 1 >> 72076 T?bingen >> Germany >> >> Phone: +49-7071-29-70436 >> Fax: +49-7071-29-5091 >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From sheoran143 at gmail.com Fri Apr 16 18:43:59 2010 From: sheoran143 at gmail.com (Deepak Sheoran) Date: Fri, 16 Apr 2010 13:43:59 -0500 Subject: [Biojava-l] Issue with SimpleNCBITaxon class In-Reply-To: References: <4BC2200D.8000109@gmail.com> <4BC23A46.7090304@gmail.com> Message-ID: <4BC8AFEF.70107@gmail.com> What my experience says on this issue we should make use of taxon_id because its a unique key in a local instance of biosql. ncbi_taxon_id should only be used for mapping purpose only so that a person can map his local taxon_id to a ncbi_taxon_id otherwise it defeat the sole purpose of having taxon_id as primary key in taxon table. The main goal which I think when biosql is designed is to make it independent of any other organization like genbank or NCBI but its a feature so that we can map a number(ncbi_taxon_id) given by a know authority to a local number (taxon_id). Deepak Sheoran On 4/15/2010 12:54 PM, Peter wrote: > Hi, > > I've CC'd this to the BioSQL mailing list for cross project > discussion. > > On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland wrote: > >> Thanks Deepak. >> >> I've had a look at the code and I believe its due to the >> different ways in which BioJava and BioPerl load the >> taxon table. >> >> BioJava sets the ncbi_taxon_id and parent_taxon_id >> columns based on the values from the NCBI taxonomy >> file. The taxon_id column in BioJava is a meaningless >> auto-generated value that is never used. >> >> BioPerl however is generating taxon_id values and >> linking them by setting parent_taxon_id to the >> generated value. The parent value from the NCBI >> taxonomy file is therefore replaced with the BioPerl >> generated parent ID, meaning that instead of linking >> from parent_taxon_id to ncbi_taxon_id as per BioJava, >> the link is to taxon_id instead. (I'm basing this >> comment on looking at load_ncbi_taxonomy.pl from >> the BioSQL archives.) >> > Note that old versions of load_ncbi_taxonomy.pl > (which is part of BioSQL, not part of BioPerl) would > set taxon_id equal to ncbi_taxon_id, see: > http://bugzilla.open-bio.org/show_bug.cgi?id=2470 > > This may help explain the confusion. > > >> I believe if you load the taxonomy table using BioJava, >> you should see BioJava giving correct behaviour. >> Likewise if you load it using BioPerl, BioPerl will >> behave correctly. But if you load with one then query >> with the other, you'll get incorrect results. >> >> This sounds like a case for discussion on both lists - >> a matter of standardisation between the two projects. >> Not quickly/easily solvable for now. >> > Its not just two projects (BioPerl& BioJava) (grin). > Its at least five projects (BioSQL itself plus BioRuby > and Biopython). > > I'm not sure about BioRuby's implementation, but > currently I think BioJava is the odd one out - BioPerl, > Biopython, and the BioSQL's load_ncbi_taxonomy.pl > all make entries in parent_taxon_id reference the > automatically generated taxon_id (please correct > me if I am wrong). > > My personal view is that bioperl-db is the reference > implementation and should be followed in the event > of any ambiguity within BioSQL. In this particular > case, there is actually a BioSQL script to check > against too (load_ncbi_taxonomy.pl). > > Hopefully Hilmar can give us an official verdict... > > Peter > From jbdundas at gmail.com Sat Apr 17 02:20:12 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 17 Apr 2010 07:50:12 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BC89E41.4030009@cs.wisc.edu> References: <4BBD820D.9070200@uni-tuebingen.de> <4BC89E41.4030009@cs.wisc.edu> Message-ID: Hi Everyone, I went throug the URLs sent by Dr Chapman. Interesting work that you are doing here.:)... I was wondering if there is anyone who could consider on these. I would like to also be a part of the research work being carried out using Biojava( especially in sequence alignment, miRNA signature Analysis (especially for cancers)...) 1) A set of tools for converting flat data (e.g. sequence strings, taxononmy strings) into BioJava-like objects (e.g. SymbolLists, NCBITaxon). These BioJava-like objects could then be used for more advanced applications. A set of tools for manipulating the BioJava-like objects. 2) Module?: biojava-ws-blast Module?: biojava-ws-biolit Proposed Module: biojava-j2ee Lead: Mark Schreiber - This would probably take the form of SessionBeans and WebServices that can be deployed to Glassfish/ JBoss etc to provide biological services for people who want to make client server or SOA apps. 3) I also liked what Mr. Gang Wu is working on(I read the discussions). I was wondering if I could do something of that sort... May I request the leads to tell me how I could chip in... Regards, Jitesh Dundas On 4/16/10, Mark Chapman wrote: > A great place to start finding ideas is the wiki. > Both http://biojava.org/wiki/BioJava:Modules > and http://biojava.org/wiki/BioJava3_Proposal > list the next steps planned/desired for BioJava. > > What research area did you have in mind? > > Have fun, > Mark > > > On 4/16/2010 8:57 AM, jitesh dundas wrote: >> Dear Sir, >> >> I am very interested in contributing to this project. >> >> I am looking for a good problem,more on the research side. I can also >> help in coding (I also work as a software >> engineer-j2ee/eclipse/jboss/tomcat .. >> >> Anything that I could work on... >> >> Regards, >> Jitesh Dundas >> >> On 4/8/10, Andreas Dr?ger wrote: >>> Hi all, >>> >>> This e-mail is just for your information about somebody new, who'd like >>> to contribute to our project. >>> >>> Cheers >>> Andreas >>> >>> >>> Subject: >>> Re: Fwd: Proposing a project on "Biojava alignment lead" >>> From: >>> Andreas Dr?ger >>> Date: >>> Wed, 07 Apr 2010 09:27:13 +0200 >>> To: >>> Cai Shaojiang >>> >>> Hi Cai Shaojiang, >>> >>> Thank you for you e-mail! I don't know what happened to the e-mail list. >>> Sometimes it takes a while due to the spam filters, I guess. >>> >>> > I am a PhD student from National University of Singapore. My major >>> research area is local alignment algorithms and data structures for SNP >>> identification. And I have used Java and Eclipse for years for software >>> development. I am very interested in your GSoC programme. I find that >>> there is a module called "biojava-alignment lead" whose mentor is you. I >>> want to propose a new project on this module. I have several questions >>> about this module. >>> >>> Yes, that's me. So great to get your support. >>> >>> > 1. It seems that pairwise alignment is to find similarity between >>> two >>> short sequences. Existing pairwise alignment is based on dynamic >>> programming, is it Smith-Waterman algorithm? >>> >>> So, currently, BioJava contains three different alignment approaches. >>> There are two deterministic algorithms, i.e., Smith-Waterman for local >>> alignment and Needleman-Wunsch for global alignment. Third, there is the >>> possibility to apply Hidden Markov Models for alignment. An example of >>> the latter approach should be in the cookbook. >>> >>> > 2. What is the exact task of "refactoring of underlying data >>> structures"? >>> >>> Yes, this is something, I did last week already but it could still be >>> improved. The problem was that the alignment algorithms actually >>> produced a kind of string that looks similar to the output of BLAST. >>> This string contained the score, the computation time, the length of the >>> alignment etc. The problem was that people wanted to perform >>> higher-level computation on the score value or evaluate some other >>> information. Now, the alignment will produce a data structure that >>> contains all the information and can, in addition to that, also produce >>> such a BLAST-like output. There is, however, still the following >>> problem: The data structure requires both sequences in the pair-wise >>> alignment to have an identical length. In case of local alignment this >>> is especially stupid (actually), because gaps are inserted to fill the >>> sequences. And then the data structure tries to keep the old sequence >>> coordinates, leading to the effect that the numbers "query start", >>> "query end", "subject start", and "subject end" are required to shift >>> the sequences against each other when displaying the output. So, you >>> cannot easily print the sequences below of each other, you first have to >>> shift them. Please check out the latest version of this package via >>> anonymeous svn and have a look ;-) >>> >>> > 3. My existing research area is aiming to deal with aligning short >>> read (10s~100s bp) against extremely long sequences (e.g., human >>> genome). Af far as I know, there is not existing such alignment tools >>> implemented in Java. Would you consider this direction? >>> >>> See, this would be very nice to include. But this requires that we no >>> longer fill the short sequence with many, many gap symbols (just a waist >>> of memory), but improve the data structure. There is already an >>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think >>> we could use this as a starting point. Then your algorithm should only >>> produce such a data structure and this would be fine. >>> >>> > 4. It seems that the existing tools is just lacking of some >>> refactoring and representation interfaces. Any more underlying tasks? >>> >>> Hm. Yes: With the release of BioJava 3 data structures have changed >>> again. So maybe there's also some adaptation to the new structure >>> required. >>> >>> > I am keeping an eye on GSoC from last month, but sorry to find out >>> that I sent the initial email to the mailing list before I subscribe >>> it... >>> >>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the >>> latest trunk, have a look, play around and if you can improve something >>> we'll put it into the trunk and write your name into the authors' tag. >>> >>> Cheers >>> Andreas >>> >>> -- >>> Dipl.-Bioinform. Andreas Dr?ger >>> Eberhard Karls University T?bingen >>> Center for Bioinformatics (ZBIT) >>> Sand 1 >>> 72076 T?bingen >>> Germany >>> >>> Phone: +49-7071-29-70436 >>> Fax: +49-7071-29-5091 >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > From jbdundas at gmail.com Sat Apr 17 02:31:46 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 17 Apr 2010 08:01:46 +0530 Subject: [Biojava-l] Analytical Tool- Prediction of Unknown Protein's location on an a Predicted pathway Message-ID: Dear All, I wanted to propose an analytical tool in BioJava. For e.g.) if we have a large datasets with complete pathway information and the related information(e.g. p53 pathway will have all the genes,proteins,miRNA s involved,etc ) mentioned, could we find the location of a specific unknown (and just predicted protein) protein/gene on a predicted pathway. This was a suggestion on the possible t ings on the analytical side that we could do.Could we think of doing something of this sort for BioJava (or atleast make it capable to handle such aspects) Any ideas / comments are most welcome... Regards, Jitesh Dundas On 4/17/10, jitesh dundas wrote: > Hi Everyone, > > I went throug the URLs sent by Dr Chapman. Interesting work that you > are doing here.:)... > > I was wondering if there is anyone who could consider on these. I > would like to also be a part of the research work being carried out > using Biojava( especially in sequence alignment, miRNA signature > Analysis (especially for cancers)...) > > 1) A set of tools for converting flat data (e.g. sequence strings, > taxononmy strings) into BioJava-like objects (e.g. SymbolLists, > NCBITaxon). These BioJava-like objects could then be used for more > advanced applications. > A set of tools for manipulating the BioJava-like objects. > > 2) Module?: biojava-ws-blast Module?: biojava-ws-biolit > Proposed Module: biojava-j2ee Lead: Mark Schreiber > > - This would probably take the form of SessionBeans and WebServices > that can be deployed to Glassfish/ JBoss etc to provide biological > services for people who want to make client server or SOA apps. > > 3) I also liked what Mr. Gang Wu is working on(I read the > discussions). I was wondering if I could > do something of that sort... > > May I request the leads to tell me how I could chip in... > > Regards, > Jitesh Dundas > > > > On 4/16/10, Mark Chapman wrote: >> A great place to start finding ideas is the wiki. >> Both http://biojava.org/wiki/BioJava:Modules >> and http://biojava.org/wiki/BioJava3_Proposal >> list the next steps planned/desired for BioJava. >> >> What research area did you have in mind? >> >> Have fun, >> Mark >> >> >> On 4/16/2010 8:57 AM, jitesh dundas wrote: >>> Dear Sir, >>> >>> I am very interested in contributing to this project. >>> >>> I am looking for a good problem,more on the research side. I can also >>> help in coding (I also work as a software >>> engineer-j2ee/eclipse/jboss/tomcat .. >>> >>> Anything that I could work on... >>> >>> Regards, >>> Jitesh Dundas >>> >>> On 4/8/10, Andreas Dr?ger wrote: >>>> Hi all, >>>> >>>> This e-mail is just for your information about somebody new, who'd like >>>> to contribute to our project. >>>> >>>> Cheers >>>> Andreas >>>> >>>> >>>> Subject: >>>> Re: Fwd: Proposing a project on "Biojava alignment lead" >>>> From: >>>> Andreas Dr?ger >>>> Date: >>>> Wed, 07 Apr 2010 09:27:13 +0200 >>>> To: >>>> Cai Shaojiang >>>> >>>> Hi Cai Shaojiang, >>>> >>>> Thank you for you e-mail! I don't know what happened to the e-mail >>>> list. >>>> Sometimes it takes a while due to the spam filters, I guess. >>>> >>>> > I am a PhD student from National University of Singapore. My major >>>> research area is local alignment algorithms and data structures for SNP >>>> identification. And I have used Java and Eclipse for years for software >>>> development. I am very interested in your GSoC programme. I find that >>>> there is a module called "biojava-alignment lead" whose mentor is you. >>>> I >>>> want to propose a new project on this module. I have several questions >>>> about this module. >>>> >>>> Yes, that's me. So great to get your support. >>>> >>>> > 1. It seems that pairwise alignment is to find similarity between >>>> two >>>> short sequences. Existing pairwise alignment is based on dynamic >>>> programming, is it Smith-Waterman algorithm? >>>> >>>> So, currently, BioJava contains three different alignment approaches. >>>> There are two deterministic algorithms, i.e., Smith-Waterman for local >>>> alignment and Needleman-Wunsch for global alignment. Third, there is >>>> the >>>> possibility to apply Hidden Markov Models for alignment. An example of >>>> the latter approach should be in the cookbook. >>>> >>>> > 2. What is the exact task of "refactoring of underlying data >>>> structures"? >>>> >>>> Yes, this is something, I did last week already but it could still be >>>> improved. The problem was that the alignment algorithms actually >>>> produced a kind of string that looks similar to the output of BLAST. >>>> This string contained the score, the computation time, the length of >>>> the >>>> alignment etc. The problem was that people wanted to perform >>>> higher-level computation on the score value or evaluate some other >>>> information. Now, the alignment will produce a data structure that >>>> contains all the information and can, in addition to that, also produce >>>> such a BLAST-like output. There is, however, still the following >>>> problem: The data structure requires both sequences in the pair-wise >>>> alignment to have an identical length. In case of local alignment this >>>> is especially stupid (actually), because gaps are inserted to fill the >>>> sequences. And then the data structure tries to keep the old sequence >>>> coordinates, leading to the effect that the numbers "query start", >>>> "query end", "subject start", and "subject end" are required to shift >>>> the sequences against each other when displaying the output. So, you >>>> cannot easily print the sequences below of each other, you first have >>>> to >>>> shift them. Please check out the latest version of this package via >>>> anonymeous svn and have a look ;-) >>>> >>>> > 3. My existing research area is aiming to deal with aligning short >>>> read (10s~100s bp) against extremely long sequences (e.g., human >>>> genome). Af far as I know, there is not existing such alignment tools >>>> implemented in Java. Would you consider this direction? >>>> >>>> See, this would be very nice to include. But this requires that we no >>>> longer fill the short sequence with many, many gap symbols (just a >>>> waist >>>> of memory), but improve the data structure. There is already an >>>> UnequalLenghtAlignment (just a data structure, no algorithm) and I >>>> think >>>> we could use this as a starting point. Then your algorithm should only >>>> produce such a data structure and this would be fine. >>>> >>>> > 4. It seems that the existing tools is just lacking of some >>>> refactoring and representation interfaces. Any more underlying tasks? >>>> >>>> Hm. Yes: With the release of BioJava 3 data structures have changed >>>> again. So maybe there's also some adaptation to the new structure >>>> required. >>>> >>>> > I am keeping an eye on GSoC from last month, but sorry to find out >>>> that I sent the initial email to the mailing list before I subscribe >>>> it... >>>> >>>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the >>>> latest trunk, have a look, play around and if you can improve something >>>> we'll put it into the trunk and write your name into the authors' tag. >>>> >>>> Cheers >>>> Andreas >>>> >>>> -- >>>> Dipl.-Bioinform. Andreas Dr?ger >>>> Eberhard Karls University T?bingen >>>> Center for Bioinformatics (ZBIT) >>>> Sand 1 >>>> 72076 T?bingen >>>> Germany >>>> >>>> Phone: +49-7071-29-70436 >>>> Fax: +49-7071-29-5091 >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > From jbdundas at gmail.com Sat Apr 17 13:34:20 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 17 Apr 2010 19:04:20 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de> References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: Dear SIr, Could anyone tell me where I could start? Is there any lead who might need my help in Software Development and research-oriebted aspects? Any comments on my previous emails would be most welcomed... Regards, JItesh Dundas On 4/8/10, Andreas Dr?ger wrote: > > Hi all, > > This e-mail is just for your information about somebody new, who'd like to > contribute to our project. > > Cheers > Andreas > > > Subject: > Re: Fwd: Proposing a project on "Biojava alignment lead" > From: > Andreas Dr?ger > Date: > Wed, 07 Apr 2010 09:27:13 +0200 > To: > Cai Shaojiang > > Hi Cai Shaojiang, > > Thank you for you e-mail! I don't know what happened to the e-mail list. > Sometimes it takes a while due to the spam filters, I guess. > > > I am a PhD student from National University of Singapore. My major > research area is local alignment algorithms and data structures for SNP > identification. And I have used Java and Eclipse for years for software > development. I am very interested in your GSoC programme. I find that there > is a module called "biojava-alignment lead" whose mentor is you. I want to > propose a new project on this module. I have several questions about this > module. > > Yes, that's me. So great to get your support. > > > 1. It seems that pairwise alignment is to find similarity between two > short sequences. Existing pairwise alignment is based on dynamic > programming, is it Smith-Waterman algorithm? > > So, currently, BioJava contains three different alignment approaches. > There are two deterministic algorithms, i.e., Smith-Waterman for local > alignment and Needleman-Wunsch for global alignment. Third, there is the > possibility to apply Hidden Markov Models for alignment. An example of the > latter approach should be in the cookbook. > > > 2. What is the exact task of "refactoring of underlying data structures"? > > Yes, this is something, I did last week already but it could still be > improved. The problem was that the alignment algorithms actually produced a > kind of string that looks similar to the output of BLAST. This string > contained the score, the computation time, the length of the alignment etc. > The problem was that people wanted to perform higher-level computation on > the score value or evaluate some other information. Now, the alignment will > produce a data structure that contains all the information and can, in > addition to that, also produce such a BLAST-like output. There is, however, > still the following problem: The data structure requires both sequences in > the pair-wise alignment to have an identical length. In case of local > alignment this is especially stupid (actually), because gaps are inserted to > fill the sequences. And then the data structure tries to keep the old > sequence coordinates, leading to the effect that the numbers "query start", > "query end", "subject start", and "subject end" are required to shift the > sequences against each other when displaying the output. So, you cannot > easily print the sequences below of each other, you first have to shift > them. Please check out the latest version of this package via anonymeous svn > and have a look ;-) > > > 3. My existing research area is aiming to deal with aligning short read > (10s~100s bp) against extremely long sequences (e.g., human genome). Af far > as I know, there is not existing such alignment tools implemented in Java. > Would you consider this direction? > > See, this would be very nice to include. But this requires that we no > longer fill the short sequence with many, many gap symbols (just a waist of > memory), but improve the data structure. There is already an > UnequalLenghtAlignment (just a data structure, no algorithm) and I think we > could use this as a starting point. Then your algorithm should only produce > such a data structure and this would be fine. > > > 4. It seems that the existing tools is just lacking of some refactoring > and representation interfaces. Any more underlying tasks? > > Hm. Yes: With the release of BioJava 3 data structures have changed again. > So maybe there's also some adaptation to the new structure required. > > > I am keeping an eye on GSoC from last month, but sorry to find out that I > sent the initial email to the mailing list before I subscribe it... > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > latest trunk, have a look, play around and if you can improve something > we'll put it into the trunk and write your name into the authors' tag. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From caishaojiang at gmail.com Mon Apr 19 03:16:39 2010 From: caishaojiang at gmail.com (Cai Shaojiang) Date: Sun, 18 Apr 2010 20:16:39 -0700 Subject: [Biojava-l] [Fwd: Re: GSoC project on MSA] In-Reply-To: <4BC84CD5.7030703@uni-tuebingen.de> References: <4BBC80A8.5000608@uni-tuebingen.de> <4BBDCFD2.3000507@uni-tuebingen.de> <4BC84CD5.7030703@uni-tuebingen.de> Message-ID: Sorry to disturb you again. But when i wanted to modify my proposal in GSOC, i got the error "This page is inactive at this time." So we cannot modify the proposal now? Could you help me? Thanks. From andreas at sdsc.edu Mon Apr 19 03:58:05 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 18 Apr 2010 20:58:05 -0700 Subject: [Biojava-l] Fwd: Biojava3-genetics In-Reply-To: <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl> References: <4BC806F4.3090302@wur.nl> <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl> Message-ID: Hi Richard, I am forwarding your message to the mailing list, since that is the best place to meet other people interested in genetics application. The BioJava source code is available via anonymous svn or the download page on the wiki. Andreas ---------- Forwarded message ---------- From: Finkers, Richard Date: Sat, Apr 17, 2010 at 12:46 AM Subject: RE: Biojava3-genetics To: Andreas Prlic Hi Andreas, To start with, associations with e.g. sequence variation (454) and phenotype data within larger sets of genetically different individuals. This will be code which I will have to write the coming year for one of my projects. I am planning to use this in combination the sequence and phylogeny based biojava modules. I also might consider migrating some of my current code to this module. This includes graphical representations of genetic data but also some statistical analysis for which we use the package R for the calculations but the rest of the data handling / formatting is done in Java. Some of the functionality, that I am thinking about, is available from other packages but I did not find the (java) source code. Richard -----Original Message----- From: andreas.prlic at gmail.com on behalf of Andreas Prlic Sent: Fri 2010-04-16 19:39 To: Finkers, Richard Cc: biojava-dev at lists.open-bio.org Subject: Re: Biojava3-genetics Hi Richard, any contribution is welcome. What do you have in mind in particular? Perhaps there is already something there along those lines... Andreas On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers wrote: > Dear List, > > I would be interested in adding a module for genetic analysis to the > biojava3 project. Are there others who are interested in this as well and > with who should I discuss this further? > > Thanks, > Richard > > > -- > Dr. Richard Finkers > Researcher Plant Breeding > Wageningen UR Plant Breeding > P.O. Box 16, 6700 AA, Wageningen, The Netherlands > Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB > Wageningen, The Netherlands > Tel. +31-317-484165 Fax +31-317-418094 > http://www.plantbreeding.wur.nl/ > https://www.eu-sol.wur.nl/ > https://cbsgdbase.wur.nl/ > http://solgenomics.wur.nl/ > http://www.disclaimer-uk.wur.nl/ > > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Mon Apr 19 04:14:24 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 18 Apr 2010 21:14:24 -0700 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: Hi Jitesh, BioJava is an open source project with the goal to support Bioinformatics applications. While we are always happy about any contribution, be it documentation, bug fixes or email support on the mailing list, for a research relate project it is probably easier to team up with your local university and do an internship there. Andreas On Sat, Apr 17, 2010 at 6:34 AM, jitesh dundas wrote: > Dear SIr, > > Could anyone tell me where I could start? Is there any lead who might need > my help in Software Development and research-oriebted aspects? > > Any comments on my previous emails would be most welcomed... > > Regards, > JItesh Dundas > > > On 4/8/10, Andreas Dr?ger wrote: > > > > Hi all, > > > > This e-mail is just for your information about somebody new, who'd like > to > > contribute to our project. > > > > Cheers > > Andreas > > > > > > Subject: > > Re: Fwd: Proposing a project on "Biojava alignment lead" > > From: > > Andreas Dr?ger > > Date: > > Wed, 07 Apr 2010 09:27:13 +0200 > > To: > > Cai Shaojiang > > > > Hi Cai Shaojiang, > > > > Thank you for you e-mail! I don't know what happened to the e-mail list. > > Sometimes it takes a while due to the spam filters, I guess. > > > > > I am a PhD student from National University of Singapore. My major > > research area is local alignment algorithms and data structures for SNP > > identification. And I have used Java and Eclipse for years for software > > development. I am very interested in your GSoC programme. I find that > there > > is a module called "biojava-alignment lead" whose mentor is you. I want > to > > propose a new project on this module. I have several questions about this > > module. > > > > Yes, that's me. So great to get your support. > > > > > 1. It seems that pairwise alignment is to find similarity between two > > short sequences. Existing pairwise alignment is based on dynamic > > programming, is it Smith-Waterman algorithm? > > > > So, currently, BioJava contains three different alignment approaches. > > There are two deterministic algorithms, i.e., Smith-Waterman for local > > alignment and Needleman-Wunsch for global alignment. Third, there is the > > possibility to apply Hidden Markov Models for alignment. An example of > the > > latter approach should be in the cookbook. > > > > > 2. What is the exact task of "refactoring of underlying data > structures"? > > > > Yes, this is something, I did last week already but it could still be > > improved. The problem was that the alignment algorithms actually produced > a > > kind of string that looks similar to the output of BLAST. This string > > contained the score, the computation time, the length of the alignment > etc. > > The problem was that people wanted to perform higher-level computation on > > the score value or evaluate some other information. Now, the alignment > will > > produce a data structure that contains all the information and can, in > > addition to that, also produce such a BLAST-like output. There is, > however, > > still the following problem: The data structure requires both sequences > in > > the pair-wise alignment to have an identical length. In case of local > > alignment this is especially stupid (actually), because gaps are inserted > to > > fill the sequences. And then the data structure tries to keep the old > > sequence coordinates, leading to the effect that the numbers "query > start", > > "query end", "subject start", and "subject end" are required to shift the > > sequences against each other when displaying the output. So, you cannot > > easily print the sequences below of each other, you first have to shift > > them. Please check out the latest version of this package via anonymeous > svn > > and have a look ;-) > > > > > 3. My existing research area is aiming to deal with aligning short read > > (10s~100s bp) against extremely long sequences (e.g., human genome). Af > far > > as I know, there is not existing such alignment tools implemented in > Java. > > Would you consider this direction? > > > > See, this would be very nice to include. But this requires that we no > > longer fill the short sequence with many, many gap symbols (just a waist > of > > memory), but improve the data structure. There is already an > > UnequalLenghtAlignment (just a data structure, no algorithm) and I think > we > > could use this as a starting point. Then your algorithm should only > produce > > such a data structure and this would be fine. > > > > > 4. It seems that the existing tools is just lacking of some refactoring > > and representation interfaces. Any more underlying tasks? > > > > Hm. Yes: With the release of BioJava 3 data structures have changed > again. > > So maybe there's also some adaptation to the new structure required. > > > > > I am keeping an eye on GSoC from last month, but sorry to find out that > I > > sent the initial email to the mailing list before I subscribe it... > > > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > > latest trunk, have a look, play around and if you can improve something > > we'll put it into the trunk and write your name into the authors' tag. > > > > Cheers > > Andreas > > > > -- > > Dipl.-Bioinform. Andreas Dr?ger > > Eberhard Karls University T?bingen > > Center for Bioinformatics (ZBIT) > > Sand 1 > > 72076 T?bingen > > Germany > > > > Phone: +49-7071-29-70436 > > Fax: +49-7071-29-5091 > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From jbdundas at gmail.com Mon Apr 19 08:33:57 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Mon, 19 Apr 2010 14:03:57 +0530 Subject: [Biojava-l] Fwd: Biojava3-genetics In-Reply-To: References: <4BC806F4.3090302@wur.nl> <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl> Message-ID: Dear Sir, I would like to work on this module. How can I help? Regards, Jitesh Dundas On 4/19/10, Andreas Prlic wrote: > Hi Richard, > > I am forwarding your message to the mailing list, since that is the best > place to meet other people interested in genetics application. > > The BioJava source code is available via anonymous svn or the download page > on the wiki. > > Andreas > > ---------- Forwarded message ---------- > From: Finkers, Richard > Date: Sat, Apr 17, 2010 at 12:46 AM > Subject: RE: Biojava3-genetics > To: Andreas Prlic > > > Hi Andreas, > > To start with, associations with e.g. sequence variation (454) and phenotype > data within larger sets of genetically different individuals. This will be > code which I will have to write the coming year for one of my projects. I am > planning to use this in combination the sequence and phylogeny based biojava > modules. > > I also might consider migrating some of my current code to this module. This > includes graphical representations of genetic data but also some statistical > analysis for which we use the package R for the calculations but the rest of > the data handling / formatting is done in Java. > > Some of the functionality, that I am thinking about, is available from other > packages but I did not find the (java) source code. > > Richard > > > > > -----Original Message----- > From: andreas.prlic at gmail.com on behalf of Andreas Prlic > Sent: Fri 2010-04-16 19:39 > To: Finkers, Richard > Cc: biojava-dev at lists.open-bio.org > Subject: Re: Biojava3-genetics > > Hi Richard, > > any contribution is welcome. What do you have in mind in particular? Perhaps > there is already something there along those lines... > > Andreas > > On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers >wrote: > >> Dear List, >> >> I would be interested in adding a module for genetic analysis to the >> biojava3 project. Are there others who are interested in this as well and >> with who should I discuss this further? >> >> Thanks, >> Richard >> >> >> -- >> Dr. Richard Finkers >> Researcher Plant Breeding >> Wageningen UR Plant Breeding >> P.O. Box 16, 6700 AA, Wageningen, The Netherlands >> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB >> Wageningen, The Netherlands >> Tel. +31-317-484165 Fax +31-317-418094 >> http://www.plantbreeding.wur.nl/ >> https://www.eu-sol.wur.nl/ >> https://cbsgdbase.wur.nl/ >> http://solgenomics.wur.nl/ >> http://www.disclaimer-uk.wur.nl/ >> >> > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas.draeger at uni-tuebingen.de Wed Apr 21 03:17:05 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Wed, 21 Apr 2010 12:17:05 +0900 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> Message-ID: <4BCE6E31.70504@uni-tuebingen.de> Hi Jitesh, Thanks for your interest to contribute to our BioJava project! In the alignment package, lots of help is required. What would be very nice, is a verstatile visual representation of the alignment data structures that can be included into graphical user interfaces with little effort. To this end, it should be very flexible and abstract. Would you be interested? Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From mitlox at op.pl Wed Apr 21 10:46:22 2010 From: mitlox at op.pl (xyz) Date: Wed, 21 Apr 2010 20:46:22 +1000 Subject: [Biojava-l] Reading and writting Fastq files In-Reply-To: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com> References: <20100330215047.084f6b00@wp01> <20100408213013.63a99b8c@wp01> <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com> Message-ID: <20100421204622.68f9ac1b@wp01> On Thu, 8 Apr 2010 12:36:36 +0100 Richard Holland wrote: > You haven't included the two import static lines in your code. See > first two lines of Michael's example code (expanding the ellipses to > the full classpath). > Thank you it was enough to include import static org.biojavax.bio.seq.RichSequence.Tools.createRichSequence; Usually Netbeans solve this kind of problems for me, but this time was no help from the IDE. From mitlox at op.pl Wed Apr 21 11:18:24 2010 From: mitlox at op.pl (xyz) Date: Wed, 21 Apr 2010 21:18:24 +1000 Subject: [Biojava-l] readFasta problem In-Reply-To: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> Message-ID: <20100421211824.75b7ada2@wp01> On Thu, 8 Apr 2010 12:41:25 +0100 Richard Holland wrote: > You have passed null into the tokenizer parameter of > RichSequence.IOTools.readFasta() - this is not allowed. The parser > cannot guess the type of sequence, it must be told what to expect by > specifying the tokenizer to use. (Importantly this also means that > you cannot mix different types of sequence within the same file to be > parsed.) > Thank you. Q1: Does RichSequenceIterator read the complete file in memory and then I retrieve each read from memory? Or does it read the file line by line and I get each read? Q2: Why am I not able to retrieve the header from the following fasta file: >1 atccccc >2 atccccctttttt >3 atccccccccccccccccctttt >4 tttttttccccccccccccccccccccccc >5 tttttttcccccccccccccccccccccca with the following code: import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import org.biojava.bio.BioException; import org.biojava.bio.seq.io.SymbolTokenization; import org.biojava.bio.symbol.AlphabetManager; import org.biojavax.bio.seq.RichSequence; import org.biojavax.bio.seq.RichSequenceIterator; public class SortFasta { public static void main(String[] args) throws FileNotFoundException, BioException { BufferedReader br = new BufferedReader(new FileReader("sortFasta.fasta")); String type = "DNA"; SymbolTokenization toke = AlphabetManager.alphabetForName(type) .getTokenization("token"); RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, null); while (rsi.hasNext()) { RichSequence rs = rsi.nextRichSequence(); System.out.println(rs.getDescription()); System.out.println(rs.seqString()); } } } What did I wrong in order to retrieve the header? From holland at eaglegenomics.com Wed Apr 21 11:29:57 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 21 Apr 2010 12:29:57 +0100 Subject: [Biojava-l] readFasta problem In-Reply-To: <20100421211824.75b7ada2@wp01> References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> <20100421211824.75b7ada2@wp01> Message-ID: On 21 Apr 2010, at 12:18, xyz wrote: > On Thu, 8 Apr 2010 12:41:25 +0100 > Richard Holland wrote: > >> You have passed null into the tokenizer parameter of >> RichSequence.IOTools.readFasta() - this is not allowed. The parser >> cannot guess the type of sequence, it must be told what to expect by >> specifying the tokenizer to use. (Importantly this also means that >> you cannot mix different types of sequence within the same file to be >> parsed.) >> > > Thank you. > > Q1: > Does RichSequenceIterator read the complete file in memory and then I > retrieve each read from memory? Or does it read the file line by line > and I get each read? Line by line. > Q2: > Why am I not able to retrieve the header from the following fasta file: >> 1 > atccccc >> 2 > atccccctttttt >> 3 > atccccccccccccccccctttt >> 4 > tttttttccccccccccccccccccccccc >> 5 > tttttttcccccccccccccccccccccca > > with the following code: > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileReader; > import org.biojava.bio.BioException; > import org.biojava.bio.seq.io.SymbolTokenization; > import org.biojava.bio.symbol.AlphabetManager; > import org.biojavax.bio.seq.RichSequence; > import org.biojavax.bio.seq.RichSequenceIterator; > > public class SortFasta { > > public static void main(String[] args) throws FileNotFoundException, > BioException { > > > BufferedReader br = new BufferedReader(new > FileReader("sortFasta.fasta")); String type = "DNA"; > SymbolTokenization toke = AlphabetManager.alphabetForName(type) > .getTokenization("token"); > > > RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, > null); > > while (rsi.hasNext()) { > RichSequence rs = rsi.nextRichSequence(); > System.out.println(rs.getDescription()); > System.out.println(rs.seqString()); > } > } > } > > What did I wrong in order to retrieve the header? Try the other methods on RichSequence - getName() for instance. cheers, Richard -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From mitlox at op.pl Wed Apr 21 12:40:48 2010 From: mitlox at op.pl (xyz) Date: Wed, 21 Apr 2010 22:40:48 +1000 Subject: [Biojava-l] NCBI Accession Number prefixes Message-ID: <20100421224048.1848c2f2@wp01> Hello, is it possible to download GenBank entries (AC) with BioJava? Thank you in advance. Best regards, From holland at eaglegenomics.com Wed Apr 21 12:44:16 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 21 Apr 2010 13:44:16 +0100 Subject: [Biojava-l] NCBI Accession Number prefixes In-Reply-To: <20100421224048.1848c2f2@wp01> References: <20100421224048.1848c2f2@wp01> Message-ID: <577294DB-EABD-48DF-A55A-5DA9629AC352@eaglegenomics.com> See http://www.biojava.org/docs/api/org/biojavax/bio/db/ncbi/GenbankRichSequenceDB.html On 21 Apr 2010, at 13:40, xyz wrote: > Hello, > is it possible to download GenBank entries (AC) with BioJava? > > Thank you in advance. > > Best regards, > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jbdundas at gmail.com Wed Apr 21 13:45:00 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Wed, 21 Apr 2010 19:15:00 +0530 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: <4BCE6E31.70504@uni-tuebingen.de> References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: Yes Sir, I will be very interested. Please send me the details. I will be working on Weekends though as office work is taking my time right now. Regards, jd On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger < andreas.draeger at uni-tuebingen.de> wrote: > Hi Jitesh, > > Thanks for your interest to contribute to our BioJava project! In the > alignment package, lots of help is required. What would be very nice, is a > verstatile visual representation of the alignment data structures that can > be included into graphical user interfaces with little effort. To this end, > it should be very flexible and abstract. Would you be interested? > > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > From er.indupandey at gmail.com Fri Apr 23 08:11:05 2010 From: er.indupandey at gmail.com (indu pandey) Date: Fri, 23 Apr 2010 01:11:05 -0700 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: hi all can any body help me in creating code in biojava for converting dna sequence to corresponding amino acid sequence regards indu On 4/21/10, jitesh dundas wrote: > > Yes Sir, I will be very interested. Please send me the details. I will be > working on Weekends though as office work is taking my time right now. > > Regards, > jd > > On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger < > andreas.draeger at uni-tuebingen.de> wrote: > > > Hi Jitesh, > > > > Thanks for your interest to contribute to our BioJava project! In the > > alignment package, lots of help is required. What would be very nice, is > a > > verstatile visual representation of the alignment data structures that > can > > be included into graphical user interfaces with little effort. To this > end, > > it should be very flexible and abstract. Would you be interested? > > > > > > Cheers > > Andreas > > > > -- > > Dipl.-Bioinform. Andreas Dr?ger > > Eberhard Karls University T?bingen > > Center for Bioinformatics (ZBIT) > > Sand 1 > > 72076 T?bingen > > Germany > > > > Phone: +49-7071-29-70436 > > Fax: +49-7071-29-5091 > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From genjasp at gmail.com Fri Apr 23 08:26:10 2010 From: genjasp at gmail.com (Alessandro Cipriani) Date: Fri, 23 Apr 2010 10:26:10 +0200 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: Hi Follow this link: http://www.biojava.org/wiki/BioJava:CookBook#Translation I think it could be usefull regards ale 2010/4/23 indu pandey > hi all > can any body help me in creating code in biojava for converting dna > sequence to corresponding amino acid sequence > > regards > indu > > On 4/21/10, jitesh dundas wrote: > > > > Yes Sir, I will be very interested. Please send me the details. I will be > > working on Weekends though as office work is taking my time right now. > > > > Regards, > > jd > > > > On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger < > > andreas.draeger at uni-tuebingen.de> wrote: > > > > > Hi Jitesh, > > > > > > Thanks for your interest to contribute to our BioJava project! In the > > > alignment package, lots of help is required. What would be very nice, > is > > a > > > verstatile visual representation of the alignment data structures that > > can > > > be included into graphical user interfaces with little effort. To this > > end, > > > it should be very flexible and abstract. Would you be interested? > > > > > > > > > Cheers > > > Andreas > > > > > > -- > > > Dipl.-Bioinform. Andreas Dr?ger > > > Eberhard Karls University T?bingen > > > Center for Bioinformatics (ZBIT) > > > Sand 1 > > > 72076 T?bingen > > > Germany > > > > > > Phone: +49-7071-29-70436 > > > Fax: +49-7071-29-5091 > > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Alessandro Cipriani (+39) 3206009509 http://www.cipriania.it skype:genjasp at gmail.com msn:jaspzz From thomascramera at dnastar.com Fri Apr 23 22:58:05 2010 From: thomascramera at dnastar.com (Andy Thomas-Cramer) Date: Fri, 23 Apr 2010 17:58:05 -0500 Subject: [Biojava-l] PDBFileParser and Atom element symbol Message-ID: Is there an easy way to identify the type of atom referenced by an Atom object? For example, if Atom.getName() is "CA", is the element calcium or the atom carbon alpha? If not, would it be feasible to add a method providing this in Atom, AtomImpl, and parsing it in PDBFileParser, using the columns defined at http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? From andreas at sdsc.edu Fri Apr 23 23:52:15 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 23 Apr 2010 16:52:15 -0700 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Hi Andy, you could check with Atom.getFullname(), which contains the space characters from the PDB file: e.g Calpha: " CA ", Calcium "CA " in addition the parent group of a Calpha atom is usually an AminoAcid and for Calciums it is a Hetatom group... Andreas On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer < thomascramera at dnastar.com> wrote: > > > Is there an easy way to identify the type of atom referenced by an Atom > object? > > For example, if Atom.getName() is "CA", is the element calcium or the > atom carbon alpha? > > If not, would it be feasible to add a method providing this in Atom, > AtomImpl, and parsing it in PDBFileParser, using the columns defined at > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From mitlox at op.pl Sun Apr 25 05:19:25 2010 From: mitlox at op.pl (xyz) Date: Sun, 25 Apr 2010 15:19:25 +1000 Subject: [Biojava-l] readFasta problem In-Reply-To: References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> <20100421211824.75b7ada2@wp01> Message-ID: <20100425151925.1c5c9a03@wp01> On Wed, 21 Apr 2010 12:29:57 +0100 Richard Holland wrote: > > Q1: > > Does RichSequenceIterator read the complete file in memory and then > > I retrieve each read from memory? Or does it read the file line by > > line and I get each read? > > > Line by line. That save memory. > > Q2: > > Why am I not able to retrieve the header from the following fasta > > file: > >> 1 > > atccccc > >> 2 > > atccccctttttt > >> 3 > > atccccccccccccccccctttt > >> 4 > > tttttttccccccccccccccccccccccc > >> 5 > > tttttttcccccccccccccccccccccca > > Try the other methods on RichSequence - getName() for instance. Thank you getName() works. I have tried to write fasta file line by line with IOTools, but I have got the following error: Exception in thread "main" java.lang.RuntimeException: Uncompilable source code 1 at SortFasta.main(SortFasta.java:31) atccccc Java Result: 1 Here is the complete code: import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; import org.biojava.bio.BioException; import org.biojava.bio.seq.io.SymbolTokenization; import org.biojava.bio.symbol.AlphabetManager; import org.biojavax.bio.seq.RichSequence; import org.biojavax.bio.seq.RichSequenceIterator; public class SortFasta { public static void main(String[] args) throws FileNotFoundException, BioException { BufferedReader br = new BufferedReader(new FileReader("sortFasta.fasta")); String type = "DNA"; SymbolTokenization toke = AlphabetManager.alphabetForName(type) .getTokenization("token"); FileOutputStream outputFasta = new FileOutputStream("test.fasta"); RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, null); while (rsi.hasNext()) { RichSequence rs = rsi.nextRichSequence(); System.out.println(rs.getName()); System.out.println(rs.seqString()); RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null, rs.getName() + "1"); } } } How is it possible to write fasta files line by line? From holland at eaglegenomics.com Sun Apr 25 08:21:22 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Sun, 25 Apr 2010 09:21:22 +0100 Subject: [Biojava-l] readFasta problem In-Reply-To: <20100425151925.1c5c9a03@wp01> References: <20100408213052.662beb8e@wp01> <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com> <20100421211824.75b7ada2@wp01> <20100425151925.1c5c9a03@wp01> Message-ID: <316097DC-6011-4205-83BC-9A24398D034D@eaglegenomics.com> Hi. You are calling a non-existing version of writeFasta. I'm surprised your code even compiles! Have a look at the JavaDocs to find out what you can actually do with writeFasta. For a start, it takes Sequence and FastaHeader objects as parameters, not Strings as you are trying to do. http://www.biojava.org/docs/api17/org/biojavax/bio/seq/RichSequence.IOTools.html cheers, Richard On 25 Apr 2010, at 06:19, xyz wrote: > On Wed, 21 Apr 2010 12:29:57 +0100 > Richard Holland wrote: > >>> Q1: >>> Does RichSequenceIterator read the complete file in memory and then >>> I retrieve each read from memory? Or does it read the file line by >>> line and I get each read? >> >> >> Line by line. > > That save memory. > >>> Q2: >>> Why am I not able to retrieve the header from the following fasta >>> file: >>>> 1 >>> atccccc >>>> 2 >>> atccccctttttt >>>> 3 >>> atccccccccccccccccctttt >>>> 4 >>> tttttttccccccccccccccccccccccc >>>> 5 >>> tttttttcccccccccccccccccccccca >> >> Try the other methods on RichSequence - getName() for instance. > > Thank you getName() works. > > I have tried to write fasta file line by line with IOTools, but I have > got the following error: > Exception in thread "main" java.lang.RuntimeException: Uncompilable > source code 1 > at SortFasta.main(SortFasta.java:31) > atccccc > Java Result: 1 > > Here is the complete code: > > import java.io.BufferedReader; > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.FileReader; > import org.biojava.bio.BioException; > import org.biojava.bio.seq.io.SymbolTokenization; > import org.biojava.bio.symbol.AlphabetManager; > import org.biojavax.bio.seq.RichSequence; > import org.biojavax.bio.seq.RichSequenceIterator; > > public class SortFasta { > > public static void main(String[] args) throws FileNotFoundException, > BioException { > > > BufferedReader br = new BufferedReader(new > FileReader("sortFasta.fasta")); String type = "DNA"; > SymbolTokenization toke = AlphabetManager.alphabetForName(type) > .getTokenization("token"); > > FileOutputStream outputFasta = new FileOutputStream("test.fasta"); > > RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke, > null); > > while (rsi.hasNext()) { > RichSequence rs = rsi.nextRichSequence(); > System.out.println(rs.getName()); > System.out.println(rs.seqString()); > > RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null, > rs.getName() + "1"); > } > } > } > > How is it possible to write fasta files line by line? -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andreas.draeger at uni-tuebingen.de Mon Apr 26 01:04:44 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Mon, 26 Apr 2010 10:04:44 +0900 Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava alignment lead"] In-Reply-To: References: <4BBD820D.9070200@uni-tuebingen.de> <4BCE6E31.70504@uni-tuebingen.de> Message-ID: <4BD4E6AC.8030901@uni-tuebingen.de> Dear Indu, If you have a question regarding to BioJava, please do not just reply to some previous e-mail. In this case, your question appears in the e-mail tree related to the BioJava alignment lead. However, you have a question related to working and manipulating symbols. Therefore, you should better open a new thread. Sorry for telling you that but this is necessary to keep an overview about all the e-mails. Best wishes Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From asidhu at biomap.org Mon Apr 26 06:27:30 2010 From: asidhu at biomap.org (Amandeep Sidhu) Date: Mon, 26 Apr 2010 14:27:30 +0800 Subject: [Biojava-l] CFP: 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 Message-ID: IEEE CBMS 2010 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 Perth, Australia, 12-15 October 2010 http://www.cbms2010.curtin.edu.au/ The 23rd IEEE International Symposium on Computer-Based Medical Systems (CBMS 2010) is intended to provide an international forum for discussing the latest results in the field of computational medicine. The scientific program of CBMS 2010 will consist of invited keynote talks given by leading scientists in the field, and regular and special track sessions that cover a broad array of issues which relate computing to medicine. RELEVANT TOPICS Network and Telemedicine Systems Medical Databases & Information Systems Computer-Aided Diagnosis Medical Devices with Embedded Computers Bioinformatics in Medicine Software Systems in Medicine Pervasive Health Systems and Services Web-based Delivery of Medical Information Medical Image Segmentation & Compression Content Analysis of Biomedical Image Data Knowledge-Based & Decision Support Systems Hand-held Computing Applications in Medicine Knowledge Discovery & Data Mining Signal and Image Processing in Medicine Multimedia Biomedical Databases CBMS 2010 invites original previously unpublished contributions that are not submitted concurrently to a journal or another conference. Many of the above listed topics are represented by corresponding Special Tracks, while others are solely covered by the general CBMS track. Prospective authors are expected to submit their contributions to one of the corresponding Special Tracks or to the general track if none of the special tracks is relevant. SPECIAL TRACKS ST1: Computational Proteomics and Genomics ST2: Knowledge Discovery and Decision Systems in Biomedicine ST3: Ontologies for Biomedical Systems ST4: HealthGrid & Cloud Computing ST5: Technology Enhanced Learning in Medical Education ST6: Intelligent Patient Management ST7: Data Streams in Healthcare ST8: Supporting Collaboration among Healthcare Workers ST9: Telemedicine ST10: Computer-Based Systems for Mental Health ST11: Image Informatics in Biomedical Research and Clinical Medicine ST12: e-Health SUBMISSION GUIDELINES Papers should be submitted electronically using EasyChair online submission system. The papers must be prepared following the IEEE two-column format and should not exceed the length of 6 (six) Letter-sized pages. LaTeX or Microsoft Word templates can be used when preparing the papers. Please, note that only PDF format of submissions is allowed. Submission web site: http://www.easychair.org/conferences/?conf=cbms2010 All submissions will be peer-reviewed by at least three reviewers. The proceedings will be published by the IEEE Computer Society Press. At least one of the authors of accepted papers is required to register and present the work at the conference; otherwise their papers will be removed from the digital library after the conference. IMPORTANT DATES Submission deadline for regular papers: 24 June 2010 Deadline for tutorial submission: 24 June 2010 Notification of acceptation for papers and tutorials: 2 Aug 2010 Final camera ready due: 2 Sep 2010 Author registration: 2 Sep 2010 INTENDED AUDIENCE Engineers, scientists, clinicians and managers involved in medical computing projects are encouraged to submit papers to the symposium and/or attend the symposium. The symposium provides its attendees with an opportunity to experience state-of-the-art research and development in a variety of topics directly and indirectly related to their own work. In addition to research papers, keynote speakers and tutorial sessions it provides participants with an opportunity to come up-to-date on important technological issues. The symposium encourages the participation of students engaged in research/development in computer-based medical systems. Organizing Committee GENERAL CHAIRS Tharam Dillon, Curtin University of Technology, Australia Daniel Rubin, National Center for Biomedical Ontologies, USA William Gallagher, University College Dublin, Ireland PROGRAM CHAIRS Amandeep Sidhu, Curtin University of Technology, Australia Alexey Tsymbal, Siemens, Germany PUBLICATION CHAIRS Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands Tony Hu, Drexel University, USA SPECIAL TRACK CHAIRS Maja Hadzic, Curtin University of Technology, Australia Jake Chen, Indiana University, USA TUTORIAL CHAIRS Phoebe Chen, La Trobe University, Australia Xiaofang Zhou, University of Queensland, Australia PUBLICITY CHAIRS Carolyn McGregor, University of Ontario Institute of Technology, Canada Meifania Chen, Curtin University of Technology, Australia From thomascramera at dnastar.com Mon Apr 26 14:51:23 2010 From: thomascramera at dnastar.com (Andy Thomas-Cramer) Date: Mon, 26 Apr 2010 09:51:23 -0500 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Thank you. I had not noticed the pattern that columns 13-14 at least sometimes contain the element symbol, whether one- or two-character. Questions: * Is this pattern documented in the PDB specification? * If this pattern can be relied on, why are columns 77-78 also dedicated to the element symbol? * Should reliance on the pattern be hidden behind a BioJava method? ________________________________ From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf Of Andreas Prlic Sent: Friday, April 23, 2010 6:52 PM To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol Hi Andy, you could check with Atom.getFullname(), which contains the space characters from the PDB file: e.g Calpha: " CA ", Calcium "CA " in addition the parent group of a Calpha atom is usually an AminoAcid and for Calciums it is a Hetatom group... Andreas On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer wrote: Is there an easy way to identify the type of atom referenced by an Atom object? For example, if Atom.getName() is "CA", is the element calcium or the atom carbon alpha? If not, would it be feasible to add a method providing this in Atom, AtomImpl, and parsing it in PDBFileParser, using the columns defined at http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Tue Apr 27 01:07:53 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 26 Apr 2010 18:07:53 -0700 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Hi Andy Questions: > * Is this pattern documented in the PDB specification? > see here: http://www.wwpdb.org/documentation/format23/sect9.html#ATOM > * If this pattern can be relied on, why are columns 77-78 also dedicated to > the element symbol? > That is the atom's element symbol (as given in the periodic table), in contrast to the first name, which contains numbering information. * Should reliance on the pattern be hidden behind a BioJava method? > If you think that is important we could probably provide an enum for all atom types. There are two categories though: the periodic table symbol and the one that is related to the position in an amino acid.... Andreas > > > > ------------------------------ > > *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On > Behalf Of *Andreas Prlic > *Sent:* Friday, April 23, 2010 6:52 PM > *To:* Andy Thomas-Cramer > *Cc:* biojava-l at lists.open-bio.org > *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol > > > > Hi Andy, > > you could check with Atom.getFullname(), which contains the space > characters from the PDB file: > e.g Calpha: " CA ", Calcium "CA " > > in addition the parent group of a Calpha atom is usually an AminoAcid and > for Calciums it is a Hetatom group... > > Andreas > > On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer < > thomascramera at dnastar.com> wrote: > > > > Is there an easy way to identify the type of atom referenced by an Atom > object? > > For example, if Atom.getName() is "CA", is the element calcium or the > atom carbon alpha? > > If not, would it be feasible to add a method providing this in Atom, > AtomImpl, and parsing it in PDBFileParser, using the columns defined at > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From rmb32 at cornell.edu Mon Apr 26 22:02:11 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 15:02:11 -0700 Subject: [Biojava-l] Google Summer of Code - accepted students Message-ID: <4BD60D63.1040400@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From andreas at sdsc.edu Tue Apr 27 05:33:51 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 26 Apr 2010 22:33:51 -0700 Subject: [Biojava-l] accepted GSoC projects Message-ID: Dear all, Google has released the results for GSoC: Congratulations to Mark Chapman and Jianjiong Gao for having been accepted to work on the MSA and PTM projects for BioJava! Let's start the "community bonding" process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we all are looking forward to work with you on this during the summer. The Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle Ellrott for the MSA project (and me). I want to thank all of of you who submitted proposals or showed interest in other ways for the Google Summer of Code. We hope you are not too disappointed if your application did not get accepted this time. We had a large number (52) applications and the the overall quality of the submissions was very high. We would like to stay in touch with you and we hope that you are interested in BioJava also beyond the scope of GSoC. There are a number of different ways how to contribute: We are always looking for people who provide code and patches to further improve our library, help out with the documentation on the Wiki page, or answer questions on the mailing lists. Let's all give Mark and Jianjiong a warm welcome to the BioJava community. For those of you who are interested in following the progress of the projects, as usually, the development related discussions are going to be on the biojava-dev list. Happy coding! Andreas From rmb32 at cornell.edu Tue Apr 27 05:52:57 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 26 Apr 2010 22:52:57 -0700 Subject: [Biojava-l] Google Summer of Code - accepted students Message-ID: <4BD67BB9.3000804@cornell.edu> Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator From jianjiong.gao at gmail.com Tue Apr 27 19:13:12 2010 From: jianjiong.gao at gmail.com (Jianjiong Gao) Date: Tue, 27 Apr 2010 14:13:12 -0500 Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects In-Reply-To: References: Message-ID: Dear Dr. Prlic and Everyone, Thanks for the warm welcome. I am so glad that I have the chance to work with the BioJava community this summer. I would like to briefly introduce myself. My name is Jianjiong (JJ) Gao. I am a PhD student in Computer Science at University of Missouri, Columbia. My study is focusing on Bioinformatics, specifically computational proteomics and PTMs. I came across BioJava about two years ago when I was working on a plugin for Cytoscape, and was attracted by the idea of providing generic Java API for bioinformatics applications. I was thinking maybe someday I could do some coding for BioJava. And now I got the chance :) Best Regards, -JJ On Tue, Apr 27, 2010 at 12:33 AM, Andreas Prlic wrote: > Dear all, > > Google has released the results for GSoC: Congratulations to Mark Chapman > and Jianjiong Gao for having been accepted to work on the MSA and PTM > projects for BioJava! Let's start the "community bonding" process ( > http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) ?and we all are > looking forward to work with you on this during the summer. The Mentors and > co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle > Ellrott for the MSA project (and me). > > I want to thank all of of you who submitted proposals or showed interest in > other ways for the Google Summer of Code. We hope you are not too > disappointed if your application did not get accepted this time. We had a > large number (52) applications and the the overall quality of the > submissions was very high. We would like to stay in touch with you and we > hope that you are interested in BioJava also beyond the scope of GSoC. There > are a number of different ways how to contribute: ?We are always looking for > people who provide code and patches to further improve our library, help out > with the documentation on the Wiki page, or answer questions on the mailing > lists. > > Let's all give Mark and Jianjiong ?a warm welcome to the BioJava community. > For those of you who are interested in following the progress of the > projects, as usually, the development related discussions are going to be on > the biojava-dev list. > > Happy coding! > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From chapman at cs.wisc.edu Wed Apr 28 04:18:25 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Tue, 27 Apr 2010 23:18:25 -0500 Subject: [Biojava-l] accepted GSoC projects In-Reply-To: References: Message-ID: <4BD7B711.9090108@cs.wisc.edu> Hi all, Thank you to Google, Open Bioinformatics Foundation, BioJava, and my mentors for this opportunity. As a short introduction, I am Mark Chapman, a graduate student in Computer Sciences at the University of Wisconsin - Madison. My focus is in artificial intelligence and bioinformatics. This summer, I will add a Multiple Sequence Alignment module to BioJava. My first task will be to update the alignment module to BioJava3 and to design the interface for MSA. My second goal is to implement a progressive MSA styled after clustalw. After that, I will add alternative routines for each step. Any ideas for the MSA project as well as more sources of programming wisdom are quite welcome. For example, Andreas suggested a series about Java parallelism and lazy execution (http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). I also noted a useful tip for iterative development (http://en.flossmanuals.net/GSoCMentoring/Workflow). Thanks again, Mark On 4/27/2010 12:33 AM, Andreas Prlic wrote: > Dear all, > > Google has released the results for GSoC: Congratulations to Mark > Chapman and Jianjiong Gao for having been accepted to work on the MSA > and PTM projects for BioJava! Let's start the "community bonding" > process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we > all are looking forward to work with you on this during the summer. The > Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis > and Kyle Ellrott for the MSA project (and me). > > I want to thank all of of you who submitted proposals or showed interest > in other ways for the Google Summer of Code. We hope you are not too > disappointed if your application did not get accepted this time. We had > a large number (52) applications and the the overall quality of the > submissions was very high. We would like to stay in touch with you and > we hope that you are interested in BioJava also beyond the scope of > GSoC. There are a number of different ways how to contribute: We are > always looking for people who provide code and patches to further > improve our library, help out with the documentation on the Wiki page, > or answer questions on the mailing lists. > > Let's all give Mark and Jianjiong a warm welcome to the BioJava > community. For those of you who are interested in following the > progress of the projects, as usually, the development related > discussions are going to be on the biojava-dev list. > > Happy coding! > > Andreas > > From bernd.jagla at pasteur.fr Wed Apr 28 07:25:05 2010 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Wed, 28 Apr 2010 09:25:05 +0200 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region Message-ID: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> Hi there, I am trying to retrieve information (features) from the UCSC genome browser using the DAS interface. I am looking at the org.biojava.bio.program.das sources. I can retrieve all top level entry points with DASSequenceDB(dbURL) (Apperently the last entry from the return XML object gives a [Fatal Error] :1:1: Content is not allowed in prolog. Which I am ignoring...) and also the DSN entries using: DAS das = new DAS(); das.addDasURL(new URL(dbURLString)); for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); ) {.... When I try to access features for a top level entry point, i.e. a reference sequence I have the impression that first all features for a given reference sequence are being downloaded. My questions: How can I access only the features of a specific region? I guess in DAS terms I want to specify the segment part of the URL (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 00). I would also like to get the list of available features. How can I achieve this? From a wireshark output I can see that this is being retrieved somehow behind the scene. How can I access this information? I am looking at TestDAS*.java; are there any other examples around that I can use to learn from? Thanks a lot for your kind support, Best, Bernd From er.indupandey at gmail.com Wed Apr 28 16:22:10 2010 From: er.indupandey at gmail.com (indu pandey) Date: Wed, 28 Apr 2010 09:22:10 -0700 Subject: [Biojava-l] regarding errors Message-ID: hi When i m trying to run this code package javaapplication10; import org.biojava.bio.symbol.*; import org.biojava.bio.seq.*; public class TranscribeDNAtoRNA { public static void main(String[] args) { try { //make a DNA SymbolList SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT"); //transcribe it to RNA (after BioJava 1.4 this method is deprecated) symL = RNATools.transcribe(symL); //(after BioJava 1.4 use this method instead) symL = DNATools.toRNA(symL); //just to prove it worked System.out.println(symL.seqString()); } catch (IllegalSymbolException ex) { //this will happen if you try and make the DNA seq using non IUB symbols ex.printStackTrace(); }catch (IllegalAlphabetException ex) { //this will happen if you try and transcribe a non DNA SymbolList ex.printStackTrace(); } } } i get following errors:. *org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and translation table source alphabets don't match: RNA and DNA at org.biojava.bio.symbol.TranslatedSymbolList.(TranslatedSymbolList.java:75) at org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125) at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490) at javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23) * From andreas at sdsc.edu Wed Apr 28 17:31:58 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 28 Apr 2010 10:31:58 -0700 Subject: [Biojava-l] accepted GSoC projects In-Reply-To: <4BD7B711.9090108@cs.wisc.edu> References: <4BD7B711.9090108@cs.wisc.edu> Message-ID: > Any ideas for the MSA project as well as more sources of programming wisdom > are quite welcome. For example, Andreas suggested a series about Java > parallelism and lazy execution ( > http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). > credits for the links go to Scooter, who recommended those ;-) My general recommendation is to read Joshua Bloch's "Effective Java". http://java.sun.com/docs/books/effective/ It is a collection of rules that should help in avoiding some frequently made mistakes... Andreas > I also noted a useful tip for iterative development ( > http://en.flossmanuals.net/GSoCMentoring/Workflow). > > Thanks again, > Mark > > > > On 4/27/2010 12:33 AM, Andreas Prlic wrote: > >> Dear all, >> >> Google has released the results for GSoC: Congratulations to Mark >> Chapman and Jianjiong Gao for having been accepted to work on the MSA >> and PTM projects for BioJava! Let's start the "community bonding" >> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we >> all are looking forward to work with you on this during the summer. The >> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis >> and Kyle Ellrott for the MSA project (and me). >> >> I want to thank all of of you who submitted proposals or showed interest >> in other ways for the Google Summer of Code. We hope you are not too >> disappointed if your application did not get accepted this time. We had >> a large number (52) applications and the the overall quality of the >> submissions was very high. We would like to stay in touch with you and >> we hope that you are interested in BioJava also beyond the scope of >> GSoC. There are a number of different ways how to contribute: We are >> always looking for people who provide code and patches to further >> improve our library, help out with the documentation on the Wiki page, >> or answer questions on the mailing lists. >> >> Let's all give Mark and Jianjiong a warm welcome to the BioJava >> community. For those of you who are interested in following the >> progress of the projects, as usually, the development related >> discussions are going to be on the biojava-dev list. >> >> Happy coding! >> >> Andreas >> >> >> -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From jw12 at sanger.ac.uk Wed Apr 28 20:21:13 2010 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Wed, 28 Apr 2010 21:21:13 +0100 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> Message-ID: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Hi Bernd For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section called "Downloading data from the UCSC DAS server" for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert ) for DAS client creation, but there is a also a good javascript library as well called JSDas. Any more info then don't hesitate to ask. Jonathan. On 28 Apr 2010, at 08:25, Bernd Jagla wrote: > Hi there, > > I am trying to retrieve information (features) from the UCSC genome > browser > using the DAS interface. > I am looking at the org.biojava.bio.program.das sources. I can > retrieve all > top level entry points with > DASSequenceDB(dbURL) > (Apperently the last entry from the return XML object gives a > [Fatal Error] :1:1: Content is not allowed in prolog. > Which I am ignoring...) > > and also the DSN entries using: > DAS das = new DAS(); > das.addDasURL(new URL(dbURLString)); > for(Iterator i = das.getReferenceServers().iterator(); > i.hasNext(); ) > {.... > > When I try to access features for a top level entry point, i.e. a > reference > sequence I have the impression that first all features for a given > reference > sequence are being downloaded. > > My questions: > > How can I access only the features of a specific region? I guess in > DAS > terms I want to specify the segment part of the URL > (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 > 00). > > I would also like to get the list of available features. How can I > achieve > this? From a wireshark output I can see that this is being retrieved > somehow > behind the scene. How can I access this information? > > I am looking at TestDAS*.java; are there any other examples around > that I > can use to learn from? > > Thanks a lot for your kind support, > > Best, > > Bernd > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From chapman at cs.wisc.edu Thu Apr 29 01:09:07 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Wed, 28 Apr 2010 20:09:07 -0500 Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects In-Reply-To: <6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu> References: <4BD7B711.9090108@cs.wisc.edu> <6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu> Message-ID: <4BD8DC33.7010607@cs.wisc.edu> Here is a summary of the concurrency lessons I learned that are useful with or without the functional programming paradigm -- 1: implement Callable to submit tasks for concurrent/parallel/lazy execution - call() methods just wrap a call to the computation intensive method 2: share a fixed size thread pool with task queue to avoid - overhead of thread creation/destruction, - too many simultaneous threads, and - most blocking issues 3: place thread blocking Future.get() calls within tasks later in the queue - while(!Future.isDone()) Thread.yield(); may also help keep the pool active 4: execution in a task queue also enables easier logging and progress listening There are two obvious places concurrent execution will fit in the MSA module -- 1: building the distance matrix - queue pairwise alignment/scoring tasks in loop over all sequence pairs 2: progressive alignment - queue profile-profile alignment tasks in postfix traversal of guide tree (from leaves to root) All our library copies of "Effective Java" are checked out, so I ordered a copy for my personal library. The sample chapter on generics sold me. Mark On 4/28/2010 12:57 PM, Scooter Willis wrote: > Andreas > > Those links were sent to me by Mark Southern who sits a couple doors down and a past BioJava contributor for the sequence viewer. We should avoid bringing in any external parallel frameworks but at minimum give ourselves enough abstraction with a backend multi-threaded job-processing approach to take advantage of a multi-processor box and a cluster via Terracotta. If the abstraction of the jobs and the mapping of resources is generic enough then that allows different implementations in various cluster environments for those who have found the next best thing in parallel computing! > > Scooter > > On Apr 28, 2010, at 1:31 PM, Andreas Prlic wrote: > >>> Any ideas for the MSA project as well as more sources of programming wisdom >>> are quite welcome. For example, Andreas suggested a series about Java >>> parallelism and lazy execution ( >>> http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). >>> >> >> >> credits for the links go to Scooter, who recommended those ;-) My general >> recommendation is to read Joshua Bloch's "Effective Java". >> http://java.sun.com/docs/books/effective/ It is a collection of rules that >> should help in avoiding some frequently made mistakes... >> >> Andreas >> >> >> >> >> >> >>> I also noted a useful tip for iterative development ( >>> http://en.flossmanuals.net/GSoCMentoring/Workflow). >>> >>> Thanks again, >>> Mark >>> >>> >>> >>> On 4/27/2010 12:33 AM, Andreas Prlic wrote: >>> >>>> Dear all, >>>> >>>> Google has released the results for GSoC: Congratulations to Mark >>>> Chapman and Jianjiong Gao for having been accepted to work on the MSA >>>> and PTM projects for BioJava! Let's start the "community bonding" >>>> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we >>>> all are looking forward to work with you on this during the summer. The >>>> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis >>>> and Kyle Ellrott for the MSA project (and me). >>>> >>>> I want to thank all of of you who submitted proposals or showed interest >>>> in other ways for the Google Summer of Code. We hope you are not too >>>> disappointed if your application did not get accepted this time. We had >>>> a large number (52) applications and the the overall quality of the >>>> submissions was very high. We would like to stay in touch with you and >>>> we hope that you are interested in BioJava also beyond the scope of >>>> GSoC. There are a number of different ways how to contribute: We are >>>> always looking for people who provide code and patches to further >>>> improve our library, help out with the documentation on the Wiki page, >>>> or answer questions on the mailing lists. >>>> >>>> Let's all give Mark and Jianjiong a warm welcome to the BioJava >>>> community. For those of you who are interested in following the >>>> progress of the projects, as usually, the development related >>>> discussions are going to be on the biojava-dev list. >>>> >>>> Happy coding! >>>> >>>> Andreas >>>> >>>> >>>> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev > From bernd.jagla at pasteur.fr Thu Apr 29 06:30:03 2010 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Thu, 29 Apr 2010 08:30:03 +0200 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Message-ID: Hi Jonathan, Just to clarify, I need to write my own das client? I was hoping to be able to use most of the functionality especially for the parsing of the XML and creating the URLs by means of functions/methods that are already around. I am now going into debug mode for the DAS package in biojava to look for the XML parsing, if you any further pointers on specific methods I should be looking at it would mean a lot to me. In short, I think I can create the URLs from scratch with not much effort. I don't currently know how to put the XML into a data structure and how this data structure should look like. Thanks for your kind help, Bernd _____ From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] Sent: Wednesday, April 28, 2010 10:21 PM To: Bernd Jagla Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence region Hi Bernd For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section called "Downloading data from the UCSC DAS server" for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for DAS client creation, but there is a also a good javascript library as well called JSDas. Any more info then don't hesitate to ask. Jonathan. On 28 Apr 2010, at 08:25, Bernd Jagla wrote: Hi there, I am trying to retrieve information (features) from the UCSC genome browser using the DAS interface. I am looking at the org.biojava.bio.program.das sources. I can retrieve all top level entry points with DASSequenceDB(dbURL) (Apperently the last entry from the return XML object gives a [Fatal Error] :1:1: Content is not allowed in prolog. Which I am ignoring...) and also the DSN entries using: DAS das = new DAS(); das.addDasURL(new URL(dbURLString)); for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); ) {.... When I try to access features for a top level entry point, i.e. a reference sequence I have the impression that first all features for a given reference sequence are being downloaded. My questions: How can I access only the features of a specific region? I guess in DAS terms I want to specify the segment part of the URL (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 00). I would also like to get the list of available features. How can I achieve this? From a wireshark output I can see that this is being retrieved somehow behind the scene. How can I access this information? I am looking at TestDAS*.java; are there any other examples around that I can use to learn from? Thanks a lot for your kind support, Best, Bernd _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jw12 at sanger.ac.uk Thu Apr 29 08:26:40 2010 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Thu, 29 Apr 2010 09:26:40 +0100 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Message-ID: The link I gave you http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert shows examples of how to connect to 'European' style das sources. For the UCSC and GBrowse type DAS sources you may have to play around with the urls to get the info you want as they work slightly differently to other DAS data sources and use the types to filter data. I would suggest contacting the UCSC for more info. The dasobert library is what you should use- the DASSequenceDB.java that you are currently looking at in biojava are old and not really supported anymore. > I was hoping to be able to use most of the functionality especially > for the parsing of the XML and creating the URLs by means of > functions/methods that are already around? this is what the dasobert library is for ;) On 29 Apr 2010, at 07:30, Bernd Jagla wrote: > Hi Jonathan, > > Just to clarify, I need to write my own das client? I was hoping to > be able to use most of the functionality especially for the parsing > of the XML and creating the URLs by means of functions/methods that > are already around? > I am now going into debug mode for the DAS package in biojava to > look for the XML parsing, if you any further pointers on specific > methods I should be looking at it would mean a lot to me? > In short, I think I can create the URLs from scratch with not much > effort. I don?t currently know how to put the XML into a data > structure and how this data structure should look like. > > Thanks for your kind help, > > Bernd > > From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] > Sent: Wednesday, April 28, 2010 10:21 PM > To: Bernd Jagla > Cc: biojava-l at lists.open-bio.org > Subject: Re: [Biojava-l] DAS client: how to retrieve features for a > sequence region > > Hi Bernd > > For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads > there is a section called "Downloading data from the UCSC DAS server" > > for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 > > the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert > ) for DAS client creation, but there is a also a good javascript > library as well called JSDas. > > Any more info then don't hesitate to ask. > > Jonathan. > > > On 28 Apr 2010, at 08:25, Bernd Jagla wrote: > > > Hi there, > > I am trying to retrieve information (features) from the UCSC genome > browser > using the DAS interface. > I am looking at the org.biojava.bio.program.das sources. I can > retrieve all > top level entry points with > DASSequenceDB(dbURL) > (Apperently the last entry from the return XML object gives a > [Fatal Error] :1:1: Content is not allowed in prolog. > Which I am ignoring...) > > and also the DSN entries using: > DAS das = new DAS(); > das.addDasURL(new URL(dbURLString)); > for(Iterator i = das.getReferenceServers().iterator(); > i.hasNext(); ) > {.... > > When I try to access features for a top level entry point, i.e. a > reference > sequence I have the impression that first all features for a given > reference > sequence are being downloaded. > > My questions: > > How can I access only the features of a specific region? I guess in > DAS > terms I want to specify the segment part of the URL > (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 > 00). > > I would also like to get the list of available features. How can I > achieve > this? From a wireshark output I can see that this is being retrieved > somehow > behind the scene. How can I access this information? > > I am looking at TestDAS*.java; are there any other examples around > that I > can use to learn from? > > Thanks a lot for your kind support, > > Best, > > Bernd > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > Jonathan Warren > Senior Developer and DAS coordinator > jw12 at sanger.ac.uk > Ext: 2314 > Telephone: 01223 492314 > > > > > > > -- The Wellcome Trust Sanger Institute is operated by Genome > Research Limited, a charity registered in England with number > 1021457 and a company registered in England with number 2742969, > whose registered office is 215 Euston Road, London, NW1 2BE. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ayates at ebi.ac.uk Thu Apr 29 08:51:23 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 29 Apr 2010 09:51:23 +0100 Subject: [Biojava-l] regarding errors In-Reply-To: References: Message-ID: I believe your problem is that you are attempting to transcribe the DNA to RNA twice. If you comment out the line: //symL = RNATools.transcribe(symL); Then you should find the code will work Regards, Andy On 28 Apr 2010, at 17:22, indu pandey wrote: > hi > > When i m trying to run this code > > package javaapplication10; > import org.biojava.bio.symbol.*; > import org.biojava.bio.seq.*; > > public class TranscribeDNAtoRNA { > public static void main(String[] args) { > try { > //make a DNA SymbolList > SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT"); > //transcribe it to RNA (after BioJava 1.4 this method is deprecated) > symL = RNATools.transcribe(symL); > //(after BioJava 1.4 use this method instead) > symL = DNATools.toRNA(symL); > //just to prove it worked > System.out.println(symL.seqString()); > } > catch (IllegalSymbolException ex) { > //this will happen if you try and make the DNA seq using non IUB > symbols > ex.printStackTrace(); > }catch (IllegalAlphabetException ex) { > //this will happen if you try and transcribe a non DNA SymbolList > ex.printStackTrace(); > } > } > } > > > i get following errors:. > > *org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and > translation table source alphabets don't match: RNA and DNA > at > org.biojava.bio.symbol.TranslatedSymbolList.(TranslatedSymbolList.java:75) > at > org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125) > at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490) > at > javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23) > * > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From bernd.jagla at pasteur.fr Thu Apr 29 09:57:58 2010 From: bernd.jagla at pasteur.fr (Bernd Jagla) Date: Thu, 29 Apr 2010 11:57:58 +0200 Subject: [Biojava-l] DAS client: how to retrieve features for a sequence region In-Reply-To: References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina> <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk> Message-ID: Great that is very helpful. One more question: Should I be using the Das1 or Das2 implementations. The demo I am looking at uses Das2 (I think), but I am running into problems. By modifying things in the Das2SourceHandler I can now get Ids (instead of using uri). Is this the right way of approaching this or should I be looking somewhere else.. When you say I have to play around with the URLs can you give me an example? Is the problem described above part of this? (this is not the URL but rather the XML..) Sorry for these questions, but I find it extremely difficult to get my head around all these different versions (DAS1/2; dasobert/programs.das; European/Rest;.) Thanks a lot, Bernd PS. I guess I should have attended the recent meeting. ;( _____ From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] Sent: Thursday, April 29, 2010 10:27 AM To: Bernd Jagla Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence region The link I gave you http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert shows examples of how to connect to 'European' style das sources. For the UCSC and GBrowse type DAS sources you may have to play around with the urls to get the info you want as they work slightly differently to other DAS data sources and use the types to filter data. I would suggest contacting the UCSC for more info. The dasobert library is what you should use- the DASSequenceDB.java that you are currently looking at in biojava are old and not really supported anymore. I was hoping to be able to use most of the functionality especially for the parsing of the XML and creating the URLs by means of functions/methods that are already around. this is what the dasobert library is for ;) On 29 Apr 2010, at 07:30, Bernd Jagla wrote: Hi Jonathan, Just to clarify, I need to write my own das client? I was hoping to be able to use most of the functionality especially for the parsing of the XML and creating the URLs by means of functions/methods that are already around. I am now going into debug mode for the DAS package in biojava to look for the XML parsing, if you any further pointers on specific methods I should be looking at it would mean a lot to me. In short, I think I can create the URLs from scratch with not much effort. I don't currently know how to put the XML into a data structure and how this data structure should look like. Thanks for your kind help, Bernd _____ From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] Sent: Wednesday, April 28, 2010 10:21 PM To: Bernd Jagla Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence region Hi Bernd For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section called "Downloading data from the UCSC DAS server" for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2 the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for DAS client creation, but there is a also a good javascript library as well called JSDas. Any more info then don't hesitate to ask. Jonathan. On 28 Apr 2010, at 08:25, Bernd Jagla wrote: Hi there, I am trying to retrieve information (features) from the UCSC genome browser using the DAS interface. I am looking at the org.biojava.bio.program.das sources. I can retrieve all top level entry points with DASSequenceDB(dbURL) (Apperently the last entry from the return XML object gives a [Fatal Error] :1:1: Content is not allowed in prolog. Which I am ignoring...) and also the DSN entries using: DAS das = new DAS(); das.addDasURL(new URL(dbURLString)); for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); ) {.... When I try to access features for a top level entry point, i.e. a reference sequence I have the impression that first all features for a given reference sequence are being downloaded. My questions: How can I access only the features of a specific region? I guess in DAS terms I want to specify the segment part of the URL (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000 00). I would also like to get the list of available features. How can I achieve this? From a wireshark output I can see that this is being retrieved somehow behind the scene. How can I access this information? I am looking at TestDAS*.java; are there any other examples around that I can use to learn from? Thanks a lot for your kind support, Best, Bernd _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From thomascramera at dnastar.com Thu Apr 29 18:14:27 2010 From: thomascramera at dnastar.com (Andy Thomas-Cramer) Date: Thu, 29 Apr 2010 13:14:27 -0500 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: Yes, I would like to have direct access to the element symbol data that's in the file. Otherwise, anyone that needs the element type has to create rules for interpreting it from the "atom name" field. It feels wrong to attempt to deduce data when it is provided explicitly. These PDB remediation project notes suggest using the element symbol specified in 77-78 http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D426#SEC3 "Atom types are provided for every atom (i.e. ATOM record columns 77-78), so prior atom name justification conventions should no longer be assumed in reading atom names." JMOL uses the PDB element symbol if present, else interprets from the "atom name" field. http://wiki.jmol.org/index.php/AtomSets "On PDB format, Jmol will identify the element from columns 77-78 (element symbol, right-justified). If this is absent, then it will interpret the "atom name" field (columns 13-14) to deduce the element identity." JMOL is LGPL. If it interpretation is desirable, could start with its current approach. Personally, I would be happy just with access to the data in the file. ________________________________ From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf Of Andreas Prlic Sent: Monday, April 26, 2010 8:08 PM To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol Hi Andy Questions: * Is this pattern documented in the PDB specification? see here: http://www.wwpdb.org/documentation/format23/sect9.html#ATOM * If this pattern can be relied on, why are columns 77-78 also dedicated to the element symbol? That is the atom's element symbol (as given in the periodic table), in contrast to the first name, which contains numbering information. * Should reliance on the pattern be hidden behind a BioJava method? If you think that is important we could probably provide an enum for all atom types. There are two categories though: the periodic table symbol and the one that is related to the position in an amino acid.... Andreas ________________________________ From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf Of Andreas Prlic Sent: Friday, April 23, 2010 6:52 PM To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol Hi Andy, you could check with Atom.getFullname(), which contains the space characters from the PDB file: e.g Calpha: " CA ", Calcium "CA " in addition the parent group of a Calpha atom is usually an AminoAcid and for Calciums it is a Hetatom group... Andreas On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer wrote: Is there an easy way to identify the type of atom referenced by an Atom object? For example, if Atom.getName() is "CA", is the element calcium or the atom carbon alpha? If not, would it be feasible to add a method providing this in Atom, AtomImpl, and parsing it in PDBFileParser, using the columns defined at http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From pwrose at ucsd.edu Thu Apr 29 19:53:33 2010 From: pwrose at ucsd.edu (Peter Rose) Date: Thu, 29 Apr 2010 12:53:33 -0700 Subject: [Biojava-l] PDBFileParser and Atom element symbol In-Reply-To: References: Message-ID: <002f01cae7d5$a673fcf0$f35bf6d0$@edu> Since there was a request to be able to access element information, I've added an Element enum to the org.biojava.bio.structure package that I had developed for another application. Each element has a number of properties such as atomic number, mass, min and max valence, electronegativity, etc. that should be useful. The AtomImpl class now has a getter and setter for Element. Also, the PDB parser now populates the Element in the Atom class. By default the PDB parser tries to parse the element from columns 77-78. As a fallback for mis-formatted PDB files that don't contain an element column, the element is parsed from the atom name. We'll also add element support for the cif parser soon. -Peter ________________________________________________ Peter Rose, Ph.D. Scientific Lead RCSB Protein Data Bank (www.pdb.org) San Diego Supercomputer Center (SDSC) and Skaggs School of Pharmacy and Pharmaceutical Sciences Pharmaceutical Sciences Building University of California San Diego -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of biojava-l-request at lists.open-bio.org Sent: Tuesday, April 27, 2010 9:00 AM To: biojava-l at lists.open-bio.org Subject: Biojava-l Digest, Vol 87, Issue 26 Send Biojava-l mailing list submissions to biojava-l at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.open-bio.org/mailman/listinfo/biojava-l or, via email, send a message with subject or body 'help' to biojava-l-request at lists.open-bio.org You can reach the person managing the list at biojava-l-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Biojava-l digest..." Today's Topics: 1. Re: PDBFileParser and Atom element symbol (Andreas Prlic) 2. Google Summer of Code - accepted students (Robert Buels) 3. accepted GSoC projects (Andreas Prlic) 4. Google Summer of Code - accepted students (Robert Buels) ---------------------------------------------------------------------- Message: 1 Date: Mon, 26 Apr 2010 18:07:53 -0700 From: Andreas Prlic Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol To: Andy Thomas-Cramer Cc: biojava-l at lists.open-bio.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi Andy Questions: > * Is this pattern documented in the PDB specification? > see here: http://www.wwpdb.org/documentation/format23/sect9.html#ATOM > * If this pattern can be relied on, why are columns 77-78 also dedicated to > the element symbol? > That is the atom's element symbol (as given in the periodic table), in contrast to the first name, which contains numbering information. * Should reliance on the pattern be hidden behind a BioJava method? > If you think that is important we could probably provide an enum for all atom types. There are two categories though: the periodic table symbol and the one that is related to the position in an amino acid.... Andreas > > > > ------------------------------ > > *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On > Behalf Of *Andreas Prlic > *Sent:* Friday, April 23, 2010 6:52 PM > *To:* Andy Thomas-Cramer > *Cc:* biojava-l at lists.open-bio.org > *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol > > > > Hi Andy, > > you could check with Atom.getFullname(), which contains the space > characters from the PDB file: > e.g Calpha: " CA ", Calcium "CA " > > in addition the parent group of a Calpha atom is usually an AminoAcid and > for Calciums it is a Hetatom group... > > Andreas > > On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer < > thomascramera at dnastar.com> wrote: > > > > Is there an easy way to identify the type of atom referenced by an Atom > object? > > For example, if Atom.getName() is "CA", is the element calcium or the > atom carbon alpha? > > If not, would it be feasible to add a method providing this in Atom, > AtomImpl, and parsing it in PDBFileParser, using the columns defined at > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM? > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- ------------------------------ Message: 2 Date: Mon, 26 Apr 2010 15:02:11 -0700 From: Robert Buels Subject: [Biojava-l] Google Summer of Code - accepted students To: rmb32 at cornell.edu Message-ID: <4BD60D63.1040400 at cornell.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator ------------------------------ Message: 3 Date: Mon, 26 Apr 2010 22:33:51 -0700 From: Andreas Prlic Subject: [Biojava-l] accepted GSoC projects To: Jianjiong Gao , Mark Chapman , Biojava , biojava-dev Cc: "Rose, Peter" , Scooter Willis , Kyle Ellrott Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Dear all, Google has released the results for GSoC: Congratulations to Mark Chapman and Jianjiong Gao for having been accepted to work on the MSA and PTM projects for BioJava! Let's start the "community bonding" process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) and we all are looking forward to work with you on this during the summer. The Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle Ellrott for the MSA project (and me). I want to thank all of of you who submitted proposals or showed interest in other ways for the Google Summer of Code. We hope you are not too disappointed if your application did not get accepted this time. We had a large number (52) applications and the the overall quality of the submissions was very high. We would like to stay in touch with you and we hope that you are interested in BioJava also beyond the scope of GSoC. There are a number of different ways how to contribute: We are always looking for people who provide code and patches to further improve our library, help out with the documentation on the Wiki page, or answer questions on the mailing lists. Let's all give Mark and Jianjiong a warm welcome to the BioJava community. For those of you who are interested in following the progress of the projects, as usually, the development related discussions are going to be on the biojava-dev list. Happy coding! Andreas ------------------------------ Message: 4 Date: Mon, 26 Apr 2010 22:52:57 -0700 From: Robert Buels Subject: [Biojava-l] Google Summer of Code - accepted students To: BioPerl List , BioPython List , BioJava List , BioRuby List , BioSQL List , BioLib List , Open-Bio List , BioDAS List Message-ID: <4BD67BB9.3000804 at cornell.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi all, I'm pleased to announce the acceptance of OBF's 2010 Google Summer of Code students, listed in alphabetical order with their project titles and primary mentors: Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including Implementation of Multiple Sequence Alignment Algorithms Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, Classification, and Visualization of Posttranslational Modification of Proteins Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & Duplication Inference Algorithm for Binary and Non-binary Species Tree Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring Congratulations to our accepted students! All told, we had 52 applications submitted for the 6 slots (5 originally assigned, plus 1 extra) allotted to us by Google. Proposals were extremely competitive: 6 out of 52 translates to an 11.5% acceptance rate. We received a lot of really excellent proposals, the decisions were not easy. Thanks very much to all the students who applied, we very much appreciate your hard work. Here's to a great 2010 Summer of Code, I'm sure these students will do some wonderful work. Rob Buels OBF GSoC 2010 Administrator ------------------------------ _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l End of Biojava-l Digest, Vol 87, Issue 26 ***************************************** From marcel.huntemann at gmail.com Fri Apr 30 00:49:10 2010 From: marcel.huntemann at gmail.com (Marcel Huntemann) Date: Thu, 29 Apr 2010 17:49:10 -0700 Subject: [Biojava-l] Error during genbank parsing Message-ID: <4BDA2906.20801@Gmail.com> Hi! I get the following error during the parsing of a genbank file: Exception in thread "main" org.biojava.bio.BioException: Could not read sequence at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) at gov.doe.jgi.img.pangenomes.Controller.createGeneMap(Controller.java:303) at gov.doe.jgi.img.pangenomes.Controller.start(Controller.java:197) at gov.doe.jgi.img.pangenomes.Main.createAndStartController(Main.java:105) at gov.doe.jgi.img.pangenomes.Main.main(Main.java:35) Caused by: org.biojava.bio.seq.io.ParseException: A Exception Has Occurred During Parsing. Please submit the details that follow to biojava-l at biojava.org or post a bug report to http://bugzilla.open-bio.org/ Format_object=org.biojavax.bio.seq.io.GenbankFormat Accession=null Id=null Comments=Bad locus line Parse_block=LOCUS NC_008711 4597686 bp DNA circular 17-DEC-2009 Stack trace follows .... at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:322) at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ... 4 more No matter which genbank file I use, I always get this error (for sure with a different LOCUS line. The strange thing is that this used to work about 1/2 - 1 year ago. No I wanted to use my program again and get always this error, although I didn't really change anything on that code. The only thing I can think of that's different, since the last time I used it (when it worked), is that I switched from a 32bit Linux to a 64bit Linux machine. But can that really cause it? Here's my code and how I use it: for ( String taxonId : givenTaxonIds ) { gbkFile = new File( dirPath + taxonId + gbkSuffix ); if ( ! gbkFile.exists() ) { logr.fatal( "Couldn't find genbank file for taxonOID " + taxonId + "!\nI tried " + gbkFile.getPath() + ", but it doesn't exist!" ); System.exit( 0 ); } BufferedReader br = new BufferedReader( new FileReader( gbkFile ) ); Namespace ns = RichObjectFactory.getDefaultNamespace(); RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA( br, ns ); numberInGenome = 0; while ( seqs.hasNext() ) { RichSequence contig = seqs.nextRichSequence(); // Get genes and their positions Set features = contig.getFeatureSet(); positions = new ArrayList(); geneIds = new ArrayList(); for ( Feature richFeature : features ) { if ( richFeature.getType().equals( "CDS" ) ) { RichLocation loc = (RichLocation) richFeature.getLocation(); position = new int[3]; position[0] = loc.getMin(); position[1] = loc.getMax(); position[2] = loc.getStrand().intValue(); Annotation a = richFeature.getAnnotation(); split = a.getProperty( "note" ).toString().split( "=" ); geneIds.add( split[1].trim() ); positions.add( position ); } else if ( richFeature.getType().equals( "gene" ) ) { Annotation a = richFeature.getAnnotation(); if ( a.containsProperty( "pseudo" ) ) { RichLocation loc = (RichLocation) richFeature.getLocation(); position = new int[3]; position[0] = loc.getMin(); position[1] = loc.getMax(); position[2] = loc.getStrand().intValue(); split = a.getProperty( "note" ).toString().split( "=" ); geneIds.add( split[1].trim() ); positions.add( position ); } } } Thanks 4 the help, Marcel P.S.: Also the info on some of the biojava pages seems outdated. I got the latest version from your svn trunk and on the GetStarted page it says that one just has to call ant to build it. But there's now build.xml in the biojava folder. Instead there's a pom.xml, so I guess u switched to maven. I bet a lot of people don'tknow how to geal with and have no clue what to do, when the ant command didn't work... From narciso at cnpaf.embrapa.br Fri Apr 30 21:32:02 2010 From: narciso at cnpaf.embrapa.br (Marcelo Goncalves Narciso (Pesquisador)) Date: Fri, 30 Apr 2010 19:32:02 -0200 Subject: [Biojava-l] problems with intallation of biojava in windows 7 In-Reply-To: <20100430184758.M13673@cnpaf.embrapa.br> References: <20100430184758.M13673@cnpaf.embrapa.br> Message-ID: <20100430212950.M75279@cnpaf.embrapa.br> Hi, people, I need your help. When I try to install biojava in windows 7, it happens: > C:\Users\narciso\biojava>java -jar biojava-1.7.1-all.jar > Failed to load Main-Class manifest attribute from > biojava-1.7.1-all.jar How can I fix it? Thanks a lot Marcelo