From huijieqiao at gmail.com  Thu Apr  1 23:02:37 2010
From: huijieqiao at gmail.com (Huijie Qiao)
Date: Fri, 2 Apr 2010 11:02:37 +0800
Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat"
Message-ID: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>

version 1.7.1

line 361
else if (sectionKey.equals(SOURCE_TAG)) {
      // ignore - can get all this from the first feature

actually the content in the SOURCE_TAG and the first feature are different
in some gb file.

For example, the example file in
http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb

The Source TAG is
SOURCE      Bos taurus (cattle)
  ORGANISM  Bos taurus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
            Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
            Pecora; Bovidae; Bovinae; Bos.

and the first feature tag is
FEATURES             Location/Qualifiers
     source          1..1136
                     /organism="Bos taurus"
                     /mol_type="mRNA"
                     /db_xref="taxon:9913"
                     /clone="pBB2I"
                     /tissue_type="liver"

I can't get the hierarchy info through the follow codes.
NCBITaxon taxon = seq.getTaxon();
System.out.println(taxon.getNameHierarchy()); output is "."

From holland at eaglegenomics.com  Fri Apr  2 03:38:44 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Fri, 2 Apr 2010 08:38:44 +0100
Subject: [Biojava-l] A bug in Class
	"org.biojavax.bio.seq.io.GenbankFormat"
In-Reply-To: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>
References: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>
Message-ID: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com>

The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism. 

If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email.

thanks,
Richard

On 2 Apr 2010, at 04:02, Huijie Qiao wrote:

> version 1.7.1
> 
> line 361
> else if (sectionKey.equals(SOURCE_TAG)) {
>      // ignore - can get all this from the first feature
> 
> actually the content in the SOURCE_TAG and the first feature are different
> in some gb file.
> 
> For example, the example file in
> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb
> 
> The Source TAG is
> SOURCE      Bos taurus (cattle)
>  ORGANISM  Bos taurus
>            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
>            Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
>            Pecora; Bovidae; Bovinae; Bos.
> 
> and the first feature tag is
> FEATURES             Location/Qualifiers
>     source          1..1136
>                     /organism="Bos taurus"
>                     /mol_type="mRNA"
>                     /db_xref="taxon:9913"
>                     /clone="pBB2I"
>                     /tissue_type="liver"
> 
> I can't get the hierarchy info through the follow codes.
> NCBITaxon taxon = seq.getTaxon();
> System.out.println(taxon.getNameHierarchy()); output is "."
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From martin.jones at ed.ac.uk  Fri Apr  2 07:23:21 2010
From: martin.jones at ed.ac.uk (Martin Jones)
Date: Fri, 2 Apr 2010 12:23:21 +0100
Subject: [Biojava-l] A bug in Class
	"org.biojavax.bio.seq.io.GenbankFormat"
In-Reply-To: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com>
References: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>
	<8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com>
Message-ID: <v2reb55ec041004020423m7353150enb4631654abb31463@mail.gmail.com>

You can also get the hierarchy directly from the NCBI taxonomy dump...
this is in Groovy but gives you the idea:

HashMap<Integer, TreeNode> taxid2node = [:]
HashMap<Integer, Integer> child2parent = [:]

def nodePattern = ~/^(\d+)\t\|\t(\d+)\t\|\t(.+?)\t\|/


def count=0
new File("/home/martin/nodes.dmp").eachLine{
   line ->
   count++
   def matcher = (line =~ nodePattern)
   if (matcher.matches()){
         Integer myId = matcher[0][1].toInteger()
         Integer parentId = matcher[0][2].toInteger()
         String myRank = matcher[0][3]

         def node = new TreeNode(taxid : myId, rank:myRank)
         taxid2node[(myId)] = node

         child2parent[(myId)] = parentId

    }
}
// do something with the hash


-Martin


On 2 April 2010 08:38, Richard Holland <holland at eaglegenomics.com> wrote:
> The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism.
>
> If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email.
>
> thanks,
> Richard
>
> On 2 Apr 2010, at 04:02, Huijie Qiao wrote:
>
>> version 1.7.1
>>
>> line 361
>> else if (sectionKey.equals(SOURCE_TAG)) {
>> ? ? ?// ignore - can get all this from the first feature
>>
>> actually the content in the SOURCE_TAG and the first feature are different
>> in some gb file.
>>
>> For example, the example file in
>> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb
>>
>> The Source TAG is
>> SOURCE ? ? ?Bos taurus (cattle)
>> ?ORGANISM ?Bos taurus
>> ? ? ? ? ? ?Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
>> Euteleostomi;
>> ? ? ? ? ? ?Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
>> ? ? ? ? ? ?Pecora; Bovidae; Bovinae; Bos.
>>
>> and the first feature tag is
>> FEATURES ? ? ? ? ? ? Location/Qualifiers
>> ? ? source ? ? ? ? ?1..1136
>> ? ? ? ? ? ? ? ? ? ? /organism="Bos taurus"
>> ? ? ? ? ? ? ? ? ? ? /mol_type="mRNA"
>> ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9913"
>> ? ? ? ? ? ? ? ? ? ? /clone="pBB2I"
>> ? ? ? ? ? ? ? ? ? ? /tissue_type="liver"
>>
>> I can't get the hierarchy info through the follow codes.
>> NCBITaxon taxon = seq.getTaxon();
>> System.out.println(taxon.getNameHierarchy()); output is "."
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
>
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From andreas.prlic at gmail.com  Sat Apr  3 11:08:57 2010
From: andreas.prlic at gmail.com (Andreas Prlic)
Date: Sat, 3 Apr 2010 08:08:57 -0700
Subject: [Biojava-l] Anonymous svn down
Message-ID: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>

Hi,

the anonymous svn server seems to be down again. I have already  
contacted support @ obf, but not recieved back a response, when it  
should be back up. In the meanwhile, is anybody volunteering to set up  
a failback mirror at github?

Andreas

From rmb32 at cornell.edu  Sat Apr  3 16:09:27 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 13:09:27 -0700
Subject: [Biojava-l] Google Summer of Code is *ON* for OBF projects!
Message-ID: <4BB7A077.4070802@cornell.edu>

Hi all,

Reminder:  GSoC student proposals must be submitted to Google by April 
9th, 19:00 UTC.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project 
mailing lists, they can help you get your proposal into shape.

So far, we have 5 proposals submitted to our org in Google's web app. 
Keep them coming, and let's see some really good ones!

Rob Buels
OBF GSoC 2010 Administrator


From jianjiong.gao at gmail.com  Sun Apr  4 02:33:15 2010
From: jianjiong.gao at gmail.com (Jianjiong Gao)
Date: Sun, 4 Apr 2010 01:33:15 -0500
Subject: [Biojava-l] GSoC project question
Message-ID: <g2zc82264f51004032333hc75e197bwd085f55ce901ea3e@mail.gmail.com>

Hello,

My name is Jianjiong Gao, a graduate student in Computer Science
Department at University of Missouri-Columbia. I am very interested in
applying for your GSoC project "Identification and Classification of
Posttranslational Modification of Proteins". This project is highly
related to my dissertation topic "Bioinformatic analysis and
prediction of phosphorylation and other PTMs." Although I have not
touched the structural part of PTM till now, I am really interested in
learning and expanding my research on this field.

After reading the project description on the idea page
(http://biojava.org/wiki/Google_Summer_of_Code), I have several
questions regarding the *approach* section:

> 1. Establish a list of known PTMs and write code to locate these PTMs in a 3D protein structure.

Q1: There are many different types of PTMs. Do you have list of PTMs
of interest? Do you have priorities on different PTMs?
Q2: Is there any available algorithm to locate the PTMs in a 3D
protein structure? What is the difficulty on this task?
Q3: The PDB file contains annotations of residue modifications such as
HETATM AND MODRES. Can we utilized this information for localizing the
PTMs?

> 2. Determine the protein residues that carry PTMs based on distance thresholds.
> 3. Traverse the sugar molecules and establish their link pattern based on connectivity.

Q4: Is this task to determine the types of glycosylation, i.e.,
N-linked glycosylation, O-N-acetylgalactosamine, O-glucose, etc?
Q5: Is there any available algorithm to do this? What is the
difficulty in this task? It looks complicated with so many different
types of glycosylation and structure isomers.

> 4. Present the PTMs as text in a linear notation and 2D graphical representations if time permits.

Q6: Can we used the SMILES format
(http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification)
here? Or do we have any other better options?

Thanks very much for your time. I am looking forward to hearing from you.

Best Regards,
-JJ

From rmb32 at cornell.edu  Sun Apr  4 00:37:38 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 21:37:38 -0700
Subject: [Biojava-l] Reminder: GSoC student applications due April 9,
	19:00 UTC
Message-ID: <4BB81792.8060001@cornell.edu>

Hi all,

Sending this again with a different subject line, just in case.

GSoC student proposals must be submitted to Google through their web 
application by *April 9th, 19:00 UTC*.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project
mailing lists, they can help you get your proposal into shape.

So far, we have 6 proposals submitted to our org in Google's web app.
Keep them coming, and keep them good!

Rob Buels
OBF GSoC 2010 Administrator


From nagendravns at gmail.com  Sun Apr  4 12:12:11 2010
From: nagendravns at gmail.com (nagendra kumar)
Date: Sun, 4 Apr 2010 21:42:11 +0530
Subject: [Biojava-l] how to add api
Message-ID: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>

sir i want bio java develop one project please give me detail how bio java
api install in system

From chapman at cs.wisc.edu  Sun Apr  4 13:54:59 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Sun, 04 Apr 2010 12:54:59 -0500
Subject: [Biojava-l] how to add api
In-Reply-To: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
References: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
Message-ID: <4BB8D273.7080601@cs.wisc.edu>

Everything you need is at:
http://biojava.org/wiki/BioJava:Download

On 4/4/2010 11:12 AM, nagendra kumar wrote:
> sir i want bio java develop one project please give me detail how bio java
> api install in system
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From anantpossible at gmail.com  Sun Apr  4 13:58:15 2010
From: anantpossible at gmail.com (Anant Jain)
Date: Sun, 4 Apr 2010 23:28:15 +0530
Subject: [Biojava-l] how to add api
In-Reply-To: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
References: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
Message-ID: <h2pbd096e3c1004041058h48b770c7gfbfe787d2972b141@mail.gmail.com>

On 4/4/10, nagendra kumar <nagendravns at gmail.com> wrote:
>
> sir i want bio java develop one project please give me detail how bio java
> api install in system
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


HI,

To use biojava API, all you need to download Biojava Jar from and perform
following steps...

1. Extract jar, you will get some more jars and files,,,
2. You need to paste these jars in following location "C:\Program
Files\Java\jre6\lib\ext", if your java install directory is C drive.


-- 
Anant Jain
B.Tech Bioinformatics, RHCE

From sacomoto at gmail.com  Tue Apr  6 01:29:23 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Tue, 6 Apr 2010 02:29:23 -0300
Subject: [Biojava-l] GSoC project on MSA
Message-ID: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>

Hello,

I'm currently a graduate student at University of S?o Paulo (Brazil)
and I'm quite interested in applying for the all-Java MSA project. I'm
already familiar with the multiple sequence alignment problem, I
developed a lossless filter for this problem as my undergraduate final
project, the work is described here
[http://www.almob.org/content/4/1/3] and there is an online version of
the algorithm here
[http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].

Now, regarding the project, just to make it clear, when you say in the
"straightforward approach for building up the MSA progressively", you
mean the standard dynamic programming approach for pairwise alignment
following the guide tree built in the second step, right?

One last question, should I send my proposal direct to the Google's
web app or here first?

Thanks,

Gustavo Sacomoto


From andreas at sdsc.edu  Tue Apr  6 13:46:16 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 6 Apr 2010 10:46:16 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
Message-ID: <l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>

Hi Gustavo,

With straightforward I meant that we only have 3 months for this project and
we should not try to solve all problems at the same time. Probably a
realistic approach is to start with trying to keep things modular and simple
(think interfaces and implementations) and stick to standard solutions that
have been shown to work elsewhere. If there is more time in the project one
can then replace some of the implementations with technically more advanced
ones.

Since we are doing things in Java I am interested in having support for
parallelisation wherever possible. Another issue is how to verify that the
created alignments are meaningful. One could e.g. use the biojava structure
modules to calculate protein structure alignments to verify the quality of
the obtained multiple sequence alignments.

All applications have to be made via Google. We are providing comments  on
drafts of proposals and try to work together with applicants to improve the
submissions. Note: The application deadline is soon and speed is important
now.

Andreas


On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto <
sacomoto at gmail.com> wrote:

> Hello,
>
> I'm currently a graduate student at University of S?o Paulo (Brazil)
> and I'm quite interested in applying for the all-Java MSA project. I'm
> already familiar with the multiple sequence alignment problem, I
> developed a lossless filter for this problem as my undergraduate final
> project, the work is described here
> [http://www.almob.org/content/4/1/3] and there is an online version of
> the algorithm here
> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].
>
> Now, regarding the project, just to make it clear, when you say in the
> "straightforward approach for building up the MSA progressively", you
> mean the standard dynamic programming approach for pairwise alignment
> following the guide tree built in the second step, right?
>
> One last question, should I send my proposal direct to the Google's
> web app or here first?
>
> Thanks,
>
> Gustavo Sacomoto
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From sacomoto at gmail.com  Tue Apr  6 14:53:04 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Tue, 6 Apr 2010 15:53:04 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
Message-ID: <j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>

Hello Andreas,

On Tue, Apr 6, 2010 at 2:46 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> With straightforward I meant that we only have 3 months for this project and
> we should not try to solve all problems at the same time. Probably a
> realistic approach is to start with trying to keep things modular and simple
> (think interfaces and implementations) and stick to standard solutions that
> have been shown to work elsewhere. If there is more time in the project one
> can then replace some of the implementations with technically more advanced
> ones.

I think my question wasn't very clear, my intention in this project is
to follow the approach (with the tree steps) outlined in the project's
page. Using the classical progressive alignment heuristic: build the
distance matrix, build the guide tree and using this tree
progressively align more sequences together.

What I propose for the third step is a first implementation using the
(more simple) dynamic programming described in the first CLUSTAL paper
(I thinks it's from 1988) and incrementally improving the algorithm to
get closer to the one described in CLUSTALW paper (from 1994). Is this
more or less what you had in mind?

> Since we are doing things in Java I am interested in having support for
> parallelisation wherever possible. Another issue is how to verify that the
> created alignments are meaningful. One could e.g. use the biojava structure
> modules to calculate protein structure alignments to verify the quality of
> the obtained multiple sequence alignments.

About parallel strategies, I think a relative easy way we could use it
is in the distance matrix construction, we could have several threads
calculating the pairwise alignment for different pairs of sequence in
the set.

Now, the alignment quality measures is a tougher issue. The CLUSTALW
paper doesn't give any way to measure the quality of the result, they
consider a good alignment the one that is hard to improve by eye (But
they claim that for sequences sufficient similar, no pair less than
35% identical, the results are good). Can I do the same as in CLUSTALW
paper and leave the quality measure to the user? How concerned should
I be with that in this project?

> All applications have to be made via Google. We are providing comments? on
> drafts of proposals and try to work together with applicants to improve the
> submissions. Note: The application deadline is soon and speed is important
> now.

I will try send to this mailing list a proposal draft until tomorrow
to have some feedback from you.

> Andreas
>
>
>
> On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto
> <sacomoto at gmail.com> wrote:
>>
>> Hello,
>>
>> I'm currently a graduate student at University of S?o Paulo (Brazil)
>> and I'm quite interested in applying for the all-Java MSA project. I'm
>> already familiar with the multiple sequence alignment problem, I
>> developed a lossless filter for this problem as my undergraduate final
>> project, the work is described here
>> [http://www.almob.org/content/4/1/3] and there is an online version of
>> the algorithm here
>> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].
>>
>> Now, regarding the project, just to make it clear, when you say in the
>> "straightforward approach for building up the MSA progressively", you
>> mean the standard dynamic programming approach for pairwise alignment
>> following the guide tree built in the second step, right?
>>
>> One last question, should I send my proposal direct to the Google's
>> web app or here first?
>>
>> Thanks,
>>
>> Gustavo Sacomoto
>>
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>

Thanks for your help.

gustavo


From andreas at sdsc.edu  Tue Apr  6 17:27:15 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 6 Apr 2010 14:27:15 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
Message-ID: <g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>

Hi Gustavo,

In principle I agree to all, see details below:


I think my question wasn't very clear, my intention in this project is

> to follow the approach (with the tree steps) outlined in the project's
> page. Using the classical progressive alignment heuristic: build the
> distance matrix, build the guide tree and using this tree
> progressively align more sequences together.
>

yes


>
> What I propose for the third step is a first implementation using the
> (more simple) dynamic programming described in the first CLUSTAL paper
> (I thinks it's from 1988) and incrementally improving the algorithm to
> get closer to the one described in CLUSTALW paper (from 1994). Is this
> more or less what you had in mind?
>

yes, sounds good.


>
> About parallel strategies, I think a relative easy way we could use it
> is in the distance matrix construction, we could have several threads
> calculating the pairwise alignment for different pairs of sequence in
> the set.
>

Correct. Probably a first implementation would be for a single machine/
multi CPU. More advanced implementations could provide support e.g. for
Map/Reduce, JPPF, or something like that...

Now, the alignment quality measures is a tougher issue. The CLUSTALW
> paper doesn't give any way to measure the quality of the result, they
> consider a good alignment the one that is hard to improve by eye (But
> they claim that for sequences sufficient similar, no pair less than
> 35% identical, the results are good). Can I do the same as in CLUSTALW
> paper and leave the quality measure to the user? How concerned should
> I be with that in this project?
>

Getting an overall core-algorithm that works should be priority. The
benchmarking part is not mandatory, but something to keep in mind... I have
plenty of material for that, once we get to that stage...

 I will try send to this mailing list a proposal draft until tomorrow
> to have some feedback from you.
>

Excellent, looking forward to it.

Andreas

-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From sacomoto at gmail.com  Wed Apr  7 01:29:31 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Wed, 7 Apr 2010 02:29:31 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com> 
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com> 
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
Message-ID: <q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>

Hi Andreas,

My proposal is pasted at the end of this e-mail.

I'm waiting for your feedback.

Thanks,

gustavo


-------------------------------------------------------------

GSoC proposal

Abstract
--------

This project aims to develop an all-Java implementation of a multiple
sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
using the progressive algorithm described in the CLUSTALW paper [1].

The Importance
--------------

Multiple sequence alignment is a frequently performed task in sequence
analysis with the goal to identify new members of protein families and
infer phylogenetic relationships between proteins and genes. At the
present there is no Java-only implementation for this algorithm. As
such the number of already existing and Java related BioInformatics
tools and web sites would benefit from this implementation and
sequence analysis could be more easily performed by the end-user.

About Me
--------

I am a graduate student at University of S?o Paulo (Brazil), I got my
undergraduate degree from the same university with a major in Computer
Science and a minor in Biology. I have been involved with
Bioinformatics for 5 years, always with sequence analysis with
particular interest in the MSA problem. Also, in my undergraduate
final project I developed a lossless filter (pruning algorithm) for
the MSA problem, the work is published in [3] and there is an online
implementation of the algorithm in [4]. Finally, I have experience
with the C, C++, Java, Python and Ruby programming languages; Git and
SVN version control systems.

Project Plan
------------

The project is divided in four main steps, at the end of each step a
completely functional and bug-free new algorithm will be added to the
Biojava code base. It should be noticed that each step has a strong
dependence on the previous one, so before move to the next step a
careful testing will be done.

The four steps are described below, estimated times for accomplishment
of each step are also given and in some steps extra enhancements are
described, they will be implemented if there is some time remaining
after all steps are completed.

** 1. Study the Biojava pairwise alignment code and update it to be
compliant with Biojava 3.

 The pairwise alignment will play an important role in the MSA
algorithm. This step is also important for me to get used to the
Biojava coding standards and get in touch with the Biojava dev
community.

 ETA: 2 weeks.

** 2. Implement the algorithm to build the distance matrix.

 This is done using the pairwise alignment for each pair of sequence
in the set to be aligned.

 ETA: 1 week.

 EXTRA: Enhance the basic algorithm to use parallel strategies, use
several threads to calculate the pairwise alignment for different
pairs in the sequence set.

** 3. Implement the algorithm to build the guide tree.

 The guide tree is based on the distance matrix built in the last
step, the tree construction strategy adopted will be the Neighbor
Joining Algorithm.

 ETA: 2 weeks.

** 4. Implement the algorithm for progressive MSA using the guide tree.

 This is certainly the most difficult part of the project, so to make
sure we are going to deliver a fully functional MSA algorithm, a safer
approach is going to be taken. In the first place, a dynamic
programming algorithm described in [2] will be implemented. Once this
get successfully done and the code fully integrated to the Biojava
code base, the features described in [1] are going to be incrementally
added (and tested) in order to implement the full dynamic programming
algorithm.

 ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.

 EXTRA: Implement some benchmark technique to measure the final
alignment quality.

References
----------

[1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
[2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
[3] http://www.almob.org/content/4/1/3
[4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu


On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> In principle I agree to all, see details below:
>
>
> I think my question wasn't very clear, my intention in this project is
>>
>> to follow the approach (with the tree steps) outlined in the project's
>> page. Using the classical progressive alignment heuristic: build the
>> distance matrix, build the guide tree and using this tree
>> progressively align more sequences together.
>
> yes
>
>>
>> What I propose for the third step is a first implementation using the
>> (more simple) dynamic programming described in the first CLUSTAL paper
>> (I thinks it's from 1988) and incrementally improving the algorithm to
>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>> more or less what you had in mind?
>
> yes, sounds good.
>
>>
>> About parallel strategies, I think a relative easy way we could use it
>> is in the distance matrix construction, we could have several threads
>> calculating the pairwise alignment for different pairs of sequence in
>> the set.
>
> Correct. Probably a first implementation would be for a single machine/
> multi CPU. More advanced implementations could provide support e.g. for
> Map/Reduce, JPPF, or something like that...
>
>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>> paper doesn't give any way to measure the quality of the result, they
>> consider a good alignment the one that is hard to improve by eye (But
>> they claim that for sequences sufficient similar, no pair less than
>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>> paper and leave the quality measure to the user? How concerned should
>> I be with that in this project?
>
> Getting an overall core-algorithm that works should be priority. The
> benchmarking part is not mandatory, but something to keep in mind... I have
> plenty of material for that, once we get to that stage...
>
>> I will try send to this mailing list a proposal draft until tomorrow
>> to have some feedback from you.
>
> Excellent, looking forward to it.
>
> Andreas
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From sma.hmc at gmail.com  Wed Apr  7 03:52:34 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Wed, 7 Apr 2010 00:52:34 -0700
Subject: [Biojava-l] Questions about Summer of Code Project
Message-ID: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>

I had previously sent this, but was not part of the mailing list, so I
can only assume it got lost in a spam loop.

I was interested in applying for the All-Java Multiple Sequence
Alignment Google Summer of Code project. I wanted to create a project
plan but had some questions about the package as it stands now.

1. What exactly has changed with the transition to BioJava 3? From
what I've read on the BioJava 3 proposal page, it seems like that the
changes are to the organization of the code. Additionally there are
some new standards to follow. Java 6 usage is desired, but I am unsure
of what of the new features could be used in modifying pairwise
sequence alignments.

2. Is the Neighbor Joining Algorithm really the best for this? Are
other multiple alignments implementations desired? I have implemented
the neighbor joining algorithm very inefficiently in python, it was
not particularly difficult. This step seems like it will not take very
long. Additionally, parallelism, I have no experience with parallelism
in Java and will only have some experience with it in C, will that be
an issue?

3. Is there a specific paper with the exact algorithm that should be
implemented here?

General: Will use cases be provided? Will test data be provided? These
would both be useful in coding the test cases which seem to be coded
first.

Additionally, I have access to my current windows machine as well as
as Linux machine for testing, but no Mac. While in theory with java,
if it works on one, then it works on another, and especially with if
it works on Linux, it should be fine on Mac, should I be worried about
strange peculiarities?

Thanks,
Singer Ma
Harvey Mudd College 2011

From ayates at ebi.ac.uk  Wed Apr  7 07:27:27 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 7 Apr 2010 12:27:27 +0100
Subject: [Biojava-l] Anonymous svn down
In-Reply-To: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
Message-ID: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>

By the looks of things this is quite a simple process to do:

http://github.com/guides/import-from-subversion

http://blog.woobling.org/2009/06/git-svn-abandon.html

http://blog.johngoulah.com/2009/11/migrating-svn-to-git/

The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up

Andy

On 3 Apr 2010, at 16:08, Andreas Prlic wrote:

> Hi,
> 
> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github?
> 
> Andreas
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From Stefan.Bleckmann at uni-duesseldorf.de  Wed Apr  7 08:08:45 2010
From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann)
Date: Wed, 07 Apr 2010 14:08:45 +0200
Subject: [Biojava-l] SubstitutionMatrix
Message-ID: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>

Hi all!

I have a problems reading the NUC4.2 and 4.4 matrix files with the SubstitutionMatrix class included in BioJava 1.7.1. 
A small example:


		File d = new File("/Users/-----/Desktop/NUC");
		FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager.alphabetForName("DNA");
		try {
			@SuppressWarnings("unused")
			final SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet,d);
		} catch (NumberFormatException e) {
			e.printStackTrace();
		} catch (NoSuchElementException e) {
			e.printStackTrace();
		} catch (BioException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}


Thrown exception:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0
	at java.lang.String.charAt(String.java:686)
	at org.biojava.bio.alignment.SubstitutionMatrix.parseMatrix(SubstitutionMatrix.java:304)
	at org.biojava.bio.alignment.SubstitutionMatrix.<init>(SubstitutionMatrix.java:100)
	at MatrixTest.main(MatrixTest.java:30)


All BLOSUM matrix files I have downloaded work, so I don't think there is a problem like wrong encoding or something similar.
Anybody an idea?

Cheers Stefan


From andreas.draeger at uni-tuebingen.de  Wed Apr  7 09:32:23 2010
From: andreas.draeger at uni-tuebingen.de (Andreas =?iso-8859-1?b?RHLkZ2Vy?=)
Date: Wed, 07 Apr 2010 15:32:23 +0200
Subject: [Biojava-l] SubstitutionMatrix
In-Reply-To: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>
References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>
Message-ID: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de>

Hi Stefan,

Thank you for this hint. I don't know what the problem is. Recently, I  
tested it and it worked. I'll have a look on it tomorrow and come back  
to you with an answer pretty soon!

Cheers
Andreas

Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From holland at eaglegenomics.com  Wed Apr  7 09:48:21 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 7 Apr 2010 14:48:21 +0100
Subject: [Biojava-l] SubstitutionMatrix
In-Reply-To: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de>
References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>
	<20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de>
Message-ID: <20ACD602-7575-46DB-AFD7-348AEB37CF68@eaglegenomics.com>

I've found the problem already - the SubstitutionMatrix class has a few inconsistencies in the use of trimmed and untrimmed versions of lines. The guessAlphabet() method in this case is falling over because of an unchecked blank line in the matrix file.

I've submitted a patch to trunk which fixes all the inconsistencies and should also fix this problem with the NUC files.


On 7 Apr 2010, at 14:32, Andreas Dr?ger wrote:

> Hi Stefan,
> 
> Thank you for this hint. I don't know what the problem is. Recently, I tested it and it worked. I'll have a look on it tomorrow and come back to you with an answer pretty soon!
> 
> Cheers
> Andreas
> 
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
> 
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From Stefan.Bleckmann at uni-duesseldorf.de  Wed Apr  7 10:01:04 2010
From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann)
Date: Wed, 07 Apr 2010 16:01:04 +0200
Subject: [Biojava-l] SubstitutionMatrix
Message-ID: <512EA47A-6F40-4A38-B69D-5990D273C9DD@uni-duesseldorf.de>

Hi Richard,

Thx for your fast replay. I found the same solution. Two additional line breaks in the file was the problem which I didn't saw in the editor I used to check the file.


Cheers Stefan


From andreas.prlic at gmail.com  Wed Apr  7 11:13:04 2010
From: andreas.prlic at gmail.com (Andreas Prlic)
Date: Wed, 7 Apr 2010 08:13:04 -0700
Subject: [Biojava-l] Anonymous svn down
In-Reply-To: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>
References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
	<36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>
Message-ID: <k2v59a41c431004070813w746f184ch4fa5a2b2b45c0bca@mail.gmail.com>

Hi Andy,

In the meanwhile Kyle Ellrott already has set  up a first github clone...

http://github.com/biojava/biojava

We are just monitoring it a bit to make sure it works properly...

Is the usermapping important? We have some 50+ users so that might be
painful...

Andreas

On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> By the looks of things this is quite a simple process to do:
>
> http://github.com/guides/import-from-subversion
>
> http://blog.woobling.org/2009/06/git-svn-abandon.html
>
> http://blog.johngoulah.com/2009/11/migrating-svn-to-git/
>
> The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up
>
> Andy
>
> On 3 Apr 2010, at 16:08, Andreas Prlic wrote:
>
>> Hi,
>>
>> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github?
>>
>> Andreas
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From ayates at ebi.ac.uk  Wed Apr  7 11:17:27 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 7 Apr 2010 16:17:27 +0100
Subject: [Biojava-l] Anonymous svn down
In-Reply-To: <k2v59a41c431004070813w746f184ch4fa5a2b2b45c0bca@mail.gmail.com>
References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
	<36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>
	<k2v59a41c431004070813w746f184ch4fa5a2b2b45c0bca@mail.gmail.com>
Message-ID: <647FD3F8-5222-487C-872F-DF00B693C809@ebi.ac.uk>

Hey Andreas,

The user mapping file only matters if we want a coherent link between our SVN users & those who have a github account. For example any commit of mine appears as ayates however it would probably be of more use to link to my github user since that would have more information about what I'm doing with the repo e.g. writing some snazzy new BJ3 code :). 

Andy

On 7 Apr 2010, at 16:13, Andreas Prlic wrote:

> Hi Andy,
> 
> In the meanwhile Kyle Ellrott already has set  up a first github clone...
> 
> http://github.com/biojava/biojava
> 
> We are just monitoring it a bit to make sure it works properly...
> 
> Is the usermapping important? We have some 50+ users so that might be
> painful...
> 
> Andreas
> 
> On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> By the looks of things this is quite a simple process to do:
>> 
>> http://github.com/guides/import-from-subversion
>> 
>> http://blog.woobling.org/2009/06/git-svn-abandon.html
>> 
>> http://blog.johngoulah.com/2009/11/migrating-svn-to-git/
>> 
>> The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up
>> 
>> Andy
>> 
>> On 3 Apr 2010, at 16:08, Andreas Prlic wrote:
>> 
>>> Hi,
>>> 
>>> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github?
>>> 
>>> Andreas
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From andreas at sdsc.edu  Wed Apr  7 15:12:27 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 7 Apr 2010 12:12:27 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>
Message-ID: <q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>

Hi Gustavo,

here my 0.02$:

* For some of your steps there is already code available in BioJava.
MIght be good to take a look at what is already there...   (look at
the alignment and phylo modules for dynamic programming and
Neighbour-Joining)

* What about risks? Where do you expect difficulties and how to work
around them?

* Step 4: Can you add more details? How do you plan to approach this?
E.g. Clustalw has a number of rules implemented at this stage. Do you
plan to support multiple rules as well and how to do this technically.
Something nice would be the possibility to use structure alignments to
guide the sequence alignments. (structure module)

Andreas


> -------------------------------------------------------------
>
> GSoC proposal
>
> Abstract
> --------
>
> This project aims to develop an all-Java implementation of a multiple
> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
> using the progressive algorithm described in the CLUSTALW paper [1].
>
> The Importance
> --------------
>
> Multiple sequence alignment is a frequently performed task in sequence
> analysis with the goal to identify new members of protein families and
> infer phylogenetic relationships between proteins and genes. At the
> present there is no Java-only implementation for this algorithm. As
> such the number of already existing and Java related BioInformatics
> tools and web sites would benefit from this implementation and
> sequence analysis could be more easily performed by the end-user.
>
> About Me
> --------
>
> I am a graduate student at University of S?o Paulo (Brazil), I got my
> undergraduate degree from the same university with a major in Computer
> Science and a minor in Biology. I have been involved with
> Bioinformatics for 5 years, always with sequence analysis with
> particular interest in the MSA problem. Also, in my undergraduate
> final project I developed a lossless filter (pruning algorithm) for
> the MSA problem, the work is published in [3] and there is an online
> implementation of the algorithm in [4]. Finally, I have experience
> with the C, C++, Java, Python and Ruby programming languages; Git and
> SVN version control systems.
>
> Project Plan
> ------------
>
> The project is divided in four main steps, at the end of each step a
> completely functional and bug-free new algorithm will be added to the
> Biojava code base. It should be noticed that each step has a strong
> dependence on the previous one, so before move to the next step a
> careful testing will be done.
>
> The four steps are described below, estimated times for accomplishment
> of each step are also given and in some steps extra enhancements are
> described, they will be implemented if there is some time remaining
> after all steps are completed.
>
> ** 1. Study the Biojava pairwise alignment code and update it to be
> compliant with Biojava 3.
>
> ?The pairwise alignment will play an important role in the MSA
> algorithm. This step is also important for me to get used to the
> Biojava coding standards and get in touch with the Biojava dev
> community.
>
> ?ETA: 2 weeks.
>
> ** 2. Implement the algorithm to build the distance matrix.
>
> ?This is done using the pairwise alignment for each pair of sequence
> in the set to be aligned.
>
> ?ETA: 1 week.
>
> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
> several threads to calculate the pairwise alignment for different
> pairs in the sequence set.
>
> ** 3. Implement the algorithm to build the guide tree.
>
> ?The guide tree is based on the distance matrix built in the last
> step, the tree construction strategy adopted will be the Neighbor
> Joining Algorithm.
>
> ?ETA: 2 weeks.
>
> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>
> ?This is certainly the most difficult part of the project, so to make
> sure we are going to deliver a fully functional MSA algorithm, a safer
> approach is going to be taken. In the first place, a dynamic
> programming algorithm described in [2] will be implemented. Once this
> get successfully done and the code fully integrated to the Biojava
> code base, the features described in [1] are going to be incrementally
> added (and tested) in order to implement the full dynamic programming
> algorithm.
>
> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>
> ?EXTRA: Implement some benchmark technique to measure the final
> alignment quality.
>
> References
> ----------
>
> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
> [3] http://www.almob.org/content/4/1/3
> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>
>
>
> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Gustavo,
>>
>> In principle I agree to all, see details below:
>>
>>
>> I think my question wasn't very clear, my intention in this project is
>>>
>>> to follow the approach (with the tree steps) outlined in the project's
>>> page. Using the classical progressive alignment heuristic: build the
>>> distance matrix, build the guide tree and using this tree
>>> progressively align more sequences together.
>>
>> yes
>>
>>>
>>> What I propose for the third step is a first implementation using the
>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>> more or less what you had in mind?
>>
>> yes, sounds good.
>>
>>>
>>> About parallel strategies, I think a relative easy way we could use it
>>> is in the distance matrix construction, we could have several threads
>>> calculating the pairwise alignment for different pairs of sequence in
>>> the set.
>>
>> Correct. Probably a first implementation would be for a single machine/
>> multi CPU. More advanced implementations could provide support e.g. for
>> Map/Reduce, JPPF, or something like that...
>>
>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>> paper doesn't give any way to measure the quality of the result, they
>>> consider a good alignment the one that is hard to improve by eye (But
>>> they claim that for sequences sufficient similar, no pair less than
>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>> paper and leave the quality measure to the user? How concerned should
>>> I be with that in this project?
>>
>> Getting an overall core-algorithm that works should be priority. The
>> benchmarking part is not mandatory, but something to keep in mind... I have
>> plenty of material for that, once we get to that stage...
>>
>>> I will try send to this mailing list a proposal draft until tomorrow
>>> to have some feedback from you.
>>
>> Excellent, looking forward to it.
>>
>> Andreas
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Wed Apr  7 15:30:19 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 7 Apr 2010 12:30:19 -0700
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
Message-ID: <n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>

Hi Singer,

> I had previously sent this, but was not part of the mailing list, so I
> can only assume it got lost in a spam loop.

You need to be subscribed in order to be able to post...

> I was interested in applying for the All-Java Multiple Sequence
> Alignment Google Summer of Code project.

Several students have expressed their interest  in this project.
Depending on how the funding situation will be, at maximum one will be
able to work on this... There is also a 2nd BioJava related project or
you could propose your own ideas...
http://biojava.org/wiki/Google_Summer_of_Code


 I wanted to create a project
> plan but had some questions about the package as it stands now.
>
> 1. What exactly has changed with the transition to BioJava 3? From
> what I've read on the BioJava 3 proposal page, it seems like that the
> changes are to the organization of the code. Additionally there are
> some new standards to follow. Java 6 usage is desired, but I am unsure
> of what of the new features could be used in modifying pairwise
> sequence alignments.

BioJava is more modular in version 3. There is a new module for
working with sequences. The current alignment module is still based on
the old version of BioJava though.

>
> 2. Is the Neighbor Joining Algorithm really the best for this? Are
> other multiple alignments implementations desired? I have implemented
> the neighbor joining algorithm very inefficiently in python, it was
> not particularly difficult.

NJ is a clustering technique, but there are also others.
http://en.wikipedia.org/wiki/Neighbor-joining
Another online lecture that might be useful is:
http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html

This step seems like it will not take very
> long. Additionally, parallelism, I have no experience with parallelism
> in Java and will only have some experience with it in C, will that be
> an issue?

I have never written multi threaded code in C, but I would guess it is
much much easier in Java...

> 3. Is there a specific paper with the exact algorithm that should be
> implemented here?

We have only 3 months for this project so having a modular core
algorithm that can be extended would be a priority. I recommend
reading the Clustalw, T-Coffee and Muscle papers.

> General: Will use cases be provided? Will test data be provided? These
> would both be useful in coding the test cases which seem to be coded
> first.

I can provide plenty of data for that.


> Additionally, I have access to my current windows machine as well as
> as Linux machine for testing, but no Mac. While in theory with java,
> if it works on one, then it works on another, and especially with if
> it works on Linux, it should be fine on Mac, should I be worried about
> strange peculiarities?

>From my experience Java works pretty fine on any platform. There might
be issues with user interfaces that require testing, but we are not
going to do  user interfaces here...

Andreas


>
> Thanks,
> Singer Ma
> Harvey Mudd College 2011
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas.draeger at uni-tuebingen.de  Thu Apr  8 03:13:17 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Thu, 08 Apr 2010 09:13:17 +0200
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
Message-ID: <4BBD820D.9070200@uni-tuebingen.de>

Hi all,

This e-mail is just for your information about somebody new, who'd like 
to contribute to our project.

Cheers
Andreas


Subject:
Re: Fwd: Proposing a project on "Biojava alignment lead"
From:
Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
Date:
Wed, 07 Apr 2010 09:27:13 +0200
To:
Cai Shaojiang <caishaojiang at gmail.com>

Hi Cai Shaojiang,

Thank you for you e-mail! I don't know what happened to the e-mail list. 
Sometimes it takes a while due to the spam filters, I guess.

 > I am a PhD student from National University of Singapore. My major 
research area is local alignment algorithms and data structures for SNP 
identification. And I have used Java and Eclipse for years for software 
development. I am very interested in your GSoC programme. I find that 
there is a module called "biojava-alignment lead" whose mentor is you. I 
want to propose a new project on this module. I have several questions 
about this module.

Yes, that's me. So great to get your support.

 > 1. It seems that pairwise alignment is to find similarity between two 
short sequences. Existing pairwise alignment is based on dynamic 
programming, is it Smith-Waterman algorithm?

So, currently, BioJava contains three different alignment approaches. 
There are two deterministic algorithms, i.e., Smith-Waterman for local 
alignment and Needleman-Wunsch for global alignment. Third, there is the 
possibility to apply Hidden Markov Models for alignment. An example of 
the latter approach should be in the cookbook.

 > 2. What is the exact task of "refactoring of underlying data structures"?

Yes, this is something, I did last week already but it could still be 
improved. The problem was that the alignment algorithms actually 
produced a kind of string that looks similar to the output of BLAST. 
This string contained the score, the computation time, the length of the 
alignment etc. The problem was that people wanted to perform 
higher-level computation on the score value or evaluate some other 
information. Now, the alignment will produce a data structure that 
contains all the information and can, in addition to that, also produce 
such a BLAST-like output. There is, however, still the following 
problem: The data structure requires both sequences in the pair-wise 
alignment to have an identical length. In case of local alignment this 
is especially stupid (actually), because gaps are inserted to fill the 
sequences. And then the data structure tries to keep the old sequence 
coordinates, leading to the effect that the numbers "query start", 
"query end", "subject start", and "subject end" are required to shift 
the sequences against each other when displaying the output. So, you 
cannot easily print the sequences below of each other, you first have to 
shift them. Please check out the latest version of this package via 
anonymeous svn and have a look ;-)

 > 3. My existing research area is aiming to deal with aligning short 
read (10s~100s bp) against extremely long sequences (e.g., human 
genome). Af far as I know, there is not existing such alignment tools 
implemented in Java. Would you consider this direction?

See, this would be very nice to include. But this requires that we no 
longer fill the short sequence with many, many gap symbols (just a waist 
of memory), but improve the data structure. There is already an 
UnequalLenghtAlignment (just a data structure, no algorithm) and I think 
we could use this as a starting point. Then your algorithm should only 
produce such a data structure and this would be fine.

 > 4. It seems that the existing tools is just lacking of some 
refactoring and representation interfaces. Any more underlying tasks?

Hm. Yes: With the release of BioJava 3 data structures have changed 
again. So maybe there's also some adaptation to the new structure required.

 > I am keeping an eye on GSoC from last month, but sorry to find out 
that I sent the initial email to the mailing list before I subscribe it...

Ok. Sounds good. Thanks for your interest. So I suggest: Download the 
latest trunk, have a look, play around and if you can improve something 
we'll put it into the trunk and write your name into the authors' tag.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091

From ayates at ebi.ac.uk  Thu Apr  8 06:23:06 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Thu, 8 Apr 2010 11:23:06 +0100
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
	<n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
Message-ID: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>

Hi Singer,

To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:

* Mutable objects are the work of the devil & should be avoided
* Tasks & Futures are quite lightweight things to produce; threads are not
* Multiple tasks can be given to a queue to be processed by a number of threads in a pool
* Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
* Assume that things will fail
* Write your program with a view to be concurrent; do not force concurrency on an already written program

Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). 

Andy

On 7 Apr 2010, at 20:30, Andreas Prlic wrote:

> Hi Singer,
> 
>> I had previously sent this, but was not part of the mailing list, so I
>> can only assume it got lost in a spam loop.
> 
> You need to be subscribed in order to be able to post...
> 
>> I was interested in applying for the All-Java Multiple Sequence
>> Alignment Google Summer of Code project.
> 
> Several students have expressed their interest  in this project.
> Depending on how the funding situation will be, at maximum one will be
> able to work on this... There is also a 2nd BioJava related project or
> you could propose your own ideas...
> http://biojava.org/wiki/Google_Summer_of_Code
> 
> 
> I wanted to create a project
>> plan but had some questions about the package as it stands now.
>> 
>> 1. What exactly has changed with the transition to BioJava 3? From
>> what I've read on the BioJava 3 proposal page, it seems like that the
>> changes are to the organization of the code. Additionally there are
>> some new standards to follow. Java 6 usage is desired, but I am unsure
>> of what of the new features could be used in modifying pairwise
>> sequence alignments.
> 
> BioJava is more modular in version 3. There is a new module for
> working with sequences. The current alignment module is still based on
> the old version of BioJava though.
> 
>> 
>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>> other multiple alignments implementations desired? I have implemented
>> the neighbor joining algorithm very inefficiently in python, it was
>> not particularly difficult.
> 
> NJ is a clustering technique, but there are also others.
> http://en.wikipedia.org/wiki/Neighbor-joining
> Another online lecture that might be useful is:
> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
> 
> This step seems like it will not take very
>> long. Additionally, parallelism, I have no experience with parallelism
>> in Java and will only have some experience with it in C, will that be
>> an issue?
> 
> I have never written multi threaded code in C, but I would guess it is
> much much easier in Java...
> 
>> 3. Is there a specific paper with the exact algorithm that should be
>> implemented here?
> 
> We have only 3 months for this project so having a modular core
> algorithm that can be extended would be a priority. I recommend
> reading the Clustalw, T-Coffee and Muscle papers.
> 
>> General: Will use cases be provided? Will test data be provided? These
>> would both be useful in coding the test cases which seem to be coded
>> first.
> 
> I can provide plenty of data for that.
> 
> 
>> Additionally, I have access to my current windows machine as well as
>> as Linux machine for testing, but no Mac. While in theory with java,
>> if it works on one, then it works on another, and especially with if
>> it works on Linux, it should be fine on Mac, should I be worried about
>> strange peculiarities?
> 
>> From my experience Java works pretty fine on any platform. There might
> be issues with user interfaces that require testing, but we are not
> going to do  user interfaces here...
> 
> Andreas
> 
> 
>> 
>> Thanks,
>> Singer Ma
>> Harvey Mudd College 2011
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From sma.hmc at gmail.com  Thu Apr  8 06:38:41 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Thu, 8 Apr 2010 03:38:41 -0700
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
	<n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
	<7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>
Message-ID: <h2k62ed8c081004080338xb1ee5a27k8253c38bb2b13fec@mail.gmail.com>

So, my questions were generated from looking past just the Summer of
Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as
part of its proposal, lists:

Make methods parallel-aware and take advantage of this when possible,
and provide a global variable to specify how much parallelisation can
take place.

on http://www.biojava.org/wiki/BioJava3_Proposal

How important it this to incorporate into the Summer of Code project?
Obviously anything that is already concurrent can remain that way, but
for the new code in multiple sequence alignment, does this need to be
parallel-aware? Clearly, in a multiple sequence alignment, certain
things can be made parallel such as the initial distance matrix
calculation, parts of the neighbor joining algorithm, etc. If I were
to contribute, I would want to uphold the agreed upon standards as
much as possible. I am just unsure of my capability to make multiple
sequence alignment parallel-aware.

Singer

On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Singer,
>
> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:
>
> * Mutable objects are the work of the devil & should be avoided
> * Tasks & Futures are quite lightweight things to produce; threads are not
> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool
> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
> * Assume that things will fail
> * Write your program with a view to be concurrent; do not force concurrency on an already written program
>
> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/).
>
> Andy
>
> On 7 Apr 2010, at 20:30, Andreas Prlic wrote:
>
>> Hi Singer,
>>
>>> I had previously sent this, but was not part of the mailing list, so I
>>> can only assume it got lost in a spam loop.
>>
>> You need to be subscribed in order to be able to post...
>>
>>> I was interested in applying for the All-Java Multiple Sequence
>>> Alignment Google Summer of Code project.
>>
>> Several students have expressed their interest ?in this project.
>> Depending on how the funding situation will be, at maximum one will be
>> able to work on this... There is also a 2nd BioJava related project or
>> you could propose your own ideas...
>> http://biojava.org/wiki/Google_Summer_of_Code
>>
>>
>> I wanted to create a project
>>> plan but had some questions about the package as it stands now.
>>>
>>> 1. What exactly has changed with the transition to BioJava 3? From
>>> what I've read on the BioJava 3 proposal page, it seems like that the
>>> changes are to the organization of the code. Additionally there are
>>> some new standards to follow. Java 6 usage is desired, but I am unsure
>>> of what of the new features could be used in modifying pairwise
>>> sequence alignments.
>>
>> BioJava is more modular in version 3. There is a new module for
>> working with sequences. The current alignment module is still based on
>> the old version of BioJava though.
>>
>>>
>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>>> other multiple alignments implementations desired? I have implemented
>>> the neighbor joining algorithm very inefficiently in python, it was
>>> not particularly difficult.
>>
>> NJ is a clustering technique, but there are also others.
>> http://en.wikipedia.org/wiki/Neighbor-joining
>> Another online lecture that might be useful is:
>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
>>
>> This step seems like it will not take very
>>> long. Additionally, parallelism, I have no experience with parallelism
>>> in Java and will only have some experience with it in C, will that be
>>> an issue?
>>
>> I have never written multi threaded code in C, but I would guess it is
>> much much easier in Java...
>>
>>> 3. Is there a specific paper with the exact algorithm that should be
>>> implemented here?
>>
>> We have only 3 months for this project so having a modular core
>> algorithm that can be extended would be a priority. I recommend
>> reading the Clustalw, T-Coffee and Muscle papers.
>>
>>> General: Will use cases be provided? Will test data be provided? These
>>> would both be useful in coding the test cases which seem to be coded
>>> first.
>>
>> I can provide plenty of data for that.
>>
>>
>>> Additionally, I have access to my current windows machine as well as
>>> as Linux machine for testing, but no Mac. While in theory with java,
>>> if it works on one, then it works on another, and especially with if
>>> it works on Linux, it should be fine on Mac, should I be worried about
>>> strange peculiarities?
>>
>>> From my experience Java works pretty fine on any platform. There might
>> be issues with user interfaces that require testing, but we are not
>> going to do ?user interfaces here...
>>
>> Andreas
>>
>>
>>>
>>> Thanks,
>>> Singer Ma
>>> Harvey Mudd College 2011
>>> _______________________________________________
>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>
>
>
>
>


From ayates at ebi.ac.uk  Thu Apr  8 06:46:15 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Thu, 8 Apr 2010 11:46:15 +0100
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <h2k62ed8c081004080338xb1ee5a27k8253c38bb2b13fec@mail.gmail.com>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
	<n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
	<7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>
	<h2k62ed8c081004080338xb1ee5a27k8253c38bb2b13fec@mail.gmail.com>
Message-ID: <91C9DF16-E6EF-4B7A-ADC4-E781275514EB@ebi.ac.uk>

Ahhh okay. So when we wrote this section it was with a view towards being able to do things in a concurrent manner as & when that framework appears. BioJava3 is still in an incubation phase; a lot of code is in place but we are all having to do this along with work commitments (which in my case is working on a Perl project so my work/BJ contributions are very limited). 

Anyway to go back to the question about being "framework" standard. The MSA algorithm would be the first case we would have to make concurrent (as far as I am  aware but Scooter is a better person to confirm this) and so the framework of building a concurrent application would come from this project. If the code is written using the standard concurrent library interfaces then it should be possible to transplant it into any concurrent Java framework and that's really the important thing here.

Andy

On 8 Apr 2010, at 11:38, Singer Ma wrote:

> So, my questions were generated from looking past just the Summer of
> Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as
> part of its proposal, lists:
> 
> Make methods parallel-aware and take advantage of this when possible,
> and provide a global variable to specify how much parallelisation can
> take place.
> 
> on http://www.biojava.org/wiki/BioJava3_Proposal
> 
> How important it this to incorporate into the Summer of Code project?
> Obviously anything that is already concurrent can remain that way, but
> for the new code in multiple sequence alignment, does this need to be
> parallel-aware? Clearly, in a multiple sequence alignment, certain
> things can be made parallel such as the initial distance matrix
> calculation, parts of the neighbor joining algorithm, etc. If I were
> to contribute, I would want to uphold the agreed upon standards as
> much as possible. I am just unsure of my capability to make multiple
> sequence alignment parallel-aware.
> 
> Singer
> 
> On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Singer,
>> 
>> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:
>> 
>> * Mutable objects are the work of the devil & should be avoided
>> * Tasks & Futures are quite lightweight things to produce; threads are not
>> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool
>> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
>> * Assume that things will fail
>> * Write your program with a view to be concurrent; do not force concurrency on an already written program
>> 
>> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/).
>> 
>> Andy
>> 
>> On 7 Apr 2010, at 20:30, Andreas Prlic wrote:
>> 
>>> Hi Singer,
>>> 
>>>> I had previously sent this, but was not part of the mailing list, so I
>>>> can only assume it got lost in a spam loop.
>>> 
>>> You need to be subscribed in order to be able to post...
>>> 
>>>> I was interested in applying for the All-Java Multiple Sequence
>>>> Alignment Google Summer of Code project.
>>> 
>>> Several students have expressed their interest  in this project.
>>> Depending on how the funding situation will be, at maximum one will be
>>> able to work on this... There is also a 2nd BioJava related project or
>>> you could propose your own ideas...
>>> http://biojava.org/wiki/Google_Summer_of_Code
>>> 
>>> 
>>> I wanted to create a project
>>>> plan but had some questions about the package as it stands now.
>>>> 
>>>> 1. What exactly has changed with the transition to BioJava 3? From
>>>> what I've read on the BioJava 3 proposal page, it seems like that the
>>>> changes are to the organization of the code. Additionally there are
>>>> some new standards to follow. Java 6 usage is desired, but I am unsure
>>>> of what of the new features could be used in modifying pairwise
>>>> sequence alignments.
>>> 
>>> BioJava is more modular in version 3. There is a new module for
>>> working with sequences. The current alignment module is still based on
>>> the old version of BioJava though.
>>> 
>>>> 
>>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>>>> other multiple alignments implementations desired? I have implemented
>>>> the neighbor joining algorithm very inefficiently in python, it was
>>>> not particularly difficult.
>>> 
>>> NJ is a clustering technique, but there are also others.
>>> http://en.wikipedia.org/wiki/Neighbor-joining
>>> Another online lecture that might be useful is:
>>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
>>> 
>>> This step seems like it will not take very
>>>> long. Additionally, parallelism, I have no experience with parallelism
>>>> in Java and will only have some experience with it in C, will that be
>>>> an issue?
>>> 
>>> I have never written multi threaded code in C, but I would guess it is
>>> much much easier in Java...
>>> 
>>>> 3. Is there a specific paper with the exact algorithm that should be
>>>> implemented here?
>>> 
>>> We have only 3 months for this project so having a modular core
>>> algorithm that can be extended would be a priority. I recommend
>>> reading the Clustalw, T-Coffee and Muscle papers.
>>> 
>>>> General: Will use cases be provided? Will test data be provided? These
>>>> would both be useful in coding the test cases which seem to be coded
>>>> first.
>>> 
>>> I can provide plenty of data for that.
>>> 
>>> 
>>>> Additionally, I have access to my current windows machine as well as
>>>> as Linux machine for testing, but no Mac. While in theory with java,
>>>> if it works on one, then it works on another, and especially with if
>>>> it works on Linux, it should be fine on Mac, should I be worried about
>>>> strange peculiarities?
>>> 
>>>> From my experience Java works pretty fine on any platform. There might
>>> be issues with user interfaces that require testing, but we are not
>>> going to do  user interfaces here...
>>> 
>>> Andreas
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Singer Ma
>>>> Harvey Mudd College 2011
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From mitlox at op.pl  Thu Apr  8 07:30:13 2010
From: mitlox at op.pl (xyz)
Date: Thu, 8 Apr 2010 21:30:13 +1000
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
References: <20100330215047.084f6b00@wp01>
	<Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
Message-ID: <20100408213013.63a99b8c@wp01>

On Wed, 31 Mar 2010 23:56:42 -0400 (EDT)
Michael Heuer wrote:

> import static ...RichSequence.Tools.*;
> import static ...RichSequence.IOTools.*;
> 
> Fastq fastq = ...;
> Namespace namepace = ...;
> RichSequence richSequence = createRichSequence(
>   namespace,
>   fastq.getDescription(),
>   fastq.getSequence(),
>   DNATools.getDNA());
> 
> writeFasta(outputStream, richSequence, namespace);

I have tried this but I got this error:
Fastq2Fasta.java:52: cannot find symbol
symbol  : method
createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet)
location: class Fastq2Fasta RichSequence richSequence =
createRichSequence(ns, 
1 error

The complete code looks now :

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import org.biojava.bio.program.fastq.Fastq;
import org.biojava.bio.program.fastq.FastqBuilder;
import org.biojava.bio.program.fastq.FastqReader;
import org.biojava.bio.program.fastq.FastqVariant;
import org.biojava.bio.program.fastq.FastqWriter;
import org.biojava.bio.program.fastq.IlluminaFastqReader;
import org.biojava.bio.program.fastq.IlluminaFastqWriter;
import org.biojava.bio.seq.DNATools;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;


public class Fastq2Fasta {

  public static void main(String[] args) throws FileNotFoundException,
  IOException {

    FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); 
    FastqReader qReader = new IlluminaFastqReader();

    FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); 
    FastqWriter qWriter = new IlluminaFastqWriter();

    //SimpleNamespace ns = new SimpleNamespace("biojava");

    FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta");


    for (Fastq fastq : qReader.read(inputFastq)) {
      System.out.println(fastq.getDescription());
      System.out.println(fastq.getSequence());
      String trimSeq = fastq.getSequence().substring(0,
      		fastq.getSequence().length() - 6); 
      System.out.println(trimSeq);
      System.out.println(fastq.getQuality());
      String trimQual = fastq.getQuality().substring(0,
    		fastq.getQuality().length() - 6);
      System.out.println(trimQual);

      FastqBuilder trimFastq = new FastqBuilder();
      trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA);
      trimFastq.withDescription(fastq.getDescription());
      trimFastq.appendSequence(trimSeq);
      trimFastq.appendQuality(trimQual);

      qWriter.write(outputFastq, trimFastq.build());


      SimpleNamespace ns = new SimpleNamespace("biojava");
      RichSequence richSequence = createRichSequence(ns,
              fastq.getDescription(), trimSeq, DNATools.getDNA());
      RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns);
    }
  }
}

What did I wrong?


> 
> > Suggestions:
> > 1)
> > After I trimmed the fastq files the header information for quality
> > is empty
> >
> > @HWI-EAS406:5:1:0:1390#0/1
> > GGGTGATGGCCGCTGCCGATGGCGTCAAAA
> > +
> > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
> >
> > this reduced the size of the files but is it compatible with
> > SOAP and TopHat?
> 
> Sorry, not sure what you are asking here.
> 
Usually  @-headerand and +-header are equal eg.
@HWI-EAS406:5:1:0:1390#0/1
+HWI-EAS406:5:1:0:1390#0/1
but after trimming and writting to fastq file I got this
@HWI-EAS406:5:1:0:1390#0/1
+
The +-header is empty. Is this ok like this and standard compatible?

Best regards,

From mitlox at op.pl  Thu Apr  8 07:30:52 2010
From: mitlox at op.pl (xyz)
Date: Thu, 8 Apr 2010 21:30:52 +1000
Subject: [Biojava-l] readFasta problem
Message-ID: <20100408213052.662beb8e@wp01>

Hello,
I would like to read fasta file without to specify whether it is DNA,
RNA or Protein in code and I wrote this code

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import org.biojava.bio.BioException;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;

public class SortFasta {

  public static void main(String[] args) throws FileNotFoundException,
  BioException {


    BufferedReader br = new BufferedReader(new
    FileReader("sortFasta.fasta")); 
    SimpleNamespace ns = new SimpleNamespace("biojava");

    // You can use any of the convenience methods found in the BioJava 1.6 API 
    //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br,  ns); 
    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns);

    // Since a single file can contain more than a sequence, you need
    // to iterate over rsi to get the information.
    while (rsi.hasNext()) {
      RichSequence rs = rsi.nextRichSequence();
      System.out.println(rs.getComments());
      System.out.println(rs.seqString());
    }
  }
}
but unfortunately it I have got following error:
it the details that follow to biojava-l at biojava.org or post a bug
    report to http://bugzilla.open-bio.org/ 

Format_object=org.biojavax.bio.seq.io.FastaFormat
Accession=
Id=
Comments=problem parsing symbols
Parse_block=atccccc
Stack trace follows ....


        at
        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222)
        at
        org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ...
        1 more Caused by: java.lang.NullPointerException at
        org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:165)
        at
        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ...
        2 more Java Result: 1

What did I wrong?

Thank you in advance.

Best regards,

From holland at eaglegenomics.com  Thu Apr  8 07:41:25 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Thu, 8 Apr 2010 12:41:25 +0100
Subject: [Biojava-l] readFasta problem
In-Reply-To: <20100408213052.662beb8e@wp01>
References: <20100408213052.662beb8e@wp01>
Message-ID: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>

You have passed null into the tokenizer parameter of RichSequence.IOTools.readFasta() - this is not allowed. The parser cannot guess the type of sequence, it must be told what to expect by specifying the tokenizer to use. (Importantly this also means that you cannot mix different types of sequence within the same file to be parsed.)


On 8 Apr 2010, at 12:30, xyz wrote:

> Hello,
> I would like to read fasta file without to specify whether it is DNA,
> RNA or Protein in code and I wrote this code
> 
> import java.io.BufferedReader;
> import java.io.FileNotFoundException;
> import java.io.FileReader;
> import org.biojava.bio.BioException;
> import org.biojavax.SimpleNamespace;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> 
> public class SortFasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  BioException {
> 
> 
>    BufferedReader br = new BufferedReader(new
>    FileReader("sortFasta.fasta")); 
>    SimpleNamespace ns = new SimpleNamespace("biojava");
> 
>    // You can use any of the convenience methods found in the BioJava 1.6 API 
>    //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br,  ns); 
>    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns);
> 
>    // Since a single file can contain more than a sequence, you need
>    // to iterate over rsi to get the information.
>    while (rsi.hasNext()) {
>      RichSequence rs = rsi.nextRichSequence();
>      System.out.println(rs.getComments());
>      System.out.println(rs.seqString());
>    }
>  }
> }
> but unfortunately it I have got following error:
> it the details that follow to biojava-l at biojava.org or post a bug
>    report to http://bugzilla.open-bio.org/ 
> 
> Format_object=org.biojavax.bio.seq.io.FastaFormat
> Accession=
> Id=
> Comments=problem parsing symbols
> Parse_block=atccccc
> Stack trace follows ....
> 
> 
>        at
>        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222)
>        at
>        org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ...
>        1 more Caused by: java.lang.NullPointerException at
>        org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:165)
>        at
>        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ...
>        2 more Java Result: 1
> 
> What did I wrong?
> 
> Thank you in advance.
> 
> Best regards,
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From holland at eaglegenomics.com  Thu Apr  8 07:36:36 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Thu, 8 Apr 2010 12:36:36 +0100
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <20100408213013.63a99b8c@wp01>
References: <20100330215047.084f6b00@wp01>
	<Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
	<20100408213013.63a99b8c@wp01>
Message-ID: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com>

You haven't included the two import static lines in your code. See first two lines of Michael's example code (expanding the ellipses to the full classpath).

On 8 Apr 2010, at 12:30, xyz wrote:

> On Wed, 31 Mar 2010 23:56:42 -0400 (EDT)
> Michael Heuer wrote:
> 
>> import static ...RichSequence.Tools.*;
>> import static ...RichSequence.IOTools.*;
>> 
>> Fastq fastq = ...;
>> Namespace namepace = ...;
>> RichSequence richSequence = createRichSequence(
>>  namespace,
>>  fastq.getDescription(),
>>  fastq.getSequence(),
>>  DNATools.getDNA());
>> 
>> writeFasta(outputStream, richSequence, namespace);
> 
> I have tried this but I got this error:
> Fastq2Fasta.java:52: cannot find symbol
> symbol  : method
> createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet)
> location: class Fastq2Fasta RichSequence richSequence =
> createRichSequence(ns, 
> 1 error
> 
> The complete code looks now :
> 
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import org.biojava.bio.program.fastq.Fastq;
> import org.biojava.bio.program.fastq.FastqBuilder;
> import org.biojava.bio.program.fastq.FastqReader;
> import org.biojava.bio.program.fastq.FastqVariant;
> import org.biojava.bio.program.fastq.FastqWriter;
> import org.biojava.bio.program.fastq.IlluminaFastqReader;
> import org.biojava.bio.program.fastq.IlluminaFastqWriter;
> import org.biojava.bio.seq.DNATools;
> import org.biojavax.SimpleNamespace;
> import org.biojavax.bio.seq.RichSequence;
> 
> 
> public class Fastq2Fasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  IOException {
> 
>    FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); 
>    FastqReader qReader = new IlluminaFastqReader();
> 
>    FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); 
>    FastqWriter qWriter = new IlluminaFastqWriter();
> 
>    //SimpleNamespace ns = new SimpleNamespace("biojava");
> 
>    FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta");
> 
> 
>    for (Fastq fastq : qReader.read(inputFastq)) {
>      System.out.println(fastq.getDescription());
>      System.out.println(fastq.getSequence());
>      String trimSeq = fastq.getSequence().substring(0,
>      		fastq.getSequence().length() - 6); 
>      System.out.println(trimSeq);
>      System.out.println(fastq.getQuality());
>      String trimQual = fastq.getQuality().substring(0,
>    		fastq.getQuality().length() - 6);
>      System.out.println(trimQual);
> 
>      FastqBuilder trimFastq = new FastqBuilder();
>      trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA);
>      trimFastq.withDescription(fastq.getDescription());
>      trimFastq.appendSequence(trimSeq);
>      trimFastq.appendQuality(trimQual);
> 
>      qWriter.write(outputFastq, trimFastq.build());
> 
> 
>      SimpleNamespace ns = new SimpleNamespace("biojava");
>      RichSequence richSequence = createRichSequence(ns,
>              fastq.getDescription(), trimSeq, DNATools.getDNA());
>      RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns);
>    }
>  }
> }
> 
> What did I wrong?
> 
> 
>> 
>>> Suggestions:
>>> 1)
>>> After I trimmed the fastq files the header information for quality
>>> is empty
>>> 
>>> @HWI-EAS406:5:1:0:1390#0/1
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAAA
>>> +
>>> OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
>>> 
>>> this reduced the size of the files but is it compatible with
>>> SOAP and TopHat?
>> 
>> Sorry, not sure what you are asking here.
>> 
> Usually  @-headerand and +-header are equal eg.
> @HWI-EAS406:5:1:0:1390#0/1
> +HWI-EAS406:5:1:0:1390#0/1
> but after trimming and writting to fastq file I got this
> @HWI-EAS406:5:1:0:1390#0/1
> +
> The +-header is empty. Is this ok like this and standard compatible?
> 
> Best regards,
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From chapman at cs.wisc.edu  Thu Apr  8 08:47:12 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Thu, 08 Apr 2010 07:47:12 -0500
Subject: [Biojava-l] GSoC Application
Message-ID: <4BBDD050.6090208@cs.wisc.edu>

I would appreciate any feedback on my proposal from mentors or other developers. 
  Check it out at: 
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817

Thanks in advance,
Mark

From caishaojiang at gmail.com  Thu Apr  8 09:28:11 2010
From: caishaojiang at gmail.com (Cai Shaojiang)
Date: Thu, 8 Apr 2010 06:28:11 -0700
Subject: [Biojava-l] [Fwd: Re:  GSoC project on MSA]
In-Reply-To: <4BBDCFD2.3000507@uni-tuebingen.de>
References: <4BBC80A8.5000608@uni-tuebingen.de>
	<v2j927e071e1004072144t557b480au27666262c79094e2@mail.gmail.com>
	<4BBDCFD2.3000507@uni-tuebingen.de>
Message-ID: <r2p927e071e1004080628hfdce95c2y1081153aeeaaecef@mail.gmail.com>

Dear Sir:

I have submitted the proposal through Google.

Cheers.

On Thu, Apr 8, 2010 at 5:45 AM, Andreas Dr?ger <
andreas.draeger at uni-tuebingen.de> wrote:

> Hi Cai,
>
> Oh yes, it is in the alignment package. But it is only an interface. It
> already has two sub-types: AbstractULAlignment and this has the
> implementation SubULAlignment. We should check first if we can already use
> these data structures to easily produce a paired alignment. Can you see how
> the AlignmentPair is produced by the alignment algorithms in the alignment
> package? We should do something similar but with this different data
> structure, I suggest.
>
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
>


-- 
Cai Shaojiang
Department of Information Systems,
School of Computing,
National University of Singapore
Telephone: +65 93-4870-93
Email: caishaojiang at gmail.com; shaoj at comp.nus.edu.sg


From sacomoto at gmail.com  Thu Apr  8 12:26:55 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Thu, 8 Apr 2010 13:26:55 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com> 
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com> 
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com> 
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com> 
	<q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>
Message-ID: <x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com>

Hi Andreas,

On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> here my 0.02$:
>
> * For some of your steps there is already code available in BioJava.
> MIght be good to take a look at what is already there... ? (look at
> the alignment and phylo modules for dynamic programming and
> Neighbour-Joining)
>
> * What about risks? Where do you expect difficulties and how to work
> around them?
>
> * Step 4: Can you add more details? How do you plan to approach this?
> E.g. Clustalw has a number of rules implemented at this stage. Do you
> plan to support multiple rules as well and how to do this technically.
> Something nice would be the possibility to use structure alignments to
> guide the sequence alignments. (structure module)

Based on it I rewrote the step 4 and add a "Main Risks" section.

I pasted just the new version of step 4 and the new section at the end
of this e-mal.

Thank you very much for your feedback.

gustavo


-------------------------------------------------------------------------------------------

** 4. Implement the algorithm for progressive MSA and the MSA wrapper.

 A progressive MSA is a heuristic approach for the MSA problem, at
each step a pairwise alignment between two sequences, a sequence and
an alignment or between two alignments is done. So, the multiple
alignment is built incrementally, at each iteration more sequences are
aligned together. The guide tree gives an order for this incremental
alignment, in a bottom-up (in the tree) fashion sequences (or groups
of sequences) with greater similarity are aligned first. Therefore, in
order to have a more flexible and reusable code, the code design will
allow any binary tree of the sequences to be used as a guide tree, not
only the one built in the last step. This will allow a priori
phylogenetic or tertiary similarity (structural similarity) knowledge
be used to guide the multiple alignment order.

 This is certainly the most difficult part of the project, so to make
sure we are going to deliver a fully functional MSA algorithm, a safer
approach is going to be taken. In the first place, a a basic algorithm
described in [2] will be implemented. Once this get successfully done
and the code fully integrated to the Biojava code base, the features
described in [1] are going to be incrementally added (and tested) in
order to implement the full algorithm. This step is further divided in
substeps.

*** 4.1 Implement a first simpler dynamic programming (DP) algorithm.

  This is the generalized pairwise alignment used in each iteration of
the progressive MSA. Gaps  already presents in one of the alignments
(profiles) remain fixed, gap opening penalties remain unchanged, this
means that opening new gaps inside existent gaps will be fully
penalized. The code for this algorithm is similar to, the already
present in Biojava, code for regular pairwise alignment.

*** 4.2 Implement the basic progressive MSA algorithm.

  In this substep is going to be implemented the incremental algorithm
to built the MSA, transversing a guide tree (parameter, could be the
one built in step 3 or any other one) in a bottom-up fashion and using
the algorithm from substep 4.1 at each iteration.

*** 4.3 Implement the MSA wrapper.

  The MSA wrapper is going to be a method that wraps steps 2, 3 and
4.2, giving a simple method (for the final user) to calculate the MSA.
Receiving as parameters the set of sequences to be aligned, the gap
opening penalty, gap extend penalty and residue matrix. Returning the
MSA for the sequence set.
  At the end of this substep, we get a basic fully functional MSA
algorithm, using the progressive heuristic.

*** 4.4 Implement gaps penalties rescaling and parameter default values.

  Gap penalties to open a new gap an extend a existing one (the affine
gap weight model) are user defined parameters. This substep will
define default values, based on the residue matrix, for this
parameters and implement global rescaling rules (based on sequences
sizes) for this parameters.

*** 4.5 Enhance the DP algorithm to use different sequences weight.

  Based on the guide tree, for each sequence a different weight
(divergent sequences receive high values) is calculated and used in
the scoring scheme of the generalized DP algorithm.

*** 4.6 Enhance the DP algorithm to use position based gap penalties.

  The DP algorithm from substep 4.1 uses globally defined gap opening
penalty. In this substep, the algorithm is going to be modified do use
position based penalty, this is simple, once is known an array of
opening penalties for each sequence position. This array is calculated
based on several hierarchical (only apply the first one that fits, if
any) rules, those are rescaling rules and the array is initialized
with the original gap opening penalty.

Given the hierarchical nature of the rules, they can be implemented in
a incremental way, from the highest priority rule to the lowest, the
algorithm of each step being a refinement of the previous one. I am
omitting the detailed description of each rule. However, to verify if
a given rule apply to a given position, all that is necessary is to
check at most 16 adjacent positions and the same position in the other
already aligned sequences.

At the end of each of the following steps we a have functional
algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.

**** 4.6.1 Lowered gap opening penalties at existing gaps.
**** 4.6.2 Increased gap opening penalties near existing gaps.
**** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
**** 4.6.4 Residue specific gap penalties.

 ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.

 EXTRA: Implement some benchmark technique to measure the final
alignment quality.

Main Risks
----------

The main risk to this project is the intrinsic complexity of the MSA
progressive algorithm. To deal with that we decided to break the
implementation in a large number of small and manageable steps, and
the steps are designed in a way that, at the end of each of them, we
will have a complete and testable new function (or a modification of
an existing one). Besides that, to be extra careful the project aims
to produce a simple full functional MSA algorithm as early as
possible, the estimated time is 8 weeks, this way we guarantee to
deliver at a simpler, but working and bug-free, version.


> Andreas
>
>
>> -------------------------------------------------------------
>>
>> GSoC proposal
>>
>> Abstract
>> --------
>>
>> This project aims to develop an all-Java implementation of a multiple
>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>> using the progressive algorithm described in the CLUSTALW paper [1].
>>
>> The Importance
>> --------------
>>
>> Multiple sequence alignment is a frequently performed task in sequence
>> analysis with the goal to identify new members of protein families and
>> infer phylogenetic relationships between proteins and genes. At the
>> present there is no Java-only implementation for this algorithm. As
>> such the number of already existing and Java related BioInformatics
>> tools and web sites would benefit from this implementation and
>> sequence analysis could be more easily performed by the end-user.
>>
>> About Me
>> --------
>>
>> I am a graduate student at University of S?o Paulo (Brazil), I got my
>> undergraduate degree from the same university with a major in Computer
>> Science and a minor in Biology. I have been involved with
>> Bioinformatics for 5 years, always with sequence analysis with
>> particular interest in the MSA problem. Also, in my undergraduate
>> final project I developed a lossless filter (pruning algorithm) for
>> the MSA problem, the work is published in [3] and there is an online
>> implementation of the algorithm in [4]. Finally, I have experience
>> with the C, C++, Java, Python and Ruby programming languages; Git and
>> SVN version control systems.
>>
>> Project Plan
>> ------------
>>
>> The project is divided in four main steps, at the end of each step a
>> completely functional and bug-free new algorithm will be added to the
>> Biojava code base. It should be noticed that each step has a strong
>> dependence on the previous one, so before move to the next step a
>> careful testing will be done.
>>
>> The four steps are described below, estimated times for accomplishment
>> of each step are also given and in some steps extra enhancements are
>> described, they will be implemented if there is some time remaining
>> after all steps are completed.
>>
>> ** 1. Study the Biojava pairwise alignment code and update it to be
>> compliant with Biojava 3.
>>
>> ?The pairwise alignment will play an important role in the MSA
>> algorithm. This step is also important for me to get used to the
>> Biojava coding standards and get in touch with the Biojava dev
>> community.
>>
>> ?ETA: 2 weeks.
>>
>> ** 2. Implement the algorithm to build the distance matrix.
>>
>> ?This is done using the pairwise alignment for each pair of sequence
>> in the set to be aligned.
>>
>> ?ETA: 1 week.
>>
>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
>> several threads to calculate the pairwise alignment for different
>> pairs in the sequence set.
>>
>> ** 3. Implement the algorithm to build the guide tree.
>>
>> ?The guide tree is based on the distance matrix built in the last
>> step, the tree construction strategy adopted will be the Neighbor
>> Joining Algorithm.
>>
>> ?ETA: 2 weeks.
>>
>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>
>> ?This is certainly the most difficult part of the project, so to make
>> sure we are going to deliver a fully functional MSA algorithm, a safer
>> approach is going to be taken. In the first place, a dynamic
>> programming algorithm described in [2] will be implemented. Once this
>> get successfully done and the code fully integrated to the Biojava
>> code base, the features described in [1] are going to be incrementally
>> added (and tested) in order to implement the full dynamic programming
>> algorithm.
>>
>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>
>> ?EXTRA: Implement some benchmark technique to measure the final
>> alignment quality.
>>
>> References
>> ----------
>>
>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>> [3] http://www.almob.org/content/4/1/3
>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>
>>
>>
>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Gustavo,
>>>
>>> In principle I agree to all, see details below:
>>>
>>>
>>> I think my question wasn't very clear, my intention in this project is
>>>>
>>>> to follow the approach (with the tree steps) outlined in the project's
>>>> page. Using the classical progressive alignment heuristic: build the
>>>> distance matrix, build the guide tree and using this tree
>>>> progressively align more sequences together.
>>>
>>> yes
>>>
>>>>
>>>> What I propose for the third step is a first implementation using the
>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>> more or less what you had in mind?
>>>
>>> yes, sounds good.
>>>
>>>>
>>>> About parallel strategies, I think a relative easy way we could use it
>>>> is in the distance matrix construction, we could have several threads
>>>> calculating the pairwise alignment for different pairs of sequence in
>>>> the set.
>>>
>>> Correct. Probably a first implementation would be for a single machine/
>>> multi CPU. More advanced implementations could provide support e.g. for
>>> Map/Reduce, JPPF, or something like that...
>>>
>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>> paper doesn't give any way to measure the quality of the result, they
>>>> consider a good alignment the one that is hard to improve by eye (But
>>>> they claim that for sequences sufficient similar, no pair less than
>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>> paper and leave the quality measure to the user? How concerned should
>>>> I be with that in this project?
>>>
>>> Getting an overall core-algorithm that works should be priority. The
>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>> plenty of material for that, once we get to that stage...
>>>
>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>> to have some feedback from you.
>>>
>>> Excellent, looking forward to it.
>>>
>>> Andreas
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From andreas at sdsc.edu  Thu Apr  8 13:26:03 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 8 Apr 2010 10:26:03 -0700
Subject: [Biojava-l] GSoC Application
In-Reply-To: <4BBDD050.6090208@cs.wisc.edu>
References: <4BBDD050.6090208@cs.wisc.edu>
Message-ID: <x2t59a41c431004081026s4eb39908gf7fb8cc30e99d483@mail.gmail.com>

Hi Mark,

looks pretty good,

* The time schedule feels tight. Where do you see possible
difficulties and risks. What might take longer than expected?

* I would like to be able to use 3D structure alignment information to
guide the final alignment. This should increase reliability of the
final alignment for remote sequence similarities. Any thoughts on how
to accomplish this?

Andreas


On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman <chapman at cs.wisc.edu> wrote:
> I would appreciate any feedback on my proposal from mentors or other
> developers. ?Check it out at:
> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817
>
> Thanks in advance,
> Mark
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Thu Apr  8 13:36:56 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 8 Apr 2010 10:36:56 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>
	<q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>
	<x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com>
Message-ID: <w2n59a41c431004081036o535f1696qbbe13f59f6f6f56b@mail.gmail.com>

Looks pretty good.

One issue during the progressive alignment build up: 3D structure
alignments can increase the reliability of the sequence alignments,
particularly if the sequences are only distantly related. Having a way
to incorporate the 3D structure info would be nice...

Andreas

On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto
<sacomoto at gmail.com> wrote:
> Hi Andreas,
>
> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Gustavo,
>>
>> here my 0.02$:
>>
>> * For some of your steps there is already code available in BioJava.
>> MIght be good to take a look at what is already there... ? (look at
>> the alignment and phylo modules for dynamic programming and
>> Neighbour-Joining)
>>
>> * What about risks? Where do you expect difficulties and how to work
>> around them?
>>
>> * Step 4: Can you add more details? How do you plan to approach this?
>> E.g. Clustalw has a number of rules implemented at this stage. Do you
>> plan to support multiple rules as well and how to do this technically.
>> Something nice would be the possibility to use structure alignments to
>> guide the sequence alignments. (structure module)
>
> Based on it I rewrote the step 4 and add a "Main Risks" section.
>
> I pasted just the new version of step 4 and the new section at the end
> of this e-mal.
>
> Thank you very much for your feedback.
>
> gustavo
>
>
>
> -------------------------------------------------------------------------------------------
>
> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper.
>
> ?A progressive MSA is a heuristic approach for the MSA problem, at
> each step a pairwise alignment between two sequences, a sequence and
> an alignment or between two alignments is done. So, the multiple
> alignment is built incrementally, at each iteration more sequences are
> aligned together. The guide tree gives an order for this incremental
> alignment, in a bottom-up (in the tree) fashion sequences (or groups
> of sequences) with greater similarity are aligned first. Therefore, in
> order to have a more flexible and reusable code, the code design will
> allow any binary tree of the sequences to be used as a guide tree, not
> only the one built in the last step. This will allow a priori
> phylogenetic or tertiary similarity (structural similarity) knowledge
> be used to guide the multiple alignment order.
>
> ?This is certainly the most difficult part of the project, so to make
> sure we are going to deliver a fully functional MSA algorithm, a safer
> approach is going to be taken. In the first place, a a basic algorithm
> described in [2] will be implemented. Once this get successfully done
> and the code fully integrated to the Biojava code base, the features
> described in [1] are going to be incrementally added (and tested) in
> order to implement the full algorithm. This step is further divided in
> substeps.
>
> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm.
>
> ?This is the generalized pairwise alignment used in each iteration of
> the progressive MSA. Gaps ?already presents in one of the alignments
> (profiles) remain fixed, gap opening penalties remain unchanged, this
> means that opening new gaps inside existent gaps will be fully
> penalized. The code for this algorithm is similar to, the already
> present in Biojava, code for regular pairwise alignment.
>
> *** 4.2 Implement the basic progressive MSA algorithm.
>
> ?In this substep is going to be implemented the incremental algorithm
> to built the MSA, transversing a guide tree (parameter, could be the
> one built in step 3 or any other one) in a bottom-up fashion and using
> the algorithm from substep 4.1 at each iteration.
>
> *** 4.3 Implement the MSA wrapper.
>
> ?The MSA wrapper is going to be a method that wraps steps 2, 3 and
> 4.2, giving a simple method (for the final user) to calculate the MSA.
> Receiving as parameters the set of sequences to be aligned, the gap
> opening penalty, gap extend penalty and residue matrix. Returning the
> MSA for the sequence set.
> ?At the end of this substep, we get a basic fully functional MSA
> algorithm, using the progressive heuristic.
>
> *** 4.4 Implement gaps penalties rescaling and parameter default values.
>
> ?Gap penalties to open a new gap an extend a existing one (the affine
> gap weight model) are user defined parameters. This substep will
> define default values, based on the residue matrix, for this
> parameters and implement global rescaling rules (based on sequences
> sizes) for this parameters.
>
> *** 4.5 Enhance the DP algorithm to use different sequences weight.
>
> ?Based on the guide tree, for each sequence a different weight
> (divergent sequences receive high values) is calculated and used in
> the scoring scheme of the generalized DP algorithm.
>
> *** 4.6 Enhance the DP algorithm to use position based gap penalties.
>
> ?The DP algorithm from substep 4.1 uses globally defined gap opening
> penalty. In this substep, the algorithm is going to be modified do use
> position based penalty, this is simple, once is known an array of
> opening penalties for each sequence position. This array is calculated
> based on several hierarchical (only apply the first one that fits, if
> any) rules, those are rescaling rules and the array is initialized
> with the original gap opening penalty.
>
> Given the hierarchical nature of the rules, they can be implemented in
> a incremental way, from the highest priority rule to the lowest, the
> algorithm of each step being a refinement of the previous one. I am
> omitting the detailed description of each rule. However, to verify if
> a given rule apply to a given position, all that is necessary is to
> check at most 16 adjacent positions and the same position in the other
> already aligned sequences.
>
> At the end of each of the following steps we a have functional
> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.
>
> **** 4.6.1 Lowered gap opening penalties at existing gaps.
> **** 4.6.2 Increased gap opening penalties near existing gaps.
> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
> **** 4.6.4 Residue specific gap penalties.
>
> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>
> ?EXTRA: Implement some benchmark technique to measure the final
> alignment quality.
>
> Main Risks
> ----------
>
> The main risk to this project is the intrinsic complexity of the MSA
> progressive algorithm. To deal with that we decided to break the
> implementation in a large number of small and manageable steps, and
> the steps are designed in a way that, at the end of each of them, we
> will have a complete and testable new function (or a modification of
> an existing one). Besides that, to be extra careful the project aims
> to produce a simple full functional MSA algorithm as early as
> possible, the estimated time is 8 weeks, this way we guarantee to
> deliver at a simpler, but working and bug-free, version.
>
>
>
>
>> Andreas
>>
>>
>>> -------------------------------------------------------------
>>>
>>> GSoC proposal
>>>
>>> Abstract
>>> --------
>>>
>>> This project aims to develop an all-Java implementation of a multiple
>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>>> using the progressive algorithm described in the CLUSTALW paper [1].
>>>
>>> The Importance
>>> --------------
>>>
>>> Multiple sequence alignment is a frequently performed task in sequence
>>> analysis with the goal to identify new members of protein families and
>>> infer phylogenetic relationships between proteins and genes. At the
>>> present there is no Java-only implementation for this algorithm. As
>>> such the number of already existing and Java related BioInformatics
>>> tools and web sites would benefit from this implementation and
>>> sequence analysis could be more easily performed by the end-user.
>>>
>>> About Me
>>> --------
>>>
>>> I am a graduate student at University of S?o Paulo (Brazil), I got my
>>> undergraduate degree from the same university with a major in Computer
>>> Science and a minor in Biology. I have been involved with
>>> Bioinformatics for 5 years, always with sequence analysis with
>>> particular interest in the MSA problem. Also, in my undergraduate
>>> final project I developed a lossless filter (pruning algorithm) for
>>> the MSA problem, the work is published in [3] and there is an online
>>> implementation of the algorithm in [4]. Finally, I have experience
>>> with the C, C++, Java, Python and Ruby programming languages; Git and
>>> SVN version control systems.
>>>
>>> Project Plan
>>> ------------
>>>
>>> The project is divided in four main steps, at the end of each step a
>>> completely functional and bug-free new algorithm will be added to the
>>> Biojava code base. It should be noticed that each step has a strong
>>> dependence on the previous one, so before move to the next step a
>>> careful testing will be done.
>>>
>>> The four steps are described below, estimated times for accomplishment
>>> of each step are also given and in some steps extra enhancements are
>>> described, they will be implemented if there is some time remaining
>>> after all steps are completed.
>>>
>>> ** 1. Study the Biojava pairwise alignment code and update it to be
>>> compliant with Biojava 3.
>>>
>>> ?The pairwise alignment will play an important role in the MSA
>>> algorithm. This step is also important for me to get used to the
>>> Biojava coding standards and get in touch with the Biojava dev
>>> community.
>>>
>>> ?ETA: 2 weeks.
>>>
>>> ** 2. Implement the algorithm to build the distance matrix.
>>>
>>> ?This is done using the pairwise alignment for each pair of sequence
>>> in the set to be aligned.
>>>
>>> ?ETA: 1 week.
>>>
>>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
>>> several threads to calculate the pairwise alignment for different
>>> pairs in the sequence set.
>>>
>>> ** 3. Implement the algorithm to build the guide tree.
>>>
>>> ?The guide tree is based on the distance matrix built in the last
>>> step, the tree construction strategy adopted will be the Neighbor
>>> Joining Algorithm.
>>>
>>> ?ETA: 2 weeks.
>>>
>>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>>
>>> ?This is certainly the most difficult part of the project, so to make
>>> sure we are going to deliver a fully functional MSA algorithm, a safer
>>> approach is going to be taken. In the first place, a dynamic
>>> programming algorithm described in [2] will be implemented. Once this
>>> get successfully done and the code fully integrated to the Biojava
>>> code base, the features described in [1] are going to be incrementally
>>> added (and tested) in order to implement the full dynamic programming
>>> algorithm.
>>>
>>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>>
>>> ?EXTRA: Implement some benchmark technique to measure the final
>>> alignment quality.
>>>
>>> References
>>> ----------
>>>
>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>>> [3] http://www.almob.org/content/4/1/3
>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>>
>>>
>>>
>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Gustavo,
>>>>
>>>> In principle I agree to all, see details below:
>>>>
>>>>
>>>> I think my question wasn't very clear, my intention in this project is
>>>>>
>>>>> to follow the approach (with the tree steps) outlined in the project's
>>>>> page. Using the classical progressive alignment heuristic: build the
>>>>> distance matrix, build the guide tree and using this tree
>>>>> progressively align more sequences together.
>>>>
>>>> yes
>>>>
>>>>>
>>>>> What I propose for the third step is a first implementation using the
>>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>>> more or less what you had in mind?
>>>>
>>>> yes, sounds good.
>>>>
>>>>>
>>>>> About parallel strategies, I think a relative easy way we could use it
>>>>> is in the distance matrix construction, we could have several threads
>>>>> calculating the pairwise alignment for different pairs of sequence in
>>>>> the set.
>>>>
>>>> Correct. Probably a first implementation would be for a single machine/
>>>> multi CPU. More advanced implementations could provide support e.g. for
>>>> Map/Reduce, JPPF, or something like that...
>>>>
>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>>> paper doesn't give any way to measure the quality of the result, they
>>>>> consider a good alignment the one that is hard to improve by eye (But
>>>>> they claim that for sequences sufficient similar, no pair less than
>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>>> paper and leave the quality measure to the user? How concerned should
>>>>> I be with that in this project?
>>>>
>>>> Getting an overall core-algorithm that works should be priority. The
>>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>>> plenty of material for that, once we get to that stage...
>>>>
>>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>>> to have some feedback from you.
>>>>
>>>> Excellent, looking forward to it.
>>>>
>>>> Andreas
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From chapman at cs.wisc.edu  Thu Apr  8 16:45:21 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Thu, 08 Apr 2010 15:45:21 -0500
Subject: [Biojava-l] GSoC Application
In-Reply-To: <x2t59a41c431004081026s4eb39908gf7fb8cc30e99d483@mail.gmail.com>
References: <4BBDD050.6090208@cs.wisc.edu>
	<x2t59a41c431004081026s4eb39908gf7fb8cc30e99d483@mail.gmail.com>
Message-ID: <4BBE4061.3000204@cs.wisc.edu>

Hi Andreas,

Thanks for the feedback.

Difficulties and risks:
By viewing progressive multiple sequence alignment as four separate stages, I 
believe the pieces become easier to manage.  However, I also expect a few of my 
ideas to prove quite challenging to implement.  One of these challenges will be 
efficient parallelization.  Instead of spending all summer finding the optimal 
approach, I plan to make routines which are called in sequence in a simple 
implementation and in parallel in a separate one.  Later work could then extend 
the parallelism to a distributed computing framework such as hadoop or condor. 
Another difficult aspect is to make a general interface for choosing anchors in 
profile-profile alignment.  The Myers-Miller algorithm chooses optimal midpoints 
as anchors in an internal decision process.  I hope to generalize this to allow 
external identification of candidate anchors, as well.

Structural alignment integration:
At least three options exist for inserting structural information into the 
multiple sequence alignment task: pairwise scoring, anchoring, and profile 
scoring.  First, scores from pairwise structural alignments could be used to 
construct the similarity matrix.  This would create a guide tree that aligns 
sequences with similar structures earlier in the progressive alignment.  Second, 
structural alignment could identify possible anchors.  The profile-profile 
alignments would then conserve known structures when two profiles share some 
anchor candidates.  Both of these options are in my plan.  The third option 
would follow the consistency method of profile-profile alignment which replaces 
scoring from a substitution matrix with a consistency score.  This technique is 
used in T-Coffee and ProbCons.  The consistency score comes from how often 
residues in each profile aligned when combining information from pairwise 
alignments.  If these were structural pairwise alignments, then the multiple 
sequence alignment would preserve structural information.  Later work could 
implement this method as an alternative profile-profile alignment.

I'll try to incorporate these ideas when I revise my application later tonight. 
  And thanks again for your input.

Mark


On 4/8/2010 12:26 PM, Andreas Prlic wrote:
> Hi Mark,
>
> looks pretty good,
>
> * The time schedule feels tight. Where do you see possible
> difficulties and risks. What might take longer than expected?
>
> * I would like to be able to use 3D structure alignment information to
> guide the final alignment. This should increase reliability of the
> final alignment for remote sequence similarities. Any thoughts on how
> to accomplish this?
>
> Andreas
>
>
>
>
> On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman<chapman at cs.wisc.edu>  wrote:
>> I would appreciate any feedback on my proposal from mentors or other
>> developers.  Check it out at:
>> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817
>>
>> Thanks in advance,
>> Mark
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>

From sacomoto at gmail.com  Thu Apr  8 20:36:27 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Thu, 8 Apr 2010 21:36:27 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <w2n59a41c431004081036o535f1696qbbe13f59f6f6f56b@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com> 
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com> 
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com> 
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com> 
	<q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com> 
	<x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com> 
	<w2n59a41c431004081036o535f1696qbbe13f59f6f6f56b@mail.gmail.com>
Message-ID: <n2t6a8f5b081004081736jb6894b71ub3cfd6649b5a7b8d@mail.gmail.com>

Hi Andreas,

On Thu, Apr 8, 2010 at 2:36 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Looks pretty good.
>
> One issue during the progressive alignment build up: 3D structure
> alignments can increase the reliability of the sequence alignments,
> particularly if the sequences are only distantly related. Having a way
> to incorporate the 3D structure info would be nice...

A first idea to incorporate some information about 3D structure
alignment is to extract from this alignment some matching substrings,
i.e. obtain the sequence substrings that correspond to the
superimposed points in the 3D alignment. And then, force the final MSA
to contain those same aligned substrings, in order to do that the DP
algorithm of step 4.1 should be modified in a way described here [
http://www.ncbi.nlm.nih.gov/pubmed/9018604 ] .

Thanks again.

gustavo

> Andreas
>
> On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto
> <sacomoto at gmail.com> wrote:
>> Hi Andreas,
>>
>> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Gustavo,
>>>
>>> here my 0.02$:
>>>
>>> * For some of your steps there is already code available in BioJava.
>>> MIght be good to take a look at what is already there... ? (look at
>>> the alignment and phylo modules for dynamic programming and
>>> Neighbour-Joining)
>>>
>>> * What about risks? Where do you expect difficulties and how to work
>>> around them?
>>>
>>> * Step 4: Can you add more details? How do you plan to approach this?
>>> E.g. Clustalw has a number of rules implemented at this stage. Do you
>>> plan to support multiple rules as well and how to do this technically.
>>> Something nice would be the possibility to use structure alignments to
>>> guide the sequence alignments. (structure module)
>>
>> Based on it I rewrote the step 4 and add a "Main Risks" section.
>>
>> I pasted just the new version of step 4 and the new section at the end
>> of this e-mal.
>>
>> Thank you very much for your feedback.
>>
>> gustavo
>>
>>
>>
>> -------------------------------------------------------------------------------------------
>>
>> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper.
>>
>> ?A progressive MSA is a heuristic approach for the MSA problem, at
>> each step a pairwise alignment between two sequences, a sequence and
>> an alignment or between two alignments is done. So, the multiple
>> alignment is built incrementally, at each iteration more sequences are
>> aligned together. The guide tree gives an order for this incremental
>> alignment, in a bottom-up (in the tree) fashion sequences (or groups
>> of sequences) with greater similarity are aligned first. Therefore, in
>> order to have a more flexible and reusable code, the code design will
>> allow any binary tree of the sequences to be used as a guide tree, not
>> only the one built in the last step. This will allow a priori
>> phylogenetic or tertiary similarity (structural similarity) knowledge
>> be used to guide the multiple alignment order.
>>
>> ?This is certainly the most difficult part of the project, so to make
>> sure we are going to deliver a fully functional MSA algorithm, a safer
>> approach is going to be taken. In the first place, a a basic algorithm
>> described in [2] will be implemented. Once this get successfully done
>> and the code fully integrated to the Biojava code base, the features
>> described in [1] are going to be incrementally added (and tested) in
>> order to implement the full algorithm. This step is further divided in
>> substeps.
>>
>> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm.
>>
>> ?This is the generalized pairwise alignment used in each iteration of
>> the progressive MSA. Gaps ?already presents in one of the alignments
>> (profiles) remain fixed, gap opening penalties remain unchanged, this
>> means that opening new gaps inside existent gaps will be fully
>> penalized. The code for this algorithm is similar to, the already
>> present in Biojava, code for regular pairwise alignment.
>>
>> *** 4.2 Implement the basic progressive MSA algorithm.
>>
>> ?In this substep is going to be implemented the incremental algorithm
>> to built the MSA, transversing a guide tree (parameter, could be the
>> one built in step 3 or any other one) in a bottom-up fashion and using
>> the algorithm from substep 4.1 at each iteration.
>>
>> *** 4.3 Implement the MSA wrapper.
>>
>> ?The MSA wrapper is going to be a method that wraps steps 2, 3 and
>> 4.2, giving a simple method (for the final user) to calculate the MSA.
>> Receiving as parameters the set of sequences to be aligned, the gap
>> opening penalty, gap extend penalty and residue matrix. Returning the
>> MSA for the sequence set.
>> ?At the end of this substep, we get a basic fully functional MSA
>> algorithm, using the progressive heuristic.
>>
>> *** 4.4 Implement gaps penalties rescaling and parameter default values.
>>
>> ?Gap penalties to open a new gap an extend a existing one (the affine
>> gap weight model) are user defined parameters. This substep will
>> define default values, based on the residue matrix, for this
>> parameters and implement global rescaling rules (based on sequences
>> sizes) for this parameters.
>>
>> *** 4.5 Enhance the DP algorithm to use different sequences weight.
>>
>> ?Based on the guide tree, for each sequence a different weight
>> (divergent sequences receive high values) is calculated and used in
>> the scoring scheme of the generalized DP algorithm.
>>
>> *** 4.6 Enhance the DP algorithm to use position based gap penalties.
>>
>> ?The DP algorithm from substep 4.1 uses globally defined gap opening
>> penalty. In this substep, the algorithm is going to be modified do use
>> position based penalty, this is simple, once is known an array of
>> opening penalties for each sequence position. This array is calculated
>> based on several hierarchical (only apply the first one that fits, if
>> any) rules, those are rescaling rules and the array is initialized
>> with the original gap opening penalty.
>>
>> Given the hierarchical nature of the rules, they can be implemented in
>> a incremental way, from the highest priority rule to the lowest, the
>> algorithm of each step being a refinement of the previous one. I am
>> omitting the detailed description of each rule. However, to verify if
>> a given rule apply to a given position, all that is necessary is to
>> check at most 16 adjacent positions and the same position in the other
>> already aligned sequences.
>>
>> At the end of each of the following steps we a have functional
>> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.
>>
>> **** 4.6.1 Lowered gap opening penalties at existing gaps.
>> **** 4.6.2 Increased gap opening penalties near existing gaps.
>> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
>> **** 4.6.4 Residue specific gap penalties.
>>
>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>
>> ?EXTRA: Implement some benchmark technique to measure the final
>> alignment quality.
>>
>> Main Risks
>> ----------
>>
>> The main risk to this project is the intrinsic complexity of the MSA
>> progressive algorithm. To deal with that we decided to break the
>> implementation in a large number of small and manageable steps, and
>> the steps are designed in a way that, at the end of each of them, we
>> will have a complete and testable new function (or a modification of
>> an existing one). Besides that, to be extra careful the project aims
>> to produce a simple full functional MSA algorithm as early as
>> possible, the estimated time is 8 weeks, this way we guarantee to
>> deliver at a simpler, but working and bug-free, version.
>>
>>
>>
>>
>>> Andreas
>>>
>>>
>>>> -------------------------------------------------------------
>>>>
>>>> GSoC proposal
>>>>
>>>> Abstract
>>>> --------
>>>>
>>>> This project aims to develop an all-Java implementation of a multiple
>>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>>>> using the progressive algorithm described in the CLUSTALW paper [1].
>>>>
>>>> The Importance
>>>> --------------
>>>>
>>>> Multiple sequence alignment is a frequently performed task in sequence
>>>> analysis with the goal to identify new members of protein families and
>>>> infer phylogenetic relationships between proteins and genes. At the
>>>> present there is no Java-only implementation for this algorithm. As
>>>> such the number of already existing and Java related BioInformatics
>>>> tools and web sites would benefit from this implementation and
>>>> sequence analysis could be more easily performed by the end-user.
>>>>
>>>> About Me
>>>> --------
>>>>
>>>> I am a graduate student at University of S?o Paulo (Brazil), I got my
>>>> undergraduate degree from the same university with a major in Computer
>>>> Science and a minor in Biology. I have been involved with
>>>> Bioinformatics for 5 years, always with sequence analysis with
>>>> particular interest in the MSA problem. Also, in my undergraduate
>>>> final project I developed a lossless filter (pruning algorithm) for
>>>> the MSA problem, the work is published in [3] and there is an online
>>>> implementation of the algorithm in [4]. Finally, I have experience
>>>> with the C, C++, Java, Python and Ruby programming languages; Git and
>>>> SVN version control systems.
>>>>
>>>> Project Plan
>>>> ------------
>>>>
>>>> The project is divided in four main steps, at the end of each step a
>>>> completely functional and bug-free new algorithm will be added to the
>>>> Biojava code base. It should be noticed that each step has a strong
>>>> dependence on the previous one, so before move to the next step a
>>>> careful testing will be done.
>>>>
>>>> The four steps are described below, estimated times for accomplishment
>>>> of each step are also given and in some steps extra enhancements are
>>>> described, they will be implemented if there is some time remaining
>>>> after all steps are completed.
>>>>
>>>> ** 1. Study the Biojava pairwise alignment code and update it to be
>>>> compliant with Biojava 3.
>>>>
>>>> ?The pairwise alignment will play an important role in the MSA
>>>> algorithm. This step is also important for me to get used to the
>>>> Biojava coding standards and get in touch with the Biojava dev
>>>> community.
>>>>
>>>> ?ETA: 2 weeks.
>>>>
>>>> ** 2. Implement the algorithm to build the distance matrix.
>>>>
>>>> ?This is done using the pairwise alignment for each pair of sequence
>>>> in the set to be aligned.
>>>>
>>>> ?ETA: 1 week.
>>>>
>>>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
>>>> several threads to calculate the pairwise alignment for different
>>>> pairs in the sequence set.
>>>>
>>>> ** 3. Implement the algorithm to build the guide tree.
>>>>
>>>> ?The guide tree is based on the distance matrix built in the last
>>>> step, the tree construction strategy adopted will be the Neighbor
>>>> Joining Algorithm.
>>>>
>>>> ?ETA: 2 weeks.
>>>>
>>>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>>>
>>>> ?This is certainly the most difficult part of the project, so to make
>>>> sure we are going to deliver a fully functional MSA algorithm, a safer
>>>> approach is going to be taken. In the first place, a dynamic
>>>> programming algorithm described in [2] will be implemented. Once this
>>>> get successfully done and the code fully integrated to the Biojava
>>>> code base, the features described in [1] are going to be incrementally
>>>> added (and tested) in order to implement the full dynamic programming
>>>> algorithm.
>>>>
>>>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>>>
>>>> ?EXTRA: Implement some benchmark technique to measure the final
>>>> alignment quality.
>>>>
>>>> References
>>>> ----------
>>>>
>>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>>>> [3] http://www.almob.org/content/4/1/3
>>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>>>
>>>>
>>>>
>>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Gustavo,
>>>>>
>>>>> In principle I agree to all, see details below:
>>>>>
>>>>>
>>>>> I think my question wasn't very clear, my intention in this project is
>>>>>>
>>>>>> to follow the approach (with the tree steps) outlined in the project's
>>>>>> page. Using the classical progressive alignment heuristic: build the
>>>>>> distance matrix, build the guide tree and using this tree
>>>>>> progressively align more sequences together.
>>>>>
>>>>> yes
>>>>>
>>>>>>
>>>>>> What I propose for the third step is a first implementation using the
>>>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>>>> more or less what you had in mind?
>>>>>
>>>>> yes, sounds good.
>>>>>
>>>>>>
>>>>>> About parallel strategies, I think a relative easy way we could use it
>>>>>> is in the distance matrix construction, we could have several threads
>>>>>> calculating the pairwise alignment for different pairs of sequence in
>>>>>> the set.
>>>>>
>>>>> Correct. Probably a first implementation would be for a single machine/
>>>>> multi CPU. More advanced implementations could provide support e.g. for
>>>>> Map/Reduce, JPPF, or something like that...
>>>>>
>>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>>>> paper doesn't give any way to measure the quality of the result, they
>>>>>> consider a good alignment the one that is hard to improve by eye (But
>>>>>> they claim that for sequences sufficient similar, no pair less than
>>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>>>> paper and leave the quality measure to the user? How concerned should
>>>>>> I be with that in this project?
>>>>>
>>>>> Getting an overall core-algorithm that works should be priority. The
>>>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>>>> plenty of material for that, once we get to that stage...
>>>>>
>>>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>>>> to have some feedback from you.
>>>>>
>>>>> Excellent, looking forward to it.
>>>>>
>>>>> Andreas
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From sheoran143 at gmail.com  Sun Apr 11 15:16:29 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Sun, 11 Apr 2010 14:16:29 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
Message-ID: <4BC2200D.8000109@gmail.com>

Hi,

Their is very fundamental issue in SimpleNCBITaxon class becuase of 
which it is producing wrong taxonomy hierarchy. I am explaing what I 
have found let me what you guys think of it, and me suggest how to fix it.

1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, 
nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to 
have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not 
true. The value which "parent_taxon_id" have is "taxon_id" which have 
parent_ncbi_taxon_id of current ncbi_taxon_id.

<property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
<property name="nodeRank" column="node_rank"/>
<property name="geneticCode" column="genetic_code"/>
<property name="mitoGeneticCode" column="mito_genetic_code"/>
<property name="leftValue" column="left_value"/>
<property name="rightValue" column="right_value"/>
<property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- 
its not correct column parent_taxon_id stores the taxon_id which have 
parent_ncbi_taxon_id for current entry

Thanks
Deepak Sheoran


From holland at eaglegenomics.com  Sun Apr 11 15:53:06 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Sun, 11 Apr 2010 20:53:06 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <4BC2200D.8000109@gmail.com>
References: <4BC2200D.8000109@gmail.com>
Message-ID: <B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>

I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).

thanks,
Richard

On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:

> Hi,
> 
> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
> 
> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
> 
> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
> <property name="nodeRank" column="node_rank"/>
> <property name="geneticCode" column="genetic_code"/>
> <property name="mitoGeneticCode" column="mito_genetic_code"/>
> <property name="leftValue" column="left_value"/>
> <property name="rightValue" column="right_value"/>
> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
> 
> Thanks
> Deepak Sheoran
> 
> 

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From sheoran143 at gmail.com  Sun Apr 11 17:08:22 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Sun, 11 Apr 2010 16:08:22 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
Message-ID: <4BC23A46.7090304@gmail.com>

I am using same table with biojava and bioperl taxon program and the 
output I get is below:

*Biojava:*
For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage 
i get is
             Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia 
australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum 
var. haydenii.

Biojava process of finding names: 
11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   
(wrong way of doing things)

*Bioperl:*
For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage 
i get is
           Retroviridae; Orthoretrovirinae; Alpharetrovirus; 
unclassified  Alpharetrovirus.

Bioperl process of finding names: 
11876==>353825==>153057==>327045==>11632   (Right way of doing things)

Hint: biojava search ncbi_taxon_id column with a value from 
parent_taxon_id where bioperl search taxon_id column with a value from 
parent_taxon_id.

*Taxon and Taxon_name Table content which is being relevant  in discussion:*

taxon_id 	ncbi_taxon_id 	parent_taxon_id 	node_rank 	name 	name_class
2901 	3609 	276240 	genus 	Rhamnus 	scientific name
3610 	4403 	3609 	species 	Platanus occidentalis 	scientific name
29052 	48579 	4403 	species 	Suillus placidus 	scientific name
114412 	143975 	48579 	species 	Diadasia australis 	scientific name
143976 	176516 	143975 	species 	Arnicastrum guerrerense 	scientific name
30680 	50447 	176516 	family 	Labiduridae 	scientific name
254757 	301952 	50447 	varietas 	Oreostemma alpigenum var. haydenii 
scientific name
9394 	11632 	17394 	family 	Retroviridae 	scientific name
277861 	327045 	9394 	subfamily 	Orthoretrovirinae 	scientific name
122448 	153057 	277861 	genus 	Alpharetrovirus 	scientific name
301952 	353825 	122448 	no rank 	unclassified Alpharetrovirus 
scientific name
9584
	11876
	301952
	species
	Avian sarcoma virus
	scientifice name


Thanks
Deepak

On 4/11/2010 2:53 PM, Richard Holland wrote:
> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>
> thanks,
> Richard
>
> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>
>    
>> Hi,
>>
>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>
>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>
>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>> <property name="nodeRank" column="node_rank"/>
>> <property name="geneticCode" column="genetic_code"/>
>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>> <property name="leftValue" column="left_value"/>
>> <property name="rightValue" column="right_value"/>
>> <property name="parentNCBITaxID" column="parent_taxon_id"/>       ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>
>> Thanks
>> Deepak Sheoran
>>
>>
>>      
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
>    


From sheoran143 at gmail.com  Sun Apr 11 18:48:00 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Sun, 11 Apr 2010 17:48:00 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <4BC23A46.7090304@gmail.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
Message-ID: <4BC251A0.4090602@gmail.com>

If we don't want to change the current code in biojava and still want to 
fix this bug I have found a way,
1) we can do this by changing one of hibernate files called 
"Taxon.hbm.xml" and replace the line
<property name="parentNCBITaxID" column="parent_taxon_id"/>
     with
<property name="parentNCBITaxID" formula="(select tax.ncbi_taxon_id from 
taxon tax where tax.taxon_id = parent_taxon_id)"/>

by changing the above setting in hibernate setting I am able to get the 
correct linage for ncbi_taxon_id = 11876(Avian sarcoma virus) which is
              Viruses; Retro-transcribing viruses; Retroviridae; 
Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus.

2) But the possible issue which we might get is with Taxonomy loader 
class which want to insert something for parent taxon_id into taxon 
table which  I think won't be possible if we do this change to hibernate 
con-fig file.

Deepak Sheoran


On 4/11/2010 4:08 PM, Deepak Sheoran wrote:
> I am using same table with biojava and bioperl taxon program and the 
> output I get is below:
>
> *Biojava:*
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the 
> lineage i get is
>             Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia 
> australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum 
> var. haydenii.
>
> Biojava process of finding names: 
> 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   
> (wrong way of doing things)
>
> *Bioperl:*
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the 
> lineage i get is
>           Retroviridae; Orthoretrovirinae; Alpharetrovirus; 
> unclassified  Alpharetrovirus.
>
> Bioperl process of finding names: 
> 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
>
> Hint: biojava search ncbi_taxon_id column with a value from 
> parent_taxon_id where bioperl search taxon_id column with a value from 
> parent_taxon_id.
>
> *Taxon and Taxon_name Table content which is being relevant  in 
> discussion:*
>
> taxon_id 	ncbi_taxon_id 	parent_taxon_id 	node_rank 	name 	name_class
> 2901 	3609 	276240 	genus 	Rhamnus 	scientific name
> 3610 	4403 	3609 	species 	Platanus occidentalis 	scientific name
> 29052 	48579 	4403 	species 	Suillus placidus 	scientific name
> 114412 	143975 	48579 	species 	Diadasia australis 	scientific name
> 143976 	176516 	143975 	species 	Arnicastrum guerrerense 	scientific name
> 30680 	50447 	176516 	family 	Labiduridae 	scientific name
> 254757 	301952 	50447 	varietas 	Oreostemma alpigenum var. haydenii 
> scientific name
> 9394 	11632 	17394 	family 	Retroviridae 	scientific name
> 277861 	327045 	9394 	subfamily 	Orthoretrovirinae 	scientific name
> 122448 	153057 	277861 	genus 	Alpharetrovirus 	scientific name
> 301952 	353825 	122448 	no rank 	unclassified Alpharetrovirus 
> scientific name
> 9584
> 	11876
> 	301952
> 	species
> 	Avian sarcoma virus
> 	scientifice name
>
>
> Thanks
> Deepak
>
> On 4/11/2010 2:53 PM, Richard Holland wrote:
>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>>
>> thanks,
>> Richard
>>
>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>
>>    
>>> Hi,
>>>
>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>>
>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>
>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>> <property name="nodeRank" column="node_rank"/>
>>> <property name="geneticCode" column="genetic_code"/>
>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>> <property name="leftValue" column="left_value"/>
>>> <property name="rightValue" column="right_value"/>
>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>       ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>>
>>> Thanks
>>> Deepak Sheoran
>>>
>>>
>>>      
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E:holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>>    
>


From holland at eaglegenomics.com  Mon Apr 12 02:57:57 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Mon, 12 Apr 2010 07:57:57 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <4BC23A46.7090304@gmail.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
Message-ID: <D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>

Thanks Deepak. 

I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. 

BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used.

BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.)

I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results.

This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now.

cheers,
Richard

On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:

> I am using same table with biojava and bioperl taxon program and the output I get is below:
> 
> Biojava:
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>             Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. 
> 
> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   (wrong way of doing things)
> 
> Bioperl:    
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>           Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified  Alpharetrovirus.
> 
> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
> 
> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id.
> 
> Taxon and Taxon_name Table content which is being relevant  in discussion:
> 
> taxon_id	ncbi_taxon_id	parent_taxon_id	node_rank	name	name_class
> 2901	3609	276240	genus	Rhamnus	scientific name
> 3610	4403	3609	species	Platanus occidentalis	scientific name
> 29052	48579	4403	species	Suillus placidus	scientific name
> 114412	143975	48579	species	Diadasia australis	scientific name
> 143976	176516	143975	species	Arnicastrum guerrerense	scientific name
> 30680	50447	176516	family	Labiduridae	scientific name
> 254757	301952	50447	varietas	Oreostemma alpigenum var. haydenii	scientific name
> 9394	11632	17394	family	Retroviridae	scientific name
> 277861	327045	9394	subfamily	Orthoretrovirinae	scientific name
> 122448	153057	277861	genus	Alpharetrovirus	scientific name
> 301952	353825	122448	no rank	unclassified Alpharetrovirus	scientific name
> 9584
> 11876
> 301952
> species
> Avian sarcoma virus
> scientifice name
> 
> Thanks
> Deepak 
> 
> On 4/11/2010 2:53 PM, Richard Holland wrote:
>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>> 
>> thanks,
>> Richard
>> 
>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>> 
>>   
>> 
>>> Hi,
>>> 
>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>> 
>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>> 
>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>> <property name="nodeRank" column="node_rank"/>
>>> <property name="geneticCode" column="genetic_code"/>
>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>> <property name="leftValue" column="left_value"/>
>>> <property name="rightValue" column="right_value"/>
>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>> 
>>> Thanks
>>> Deepak Sheoran
>>> 
>>> 
>>>     
>>> 
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: 
>> holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>> 
>> 
>>   
>> 
> 

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From holland at eaglegenomics.com  Mon Apr 12 03:07:55 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Mon, 12 Apr 2010 08:07:55 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
	<D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
Message-ID: <E7FB88D1-52D9-496C-86FA-738419FFF579@eaglegenomics.com>

Incidentally, BioJava's approach matches the description in the BioSQL docs at:

 http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME

(first example SQL statement - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join)

The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this description.

cheers,
Richard

On 12 Apr 2010, at 07:57, Richard Holland wrote:

> Thanks Deepak. 
> 
> I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. 
> 
> BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used.
> 
> BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.)
> 
> I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results.
> 
> This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now.
> 
> cheers,
> Richard
> 
> On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:
> 
>> I am using same table with biojava and bioperl taxon program and the output I get is below:
>> 
>> Biojava:
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>>            Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. 
>> 
>> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   (wrong way of doing things)
>> 
>> Bioperl:    
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>>          Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified  Alpharetrovirus.
>> 
>> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
>> 
>> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id.
>> 
>> Taxon and Taxon_name Table content which is being relevant  in discussion:
>> 
>> taxon_id	ncbi_taxon_id	parent_taxon_id	node_rank	name	name_class
>> 2901	3609	276240	genus	Rhamnus	scientific name
>> 3610	4403	3609	species	Platanus occidentalis	scientific name
>> 29052	48579	4403	species	Suillus placidus	scientific name
>> 114412	143975	48579	species	Diadasia australis	scientific name
>> 143976	176516	143975	species	Arnicastrum guerrerense	scientific name
>> 30680	50447	176516	family	Labiduridae	scientific name
>> 254757	301952	50447	varietas	Oreostemma alpigenum var. haydenii	scientific name
>> 9394	11632	17394	family	Retroviridae	scientific name
>> 277861	327045	9394	subfamily	Orthoretrovirinae	scientific name
>> 122448	153057	277861	genus	Alpharetrovirus	scientific name
>> 301952	353825	122448	no rank	unclassified Alpharetrovirus	scientific name
>> 9584
>> 11876
>> 301952
>> species
>> Avian sarcoma virus
>> scientifice name
>> 
>> Thanks
>> Deepak 
>> 
>> On 4/11/2010 2:53 PM, Richard Holland wrote:
>>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>>> 
>>> thanks,
>>> Richard
>>> 
>>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>> 
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>>> 
>>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>> 
>>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>>> <property name="nodeRank" column="node_rank"/>
>>>> <property name="geneticCode" column="genetic_code"/>
>>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>>> <property name="leftValue" column="left_value"/>
>>>> <property name="rightValue" column="right_value"/>
>>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>>> 
>>>> Thanks
>>>> Deepak Sheoran
>>>> 
>>>> 
>>>> 
>>>> 
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: 
>>> holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>> 
>>> 
>>> 
>>> 
>> 
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From mara.axiom at gmail.com  Tue Apr 13 10:55:50 2010
From: mara.axiom at gmail.com (Mara Axiom)
Date: Tue, 13 Apr 2010 10:55:50 -0400
Subject: [Biojava-l] BioJava implementation of a phylogenetic tree
	reconstruction algorithm
Message-ID: <z2z6375ed361004130755oc61dc936j140bc4515b0270fc@mail.gmail.com>

Hello all,

Does anyone have BioJava implementation of a phylogenetic tree
reconstruction algorithm, except neighbor-joining or UPGMA? I need this for
a research. We have neighbor-joining or UPGMA implementation already, and we
want to look at other algorithms other than these. I am new to BioJava, any
information will help.

Here is what we want.

1 - Compare sequences in a FASTA file, and find sequences that are similar
to each other.
2 - Construct the tree.
3 - Output the tree in Newick (XML will work too) format.

In particular we are interested in implementation of BNNP (
http://www.cs.cmu.edu/~guyb/papers/SDBHRS06.pdf) and Align Free (
http://www.math.ucla.edu/~roch/research_files/align-free.pdf) algorithms,
but we are open to other algorithms too.

Please do not recommend a P-tree reconstruction tool. We are only interested
in a source code to meet our specific purpose.

Thanks in advance,
Mara

From biopython at maubp.freeserve.co.uk  Thu Apr 15 13:54:56 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Apr 2010 18:54:56 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
	<D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
Message-ID: <m2o320fb6e01004151054rcb57a28fvad135dffbe35d5fa@mail.gmail.com>

Hi,

I've CC'd this to the BioSQL mailing list for cross project
discussion.

On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland  wrote:
> Thanks Deepak.
>
> I've had a look at the code and I believe its due to the
> different ways in which BioJava and BioPerl load the
> taxon table.
>
> BioJava sets the ncbi_taxon_id and parent_taxon_id
> columns based on the values from the NCBI taxonomy
> file. The taxon_id column in BioJava is a meaningless
> auto-generated value that is never used.
>
> BioPerl however is generating taxon_id values and
> linking them by setting parent_taxon_id to the
> generated value. The parent value from the NCBI
> taxonomy file is therefore replaced with the BioPerl
> generated parent ID, meaning that instead of linking
> from parent_taxon_id to ncbi_taxon_id as per BioJava,
> the link is to taxon_id instead. (I'm basing this
> comment on looking at load_ncbi_taxonomy.pl from
> the BioSQL archives.)

Note that old versions of load_ncbi_taxonomy.pl
(which is part of BioSQL, not part of BioPerl) would
set taxon_id equal to ncbi_taxon_id, see:
http://bugzilla.open-bio.org/show_bug.cgi?id=2470

This may help explain the confusion.

> I believe if you load the taxonomy table using BioJava,
> you should see BioJava giving correct behaviour.
> Likewise if you load it using BioPerl, BioPerl will
> behave correctly. But if you load with one then query
> with the other, you'll get incorrect results.
>
> This sounds like a case for discussion on both lists -
> a matter of standardisation between the two projects.
> Not quickly/easily solvable for now.

Its not just two projects (BioPerl & BioJava) (grin).
Its at least five projects (BioSQL itself plus BioRuby
and Biopython).

I'm not sure about BioRuby's implementation, but
currently I think BioJava is the odd one out - BioPerl,
Biopython, and the BioSQL's load_ncbi_taxonomy.pl
all make entries in parent_taxon_id reference the
automatically generated taxon_id (please correct
me if I am wrong).

My personal view is that bioperl-db is the reference
implementation and should be followed in the event
of any ambiguity within BioSQL. In this particular
case, there is actually a BioSQL script to check
against too (load_ncbi_taxonomy.pl).

Hopefully Hilmar can give us an official verdict...

Peter

From andreas.draeger at uni-tuebingen.de  Wed Apr  7 09:22:26 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Wed, 07 Apr 2010 15:22:26 +0200
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
Message-ID: <4BBC8712.90907@uni-tuebingen.de>

Hi all,

This e-mail is just for your information about somebody new, who'd like 
to contribute to our project.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091
-------------- next part --------------
An embedded message was scrubbed...
From: =?ISO-8859-1?Q?Andreas_Dr=E4ger?=
 <andreas.draeger at uni-tuebingen.de>
Subject: Re: Fwd: Proposing a project on "Biojava alignment lead"
Date: Wed, 07 Apr 2010 09:27:13 +0200
Size: 4779
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20100407/6a3f0bf8/attachment.eml>

From jbdundas at gmail.com  Fri Apr 16 09:57:41 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 16 Apr 2010 19:27:41 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de>
References: <4BBD820D.9070200@uni-tuebingen.de>
Message-ID: <j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>

Dear Sir,

I am very interested in contributing to this project.

I am looking for a good problem,more on the research side. I can also
help in coding (I also work as a software
engineer-j2ee/eclipse/jboss/tomcat ..

Anything that I could work on...

Regards,
Jitesh Dundas

On 4/8/10, Andreas Dr?ger <andreas.draeger at uni-tuebingen.de> wrote:
> Hi all,
>
> This e-mail is just for your information about somebody new, who'd like
> to contribute to our project.
>
> Cheers
> Andreas
>
>
> Subject:
> Re: Fwd: Proposing a project on "Biojava alignment lead"
> From:
> Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
> Date:
> Wed, 07 Apr 2010 09:27:13 +0200
> To:
> Cai Shaojiang <caishaojiang at gmail.com>
>
> Hi Cai Shaojiang,
>
> Thank you for you e-mail! I don't know what happened to the e-mail list.
> Sometimes it takes a while due to the spam filters, I guess.
>
>  > I am a PhD student from National University of Singapore. My major
> research area is local alignment algorithms and data structures for SNP
> identification. And I have used Java and Eclipse for years for software
> development. I am very interested in your GSoC programme. I find that
> there is a module called "biojava-alignment lead" whose mentor is you. I
> want to propose a new project on this module. I have several questions
> about this module.
>
> Yes, that's me. So great to get your support.
>
>  > 1. It seems that pairwise alignment is to find similarity between two
> short sequences. Existing pairwise alignment is based on dynamic
> programming, is it Smith-Waterman algorithm?
>
> So, currently, BioJava contains three different alignment approaches.
> There are two deterministic algorithms, i.e., Smith-Waterman for local
> alignment and Needleman-Wunsch for global alignment. Third, there is the
> possibility to apply Hidden Markov Models for alignment. An example of
> the latter approach should be in the cookbook.
>
>  > 2. What is the exact task of "refactoring of underlying data structures"?
>
> Yes, this is something, I did last week already but it could still be
> improved. The problem was that the alignment algorithms actually
> produced a kind of string that looks similar to the output of BLAST.
> This string contained the score, the computation time, the length of the
> alignment etc. The problem was that people wanted to perform
> higher-level computation on the score value or evaluate some other
> information. Now, the alignment will produce a data structure that
> contains all the information and can, in addition to that, also produce
> such a BLAST-like output. There is, however, still the following
> problem: The data structure requires both sequences in the pair-wise
> alignment to have an identical length. In case of local alignment this
> is especially stupid (actually), because gaps are inserted to fill the
> sequences. And then the data structure tries to keep the old sequence
> coordinates, leading to the effect that the numbers "query start",
> "query end", "subject start", and "subject end" are required to shift
> the sequences against each other when displaying the output. So, you
> cannot easily print the sequences below of each other, you first have to
> shift them. Please check out the latest version of this package via
> anonymeous svn and have a look ;-)
>
>  > 3. My existing research area is aiming to deal with aligning short
> read (10s~100s bp) against extremely long sequences (e.g., human
> genome). Af far as I know, there is not existing such alignment tools
> implemented in Java. Would you consider this direction?
>
> See, this would be very nice to include. But this requires that we no
> longer fill the short sequence with many, many gap symbols (just a waist
> of memory), but improve the data structure. There is already an
> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
> we could use this as a starting point. Then your algorithm should only
> produce such a data structure and this would be fine.
>
>  > 4. It seems that the existing tools is just lacking of some
> refactoring and representation interfaces. Any more underlying tasks?
>
> Hm. Yes: With the release of BioJava 3 data structures have changed
> again. So maybe there's also some adaptation to the new structure required.
>
>  > I am keeping an eye on GSoC from last month, but sorry to find out
> that I sent the initial email to the mailing list before I subscribe it...
>
> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> latest trunk, have a look, play around and if you can improve something
> we'll put it into the trunk and write your name into the authors' tag.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From chapman at cs.wisc.edu  Fri Apr 16 13:28:33 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Fri, 16 Apr 2010 12:28:33 -0500
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on
 "Biojava	alignment lead"]
In-Reply-To: <j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>
Message-ID: <4BC89E41.4030009@cs.wisc.edu>

A great place to start finding ideas is the wiki.
Both http://biojava.org/wiki/BioJava:Modules
and http://biojava.org/wiki/BioJava3_Proposal
list the next steps planned/desired for BioJava.

What research area did you have in mind?

Have fun,
Mark


On 4/16/2010 8:57 AM, jitesh dundas wrote:
> Dear Sir,
>
> I am very interested in contributing to this project.
>
> I am looking for a good problem,more on the research side. I can also
> help in coding (I also work as a software
> engineer-j2ee/eclipse/jboss/tomcat ..
>
> Anything that I could work on...
>
> Regards,
> Jitesh Dundas
>
> On 4/8/10, Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>  wrote:
>> Hi all,
>>
>> This e-mail is just for your information about somebody new, who'd like
>> to contribute to our project.
>>
>> Cheers
>> Andreas
>>
>>
>> Subject:
>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>> From:
>> Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>
>> Date:
>> Wed, 07 Apr 2010 09:27:13 +0200
>> To:
>> Cai Shaojiang<caishaojiang at gmail.com>
>>
>> Hi Cai Shaojiang,
>>
>> Thank you for you e-mail! I don't know what happened to the e-mail list.
>> Sometimes it takes a while due to the spam filters, I guess.
>>
>>   >  I am a PhD student from National University of Singapore. My major
>> research area is local alignment algorithms and data structures for SNP
>> identification. And I have used Java and Eclipse for years for software
>> development. I am very interested in your GSoC programme. I find that
>> there is a module called "biojava-alignment lead" whose mentor is you. I
>> want to propose a new project on this module. I have several questions
>> about this module.
>>
>> Yes, that's me. So great to get your support.
>>
>>   >  1. It seems that pairwise alignment is to find similarity between two
>> short sequences. Existing pairwise alignment is based on dynamic
>> programming, is it Smith-Waterman algorithm?
>>
>> So, currently, BioJava contains three different alignment approaches.
>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>> alignment and Needleman-Wunsch for global alignment. Third, there is the
>> possibility to apply Hidden Markov Models for alignment. An example of
>> the latter approach should be in the cookbook.
>>
>>   >  2. What is the exact task of "refactoring of underlying data structures"?
>>
>> Yes, this is something, I did last week already but it could still be
>> improved. The problem was that the alignment algorithms actually
>> produced a kind of string that looks similar to the output of BLAST.
>> This string contained the score, the computation time, the length of the
>> alignment etc. The problem was that people wanted to perform
>> higher-level computation on the score value or evaluate some other
>> information. Now, the alignment will produce a data structure that
>> contains all the information and can, in addition to that, also produce
>> such a BLAST-like output. There is, however, still the following
>> problem: The data structure requires both sequences in the pair-wise
>> alignment to have an identical length. In case of local alignment this
>> is especially stupid (actually), because gaps are inserted to fill the
>> sequences. And then the data structure tries to keep the old sequence
>> coordinates, leading to the effect that the numbers "query start",
>> "query end", "subject start", and "subject end" are required to shift
>> the sequences against each other when displaying the output. So, you
>> cannot easily print the sequences below of each other, you first have to
>> shift them. Please check out the latest version of this package via
>> anonymeous svn and have a look ;-)
>>
>>   >  3. My existing research area is aiming to deal with aligning short
>> read (10s~100s bp) against extremely long sequences (e.g., human
>> genome). Af far as I know, there is not existing such alignment tools
>> implemented in Java. Would you consider this direction?
>>
>> See, this would be very nice to include. But this requires that we no
>> longer fill the short sequence with many, many gap symbols (just a waist
>> of memory), but improve the data structure. There is already an
>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
>> we could use this as a starting point. Then your algorithm should only
>> produce such a data structure and this would be fine.
>>
>>   >  4. It seems that the existing tools is just lacking of some
>> refactoring and representation interfaces. Any more underlying tasks?
>>
>> Hm. Yes: With the release of BioJava 3 data structures have changed
>> again. So maybe there's also some adaptation to the new structure required.
>>
>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>> that I sent the initial email to the mailing list before I subscribe it...
>>
>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>> latest trunk, have a look, play around and if you can improve something
>> we'll put it into the trunk and write your name into the authors' tag.
>>
>> Cheers
>> Andreas
>>
>> --
>> Dipl.-Bioinform. Andreas Dr?ger
>> Eberhard Karls University T?bingen
>> Center for Bioinformatics (ZBIT)
>> Sand 1
>> 72076 T?bingen
>> Germany
>>
>> Phone: +49-7071-29-70436
>> Fax:   +49-7071-29-5091
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From sheoran143 at gmail.com  Fri Apr 16 14:43:59 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Fri, 16 Apr 2010 13:43:59 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <m2o320fb6e01004151054rcb57a28fvad135dffbe35d5fa@mail.gmail.com>
References: <4BC2200D.8000109@gmail.com>	
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>	
	<4BC23A46.7090304@gmail.com>	
	<D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
	<m2o320fb6e01004151054rcb57a28fvad135dffbe35d5fa@mail.gmail.com>
Message-ID: <4BC8AFEF.70107@gmail.com>

What my experience says on this issue we should make use of taxon_id 
because its a unique key in a local instance of biosql.
ncbi_taxon_id should only be used for mapping purpose only so that a 
person can map his local taxon_id to a ncbi_taxon_id otherwise it defeat 
the sole purpose of having taxon_id as primary key in taxon table. The 
main goal which I think when biosql is designed is to make it 
independent of any other organization like genbank or NCBI but its a 
feature so that we can map a number(ncbi_taxon_id) given by a know 
authority to a local number (taxon_id).

Deepak Sheoran

On 4/15/2010 12:54 PM, Peter wrote:
> Hi,
>
> I've CC'd this to the BioSQL mailing list for cross project
> discussion.
>
> On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland  wrote:
>    
>> Thanks Deepak.
>>
>> I've had a look at the code and I believe its due to the
>> different ways in which BioJava and BioPerl load the
>> taxon table.
>>
>> BioJava sets the ncbi_taxon_id and parent_taxon_id
>> columns based on the values from the NCBI taxonomy
>> file. The taxon_id column in BioJava is a meaningless
>> auto-generated value that is never used.
>>
>> BioPerl however is generating taxon_id values and
>> linking them by setting parent_taxon_id to the
>> generated value. The parent value from the NCBI
>> taxonomy file is therefore replaced with the BioPerl
>> generated parent ID, meaning that instead of linking
>> from parent_taxon_id to ncbi_taxon_id as per BioJava,
>> the link is to taxon_id instead. (I'm basing this
>> comment on looking at load_ncbi_taxonomy.pl from
>> the BioSQL archives.)
>>      
> Note that old versions of load_ncbi_taxonomy.pl
> (which is part of BioSQL, not part of BioPerl) would
> set taxon_id equal to ncbi_taxon_id, see:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2470
>
> This may help explain the confusion.
>
>    
>> I believe if you load the taxonomy table using BioJava,
>> you should see BioJava giving correct behaviour.
>> Likewise if you load it using BioPerl, BioPerl will
>> behave correctly. But if you load with one then query
>> with the other, you'll get incorrect results.
>>
>> This sounds like a case for discussion on both lists -
>> a matter of standardisation between the two projects.
>> Not quickly/easily solvable for now.
>>      
> Its not just two projects (BioPerl&  BioJava) (grin).
> Its at least five projects (BioSQL itself plus BioRuby
> and Biopython).
>
> I'm not sure about BioRuby's implementation, but
> currently I think BioJava is the odd one out - BioPerl,
> Biopython, and the BioSQL's load_ncbi_taxonomy.pl
> all make entries in parent_taxon_id reference the
> automatically generated taxon_id (please correct
> me if I am wrong).
>
> My personal view is that bioperl-db is the reference
> implementation and should be followed in the event
> of any ambiguity within BioSQL. In this particular
> case, there is actually a BioSQL script to check
> against too (load_ncbi_taxonomy.pl).
>
> Hopefully Hilmar can give us an official verdict...
>
> Peter
>    


From jbdundas at gmail.com  Fri Apr 16 22:20:12 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 17 Apr 2010 07:50:12 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BC89E41.4030009@cs.wisc.edu>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>
	<4BC89E41.4030009@cs.wisc.edu>
Message-ID: <u2j326ea8621004161920wd63884fftaac222d022edcdbe@mail.gmail.com>

Hi Everyone,

I went throug  the URLs sent by Dr Chapman. Interesting  work that you
are doing here.:)...

I was wondering if there is anyone who could consider on these. I
would like to also be a part of the research work being carried out
using Biojava( especially in sequence alignment, miRNA signature
Analysis (especially for cancers)...)

1) A set of tools for converting flat data (e.g. sequence strings,
taxononmy strings) into BioJava-like objects (e.g. SymbolLists,
NCBITaxon). These BioJava-like objects could then be used for more
advanced applications.
 A set of tools for manipulating the BioJava-like objects.

2) Module?: biojava-ws-blast Module?: biojava-ws-biolit
Proposed Module: biojava-j2ee Lead: Mark Schreiber

- This would probably take the form of SessionBeans and WebServices
that can be deployed to Glassfish/ JBoss etc to provide biological
services for people who want to make client server or SOA apps.

3) I also liked what  Mr. Gang Wu is working on(I read the
discussions). I was wondering if I could
do something of that  sort...

May I request the leads to tell me how I could chip in...

Regards,
Jitesh Dundas


On 4/16/10, Mark Chapman <chapman at cs.wisc.edu> wrote:
> A great place to start finding ideas is the wiki.
> Both http://biojava.org/wiki/BioJava:Modules
> and http://biojava.org/wiki/BioJava3_Proposal
> list the next steps planned/desired for BioJava.
>
> What research area did you have in mind?
>
> Have fun,
> Mark
>
>
> On 4/16/2010 8:57 AM, jitesh dundas wrote:
>> Dear Sir,
>>
>> I am very interested in contributing to this project.
>>
>> I am looking for a good problem,more on the research side. I can also
>> help in coding (I also work as a software
>> engineer-j2ee/eclipse/jboss/tomcat ..
>>
>> Anything that I could work on...
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 4/8/10, Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>  wrote:
>>> Hi all,
>>>
>>> This e-mail is just for your information about somebody new, who'd like
>>> to contribute to our project.
>>>
>>> Cheers
>>> Andreas
>>>
>>>
>>> Subject:
>>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>>> From:
>>> Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>
>>> Date:
>>> Wed, 07 Apr 2010 09:27:13 +0200
>>> To:
>>> Cai Shaojiang<caishaojiang at gmail.com>
>>>
>>> Hi Cai Shaojiang,
>>>
>>> Thank you for you e-mail! I don't know what happened to the e-mail list.
>>> Sometimes it takes a while due to the spam filters, I guess.
>>>
>>>   >  I am a PhD student from National University of Singapore. My major
>>> research area is local alignment algorithms and data structures for SNP
>>> identification. And I have used Java and Eclipse for years for software
>>> development. I am very interested in your GSoC programme. I find that
>>> there is a module called "biojava-alignment lead" whose mentor is you. I
>>> want to propose a new project on this module. I have several questions
>>> about this module.
>>>
>>> Yes, that's me. So great to get your support.
>>>
>>>   >  1. It seems that pairwise alignment is to find similarity between
>>> two
>>> short sequences. Existing pairwise alignment is based on dynamic
>>> programming, is it Smith-Waterman algorithm?
>>>
>>> So, currently, BioJava contains three different alignment approaches.
>>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>>> alignment and Needleman-Wunsch for global alignment. Third, there is the
>>> possibility to apply Hidden Markov Models for alignment. An example of
>>> the latter approach should be in the cookbook.
>>>
>>>   >  2. What is the exact task of "refactoring of underlying data
>>> structures"?
>>>
>>> Yes, this is something, I did last week already but it could still be
>>> improved. The problem was that the alignment algorithms actually
>>> produced a kind of string that looks similar to the output of BLAST.
>>> This string contained the score, the computation time, the length of the
>>> alignment etc. The problem was that people wanted to perform
>>> higher-level computation on the score value or evaluate some other
>>> information. Now, the alignment will produce a data structure that
>>> contains all the information and can, in addition to that, also produce
>>> such a BLAST-like output. There is, however, still the following
>>> problem: The data structure requires both sequences in the pair-wise
>>> alignment to have an identical length. In case of local alignment this
>>> is especially stupid (actually), because gaps are inserted to fill the
>>> sequences. And then the data structure tries to keep the old sequence
>>> coordinates, leading to the effect that the numbers "query start",
>>> "query end", "subject start", and "subject end" are required to shift
>>> the sequences against each other when displaying the output. So, you
>>> cannot easily print the sequences below of each other, you first have to
>>> shift them. Please check out the latest version of this package via
>>> anonymeous svn and have a look ;-)
>>>
>>>   >  3. My existing research area is aiming to deal with aligning short
>>> read (10s~100s bp) against extremely long sequences (e.g., human
>>> genome). Af far as I know, there is not existing such alignment tools
>>> implemented in Java. Would you consider this direction?
>>>
>>> See, this would be very nice to include. But this requires that we no
>>> longer fill the short sequence with many, many gap symbols (just a waist
>>> of memory), but improve the data structure. There is already an
>>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
>>> we could use this as a starting point. Then your algorithm should only
>>> produce such a data structure and this would be fine.
>>>
>>>   >  4. It seems that the existing tools is just lacking of some
>>> refactoring and representation interfaces. Any more underlying tasks?
>>>
>>> Hm. Yes: With the release of BioJava 3 data structures have changed
>>> again. So maybe there's also some adaptation to the new structure
>>> required.
>>>
>>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>>> that I sent the initial email to the mailing list before I subscribe
>>> it...
>>>
>>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>>> latest trunk, have a look, play around and if you can improve something
>>> we'll put it into the trunk and write your name into the authors' tag.
>>>
>>> Cheers
>>> Andreas
>>>
>>> --
>>> Dipl.-Bioinform. Andreas Dr?ger
>>> Eberhard Karls University T?bingen
>>> Center for Bioinformatics (ZBIT)
>>> Sand 1
>>> 72076 T?bingen
>>> Germany
>>>
>>> Phone: +49-7071-29-70436
>>> Fax:   +49-7071-29-5091
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From jbdundas at gmail.com  Fri Apr 16 22:31:46 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 17 Apr 2010 08:01:46 +0530
Subject: [Biojava-l] Analytical Tool- Prediction of Unknown Protein's
	location on an a Predicted pathway
Message-ID: <j2w326ea8621004161931m69912d64t2e5d7452ac22cd8e@mail.gmail.com>

Dear All,

I wanted to propose an analytical tool in BioJava.

For e.g.) if we have  a large datasets with complete pathway
information  and the related information(e.g. p53 pathway will have
all the genes,proteins,miRNA s involved,etc ) mentioned, could we find
the location of a specific unknown (and just predicted protein)
protein/gene on a predicted pathway.

This was a suggestion on  the possible t ings on the analytical side
that we could do.Could we think of doing something of this sort for
BioJava (or atleast make it capable to handle such aspects)

Any ideas / comments are most welcome...

Regards,
Jitesh Dundas

On 4/17/10, jitesh dundas <jbdundas at gmail.com> wrote:
> Hi Everyone,
>
> I went throug  the URLs sent by Dr Chapman. Interesting  work that you
> are doing here.:)...
>
> I was wondering if there is anyone who could consider on these. I
> would like to also be a part of the research work being carried out
> using Biojava( especially in sequence alignment, miRNA signature
> Analysis (especially for cancers)...)
>
> 1) A set of tools for converting flat data (e.g. sequence strings,
> taxononmy strings) into BioJava-like objects (e.g. SymbolLists,
> NCBITaxon). These BioJava-like objects could then be used for more
> advanced applications.
>  A set of tools for manipulating the BioJava-like objects.
>
> 2) Module?: biojava-ws-blast Module?: biojava-ws-biolit
> Proposed Module: biojava-j2ee Lead: Mark Schreiber
>
> - This would probably take the form of SessionBeans and WebServices
> that can be deployed to Glassfish/ JBoss etc to provide biological
> services for people who want to make client server or SOA apps.
>
> 3) I also liked what  Mr. Gang Wu is working on(I read the
> discussions). I was wondering if I could
> do something of that  sort...
>
> May I request the leads to tell me how I could chip in...
>
> Regards,
> Jitesh Dundas
>
>
>
> On 4/16/10, Mark Chapman <chapman at cs.wisc.edu> wrote:
>> A great place to start finding ideas is the wiki.
>> Both http://biojava.org/wiki/BioJava:Modules
>> and http://biojava.org/wiki/BioJava3_Proposal
>> list the next steps planned/desired for BioJava.
>>
>> What research area did you have in mind?
>>
>> Have fun,
>> Mark
>>
>>
>> On 4/16/2010 8:57 AM, jitesh dundas wrote:
>>> Dear Sir,
>>>
>>> I am very interested in contributing to this project.
>>>
>>> I am looking for a good problem,more on the research side. I can also
>>> help in coding (I also work as a software
>>> engineer-j2ee/eclipse/jboss/tomcat ..
>>>
>>> Anything that I could work on...
>>>
>>> Regards,
>>> Jitesh Dundas
>>>
>>> On 4/8/10, Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>  wrote:
>>>> Hi all,
>>>>
>>>> This e-mail is just for your information about somebody new, who'd like
>>>> to contribute to our project.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>>
>>>> Subject:
>>>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>>>> From:
>>>> Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>
>>>> Date:
>>>> Wed, 07 Apr 2010 09:27:13 +0200
>>>> To:
>>>> Cai Shaojiang<caishaojiang at gmail.com>
>>>>
>>>> Hi Cai Shaojiang,
>>>>
>>>> Thank you for you e-mail! I don't know what happened to the e-mail
>>>> list.
>>>> Sometimes it takes a while due to the spam filters, I guess.
>>>>
>>>>   >  I am a PhD student from National University of Singapore. My major
>>>> research area is local alignment algorithms and data structures for SNP
>>>> identification. And I have used Java and Eclipse for years for software
>>>> development. I am very interested in your GSoC programme. I find that
>>>> there is a module called "biojava-alignment lead" whose mentor is you.
>>>> I
>>>> want to propose a new project on this module. I have several questions
>>>> about this module.
>>>>
>>>> Yes, that's me. So great to get your support.
>>>>
>>>>   >  1. It seems that pairwise alignment is to find similarity between
>>>> two
>>>> short sequences. Existing pairwise alignment is based on dynamic
>>>> programming, is it Smith-Waterman algorithm?
>>>>
>>>> So, currently, BioJava contains three different alignment approaches.
>>>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>>>> alignment and Needleman-Wunsch for global alignment. Third, there is
>>>> the
>>>> possibility to apply Hidden Markov Models for alignment. An example of
>>>> the latter approach should be in the cookbook.
>>>>
>>>>   >  2. What is the exact task of "refactoring of underlying data
>>>> structures"?
>>>>
>>>> Yes, this is something, I did last week already but it could still be
>>>> improved. The problem was that the alignment algorithms actually
>>>> produced a kind of string that looks similar to the output of BLAST.
>>>> This string contained the score, the computation time, the length of
>>>> the
>>>> alignment etc. The problem was that people wanted to perform
>>>> higher-level computation on the score value or evaluate some other
>>>> information. Now, the alignment will produce a data structure that
>>>> contains all the information and can, in addition to that, also produce
>>>> such a BLAST-like output. There is, however, still the following
>>>> problem: The data structure requires both sequences in the pair-wise
>>>> alignment to have an identical length. In case of local alignment this
>>>> is especially stupid (actually), because gaps are inserted to fill the
>>>> sequences. And then the data structure tries to keep the old sequence
>>>> coordinates, leading to the effect that the numbers "query start",
>>>> "query end", "subject start", and "subject end" are required to shift
>>>> the sequences against each other when displaying the output. So, you
>>>> cannot easily print the sequences below of each other, you first have
>>>> to
>>>> shift them. Please check out the latest version of this package via
>>>> anonymeous svn and have a look ;-)
>>>>
>>>>   >  3. My existing research area is aiming to deal with aligning short
>>>> read (10s~100s bp) against extremely long sequences (e.g., human
>>>> genome). Af far as I know, there is not existing such alignment tools
>>>> implemented in Java. Would you consider this direction?
>>>>
>>>> See, this would be very nice to include. But this requires that we no
>>>> longer fill the short sequence with many, many gap symbols (just a
>>>> waist
>>>> of memory), but improve the data structure. There is already an
>>>> UnequalLenghtAlignment (just a data structure, no algorithm) and I
>>>> think
>>>> we could use this as a starting point. Then your algorithm should only
>>>> produce such a data structure and this would be fine.
>>>>
>>>>   >  4. It seems that the existing tools is just lacking of some
>>>> refactoring and representation interfaces. Any more underlying tasks?
>>>>
>>>> Hm. Yes: With the release of BioJava 3 data structures have changed
>>>> again. So maybe there's also some adaptation to the new structure
>>>> required.
>>>>
>>>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>>>> that I sent the initial email to the mailing list before I subscribe
>>>> it...
>>>>
>>>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>>>> latest trunk, have a look, play around and if you can improve something
>>>> we'll put it into the trunk and write your name into the authors' tag.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>> --
>>>> Dipl.-Bioinform. Andreas Dr?ger
>>>> Eberhard Karls University T?bingen
>>>> Center for Bioinformatics (ZBIT)
>>>> Sand 1
>>>> 72076 T?bingen
>>>> Germany
>>>>
>>>> Phone: +49-7071-29-70436
>>>> Fax:   +49-7071-29-5091
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>


From jbdundas at gmail.com  Sat Apr 17 09:34:20 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 17 Apr 2010 19:04:20 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de>
References: <4BBD820D.9070200@uni-tuebingen.de>
Message-ID: <r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>

Dear SIr,

Could anyone tell me where I could start? Is there any lead who might need
my help in Software Development and research-oriebted aspects?

Any comments on my previous emails would be most welcomed...

Regards,
JItesh Dundas


On 4/8/10, Andreas Dr?ger <andreas.draeger at uni-tuebingen.de> wrote:
>
> Hi all,
>
> This e-mail is just for your information about somebody new, who'd like to
> contribute to our project.
>
> Cheers
> Andreas
>
>
> Subject:
> Re: Fwd: Proposing a project on "Biojava alignment lead"
> From:
> Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
> Date:
> Wed, 07 Apr 2010 09:27:13 +0200
> To:
> Cai Shaojiang <caishaojiang at gmail.com>
>
> Hi Cai Shaojiang,
>
> Thank you for you e-mail! I don't know what happened to the e-mail list.
> Sometimes it takes a while due to the spam filters, I guess.
>
> > I am a PhD student from National University of Singapore. My major
> research area is local alignment algorithms and data structures for SNP
> identification. And I have used Java and Eclipse for years for software
> development. I am very interested in your GSoC programme. I find that there
> is a module called "biojava-alignment lead" whose mentor is you. I want to
> propose a new project on this module. I have several questions about this
> module.
>
> Yes, that's me. So great to get your support.
>
> > 1. It seems that pairwise alignment is to find similarity between two
> short sequences. Existing pairwise alignment is based on dynamic
> programming, is it Smith-Waterman algorithm?
>
> So, currently, BioJava contains three different alignment approaches.
> There are two deterministic algorithms, i.e., Smith-Waterman for local
> alignment and Needleman-Wunsch for global alignment. Third, there is the
> possibility to apply Hidden Markov Models for alignment. An example of the
> latter approach should be in the cookbook.
>
> > 2. What is the exact task of "refactoring of underlying data structures"?
>
> Yes, this is something, I did last week already but it could still be
> improved. The problem was that the alignment algorithms actually produced a
> kind of string that looks similar to the output of BLAST. This string
> contained the score, the computation time, the length of the alignment etc.
> The problem was that people wanted to perform higher-level computation on
> the score value or evaluate some other information. Now, the alignment will
> produce a data structure that contains all the information and can, in
> addition to that, also produce such a BLAST-like output. There is, however,
> still the following problem: The data structure requires both sequences in
> the pair-wise alignment to have an identical length. In case of local
> alignment this is especially stupid (actually), because gaps are inserted to
> fill the sequences. And then the data structure tries to keep the old
> sequence coordinates, leading to the effect that the numbers "query start",
> "query end", "subject start", and "subject end" are required to shift the
> sequences against each other when displaying the output. So, you cannot
> easily print the sequences below of each other, you first have to shift
> them. Please check out the latest version of this package via anonymeous svn
> and have a look ;-)
>
> > 3. My existing research area is aiming to deal with aligning short read
> (10s~100s bp) against extremely long sequences (e.g., human genome). Af far
> as I know, there is not existing such alignment tools implemented in Java.
> Would you consider this direction?
>
> See, this would be very nice to include. But this requires that we no
> longer fill the short sequence with many, many gap symbols (just a waist of
> memory), but improve the data structure. There is already an
> UnequalLenghtAlignment (just a data structure, no algorithm) and I think we
> could use this as a starting point. Then your algorithm should only produce
> such a data structure and this would be fine.
>
> > 4. It seems that the existing tools is just lacking of some refactoring
> and representation interfaces. Any more underlying tasks?
>
> Hm. Yes: With the release of BioJava 3 data structures have changed again.
> So maybe there's also some adaptation to the new structure required.
>
> > I am keeping an eye on GSoC from last month, but sorry to find out that I
> sent the initial email to the mailing list before I subscribe it...
>
> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> latest trunk, have a look, play around and if you can improve something
> we'll put it into the trunk and write your name into the authors' tag.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From caishaojiang at gmail.com  Sun Apr 18 23:16:39 2010
From: caishaojiang at gmail.com (Cai Shaojiang)
Date: Sun, 18 Apr 2010 20:16:39 -0700
Subject: [Biojava-l] [Fwd: Re:  GSoC project on MSA]
In-Reply-To: <4BC84CD5.7030703@uni-tuebingen.de>
References: <4BBC80A8.5000608@uni-tuebingen.de>
	<v2j927e071e1004072144t557b480au27666262c79094e2@mail.gmail.com>
	<4BBDCFD2.3000507@uni-tuebingen.de>
	<y2y927e071e1004080221u778ca151l4e4eab6762b93603@mail.gmail.com>
	<s2o927e071e1004150536j7fc81d8av161035609eeed116@mail.gmail.com> 
	<4BC84CD5.7030703@uni-tuebingen.de>
Message-ID: <s2j927e071e1004182016u1807400eod14fe9cdccee2b21@mail.gmail.com>

Sorry to disturb you again. But when i wanted to modify my proposal in GSOC,
i got the error "This page is inactive at this time." So we cannot modify
the proposal now? Could you help me? Thanks.

From andreas at sdsc.edu  Sun Apr 18 23:58:05 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sun, 18 Apr 2010 20:58:05 -0700
Subject: [Biojava-l] Fwd: Biojava3-genetics
In-Reply-To: <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl>
References: <4BC806F4.3090302@wur.nl>
	<r2n59a41c431004161039hd93b268eu159de8a6659d969f@mail.gmail.com>
	<33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl>
Message-ID: <i2h59a41c431004182058pf0ee4b80s960a579b9b8c7cbe@mail.gmail.com>

Hi Richard,

I am forwarding your message to the mailing list, since that is the best
place to meet other people interested in genetics application.

The BioJava source code is available via anonymous svn or the download page
on the wiki.

Andreas

---------- Forwarded message ----------
From: Finkers, Richard <richard.finkers at wur.nl>
Date: Sat, Apr 17, 2010 at 12:46 AM
Subject: RE: Biojava3-genetics
To: Andreas Prlic <andreas at sdsc.edu>


Hi Andreas,

To start with, associations with e.g. sequence variation (454) and phenotype
data within larger sets of genetically different individuals. This will be
code which I will have to write the coming year for one of my projects. I am
planning to use this in combination the sequence and phylogeny based biojava
modules.

I also might consider migrating some of my current code to this module. This
includes graphical representations of genetic data but also some statistical
analysis for which we use the package R for the calculations but the rest of
the data handling / formatting is done in Java.

Some of the functionality, that I am thinking about, is available from other
packages but I did not find the (java) source code.

Richard


-----Original Message-----
From: andreas.prlic at gmail.com on behalf of Andreas Prlic
Sent: Fri 2010-04-16 19:39
To: Finkers, Richard
Cc: biojava-dev at lists.open-bio.org
Subject: Re: Biojava3-genetics

Hi Richard,

any contribution is welcome. What do you have in mind in particular? Perhaps
there is already something there along those lines...

Andreas

On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers <Richard.Finkers at wur.nl
>wrote:

> Dear List,
>
> I would be interested in adding a module for genetic analysis to the
> biojava3 project. Are there others who are interested in this as well and
> with who should I discuss this further?
>
> Thanks,
> Richard
>
>
> --
> Dr. Richard Finkers
> Researcher Plant Breeding
> Wageningen UR Plant Breeding
> P.O. Box 16, 6700 AA, Wageningen, The Netherlands
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB
> Wageningen, The Netherlands
> Tel. +31-317-484165 Fax +31-317-418094
> http://www.plantbreeding.wur.nl/ <http://www.plantbreeding.wur.nl>
> https://www.eu-sol.wur.nl/ <https://www.eu-sol.wur.nl>
> https://cbsgdbase.wur.nl/ <https://cbsgdbase.wur.nl>
> http://solgenomics.wur.nl/ <http://solgenomics.wur.nl>
> http://www.disclaimer-uk.wur.nl/
>
>


--
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From andreas at sdsc.edu  Mon Apr 19 00:14:24 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sun, 18 Apr 2010 21:14:24 -0700
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
Message-ID: <u2t59a41c431004182114j34529046h9dc22b4bec3cc51c@mail.gmail.com>

Hi Jitesh,

BioJava is an open source project with the goal to support Bioinformatics
applications. While we are always happy about any contribution, be it
documentation, bug fixes or email support on the mailing list, for a
research relate project it is probably easier to team up with your local
university and do an internship there.

Andreas


On Sat, Apr 17, 2010 at 6:34 AM, jitesh dundas <jbdundas at gmail.com> wrote:

> Dear SIr,
>
> Could anyone tell me where I could start? Is there any lead who might need
> my help in Software Development and research-oriebted aspects?
>
> Any comments on my previous emails would be most welcomed...
>
> Regards,
> JItesh Dundas
>
>
> On 4/8/10, Andreas Dr?ger <andreas.draeger at uni-tuebingen.de> wrote:
> >
> > Hi all,
> >
> > This e-mail is just for your information about somebody new, who'd like
> to
> > contribute to our project.
> >
> > Cheers
> > Andreas
> >
> >
> > Subject:
> > Re: Fwd: Proposing a project on "Biojava alignment lead"
> > From:
> > Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
> > Date:
> > Wed, 07 Apr 2010 09:27:13 +0200
> > To:
> > Cai Shaojiang <caishaojiang at gmail.com>
> >
> > Hi Cai Shaojiang,
> >
> > Thank you for you e-mail! I don't know what happened to the e-mail list.
> > Sometimes it takes a while due to the spam filters, I guess.
> >
> > > I am a PhD student from National University of Singapore. My major
> > research area is local alignment algorithms and data structures for SNP
> > identification. And I have used Java and Eclipse for years for software
> > development. I am very interested in your GSoC programme. I find that
> there
> > is a module called "biojava-alignment lead" whose mentor is you. I want
> to
> > propose a new project on this module. I have several questions about this
> > module.
> >
> > Yes, that's me. So great to get your support.
> >
> > > 1. It seems that pairwise alignment is to find similarity between two
> > short sequences. Existing pairwise alignment is based on dynamic
> > programming, is it Smith-Waterman algorithm?
> >
> > So, currently, BioJava contains three different alignment approaches.
> > There are two deterministic algorithms, i.e., Smith-Waterman for local
> > alignment and Needleman-Wunsch for global alignment. Third, there is the
> > possibility to apply Hidden Markov Models for alignment. An example of
> the
> > latter approach should be in the cookbook.
> >
> > > 2. What is the exact task of "refactoring of underlying data
> structures"?
> >
> > Yes, this is something, I did last week already but it could still be
> > improved. The problem was that the alignment algorithms actually produced
> a
> > kind of string that looks similar to the output of BLAST. This string
> > contained the score, the computation time, the length of the alignment
> etc.
> > The problem was that people wanted to perform higher-level computation on
> > the score value or evaluate some other information. Now, the alignment
> will
> > produce a data structure that contains all the information and can, in
> > addition to that, also produce such a BLAST-like output. There is,
> however,
> > still the following problem: The data structure requires both sequences
> in
> > the pair-wise alignment to have an identical length. In case of local
> > alignment this is especially stupid (actually), because gaps are inserted
> to
> > fill the sequences. And then the data structure tries to keep the old
> > sequence coordinates, leading to the effect that the numbers "query
> start",
> > "query end", "subject start", and "subject end" are required to shift the
> > sequences against each other when displaying the output. So, you cannot
> > easily print the sequences below of each other, you first have to shift
> > them. Please check out the latest version of this package via anonymeous
> svn
> > and have a look ;-)
> >
> > > 3. My existing research area is aiming to deal with aligning short read
> > (10s~100s bp) against extremely long sequences (e.g., human genome). Af
> far
> > as I know, there is not existing such alignment tools implemented in
> Java.
> > Would you consider this direction?
> >
> > See, this would be very nice to include. But this requires that we no
> > longer fill the short sequence with many, many gap symbols (just a waist
> of
> > memory), but improve the data structure. There is already an
> > UnequalLenghtAlignment (just a data structure, no algorithm) and I think
> we
> > could use this as a starting point. Then your algorithm should only
> produce
> > such a data structure and this would be fine.
> >
> > > 4. It seems that the existing tools is just lacking of some refactoring
> > and representation interfaces. Any more underlying tasks?
> >
> > Hm. Yes: With the release of BioJava 3 data structures have changed
> again.
> > So maybe there's also some adaptation to the new structure required.
> >
> > > I am keeping an eye on GSoC from last month, but sorry to find out that
> I
> > sent the initial email to the mailing list before I subscribe it...
> >
> > Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> > latest trunk, have a look, play around and if you can improve something
> > we'll put it into the trunk and write your name into the authors' tag.
> >
> > Cheers
> > Andreas
> >
> > --
> > Dipl.-Bioinform. Andreas Dr?ger
> > Eberhard Karls University T?bingen
> > Center for Bioinformatics (ZBIT)
> > Sand 1
> > 72076 T?bingen
> > Germany
> >
> > Phone: +49-7071-29-70436
> > Fax:   +49-7071-29-5091
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From jbdundas at gmail.com  Mon Apr 19 04:33:57 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Mon, 19 Apr 2010 14:03:57 +0530
Subject: [Biojava-l] Fwd: Biojava3-genetics
In-Reply-To: <i2h59a41c431004182058pf0ee4b80s960a579b9b8c7cbe@mail.gmail.com>
References: <4BC806F4.3090302@wur.nl>
	<r2n59a41c431004161039hd93b268eu159de8a6659d969f@mail.gmail.com>
	<33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl>
	<i2h59a41c431004182058pf0ee4b80s960a579b9b8c7cbe@mail.gmail.com>
Message-ID: <o2v326ea8621004190133oc0ae71b3l2f58c9967fd2fcb0@mail.gmail.com>

Dear Sir,

I would like to work on this module.

How can I help?

Regards,
Jitesh Dundas

On 4/19/10, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Richard,
>
> I am forwarding your message to the mailing list, since that is the best
> place to meet other people interested in genetics application.
>
> The BioJava source code is available via anonymous svn or the download page
> on the wiki.
>
> Andreas
>
> ---------- Forwarded message ----------
> From: Finkers, Richard <richard.finkers at wur.nl>
> Date: Sat, Apr 17, 2010 at 12:46 AM
> Subject: RE: Biojava3-genetics
> To: Andreas Prlic <andreas at sdsc.edu>
>
>
> Hi Andreas,
>
> To start with, associations with e.g. sequence variation (454) and phenotype
> data within larger sets of genetically different individuals. This will be
> code which I will have to write the coming year for one of my projects. I am
> planning to use this in combination the sequence and phylogeny based biojava
> modules.
>
> I also might consider migrating some of my current code to this module. This
> includes graphical representations of genetic data but also some statistical
> analysis for which we use the package R for the calculations but the rest of
> the data handling / formatting is done in Java.
>
> Some of the functionality, that I am thinking about, is available from other
> packages but I did not find the (java) source code.
>
> Richard
>
>
>
>
> -----Original Message-----
> From: andreas.prlic at gmail.com on behalf of Andreas Prlic
> Sent: Fri 2010-04-16 19:39
> To: Finkers, Richard
> Cc: biojava-dev at lists.open-bio.org
> Subject: Re: Biojava3-genetics
>
> Hi Richard,
>
> any contribution is welcome. What do you have in mind in particular? Perhaps
> there is already something there along those lines...
>
> Andreas
>
> On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers <Richard.Finkers at wur.nl
>>wrote:
>
>> Dear List,
>>
>> I would be interested in adding a module for genetic analysis to the
>> biojava3 project. Are there others who are interested in this as well and
>> with who should I discuss this further?
>>
>> Thanks,
>> Richard
>>
>>
>> --
>> Dr. Richard Finkers
>> Researcher Plant Breeding
>> Wageningen UR Plant Breeding
>> P.O. Box 16, 6700 AA, Wageningen, The Netherlands
>> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB
>> Wageningen, The Netherlands
>> Tel. +31-317-484165 Fax +31-317-418094
>> http://www.plantbreeding.wur.nl/ <http://www.plantbreeding.wur.nl>
>> https://www.eu-sol.wur.nl/ <https://www.eu-sol.wur.nl>
>> https://cbsgdbase.wur.nl/ <https://cbsgdbase.wur.nl>
>> http://solgenomics.wur.nl/ <http://solgenomics.wur.nl>
>> http://www.disclaimer-uk.wur.nl/
>>
>>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>
>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From andreas.draeger at uni-tuebingen.de  Tue Apr 20 23:17:05 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Wed, 21 Apr 2010 12:17:05 +0900
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on
 "Biojava	alignment lead"]
In-Reply-To: <r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
Message-ID: <4BCE6E31.70504@uni-tuebingen.de>

Hi Jitesh,

Thanks for your interest to contribute to our BioJava project! In the 
alignment package, lots of help is required. What would be very nice, is 
a verstatile visual representation of the alignment data structures that 
can be included into graphical user interfaces with little effort. To 
this end, it should be very flexible and abstract. Would you be interested?

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091

From mitlox at op.pl  Wed Apr 21 06:46:22 2010
From: mitlox at op.pl (xyz)
Date: Wed, 21 Apr 2010 20:46:22 +1000
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com>
References: <20100330215047.084f6b00@wp01>
	<Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
	<20100408213013.63a99b8c@wp01>
	<5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com>
Message-ID: <20100421204622.68f9ac1b@wp01>

On Thu, 8 Apr 2010 12:36:36 +0100
Richard Holland <holland at eaglegenomics.com> wrote:

> You haven't included the two import static lines in your code. See
> first two lines of Michael's example code (expanding the ellipses to
> the full classpath).
> 

Thank you it was enough to include 
import static
org.biojavax.bio.seq.RichSequence.Tools.createRichSequence;

Usually Netbeans solve this kind of problems for me, but this time was
no help from the IDE. 


From mitlox at op.pl  Wed Apr 21 07:18:24 2010
From: mitlox at op.pl (xyz)
Date: Wed, 21 Apr 2010 21:18:24 +1000
Subject: [Biojava-l] readFasta problem
In-Reply-To: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
Message-ID: <20100421211824.75b7ada2@wp01>

On Thu, 8 Apr 2010 12:41:25 +0100
Richard Holland <holland at eaglegenomics.com> wrote:

> You have passed null into the tokenizer parameter of
> RichSequence.IOTools.readFasta() - this is not allowed. The parser
> cannot guess the type of sequence, it must be told what to expect by
> specifying the tokenizer to use. (Importantly this also means that
> you cannot mix different types of sequence within the same file to be
> parsed.)
> 

Thank you. 

Q1:
Does RichSequenceIterator read the complete file in memory and then I
retrieve each read from memory? Or does it read the file line by line
and I get each read?

Q2:
Why am I not able to retrieve the header from the following fasta file:
>1
atccccc
>2
atccccctttttt
>3
atccccccccccccccccctttt
>4
tttttttccccccccccccccccccccccc
>5
tttttttcccccccccccccccccccccca

with the following code:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import org.biojava.bio.BioException;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojava.bio.symbol.AlphabetManager;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;

public class SortFasta {

  public static void main(String[] args) throws FileNotFoundException,
  BioException {


    BufferedReader br = new BufferedReader(new
    FileReader("sortFasta.fasta")); String type = "DNA";
    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
					.getTokenization("token");


    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
    null);

    while (rsi.hasNext()) {
      RichSequence rs = rsi.nextRichSequence();
      System.out.println(rs.getDescription());
      System.out.println(rs.seqString());
    }
  }
}

What did I wrong in order to retrieve the header?

From holland at eaglegenomics.com  Wed Apr 21 07:29:57 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 21 Apr 2010 12:29:57 +0100
Subject: [Biojava-l] readFasta problem
In-Reply-To: <20100421211824.75b7ada2@wp01>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
	<20100421211824.75b7ada2@wp01>
Message-ID: <BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>


On 21 Apr 2010, at 12:18, xyz wrote:

> On Thu, 8 Apr 2010 12:41:25 +0100
> Richard Holland <holland at eaglegenomics.com> wrote:
> 
>> You have passed null into the tokenizer parameter of
>> RichSequence.IOTools.readFasta() - this is not allowed. The parser
>> cannot guess the type of sequence, it must be told what to expect by
>> specifying the tokenizer to use. (Importantly this also means that
>> you cannot mix different types of sequence within the same file to be
>> parsed.)
>> 
> 
> Thank you. 
> 
> Q1:
> Does RichSequenceIterator read the complete file in memory and then I
> retrieve each read from memory? Or does it read the file line by line
> and I get each read?


Line by line.

> Q2:
> Why am I not able to retrieve the header from the following fasta file:
>> 1
> atccccc
>> 2
> atccccctttttt
>> 3
> atccccccccccccccccctttt
>> 4
> tttttttccccccccccccccccccccccc
>> 5
> tttttttcccccccccccccccccccccca
> 
> with the following code:
> 
> import java.io.BufferedReader;
> import java.io.FileNotFoundException;
> import java.io.FileReader;
> import org.biojava.bio.BioException;
> import org.biojava.bio.seq.io.SymbolTokenization;
> import org.biojava.bio.symbol.AlphabetManager;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> 
> public class SortFasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  BioException {
> 
> 
>    BufferedReader br = new BufferedReader(new
>    FileReader("sortFasta.fasta")); String type = "DNA";
>    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
> 					.getTokenization("token");
> 
> 
>    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
>    null);
> 
>    while (rsi.hasNext()) {
>      RichSequence rs = rsi.nextRichSequence();
>      System.out.println(rs.getDescription());
>      System.out.println(rs.seqString());
>    }
>  }
> }
> 
> What did I wrong in order to retrieve the header?


Try the other methods on RichSequence - getName() for instance.

cheers,
Richard

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From mitlox at op.pl  Wed Apr 21 08:40:48 2010
From: mitlox at op.pl (xyz)
Date: Wed, 21 Apr 2010 22:40:48 +1000
Subject: [Biojava-l] NCBI Accession Number prefixes
Message-ID: <20100421224048.1848c2f2@wp01>

Hello,
is it possible to download GenBank entries (AC) with BioJava?

Thank you in advance.

Best regards,

From holland at eaglegenomics.com  Wed Apr 21 08:44:16 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 21 Apr 2010 13:44:16 +0100
Subject: [Biojava-l] NCBI Accession Number prefixes
In-Reply-To: <20100421224048.1848c2f2@wp01>
References: <20100421224048.1848c2f2@wp01>
Message-ID: <577294DB-EABD-48DF-A55A-5DA9629AC352@eaglegenomics.com>

See http://www.biojava.org/docs/api/org/biojavax/bio/db/ncbi/GenbankRichSequenceDB.html

On 21 Apr 2010, at 13:40, xyz wrote:

> Hello,
> is it possible to download GenBank entries (AC) with BioJava?
> 
> Thank you in advance.
> 
> Best regards,
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From jbdundas at gmail.com  Wed Apr 21 09:45:00 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Wed, 21 Apr 2010 19:15:00 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BCE6E31.70504@uni-tuebingen.de>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
	<4BCE6E31.70504@uni-tuebingen.de>
Message-ID: <i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>

Yes Sir, I will be very interested. Please send me the details. I will be
working on Weekends though as office work is taking my time right now.

Regards,
jd

On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger <
andreas.draeger at uni-tuebingen.de> wrote:

> Hi Jitesh,
>
> Thanks for your interest to contribute to our BioJava project! In the
> alignment package, lots of help is required. What would be very nice, is a
> verstatile visual representation of the alignment data structures that can
> be included into graphical user interfaces with little effort. To this end,
> it should be very flexible and abstract. Would you be interested?
>
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
>


From er.indupandey at gmail.com  Fri Apr 23 04:11:05 2010
From: er.indupandey at gmail.com (indu pandey)
Date: Fri, 23 Apr 2010 01:11:05 -0700
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
	<4BCE6E31.70504@uni-tuebingen.de>
	<i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
Message-ID: <k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>

hi all
 can any body help me in creating code in biojava for converting dna
sequence to corresponding amino acid sequence

regards
 indu

On 4/21/10, jitesh dundas <jbdundas at gmail.com> wrote:
>
> Yes Sir, I will be very interested. Please send me the details. I will be
> working on Weekends though as office work is taking my time right now.
>
> Regards,
> jd
>
> On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger <
> andreas.draeger at uni-tuebingen.de> wrote:
>
> > Hi Jitesh,
> >
> > Thanks for your interest to contribute to our BioJava project! In the
> > alignment package, lots of help is required. What would be very nice, is
> a
> > verstatile visual representation of the alignment data structures that
> can
> > be included into graphical user interfaces with little effort. To this
> end,
> > it should be very flexible and abstract. Would you be interested?
> >
> >
> > Cheers
> > Andreas
> >
> > --
> > Dipl.-Bioinform. Andreas Dr?ger
> > Eberhard Karls University T?bingen
> > Center for Bioinformatics (ZBIT)
> > Sand 1
> > 72076 T?bingen
> > Germany
> >
> > Phone: +49-7071-29-70436
> > Fax:   +49-7071-29-5091
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From genjasp at gmail.com  Fri Apr 23 04:26:10 2010
From: genjasp at gmail.com (Alessandro Cipriani)
Date: Fri, 23 Apr 2010 10:26:10 +0200
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
	<4BCE6E31.70504@uni-tuebingen.de>
	<i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
	<k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
Message-ID: <k2j46b9a2151004230126i287b838frcd874f8ce9bab47d@mail.gmail.com>

Hi
Follow this link: http://www.biojava.org/wiki/BioJava:CookBook#Translation
I think it could be usefull

regards
ale


2010/4/23 indu pandey <er.indupandey at gmail.com>

> hi all
>  can any body help me in creating code in biojava for converting dna
> sequence to corresponding amino acid sequence
>
> regards
>  indu
>
> On 4/21/10, jitesh dundas <jbdundas at gmail.com> wrote:
> >
> > Yes Sir, I will be very interested. Please send me the details. I will be
> > working on Weekends though as office work is taking my time right now.
> >
> > Regards,
> > jd
> >
> > On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger <
> > andreas.draeger at uni-tuebingen.de> wrote:
> >
> > > Hi Jitesh,
> > >
> > > Thanks for your interest to contribute to our BioJava project! In the
> > > alignment package, lots of help is required. What would be very nice,
> is
> > a
> > > verstatile visual representation of the alignment data structures that
> > can
> > > be included into graphical user interfaces with little effort. To this
> > end,
> > > it should be very flexible and abstract. Would you be interested?
> > >
> > >
> > > Cheers
> > > Andreas
> > >
> > > --
> > > Dipl.-Bioinform. Andreas Dr?ger
> > > Eberhard Karls University T?bingen
> > > Center for Bioinformatics (ZBIT)
> > > Sand 1
> > > 72076 T?bingen
> > > Germany
> > >
> > > Phone: +49-7071-29-70436
> > > Fax:   +49-7071-29-5091
> > >
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Alessandro Cipriani
(+39) 3206009509
http://www.cipriania.it
skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com>
msn:jaspzz


From thomascramera at dnastar.com  Fri Apr 23 18:58:05 2010
From: thomascramera at dnastar.com (Andy Thomas-Cramer)
Date: Fri, 23 Apr 2010 17:58:05 -0500
Subject: [Biojava-l] PDBFileParser and Atom element symbol
Message-ID: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>


Is there an easy way to identify the type of atom referenced by an Atom
object? 

For example, if Atom.getName() is "CA", is the element calcium or the
atom carbon alpha?

If not, would it be feasible to add a method providing this in Atom,
AtomImpl, and parsing it in PDBFileParser, using the columns defined at
http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?


From andreas at sdsc.edu  Fri Apr 23 19:52:15 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Fri, 23 Apr 2010 16:52:15 -0700
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
Message-ID: <n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>

Hi Andy,

you could check with  Atom.getFullname(), which contains the space
characters from the PDB file:
e.g Calpha: " CA ", Calcium "CA  "

in addition the parent group of a Calpha atom is usually an AminoAcid and
for Calciums it is a Hetatom group...

Andreas

On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer <
thomascramera at dnastar.com> wrote:

>
>
> Is there an easy way to identify the type of atom referenced by an Atom
> object?
>
> For example, if Atom.getName() is "CA", is the element calcium or the
> atom carbon alpha?
>
> If not, would it be feasible to add a method providing this in Atom,
> AtomImpl, and parsing it in PDBFileParser, using the columns defined at
> http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From mitlox at op.pl  Sun Apr 25 01:19:25 2010
From: mitlox at op.pl (xyz)
Date: Sun, 25 Apr 2010 15:19:25 +1000
Subject: [Biojava-l] readFasta problem
In-Reply-To: <BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
	<20100421211824.75b7ada2@wp01>
	<BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>
Message-ID: <20100425151925.1c5c9a03@wp01>

On Wed, 21 Apr 2010 12:29:57 +0100
Richard Holland wrote:

> > Q1:
> > Does RichSequenceIterator read the complete file in memory and then
> > I retrieve each read from memory? Or does it read the file line by
> > line and I get each read?
> 
> 
> Line by line.

That save memory.

> > Q2:
> > Why am I not able to retrieve the header from the following fasta
> > file:
> >> 1
> > atccccc
> >> 2
> > atccccctttttt
> >> 3
> > atccccccccccccccccctttt
> >> 4
> > tttttttccccccccccccccccccccccc
> >> 5
> > tttttttcccccccccccccccccccccca
> 
> Try the other methods on RichSequence - getName() for instance.

Thank you getName() works.

I have tried to write fasta file line by line with IOTools, but I have
got the following error:
Exception in thread "main" java.lang.RuntimeException: Uncompilable
source code 1
        at SortFasta.main(SortFasta.java:31)
atccccc
Java Result: 1

Here is the complete code:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import org.biojava.bio.BioException;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojava.bio.symbol.AlphabetManager;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;

public class SortFasta {

  public static void main(String[] args) throws FileNotFoundException,
  BioException {


    BufferedReader br = new BufferedReader(new
    FileReader("sortFasta.fasta")); String type = "DNA";
    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
					.getTokenization("token");

    FileOutputStream outputFasta = new FileOutputStream("test.fasta");

    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
    null);

    while (rsi.hasNext()) {
      RichSequence rs = rsi.nextRichSequence();
      System.out.println(rs.getName());
      System.out.println(rs.seqString());

      RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null,
              rs.getName() + "1");
    }
  }
}

How is it possible to write fasta files line by line?

From holland at eaglegenomics.com  Sun Apr 25 04:21:22 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Sun, 25 Apr 2010 09:21:22 +0100
Subject: [Biojava-l] readFasta problem
In-Reply-To: <20100425151925.1c5c9a03@wp01>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
	<20100421211824.75b7ada2@wp01>
	<BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>
	<20100425151925.1c5c9a03@wp01>
Message-ID: <316097DC-6011-4205-83BC-9A24398D034D@eaglegenomics.com>

Hi.

You are calling a non-existing version of writeFasta. I'm surprised your code even compiles!

Have a look at the JavaDocs to find out what you can actually do with writeFasta. For a start, it takes Sequence and FastaHeader objects as parameters, not Strings as you are trying to do.

http://www.biojava.org/docs/api17/org/biojavax/bio/seq/RichSequence.IOTools.html

cheers,
Richard

On 25 Apr 2010, at 06:19, xyz wrote:

> On Wed, 21 Apr 2010 12:29:57 +0100
> Richard Holland wrote:
> 
>>> Q1:
>>> Does RichSequenceIterator read the complete file in memory and then
>>> I retrieve each read from memory? Or does it read the file line by
>>> line and I get each read?
>> 
>> 
>> Line by line.
> 
> That save memory.
> 
>>> Q2:
>>> Why am I not able to retrieve the header from the following fasta
>>> file:
>>>> 1
>>> atccccc
>>>> 2
>>> atccccctttttt
>>>> 3
>>> atccccccccccccccccctttt
>>>> 4
>>> tttttttccccccccccccccccccccccc
>>>> 5
>>> tttttttcccccccccccccccccccccca
>> 
>> Try the other methods on RichSequence - getName() for instance.
> 
> Thank you getName() works.
> 
> I have tried to write fasta file line by line with IOTools, but I have
> got the following error:
> Exception in thread "main" java.lang.RuntimeException: Uncompilable
> source code 1
>        at SortFasta.main(SortFasta.java:31)
> atccccc
> Java Result: 1
> 
> Here is the complete code:
> 
> import java.io.BufferedReader;
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.FileReader;
> import org.biojava.bio.BioException;
> import org.biojava.bio.seq.io.SymbolTokenization;
> import org.biojava.bio.symbol.AlphabetManager;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> 
> public class SortFasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  BioException {
> 
> 
>    BufferedReader br = new BufferedReader(new
>    FileReader("sortFasta.fasta")); String type = "DNA";
>    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
> 					.getTokenization("token");
> 
>    FileOutputStream outputFasta = new FileOutputStream("test.fasta");
> 
>    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
>    null);
> 
>    while (rsi.hasNext()) {
>      RichSequence rs = rsi.nextRichSequence();
>      System.out.println(rs.getName());
>      System.out.println(rs.seqString());
> 
>      RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null,
>              rs.getName() + "1");
>    }
>  }
> }
> 
> How is it possible to write fasta files line by line?

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From andreas.draeger at uni-tuebingen.de  Sun Apr 25 21:04:44 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Mon, 26 Apr 2010 10:04:44 +0900
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
 alignment lead"]
In-Reply-To: <k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>	
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>	
	<4BCE6E31.70504@uni-tuebingen.de>	
	<i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
	<k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
Message-ID: <4BD4E6AC.8030901@uni-tuebingen.de>

Dear Indu,

If you have a question regarding to BioJava, please do not just reply to 
some previous e-mail. In this case, your question appears in the e-mail 
tree related to the BioJava alignment lead. However, you have a question 
related to working and manipulating symbols. Therefore, you should 
better open a new thread. Sorry for telling you that but this is 
necessary to keep an overview about all the e-mails.

Best wishes
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091

From asidhu at biomap.org  Mon Apr 26 02:27:30 2010
From: asidhu at biomap.org (Amandeep Sidhu)
Date: Mon, 26 Apr 2010 14:27:30 +0800
Subject: [Biojava-l] CFP: 23rd IEEE International Symposium on
	Computer-Based Medical Systems 2010
Message-ID: <B41481C1-F539-4CDC-8373-A58A0DA14FA7@biomap.org>

IEEE CBMS 2010
23rd IEEE International Symposium on Computer-Based Medical Systems 2010
Perth, Australia, 12-15 October 2010

http://www.cbms2010.curtin.edu.au/

The 23rd IEEE International Symposium on Computer-Based Medical Systems (CBMS 2010) is intended to provide an international forum for discussing the latest results in the field of computational medicine. The scientific program of CBMS 2010 will consist of invited keynote talks given by leading scientists in the field, and regular and special track sessions that cover a broad array of issues which relate computing to medicine.

RELEVANT TOPICS

Network and Telemedicine Systems
Medical Databases & Information Systems
Computer-Aided Diagnosis
Medical Devices with Embedded Computers
Bioinformatics in Medicine
Software Systems in Medicine
Pervasive Health Systems and Services
Web-based Delivery of Medical Information
Medical Image Segmentation & Compression
Content Analysis of Biomedical Image Data
Knowledge-Based & Decision Support Systems
Hand-held Computing Applications in Medicine
Knowledge Discovery & Data Mining
Signal and Image Processing in Medicine
Multimedia Biomedical Databases

CBMS 2010 invites original previously unpublished contributions that are not submitted concurrently to a journal or another conference. Many of the above listed topics are represented by corresponding Special Tracks, while others are solely covered by the general CBMS track. Prospective authors are expected to submit their contributions to one of the corresponding Special Tracks or to the general track if none of the special tracks is relevant.

SPECIAL TRACKS

ST1: Computational Proteomics and Genomics
ST2: Knowledge Discovery and Decision Systems in Biomedicine
ST3: Ontologies for Biomedical Systems
ST4: HealthGrid & Cloud Computing
ST5: Technology Enhanced Learning in Medical Education
ST6: Intelligent Patient Management
ST7: Data Streams in Healthcare
ST8: Supporting Collaboration among Healthcare Workers
ST9: Telemedicine
ST10: Computer-Based Systems for Mental Health
ST11: Image Informatics in Biomedical Research and Clinical Medicine
ST12: e-Health

SUBMISSION GUIDELINES

Papers should be submitted electronically using EasyChair online submission system. The papers must be prepared following the IEEE two-column format and should not exceed the length of 6 (six) Letter-sized pages. LaTeX or Microsoft Word templates can be used when preparing the papers. Please, note that only PDF format of submissions is allowed.

Submission web site: http://www.easychair.org/conferences/?conf=cbms2010

All submissions will be peer-reviewed by at least three reviewers. The proceedings will be published by the IEEE Computer Society Press. At least one of the authors of accepted papers is required to register and present the work at the conference; otherwise their papers will be removed from the digital library after the conference.

IMPORTANT DATES

Submission deadline for regular papers:        		24 June 2010
Deadline for tutorial submission:                       24 June 2010
Notification of acceptation for papers and tutorials:    2 Aug 2010
Final camera ready due:                                  2 Sep 2010
Author registration:                                     2 Sep 2010

INTENDED AUDIENCE

Engineers, scientists, clinicians and managers involved in medical computing projects are encouraged to submit papers to the symposium and/or attend the symposium. The symposium provides its attendees with an opportunity to experience state-of-the-art research and development in a variety of topics directly and indirectly related to their own work. In addition to research papers, keynote speakers and tutorial sessions it provides participants with an opportunity to come up-to-date on important technological issues. The symposium encourages the participation of students engaged in research/development in computer-based medical systems.

Organizing Committee

GENERAL CHAIRS

Tharam Dillon, Curtin University of Technology, Australia
Daniel Rubin, National Center for Biomedical Ontologies, USA
William Gallagher, University College Dublin, Ireland

PROGRAM CHAIRS

Amandeep Sidhu, Curtin University of Technology, Australia
Alexey Tsymbal, Siemens, Germany

PUBLICATION CHAIRS

Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands
Tony Hu, Drexel University, USA

SPECIAL TRACK CHAIRS

Maja Hadzic, Curtin University of Technology, Australia
Jake Chen, Indiana University, USA

TUTORIAL CHAIRS

Phoebe Chen, La Trobe University, Australia
Xiaofang Zhou, University of Queensland, Australia

PUBLICITY CHAIRS

Carolyn McGregor, University of Ontario Institute of Technology, Canada
Meifania Chen, Curtin University of Technology, Australia

From thomascramera at dnastar.com  Mon Apr 26 10:51:23 2010
From: thomascramera at dnastar.com (Andy Thomas-Cramer)
Date: Mon, 26 Apr 2010 09:51:23 -0500
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
	<n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>

 
Thank you. I had not noticed the pattern that columns 13-14 at least
sometimes contain the element symbol, whether one- or two-character.

 
Questions:

* Is this pattern documented in the PDB specification? 

* If this pattern can be relied on, why are columns 77-78 also dedicated
to the element symbol?

* Should reliance on the pattern be hidden behind a BioJava method?

 
________________________________

From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf
Of Andreas Prlic
Sent: Friday, April 23, 2010 6:52 PM
To: Andy Thomas-Cramer
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol

 
Hi Andy,

you could check with  Atom.getFullname(), which contains the space
characters from the PDB file:
e.g Calpha: " CA ", Calcium "CA  " 

in addition the parent group of a Calpha atom is usually an AminoAcid
and for Calciums it is a Hetatom group...

Andreas

On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer
<thomascramera at dnastar.com> wrote:


Is there an easy way to identify the type of atom referenced by an Atom
object?

For example, if Atom.getName() is "CA", is the element calcium or the
atom carbon alpha?

If not, would it be feasible to add a method providing this in Atom,
AtomImpl, and parsing it in PDBFileParser, using the columns defined at
http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?


_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Mon Apr 26 21:07:53 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Mon, 26 Apr 2010 18:07:53 -0700
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
	<n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>
Message-ID: <m2v59a41c431004261807h4440eac5t31871134c0a5d02f@mail.gmail.com>

Hi Andy

Questions:

> * Is this pattern documented in the PDB specification?
>

see here:
http://www.wwpdb.org/documentation/format23/sect9.html#ATOM


> * If this pattern can be relied on, why are columns 77-78 also dedicated to
> the element symbol?
>
That is the atom's element symbol (as given in the periodic table), in
contrast to the first name, which contains numbering information.

* Should reliance on the pattern be hidden behind a BioJava method?
>

If you think that is important we could probably provide an enum for all
atom types. There are two categories though: the periodic table symbol and
the one that is related to the position in an amino acid....

Andreas


>
>
>
>  ------------------------------
>
> *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On
> Behalf Of *Andreas Prlic
> *Sent:* Friday, April 23, 2010 6:52 PM
> *To:* Andy Thomas-Cramer
> *Cc:* biojava-l at lists.open-bio.org
> *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol
>
>
>
> Hi Andy,
>
> you could check with  Atom.getFullname(), which contains the space
> characters from the PDB file:
> e.g Calpha: " CA ", Calcium "CA  "
>
> in addition the parent group of a Calpha atom is usually an AminoAcid and
> for Calciums it is a Hetatom group...
>
> Andreas
>
> On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer <
> thomascramera at dnastar.com> wrote:
>
>
>
> Is there an easy way to identify the type of atom referenced by an Atom
> object?
>
> For example, if Atom.getName() is "CA", is the element calcium or the
> atom carbon alpha?
>
> If not, would it be feasible to add a method providing this in Atom,
> AtomImpl, and parsing it in PDBFileParser, using the columns defined at
> http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From rmb32 at cornell.edu  Mon Apr 26 18:02:11 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 15:02:11 -0700
Subject: [Biojava-l] Google Summer of Code - accepted students
Message-ID: <4BD60D63.1040400@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of 
Code students, listed in alphabetical order with their project titles 
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including 
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, 
Classification, and Visualization of Posttranslational Modification of 
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & 
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending 
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally 
assigned, plus 1 extra) allotted to us by Google.  Proposals were 
extremely competitive: 6 out of 52 translates to an 11.5% acceptance 
rate.  We received a lot of really excellent proposals, the decisions 
were not easy.

Thanks very much to all the students who applied, we very much 
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do 
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From andreas at sdsc.edu  Tue Apr 27 01:33:51 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Mon, 26 Apr 2010 22:33:51 -0700
Subject: [Biojava-l] accepted GSoC projects
Message-ID: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>

Dear all,

Google has released the results for GSoC: Congratulations to Mark Chapman
and Jianjiong Gao for having been accepted to work on the MSA and PTM
projects for BioJava! Let's start the "community bonding" process (
http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we all are
looking forward to work with you on this during the summer. The Mentors and
co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle
Ellrott for the MSA project (and me).

I want to thank all of of you who submitted proposals or showed interest in
other ways for the Google Summer of Code. We hope you are not too
disappointed if your application did not get accepted this time. We had a
large number (52) applications and the the overall quality of the
submissions was very high. We would like to stay in touch with you and we
hope that you are interested in BioJava also beyond the scope of GSoC. There
are a number of different ways how to contribute:  We are always looking for
people who provide code and patches to further improve our library, help out
with the documentation on the Wiki page, or answer questions on the mailing
lists.

Let's all give Mark and Jianjiong  a warm welcome to the BioJava community.
For those of you who are interested in following the progress of the
projects, as usually, the development related discussions are going to be on
the biojava-dev list.

Happy coding!

Andreas

From rmb32 at cornell.edu  Tue Apr 27 01:52:57 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 22:52:57 -0700
Subject: [Biojava-l] Google Summer of Code - accepted students
Message-ID: <4BD67BB9.3000804@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally
assigned, plus 1 extra) allotted to us by Google.  Proposals were
extremely competitive: 6 out of 52 translates to an 11.5% acceptance
rate.  We received a lot of really excellent proposals, the decisions
were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From jianjiong.gao at gmail.com  Tue Apr 27 15:13:12 2010
From: jianjiong.gao at gmail.com (Jianjiong Gao)
Date: Tue, 27 Apr 2010 14:13:12 -0500
Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects
In-Reply-To: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
Message-ID: <h2kc82264f51004271213u1ea78e1bq29184a65b6315cbe@mail.gmail.com>

Dear Dr. Prlic and Everyone,

Thanks for the warm welcome. I am so glad that I have the chance to
work with the BioJava community this summer. I would like to briefly
introduce myself. My name is Jianjiong (JJ) Gao. I am a PhD student in
Computer Science at University of Missouri, Columbia. My study is
focusing on Bioinformatics, specifically computational proteomics and
PTMs.

I came across BioJava about two years ago when I was working on a
plugin for Cytoscape, and was attracted by the idea of providing
generic Java API for bioinformatics applications. I was thinking maybe
someday I could do some coding for BioJava. And now I got the chance
:)

Best Regards,
-JJ

On Tue, Apr 27, 2010 at 12:33 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Dear all,
>
> Google has released the results for GSoC: Congratulations to Mark Chapman
> and Jianjiong Gao for having been accepted to work on the MSA and PTM
> projects for BioJava! Let's start the "community bonding" process (
> http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) ?and we all are
> looking forward to work with you on this during the summer. The Mentors and
> co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle
> Ellrott for the MSA project (and me).
>
> I want to thank all of of you who submitted proposals or showed interest in
> other ways for the Google Summer of Code. We hope you are not too
> disappointed if your application did not get accepted this time. We had a
> large number (52) applications and the the overall quality of the
> submissions was very high. We would like to stay in touch with you and we
> hope that you are interested in BioJava also beyond the scope of GSoC. There
> are a number of different ways how to contribute: ?We are always looking for
> people who provide code and patches to further improve our library, help out
> with the documentation on the Wiki page, or answer questions on the mailing
> lists.
>
> Let's all give Mark and Jianjiong ?a warm welcome to the BioJava community.
> For those of you who are interested in following the progress of the
> projects, as usually, the development related discussions are going to be on
> the biojava-dev list.
>
> Happy coding!
>
> Andreas
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>


From chapman at cs.wisc.edu  Wed Apr 28 00:18:25 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Tue, 27 Apr 2010 23:18:25 -0500
Subject: [Biojava-l] accepted GSoC projects
In-Reply-To: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
Message-ID: <4BD7B711.9090108@cs.wisc.edu>

Hi all,

Thank you to Google, Open Bioinformatics Foundation, BioJava, and my mentors for 
this opportunity.  As a short introduction, I am Mark Chapman, a graduate 
student in Computer Sciences at the University of Wisconsin - Madison.  My focus 
is in artificial intelligence and bioinformatics.  This summer, I will add a 
Multiple Sequence Alignment module to BioJava.

My first task will be to update the alignment module to BioJava3 and to design 
the interface for MSA.  My second goal is to implement a progressive MSA styled 
after clustalw.  After that, I will add alternative routines for each step.

Any ideas for the MSA project as well as more sources of programming wisdom are 
quite welcome.  For example, Andreas suggested a series about Java parallelism 
and lazy execution 
(http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). 
  I also noted a useful tip for iterative development 
(http://en.flossmanuals.net/GSoCMentoring/Workflow).

Thanks again,
Mark


On 4/27/2010 12:33 AM, Andreas Prlic wrote:
> Dear all,
>
> Google has released the results for GSoC: Congratulations to Mark
> Chapman and Jianjiong Gao for having been accepted to work on the MSA
> and PTM projects for BioJava! Let's start the "community bonding"
> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we
> all are looking forward to work with you on this during the summer. The
> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis
> and Kyle Ellrott for the MSA project (and me).
>
> I want to thank all of of you who submitted proposals or showed interest
> in other ways for the Google Summer of Code. We hope you are not too
> disappointed if your application did not get accepted this time. We had
> a  large number (52) applications and the the overall quality of the
> submissions was very high. We would like to stay in touch with you and
> we hope that you are interested in BioJava also beyond the scope of
> GSoC. There are a number of different ways how to contribute:  We are
> always looking for people who provide code and patches to further
> improve our library, help out with the documentation on the Wiki page,
> or answer questions on the mailing lists.
>
> Let's all give Mark and Jianjiong  a warm welcome to the BioJava
> community.  For those of you who are interested in following the
> progress of the projects, as usually, the development related
> discussions are going to be on the biojava-dev list.
>
> Happy coding!
>
> Andreas
>
>

From bernd.jagla at pasteur.fr  Wed Apr 28 03:25:05 2010
From: bernd.jagla at pasteur.fr (Bernd Jagla)
Date: Wed, 28 Apr 2010 09:25:05 +0200
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
Message-ID: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>

Hi there,	

I am trying to retrieve information (features) from the UCSC genome browser
using the DAS interface. 
I am looking at the org.biojava.bio.program.das sources. I can retrieve all
top level entry points with 
DASSequenceDB(dbURL)
(Apperently the last entry from the return XML object gives a 
[Fatal Error] :1:1: Content is not allowed in prolog.
Which I am ignoring...)

and also the DSN entries using:
DAS das = new DAS();
    das.addDasURL(new URL(dbURLString));
    for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); )
{....
     
When I try to access features for a top level entry point, i.e. a reference
sequence I have the impression that first all features for a given reference
sequence are being downloaded. 

My questions: 

How can I access only the features of a specific region? I guess in DAS
terms I want to specify the segment part of the URL
(http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
00).

I would also like to get the list of available features. How can I achieve
this? From a wireshark output I can see that this is being retrieved somehow
behind the scene. How can I access this information?

I am looking at TestDAS*.java; are there any other examples around that I
can use to learn from?

Thanks a lot for your kind support,

Best,

Bernd


From er.indupandey at gmail.com  Wed Apr 28 12:22:10 2010
From: er.indupandey at gmail.com (indu pandey)
Date: Wed, 28 Apr 2010 09:22:10 -0700
Subject: [Biojava-l] regarding errors
Message-ID: <q2n8ea551a31004280922v903336ffia92507377387fb43@mail.gmail.com>

hi

When i m trying to run this code

package javaapplication10;
import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;

public class TranscribeDNAtoRNA {
   public static void main(String[] args) {
      try {
       //make a DNA SymbolList
       SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT");
       //transcribe it to RNA (after BioJava 1.4 this method is deprecated)
       symL = RNATools.transcribe(symL);
       //(after BioJava 1.4 use this method instead)
       symL = DNATools.toRNA(symL);
       //just to prove it worked
       System.out.println(symL.seqString());
      }
      catch (IllegalSymbolException ex) {
        //this will happen if you try and make the DNA seq using non IUB
symbols
         ex.printStackTrace();
      }catch (IllegalAlphabetException ex) {
       //this will happen if you try and transcribe a non DNA SymbolList
         ex.printStackTrace();
      }
   }
}


i get following errors:.

*org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and
translation table source alphabets don't match: RNA and DNA
        at
org.biojava.bio.symbol.TranslatedSymbolList.<init>(TranslatedSymbolList.java:75)
        at
org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125)
        at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490)
        at
javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23)
*

From andreas at sdsc.edu  Wed Apr 28 13:31:58 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 28 Apr 2010 10:31:58 -0700
Subject: [Biojava-l] accepted GSoC projects
In-Reply-To: <4BD7B711.9090108@cs.wisc.edu>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
	<4BD7B711.9090108@cs.wisc.edu>
Message-ID: <w2g59a41c431004281031oe53560d6j2826a4cf4e5cb24d@mail.gmail.com>

> Any ideas for the MSA project as well as more sources of programming wisdom
> are quite welcome.  For example, Andreas suggested a series about Java
> parallelism and lazy execution (
> http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/).
>


credits for the links go to Scooter, who recommended those ;-)  My general
recommendation is to read Joshua Bloch's "Effective Java".
http://java.sun.com/docs/books/effective/ It is a collection of  rules that
should help in avoiding some frequently made mistakes...

Andreas


>  I also noted a useful tip for iterative development (
> http://en.flossmanuals.net/GSoCMentoring/Workflow).
>
> Thanks again,
> Mark
>
>
>
> On 4/27/2010 12:33 AM, Andreas Prlic wrote:
>
>> Dear all,
>>
>> Google has released the results for GSoC: Congratulations to Mark
>> Chapman and Jianjiong Gao for having been accepted to work on the MSA
>> and PTM projects for BioJava! Let's start the "community bonding"
>> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we
>> all are looking forward to work with you on this during the summer. The
>> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis
>> and Kyle Ellrott for the MSA project (and me).
>>
>> I want to thank all of of you who submitted proposals or showed interest
>> in other ways for the Google Summer of Code. We hope you are not too
>> disappointed if your application did not get accepted this time. We had
>> a  large number (52) applications and the the overall quality of the
>> submissions was very high. We would like to stay in touch with you and
>> we hope that you are interested in BioJava also beyond the scope of
>> GSoC. There are a number of different ways how to contribute:  We are
>> always looking for people who provide code and patches to further
>> improve our library, help out with the documentation on the Wiki page,
>> or answer questions on the mailing lists.
>>
>> Let's all give Mark and Jianjiong  a warm welcome to the BioJava
>> community.  For those of you who are interested in following the
>> progress of the projects, as usually, the development related
>> discussions are going to be on the biojava-dev list.
>>
>> Happy coding!
>>
>> Andreas
>>
>>
>>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------

From jw12 at sanger.ac.uk  Wed Apr 28 16:21:13 2010
From: jw12 at sanger.ac.uk (Jonathan Warren)
Date: Wed, 28 Apr 2010 21:21:13 +0100
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
Message-ID: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>

Hi Bernd

For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads 
  there is a section called "Downloading data from the UCSC DAS server"

for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2

the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert 
) for DAS client creation, but there is a also a good javascript  
library as well called JSDas.

Any more info then don't hesitate to ask.

Jonathan.

On 28 Apr 2010, at 08:25, Bernd Jagla wrote:

> Hi there,	
>
> I am trying to retrieve information (features) from the UCSC genome  
> browser
> using the DAS interface.
> I am looking at the org.biojava.bio.program.das sources. I can  
> retrieve all
> top level entry points with
> DASSequenceDB(dbURL)
> (Apperently the last entry from the return XML object gives a
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Which I am ignoring...)
>
> and also the DSN entries using:
> DAS das = new DAS();
>    das.addDasURL(new URL(dbURLString));
>    for(Iterator i = das.getReferenceServers().iterator();  
> i.hasNext(); )
> {....
>
> When I try to access features for a top level entry point, i.e. a  
> reference
> sequence I have the impression that first all features for a given  
> reference
> sequence are being downloaded.
>
> My questions:
>
> How can I access only the features of a specific region? I guess in  
> DAS
> terms I want to specify the segment part of the URL
> (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
> 00).
>
> I would also like to get the list of available features. How can I  
> achieve
> this? From a wireshark output I can see that this is being retrieved  
> somehow
> behind the scene. How can I access this information?
>
> I am looking at TestDAS*.java; are there any other examples around  
> that I
> can use to learn from?
>
> Thanks a lot for your kind support,
>
> Best,
>
> Bernd
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

Jonathan Warren
Senior Developer and DAS coordinator
jw12 at sanger.ac.uk
Ext: 2314
Telephone: 01223 492314


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From chapman at cs.wisc.edu  Wed Apr 28 21:09:07 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Wed, 28 Apr 2010 20:09:07 -0500
Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects
In-Reply-To: <6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
	<4BD7B711.9090108@cs.wisc.edu>
	<w2g59a41c431004281031oe53560d6j2826a4cf4e5cb24d@mail.gmail.com>
	<6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu>
Message-ID: <4BD8DC33.7010607@cs.wisc.edu>

Here is a summary of the concurrency lessons I learned that are useful with or 
without the functional programming paradigm --

1: implement Callable<T> to submit tasks for concurrent/parallel/lazy execution
  - call() methods just wrap a call to the computation intensive method
2: share a fixed size thread pool with task queue to avoid
  - overhead of thread creation/destruction,
  - too many simultaneous threads, and
  - most blocking issues
3: place thread blocking Future<T>.get() calls within tasks later in the queue
  - while(!Future<T>.isDone()) Thread.yield(); may also help keep the pool active
4: execution in a task queue also enables easier logging and progress listening

There are two obvious places concurrent execution will fit in the MSA module --

1: building the distance matrix
  - queue pairwise alignment/scoring tasks in loop over all sequence pairs
2: progressive alignment
  - queue profile-profile alignment tasks in postfix traversal of guide tree 
(from leaves to root)

All our library copies of "Effective Java" are checked out, so I ordered a copy 
for my personal library.  The sample chapter on generics sold me.

Mark


On 4/28/2010 12:57 PM, Scooter Willis wrote:
> Andreas
>
> Those links were sent to me by Mark Southern who sits a couple doors down and a past BioJava contributor for the sequence viewer. We should avoid bringing in any external parallel frameworks but at minimum give ourselves enough abstraction with a backend multi-threaded job-processing approach to take advantage of a multi-processor box and a cluster via Terracotta.  If the abstraction of the jobs and the mapping of resources is generic enough then that allows different implementations in various cluster environments for those who have found the next best thing in parallel computing!
>
> Scooter
>
> On Apr 28, 2010, at 1:31 PM, Andreas Prlic wrote:
>
>>> Any ideas for the MSA project as well as more sources of programming wisdom
>>> are quite welcome.  For example, Andreas suggested a series about Java
>>> parallelism and lazy execution (
>>> http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/).
>>>
>>
>>
>> credits for the links go to Scooter, who recommended those ;-)  My general
>> recommendation is to read Joshua Bloch's "Effective Java".
>> http://java.sun.com/docs/books/effective/ It is a collection of  rules that
>> should help in avoiding some frequently made mistakes...
>>
>> Andreas
>>
>>
>>
>>
>>
>>
>>> I also noted a useful tip for iterative development (
>>> http://en.flossmanuals.net/GSoCMentoring/Workflow).
>>>
>>> Thanks again,
>>> Mark
>>>
>>>
>>>
>>> On 4/27/2010 12:33 AM, Andreas Prlic wrote:
>>>
>>>> Dear all,
>>>>
>>>> Google has released the results for GSoC: Congratulations to Mark
>>>> Chapman and Jianjiong Gao for having been accepted to work on the MSA
>>>> and PTM projects for BioJava! Let's start the "community bonding"
>>>> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we
>>>> all are looking forward to work with you on this during the summer. The
>>>> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis
>>>> and Kyle Ellrott for the MSA project (and me).
>>>>
>>>> I want to thank all of of you who submitted proposals or showed interest
>>>> in other ways for the Google Summer of Code. We hope you are not too
>>>> disappointed if your application did not get accepted this time. We had
>>>> a  large number (52) applications and the the overall quality of the
>>>> submissions was very high. We would like to stay in touch with you and
>>>> we hope that you are interested in BioJava also beyond the scope of
>>>> GSoC. There are a number of different ways how to contribute:  We are
>>>> always looking for people who provide code and patches to further
>>>> improve our library, help out with the documentation on the Wiki page,
>>>> or answer questions on the mailing lists.
>>>>
>>>> Let's all give Mark and Jianjiong  a warm welcome to the BioJava
>>>> community.  For those of you who are interested in following the
>>>> progress of the projects, as usually, the development related
>>>> discussions are going to be on the biojava-dev list.
>>>>
>>>> Happy coding!
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>

From bernd.jagla at pasteur.fr  Thu Apr 29 02:30:03 2010
From: bernd.jagla at pasteur.fr (Bernd Jagla)
Date: Thu, 29 Apr 2010 08:30:03 +0200
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
	<58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
Message-ID: <C9F821CA96FB4466A66E969AEFCF125E@zillumina>

Hi Jonathan,

 
Just to clarify, I need to write my own das client? I was hoping to be able
to use most of the functionality especially for the parsing of the XML and
creating the URLs by means of functions/methods that are already around. 

I am now going into debug mode for the DAS package in biojava to look for
the XML parsing, if you any further pointers on specific methods I should be
looking at it would mean a lot to me.

In short, I think I can create the URLs from scratch with not much effort. I
don't currently know how to put the XML into a data structure and how this
data structure should look like.

 
Thanks for your kind help,

 
Bernd

 
  _____  

From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] 
Sent: Wednesday, April 28, 2010 10:21 PM
To: Bernd Jagla
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence
region

 
Hi Bernd

 
For the UCSC you need to filter on types. see
http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section
called "Downloading data from the UCSC DAS server"

 
for DAS libraries you can see a tutorial here
http://www.biodas.org/wiki/DASWorkshop2010#Day_2

 
the one you would be most interested in is the Dasobert tutorial
(http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for
DAS client creation, but there is a also a good javascript library as well
called JSDas.

 
Any more info then don't hesitate to ask.

 
Jonathan.


On 28 Apr 2010, at 08:25, Bernd Jagla wrote:


Hi there,         

I am trying to retrieve information (features) from the UCSC genome browser
using the DAS interface. 
I am looking at the org.biojava.bio.program.das sources. I can retrieve all
top level entry points with 
DASSequenceDB(dbURL)
(Apperently the last entry from the return XML object gives a 
[Fatal Error] :1:1: Content is not allowed in prolog.
Which I am ignoring...)

and also the DSN entries using:
DAS das = new DAS();
   das.addDasURL(new URL(dbURLString));
   for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); )
{....

When I try to access features for a top level entry point, i.e. a reference
sequence I have the impression that first all features for a given reference
sequence are being downloaded. 

My questions: 

How can I access only the features of a specific region? I guess in DAS
terms I want to specify the segment part of the URL
(http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
00).

I would also like to get the list of available features. How can I achieve
this? From a wireshark output I can see that this is being retrieved somehow
behind the scene. How can I access this information?

I am looking at TestDAS*.java; are there any other examples around that I
can use to learn from?

Thanks a lot for your kind support,

Best,

Bernd


_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

 
Jonathan Warren

Senior Developer and DAS coordinator

jw12 at sanger.ac.uk

Ext: 2314

Telephone: 01223 492314

 
-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE. 


From jw12 at sanger.ac.uk  Thu Apr 29 04:26:40 2010
From: jw12 at sanger.ac.uk (Jonathan Warren)
Date: Thu, 29 Apr 2010 09:26:40 +0100
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <C9F821CA96FB4466A66E969AEFCF125E@zillumina>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
	<58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
	<C9F821CA96FB4466A66E969AEFCF125E@zillumina>
Message-ID: <A8AF8818-8B33-4869-8EF9-07A782C496FC@sanger.ac.uk>

The link I gave you http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert 
  shows examples of how to connect to 'European' style das sources.  
For the UCSC and GBrowse type DAS sources you may have to play around  
with the urls to get the info you want as they work slightly  
differently to other DAS data sources and use the types to filter  
data. I would suggest contacting the UCSC for more info.

The dasobert library is what you should use- the DASSequenceDB.java  
that you are currently looking at in biojava are old and not really  
supported anymore.

> I was hoping to be able to use most of the functionality especially  
> for the parsing of the XML and creating the URLs by means of  
> functions/methods that are already around?
this is what the dasobert library is for ;)


On 29 Apr 2010, at 07:30, Bernd Jagla wrote:

> Hi Jonathan,
>
> Just to clarify, I need to write my own das client? I was hoping to  
> be able to use most of the functionality especially for the parsing  
> of the XML and creating the URLs by means of functions/methods that  
> are already around?
> I am now going into debug mode for the DAS package in biojava to  
> look for the XML parsing, if you any further pointers on specific  
> methods I should be looking at it would mean a lot to me?
> In short, I think I can create the URLs from scratch with not much  
> effort. I don?t currently know how to put the XML into a data  
> structure and how this data structure should look like.
>
> Thanks for your kind help,
>
> Bernd
>
> From: Jonathan Warren [mailto:jw12 at sanger.ac.uk]
> Sent: Wednesday, April 28, 2010 10:21 PM
> To: Bernd Jagla
> Cc: biojava-l at lists.open-bio.org
> Subject: Re: [Biojava-l] DAS client: how to retrieve features for a  
> sequence region
>
> Hi Bernd
>
> For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads 
>  there is a section called "Downloading data from the UCSC DAS server"
>
> for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2
>
> the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert 
> ) for DAS client creation, but there is a also a good javascript  
> library as well called JSDas.
>
> Any more info then don't hesitate to ask.
>
> Jonathan.
>
>
> On 28 Apr 2010, at 08:25, Bernd Jagla wrote:
>
>
> Hi there,
>
> I am trying to retrieve information (features) from the UCSC genome  
> browser
> using the DAS interface.
> I am looking at the org.biojava.bio.program.das sources. I can  
> retrieve all
> top level entry points with
> DASSequenceDB(dbURL)
> (Apperently the last entry from the return XML object gives a
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Which I am ignoring...)
>
> and also the DSN entries using:
> DAS das = new DAS();
>    das.addDasURL(new URL(dbURLString));
>    for(Iterator i = das.getReferenceServers().iterator();  
> i.hasNext(); )
> {....
>
> When I try to access features for a top level entry point, i.e. a  
> reference
> sequence I have the impression that first all features for a given  
> reference
> sequence are being downloaded.
>
> My questions:
>
> How can I access only the features of a specific region? I guess in  
> DAS
> terms I want to specify the segment part of the URL
> (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
> 00).
>
> I would also like to get the list of available features. How can I  
> achieve
> this? From a wireshark output I can see that this is being retrieved  
> somehow
> behind the scene. How can I access this information?
>
> I am looking at TestDAS*.java; are there any other examples around  
> that I
> can use to learn from?
>
> Thanks a lot for your kind support,
>
> Best,
>
> Bernd
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> Jonathan Warren
> Senior Developer and DAS coordinator
> jw12 at sanger.ac.uk
> Ext: 2314
> Telephone: 01223 492314
>
>
>
>
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome  
> Research Limited, a charity registered in England with number  
> 1021457 and a company registered in England with number 2742969,  
> whose registered office is 215 Euston Road, London, NW1 2BE.

Jonathan Warren
Senior Developer and DAS coordinator
jw12 at sanger.ac.uk
Ext: 2314
Telephone: 01223 492314


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From ayates at ebi.ac.uk  Thu Apr 29 04:51:23 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Thu, 29 Apr 2010 09:51:23 +0100
Subject: [Biojava-l] regarding errors
In-Reply-To: <q2n8ea551a31004280922v903336ffia92507377387fb43@mail.gmail.com>
References: <q2n8ea551a31004280922v903336ffia92507377387fb43@mail.gmail.com>
Message-ID: <C7EE3258-C65D-4899-A47F-B4F19B7DE2B9@ebi.ac.uk>

I believe your problem is that you are attempting to transcribe the DNA to RNA twice. If you comment out the line:

//symL = RNATools.transcribe(symL);

Then you should find the code will work

Regards,

Andy

On 28 Apr 2010, at 17:22, indu pandey wrote:

> hi
> 
> When i m trying to run this code
> 
> package javaapplication10;
> import org.biojava.bio.symbol.*;
> import org.biojava.bio.seq.*;
> 
> public class TranscribeDNAtoRNA {
>   public static void main(String[] args) {
>      try {
>       //make a DNA SymbolList
>       SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT");
>       //transcribe it to RNA (after BioJava 1.4 this method is deprecated)
>       symL = RNATools.transcribe(symL);
>       //(after BioJava 1.4 use this method instead)
>       symL = DNATools.toRNA(symL);
>       //just to prove it worked
>       System.out.println(symL.seqString());
>      }
>      catch (IllegalSymbolException ex) {
>        //this will happen if you try and make the DNA seq using non IUB
> symbols
>         ex.printStackTrace();
>      }catch (IllegalAlphabetException ex) {
>       //this will happen if you try and transcribe a non DNA SymbolList
>         ex.printStackTrace();
>      }
>   }
> }
> 
> 
> i get following errors:.
> 
> *org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and
> translation table source alphabets don't match: RNA and DNA
>        at
> org.biojava.bio.symbol.TranslatedSymbolList.<init>(TranslatedSymbolList.java:75)
>        at
> org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125)
>        at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490)
>        at
> javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23)
> *
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From bernd.jagla at pasteur.fr  Thu Apr 29 05:57:58 2010
From: bernd.jagla at pasteur.fr (Bernd Jagla)
Date: Thu, 29 Apr 2010 11:57:58 +0200
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <A8AF8818-8B33-4869-8EF9-07A782C496FC@sanger.ac.uk>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
	<58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
	<C9F821CA96FB4466A66E969AEFCF125E@zillumina>
	<A8AF8818-8B33-4869-8EF9-07A782C496FC@sanger.ac.uk>
Message-ID: <F1A08360F6094E559A20ED9CD0EA0168@zillumina>

Great that is very helpful. One more question: Should I be using the Das1 or
Das2 implementations. The demo I am looking at uses Das2 (I think), but I am
running into problems. By modifying things in the Das2SourceHandler I can
now get Ids (instead of using uri). Is this the right way of approaching
this or should I be looking somewhere else..

 
When you say I have to play around with the URLs can you give me an example?
Is the problem described above part of this? (this is not the URL but rather
the XML..)

 
Sorry for these questions, but I find it extremely difficult to get my head
around all these different versions (DAS1/2; dasobert/programs.das;
European/Rest;.)

 
Thanks a lot,

 
Bernd

 
PS. I guess I should have attended the recent meeting. ;(

 
  _____  

From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] 
Sent: Thursday, April 29, 2010 10:27 AM
To: Bernd Jagla
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence
region

 
The link I gave you
http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert shows
examples of how to connect to 'European' style das sources. For the UCSC and
GBrowse type DAS sources you may have to play around with the urls to get
the info you want as they work slightly differently to other DAS data
sources and use the types to filter data. I would suggest contacting the
UCSC for more info.

 
The dasobert library is what you should use- the DASSequenceDB.java that you
are currently looking at in biojava are old and not really supported
anymore.

 
I was hoping to be able to use most of the functionality especially for the
parsing of the XML and creating the URLs by means of functions/methods that
are already around.

this is what the dasobert library is for ;)

 
On 29 Apr 2010, at 07:30, Bernd Jagla wrote:


Hi Jonathan,

 
Just to clarify, I need to write my own das client? I was hoping to be able
to use most of the functionality especially for the parsing of the XML and
creating the URLs by means of functions/methods that are already around.

I am now going into debug mode for the DAS package in biojava to look for
the XML parsing, if you any further pointers on specific methods I should be
looking at it would mean a lot to me.

In short, I think I can create the URLs from scratch with not much effort. I
don't currently know how to put the XML into a data structure and how this
data structure should look like.

 
Thanks for your kind help,

 
Bernd

 
  _____  

From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] 
Sent: Wednesday, April 28, 2010 10:21 PM
To: Bernd Jagla
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence
region

 
Hi Bernd

 
For the UCSC you need to filter on types. see
http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section
called "Downloading data from the UCSC DAS server"

 
for DAS libraries you can see a tutorial here
http://www.biodas.org/wiki/DASWorkshop2010#Day_2

 
the one you would be most interested in is the Dasobert tutorial
(http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for
DAS client creation, but there is a also a good javascript library as well
called JSDas.

 
Any more info then don't hesitate to ask.

 
Jonathan.


On 28 Apr 2010, at 08:25, Bernd Jagla wrote:


Hi there,         

I am trying to retrieve information (features) from the UCSC genome browser
using the DAS interface. 
I am looking at the org.biojava.bio.program.das sources. I can retrieve all
top level entry points with 
DASSequenceDB(dbURL)
(Apperently the last entry from the return XML object gives a 
[Fatal Error] :1:1: Content is not allowed in prolog.
Which I am ignoring...)

and also the DSN entries using:
DAS das = new DAS();
   das.addDasURL(new URL(dbURLString));
   for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); )
{....

When I try to access features for a top level entry point, i.e. a reference
sequence I have the impression that first all features for a given reference
sequence are being downloaded. 

My questions: 

How can I access only the features of a specific region? I guess in DAS
terms I want to specify the segment part of the URL
(http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
00).

I would also like to get the list of available features. How can I achieve
this? From a wireshark output I can see that this is being retrieved somehow
behind the scene. How can I access this information?

I am looking at TestDAS*.java; are there any other examples around that I
can use to learn from?

Thanks a lot for your kind support,

Best,

Bernd


_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

 
Jonathan Warren

Senior Developer and DAS coordinator

jw12 at sanger.ac.uk

Ext: 2314

Telephone: 01223 492314

 
-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE.

 
Jonathan Warren

Senior Developer and DAS coordinator

jw12 at sanger.ac.uk

Ext: 2314

Telephone: 01223 492314

 
-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE. 


From thomascramera at dnastar.com  Thu Apr 29 14:14:27 2010
From: thomascramera at dnastar.com (Andy Thomas-Cramer)
Date: Thu, 29 Apr 2010 13:14:27 -0500
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <m2v59a41c431004261807h4440eac5t31871134c0a5d02f@mail.gmail.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
	<n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>
	<m2v59a41c431004261807h4440eac5t31871134c0a5d02f@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610021BF1D7@FS1.dnastar.com>

 
Yes, I would like to have direct access to the element symbol data
that's in the file. Otherwise, anyone that needs the element type has to
create rules for interpreting it from the "atom name" field. It feels
wrong to attempt to deduce data when it is provided explicitly.


These PDB remediation project notes suggest using the element symbol
specified in 77-78 

http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D426#SEC3 

"Atom types are provided for every atom (i.e. ATOM record columns
77-78), so prior atom name justification conventions should no longer be
assumed in reading atom names."

 
JMOL uses the PDB element symbol if present, else interprets from the
"atom name" field. 

http://wiki.jmol.org/index.php/AtomSets 

"On PDB format, Jmol will identify the element from columns 77-78
(element symbol, right-justified). If this is absent, then it will
interpret the "atom name" field (columns 13-14) to deduce the element
identity."

JMOL is LGPL. If it interpretation is desirable, could start with its
current approach. Personally, I would be happy just with access to the
data in the file.

 
________________________________

From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf
Of Andreas Prlic
Sent: Monday, April 26, 2010 8:08 PM
To: Andy Thomas-Cramer
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol

 
Hi Andy

Questions:

	* Is this pattern documented in the PDB specification? 


see here:
http://www.wwpdb.org/documentation/format23/sect9.html#ATOM
 

	* If this pattern can be relied on, why are columns 77-78 also
dedicated to the element symbol?

That is the atom's element symbol (as given in the periodic table), in
contrast to the first name, which contains numbering information.

	* Should reliance on the pattern be hidden behind a BioJava
method?


If you think that is important we could probably provide an enum for all
atom types. There are two categories though: the periodic table symbol
and the one that is related to the position in an amino acid....

Andreas 
 

________________________________


	From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com]
On Behalf Of Andreas Prlic
	Sent: Friday, April 23, 2010 6:52 PM
	To: Andy Thomas-Cramer
	Cc: biojava-l at lists.open-bio.org
	Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol

	 
	Hi Andy,
	
	you could check with  Atom.getFullname(), which contains the
space characters from the PDB file:
	e.g Calpha: " CA ", Calcium "CA  " 
	
	in addition the parent group of a Calpha atom is usually an
AminoAcid and for Calciums it is a Hetatom group...
	
	Andreas

	On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer
<thomascramera at dnastar.com> wrote:

	
	Is there an easy way to identify the type of atom referenced by
an Atom
	object?
	
	For example, if Atom.getName() is "CA", is the element calcium
or the
	atom carbon alpha?
	
	If not, would it be feasible to add a method providing this in
Atom,
	AtomImpl, and parsing it in PDBFileParser, using the columns
defined at
	http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
	
	
	_______________________________________________
	Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
	http://lists.open-bio.org/mailman/listinfo/biojava-l

	
	-- 
	
-----------------------------------------------------------------------
	Dr. Andreas Prlic
	Senior Scientist, RCSB PDB Protein Data Bank
	University of California, San Diego
	(+1) 858.246.0526
	
-----------------------------------------------------------------------


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From pwrose at ucsd.edu  Thu Apr 29 15:53:33 2010
From: pwrose at ucsd.edu (Peter Rose)
Date: Thu, 29 Apr 2010 12:53:33 -0700
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <mailman.9.1272384003.32055.biojava-l@lists.open-bio.org>
References: <mailman.9.1272384003.32055.biojava-l@lists.open-bio.org>
Message-ID: <002f01cae7d5$a673fcf0$f35bf6d0$@edu>

Since there was a request to be able to access element information, I've
added an Element enum to the org.biojava.bio.structure package that I had
developed for another application.

Each element has a number of properties such as atomic number, mass, min and
max valence, electronegativity, etc. that should be useful.

The AtomImpl class now has a getter and setter for Element.

Also, the PDB parser now populates the Element in the Atom class. By default
the PDB parser tries to parse the element from columns 77-78. As a fallback
for mis-formatted PDB files that don't contain an element column, the
element is parsed from the atom name.

We'll also add element support for the cif parser soon.

-Peter

________________________________________________
Peter Rose, Ph.D.                         
Scientific Lead
RCSB Protein Data Bank (www.pdb.org)

San Diego Supercomputer Center (SDSC) and
Skaggs School of Pharmacy and Pharmaceutical Sciences

Pharmaceutical Sciences Building
University of California San Diego


-----Original Message-----
From: biojava-l-bounces at lists.open-bio.org
[mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of
biojava-l-request at lists.open-bio.org
Sent: Tuesday, April 27, 2010 9:00 AM
To: biojava-l at lists.open-bio.org
Subject: Biojava-l Digest, Vol 87, Issue 26

Send Biojava-l mailing list submissions to
	biojava-l at lists.open-bio.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.open-bio.org/mailman/listinfo/biojava-l
or, via email, send a message with subject or body 'help' to
	biojava-l-request at lists.open-bio.org

You can reach the person managing the list at
	biojava-l-owner at lists.open-bio.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Biojava-l digest..."


Today's Topics:

   1. Re: PDBFileParser and Atom element symbol (Andreas Prlic)
   2. Google Summer of Code - accepted students (Robert Buels)
   3. accepted GSoC projects (Andreas Prlic)
   4. Google Summer of Code - accepted students (Robert Buels)


----------------------------------------------------------------------

Message: 1
Date: Mon, 26 Apr 2010 18:07:53 -0700
From: Andreas Prlic <andreas at sdsc.edu>
Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol
To: Andy Thomas-Cramer <thomascramera at dnastar.com>
Cc: biojava-l at lists.open-bio.org
Message-ID:
	<m2v59a41c431004261807h4440eac5t31871134c0a5d02f at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi Andy

Questions:

> * Is this pattern documented in the PDB specification?
>

see here:
http://www.wwpdb.org/documentation/format23/sect9.html#ATOM


> * If this pattern can be relied on, why are columns 77-78 also dedicated
to
> the element symbol?
>
That is the atom's element symbol (as given in the periodic table), in
contrast to the first name, which contains numbering information.

* Should reliance on the pattern be hidden behind a BioJava method?
>

If you think that is important we could probably provide an enum for all
atom types. There are two categories though: the periodic table symbol and
the one that is related to the position in an amino acid....

Andreas


>
>
>
>  ------------------------------
>
> *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On
> Behalf Of *Andreas Prlic
> *Sent:* Friday, April 23, 2010 6:52 PM
> *To:* Andy Thomas-Cramer
> *Cc:* biojava-l at lists.open-bio.org
> *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol
>
>
>
> Hi Andy,
>
> you could check with  Atom.getFullname(), which contains the space
> characters from the PDB file:
> e.g Calpha: " CA ", Calcium "CA  "
>
> in addition the parent group of a Calpha atom is usually an AminoAcid and
> for Calciums it is a Hetatom group...
>
> Andreas
>
> On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer <
> thomascramera at dnastar.com> wrote:
>
>
>
> Is there an easy way to identify the type of atom referenced by an Atom
> object?
>
> For example, if Atom.getName() is "CA", is the element calcium or the
> atom carbon alpha?
>
> If not, would it be feasible to add a method providing this in Atom,
> AtomImpl, and parsing it in PDBFileParser, using the columns defined at
> http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


------------------------------

Message: 2
Date: Mon, 26 Apr 2010 15:02:11 -0700
From: Robert Buels <rmb32 at cornell.edu>
Subject: [Biojava-l] Google Summer of Code - accepted students
To: rmb32 at cornell.edu
Message-ID: <4BD60D63.1040400 at cornell.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of 
Code students, listed in alphabetical order with their project titles 
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including 
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, 
Classification, and Visualization of Posttranslational Modification of 
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & 
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending 
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally 
assigned, plus 1 extra) allotted to us by Google.  Proposals were 
extremely competitive: 6 out of 52 translates to an 11.5% acceptance 
rate.  We received a lot of really excellent proposals, the decisions 
were not easy.

Thanks very much to all the students who applied, we very much 
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do 
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


------------------------------

Message: 3
Date: Mon, 26 Apr 2010 22:33:51 -0700
From: Andreas Prlic <andreas at sdsc.edu>
Subject: [Biojava-l] accepted GSoC projects
To: Jianjiong Gao <jianjong.gao at gmail.com>, Mark Chapman
	<chapman at cs.wisc.edu>,	Biojava <biojava-l at lists.open-bio.org>,
	biojava-dev <biojava-dev at lists.open-bio.org>
Cc: "Rose, Peter" <pwrose at ucsd.edu>, Scooter Willis
	<HWillis at scripps.edu>,	Kyle Ellrott <kellrott at ucsd.edu>
Message-ID:
	<u2w59a41c431004262233xe5553c17je23c2b42a3aae81d at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Dear all,

Google has released the results for GSoC: Congratulations to Mark Chapman
and Jianjiong Gao for having been accepted to work on the MSA and PTM
projects for BioJava! Let's start the "community bonding" process (
http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we all are
looking forward to work with you on this during the summer. The Mentors and
co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle
Ellrott for the MSA project (and me).

I want to thank all of of you who submitted proposals or showed interest in
other ways for the Google Summer of Code. We hope you are not too
disappointed if your application did not get accepted this time. We had a
large number (52) applications and the the overall quality of the
submissions was very high. We would like to stay in touch with you and we
hope that you are interested in BioJava also beyond the scope of GSoC. There
are a number of different ways how to contribute:  We are always looking for
people who provide code and patches to further improve our library, help out
with the documentation on the Wiki page, or answer questions on the mailing
lists.

Let's all give Mark and Jianjiong  a warm welcome to the BioJava community.
For those of you who are interested in following the progress of the
projects, as usually, the development related discussions are going to be on
the biojava-dev list.

Happy coding!

Andreas


------------------------------

Message: 4
Date: Mon, 26 Apr 2010 22:52:57 -0700
From: Robert Buels <rmb32 at cornell.edu>
Subject: [Biojava-l] Google Summer of Code - accepted students
To: BioPerl List <bioperl-l at lists.open-bio.org>,	BioPython List
	<biopython at lists.open-bio.org>,	BioJava List
	<biojava-l at lists.open-bio.org>,	BioRuby List
	<bioruby at lists.open-bio.org>,	BioSQL List
	<biosql-l at lists.open-bio.org>,	BioLib List
	<biolib-dev at lists.open-bio.org>,	Open-Bio List
	<open-bio-l at lists.open-bio.org>,	BioDAS List
<das at lists.open-bio.org>
Message-ID: <4BD67BB9.3000804 at cornell.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally
assigned, plus 1 extra) allotted to us by Google.  Proposals were
extremely competitive: 6 out of 52 translates to an 11.5% acceptance
rate.  We received a lot of really excellent proposals, the decisions
were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


------------------------------

_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


End of Biojava-l Digest, Vol 87, Issue 26
*****************************************


From marcel.huntemann at gmail.com  Thu Apr 29 20:49:10 2010
From: marcel.huntemann at gmail.com (Marcel Huntemann)
Date: Thu, 29 Apr 2010 17:49:10 -0700
Subject: [Biojava-l] Error during genbank parsing
Message-ID: <4BDA2906.20801@Gmail.com>

Hi!

I get the following error during the parsing of a genbank file:

Exception in thread "main" org.biojava.bio.BioException: Could not read
sequence
	at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
	at gov.doe.jgi.img.pangenomes.Controller.createGeneMap(Controller.java:303)
	at gov.doe.jgi.img.pangenomes.Controller.start(Controller.java:197)
	at gov.doe.jgi.img.pangenomes.Main.createAndStartController(Main.java:105)
	at gov.doe.jgi.img.pangenomes.Main.main(Main.java:35)
Caused by: org.biojava.bio.seq.io.ParseException:

A Exception Has Occurred During Parsing.
Please submit the details that follow to biojava-l at biojava.org or post a
bug report to http://bugzilla.open-bio.org/

Format_object=org.biojavax.bio.seq.io.GenbankFormat
Accession=null
Id=null
Comments=Bad locus line
Parse_block=LOCUS   NC_008711      4597686 bp      DNA circular
17-DEC-2009
Stack trace follows ....


	at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:322)
	at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
	... 4 more

No matter which genbank file I use, I always get this error (for sure with
a different LOCUS line. The strange thing is that this used to work about
1/2 - 1 year ago. No I wanted to use my program again and get always this
error, although I didn't really change anything on that code.
The only thing I can think of that's different, since the last time I used
it (when it worked), is that I switched from a 32bit Linux to a 64bit
Linux machine. But can that really cause it?

Here's my code and how I use it:

for ( String taxonId : givenTaxonIds )
		{
    		gbkFile = new File( dirPath + taxonId + gbkSuffix );
    		if ( ! gbkFile.exists() )
    		{
    			logr.fatal( "Couldn't find genbank file for taxonOID " + taxonId +
    					"!\nI tried " + gbkFile.getPath() + ", but it doesn't exist!" );
    			System.exit( 0 );
    		}
    		
    		BufferedReader br = new BufferedReader( new FileReader( gbkFile ) );
        	Namespace ns = RichObjectFactory.getDefaultNamespace();

        	RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(
br, ns );
    		numberInGenome = 0;
    		while ( seqs.hasNext() )
    		{
    			RichSequence contig = seqs.nextRichSequence();
    			// Get genes and their positions
    			Set<Feature> features = contig.getFeatureSet();
    			positions = new ArrayList<int[]>();
    			geneIds = new ArrayList<String>();
    			
			    for ( Feature richFeature : features )
				{
			    	if ( richFeature.getType().equals( "CDS" ) )
					{
			    		RichLocation loc = (RichLocation) richFeature.getLocation();
			    		position = new int[3];
			    		position[0] = loc.getMin();
			    		position[1] = loc.getMax();
			    		position[2] = loc.getStrand().intValue();
			    		Annotation a = richFeature.getAnnotation();
		    			split = a.getProperty( "note" ).toString().split( "=" );
		    			geneIds.add( split[1].trim() );
			    		positions.add( position );
					}
			    	else if ( richFeature.getType().equals( "gene" ) )
					{
			    		Annotation a = richFeature.getAnnotation();
			    		if ( a.containsProperty( "pseudo" ) )
						{
			    			RichLocation loc = (RichLocation) richFeature.getLocation();
				    		position = new int[3];
				    		position[0] = loc.getMin();
				    		position[1] = loc.getMax();
				    		position[2] = loc.getStrand().intValue();
				    		split = a.getProperty( "note" ).toString().split( "=" );
			    			geneIds.add( split[1].trim() );
				    		positions.add( position );
						}
					}
				}

Thanks 4 the help,
Marcel

P.S.: Also the info on some of the biojava pages seems outdated. I got the
latest version from your svn trunk and on the GetStarted page it says that
 one just has to call ant to build it. But there's now build.xml in the
biojava folder. Instead there's a pom.xml, so I guess u switched to maven.
I bet a lot of people don'tknow how to geal with and have no clue what to
do, when the ant command didn't work...


From narciso at cnpaf.embrapa.br  Fri Apr 30 17:32:02 2010
From: narciso at cnpaf.embrapa.br (Marcelo Goncalves Narciso (Pesquisador))
Date: Fri, 30 Apr 2010 19:32:02 -0200
Subject: [Biojava-l] problems with intallation of biojava in windows 7
In-Reply-To: <20100430184758.M13673@cnpaf.embrapa.br>
References: <20100430184758.M13673@cnpaf.embrapa.br>
Message-ID: <20100430212950.M75279@cnpaf.embrapa.br>

Hi, people,

I need your help.

When I try to install biojava in windows 7, it happens:

> C:\Users\narciso\biojava>java -jar biojava-1.7.1-all.jar
> Failed to load Main-Class manifest attribute from
> biojava-1.7.1-all.jar
How can I fix it?

Thanks a lot

Marcelo

From heuermh at acm.org  Thu Apr  1 03:56:42 2010
From: heuermh at acm.org (Michael Heuer)
Date: Wed, 31 Mar 2010 23:56:42 -0400 (EDT)
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <20100330215047.084f6b00@wp01>
Message-ID: <Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>

xyz wrote:

> Thank you it works, but after I extended the code with
> RichSequence.IOTools.writeFasta(outputFasta, trimSeq, ns,
> fastq.getDescription());
> in order to get also a trimmed fasta file I got the following error:
>
> Fastq2Fasta.java:51: cannot
> find symbol symbol  : method
> writeFasta(java.io.FileOutputStream,java.lang.String,org.biojavax.SimpleNamespace,java.lang.String)
> location: class org.biojavax.bio.seq.RichSequence.IOTools
> RichSequence.IOTools.writeFasta(outputFasta, trimSeq, ns,
> fastq.getDescription()); 1 error

The fastq package has not yet been integrated with biojava core or the
biojavax packages.  If you would like to use RichSequence.IOTools, you
would need to create a RichSequence from each Fastq object before writing.

Something like

import static ...RichSequence.Tools.*;
import static ...RichSequence.IOTools.*;

Fastq fastq = ...;
Namespace namepace = ...;
RichSequence richSequence = createRichSequence(
  namespace,
  fastq.getDescription(),
  fastq.getSequence(),
  DNATools.getDNA());

writeFasta(outputStream, richSequence, namespace);

may work.


> Suggestions:
> 1)
> After I trimmed the fastq files the header information for quality
> is empty
>
> @HWI-EAS406:5:1:0:1390#0/1
> GGGTGATGGCCGCTGCCGATGGCGTCAAAA
> +
> OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
>
> this reduced the size of the files but is it compatible with
> SOAP and TopHat?

Sorry, not sure what you are asking here.


> 2)
> I was using fastq files up to 6 GBytes and I have not run any benchmarks
> with different Buffer/stream combination on big text files and therefore
> I am not sure that is enough to use just FileInputStream or
> FileOutputStream. BioJavaX is using BufferedReader br = new
> BufferedReader(new FileReader()) are there any speed difference?

AbstractFastqReader.read(InputStream) uses a BufferedReader, and all the
other read methods pass through that one.

   michael


From huijieqiao at gmail.com  Fri Apr  2 03:02:37 2010
From: huijieqiao at gmail.com (Huijie Qiao)
Date: Fri, 2 Apr 2010 11:02:37 +0800
Subject: [Biojava-l] A bug in Class "org.biojavax.bio.seq.io.GenbankFormat"
Message-ID: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>

version 1.7.1

line 361
else if (sectionKey.equals(SOURCE_TAG)) {
      // ignore - can get all this from the first feature

actually the content in the SOURCE_TAG and the first feature are different
in some gb file.

For example, the example file in
http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb

The Source TAG is
SOURCE      Bos taurus (cattle)
  ORGANISM  Bos taurus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
            Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
            Pecora; Bovidae; Bovinae; Bos.

and the first feature tag is
FEATURES             Location/Qualifiers
     source          1..1136
                     /organism="Bos taurus"
                     /mol_type="mRNA"
                     /db_xref="taxon:9913"
                     /clone="pBB2I"
                     /tissue_type="liver"

I can't get the hierarchy info through the follow codes.
NCBITaxon taxon = seq.getTaxon();
System.out.println(taxon.getNameHierarchy()); output is "."


From holland at eaglegenomics.com  Fri Apr  2 07:38:44 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Fri, 2 Apr 2010 08:38:44 +0100
Subject: [Biojava-l] A bug in Class
	"org.biojavax.bio.seq.io.GenbankFormat"
In-Reply-To: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>
References: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>
Message-ID: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com>

The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism. 

If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email.

thanks,
Richard

On 2 Apr 2010, at 04:02, Huijie Qiao wrote:

> version 1.7.1
> 
> line 361
> else if (sectionKey.equals(SOURCE_TAG)) {
>      // ignore - can get all this from the first feature
> 
> actually the content in the SOURCE_TAG and the first feature are different
> in some gb file.
> 
> For example, the example file in
> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb
> 
> The Source TAG is
> SOURCE      Bos taurus (cattle)
>  ORGANISM  Bos taurus
>            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
>            Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
>            Pecora; Bovidae; Bovinae; Bos.
> 
> and the first feature tag is
> FEATURES             Location/Qualifiers
>     source          1..1136
>                     /organism="Bos taurus"
>                     /mol_type="mRNA"
>                     /db_xref="taxon:9913"
>                     /clone="pBB2I"
>                     /tissue_type="liver"
> 
> I can't get the hierarchy info through the follow codes.
> NCBITaxon taxon = seq.getTaxon();
> System.out.println(taxon.getNameHierarchy()); output is "."
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From martin.jones at ed.ac.uk  Fri Apr  2 11:23:21 2010
From: martin.jones at ed.ac.uk (Martin Jones)
Date: Fri, 2 Apr 2010 12:23:21 +0100
Subject: [Biojava-l] A bug in Class
	"org.biojavax.bio.seq.io.GenbankFormat"
In-Reply-To: <8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com>
References: <n2ld081ddc11004012002i5eec2478x81716b3f03a7997@mail.gmail.com>
	<8319D21D-9548-438A-BCB4-0CB9C5B7F568@eaglegenomics.com>
Message-ID: <v2reb55ec041004020423m7353150enb4631654abb31463@mail.gmail.com>

You can also get the hierarchy directly from the NCBI taxonomy dump...
this is in Groovy but gives you the idea:

HashMap<Integer, TreeNode> taxid2node = [:]
HashMap<Integer, Integer> child2parent = [:]

def nodePattern = ~/^(\d+)\t\|\t(\d+)\t\|\t(.+?)\t\|/


def count=0
new File("/home/martin/nodes.dmp").eachLine{
   line ->
   count++
   def matcher = (line =~ nodePattern)
   if (matcher.matches()){
         Integer myId = matcher[0][1].toInteger()
         Integer parentId = matcher[0][2].toInteger()
         String myRank = matcher[0][3]

         def node = new TreeNode(taxid : myId, rank:myRank)
         taxid2node[(myId)] = node

         child2parent[(myId)] = parentId

    }
}
// do something with the hash


-Martin


On 2 April 2010 08:38, Richard Holland <holland at eaglegenomics.com> wrote:
> The parsers don't load the hiearachy from Genbank because it is redundant information separately available from NCBI taxonomy. Also it tends to be buggy and can differ between Genbank files for the same organism.
>
> If you want the hierarchy. you need to be using BioJava in conjunction with BioSQL and load the NCBI taxonomy into your BioSQL instance ( http://www.biojava.org/wiki/BioJava:BioJavaXDocs#NCBI_Taxonomy_data ), from where BioJava can then retrieve it using the sample code you show in your email.
>
> thanks,
> Richard
>
> On 2 Apr 2010, at 04:02, Huijie Qiao wrote:
>
>> version 1.7.1
>>
>> line 361
>> else if (sectionKey.equals(SOURCE_TAG)) {
>> ? ? ?// ignore - can get all this from the first feature
>>
>> actually the content in the SOURCE_TAG and the first feature are different
>> in some gb file.
>>
>> For example, the example file in
>> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html
>> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&rettype=gb
>>
>> The Source TAG is
>> SOURCE ? ? ?Bos taurus (cattle)
>> ?ORGANISM ?Bos taurus
>> ? ? ? ? ? ?Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
>> Euteleostomi;
>> ? ? ? ? ? ?Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia;
>> ? ? ? ? ? ?Pecora; Bovidae; Bovinae; Bos.
>>
>> and the first feature tag is
>> FEATURES ? ? ? ? ? ? Location/Qualifiers
>> ? ? source ? ? ? ? ?1..1136
>> ? ? ? ? ? ? ? ? ? ? /organism="Bos taurus"
>> ? ? ? ? ? ? ? ? ? ? /mol_type="mRNA"
>> ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:9913"
>> ? ? ? ? ? ? ? ? ? ? /clone="pBB2I"
>> ? ? ? ? ? ? ? ? ? ? /tissue_type="liver"
>>
>> I can't get the hierarchy info through the follow codes.
>> NCBITaxon taxon = seq.getTaxon();
>> System.out.println(taxon.getNameHierarchy()); output is "."
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
>
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>


From andreas.prlic at gmail.com  Sat Apr  3 15:08:57 2010
From: andreas.prlic at gmail.com (Andreas Prlic)
Date: Sat, 3 Apr 2010 08:08:57 -0700
Subject: [Biojava-l] Anonymous svn down
Message-ID: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>

Hi,

the anonymous svn server seems to be down again. I have already  
contacted support @ obf, but not recieved back a response, when it  
should be back up. In the meanwhile, is anybody volunteering to set up  
a failback mirror at github?

Andreas


From rmb32 at cornell.edu  Sat Apr  3 20:09:27 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 13:09:27 -0700
Subject: [Biojava-l] Google Summer of Code is *ON* for OBF projects!
Message-ID: <4BB7A077.4070802@cornell.edu>

Hi all,

Reminder:  GSoC student proposals must be submitted to Google by April 
9th, 19:00 UTC.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project 
mailing lists, they can help you get your proposal into shape.

So far, we have 5 proposals submitted to our org in Google's web app. 
Keep them coming, and let's see some really good ones!

Rob Buels
OBF GSoC 2010 Administrator


From jianjiong.gao at gmail.com  Sun Apr  4 06:33:15 2010
From: jianjiong.gao at gmail.com (Jianjiong Gao)
Date: Sun, 4 Apr 2010 01:33:15 -0500
Subject: [Biojava-l] GSoC project question
Message-ID: <g2zc82264f51004032333hc75e197bwd085f55ce901ea3e@mail.gmail.com>

Hello,

My name is Jianjiong Gao, a graduate student in Computer Science
Department at University of Missouri-Columbia. I am very interested in
applying for your GSoC project "Identification and Classification of
Posttranslational Modification of Proteins". This project is highly
related to my dissertation topic "Bioinformatic analysis and
prediction of phosphorylation and other PTMs." Although I have not
touched the structural part of PTM till now, I am really interested in
learning and expanding my research on this field.

After reading the project description on the idea page
(http://biojava.org/wiki/Google_Summer_of_Code), I have several
questions regarding the *approach* section:

> 1. Establish a list of known PTMs and write code to locate these PTMs in a 3D protein structure.

Q1: There are many different types of PTMs. Do you have list of PTMs
of interest? Do you have priorities on different PTMs?
Q2: Is there any available algorithm to locate the PTMs in a 3D
protein structure? What is the difficulty on this task?
Q3: The PDB file contains annotations of residue modifications such as
HETATM AND MODRES. Can we utilized this information for localizing the
PTMs?

> 2. Determine the protein residues that carry PTMs based on distance thresholds.
> 3. Traverse the sugar molecules and establish their link pattern based on connectivity.

Q4: Is this task to determine the types of glycosylation, i.e.,
N-linked glycosylation, O-N-acetylgalactosamine, O-glucose, etc?
Q5: Is there any available algorithm to do this? What is the
difficulty in this task? It looks complicated with so many different
types of glycosylation and structure isomers.

> 4. Present the PTMs as text in a linear notation and 2D graphical representations if time permits.

Q6: Can we used the SMILES format
(http://en.wikipedia.org/wiki/Simplified_molecular_input_line_entry_specification)
here? Or do we have any other better options?

Thanks very much for your time. I am looking forward to hearing from you.

Best Regards,
-JJ


From rmb32 at cornell.edu  Sun Apr  4 04:37:38 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Sat, 03 Apr 2010 21:37:38 -0700
Subject: [Biojava-l] Reminder: GSoC student applications due April 9,
	19:00 UTC
Message-ID: <4BB81792.8060001@cornell.edu>

Hi all,

Sending this again with a different subject line, just in case.

GSoC student proposals must be submitted to Google through their web 
application by *April 9th, 19:00 UTC*.  That's less than a week away.

Students: you should ALREADY be working with mentors on the project
mailing lists, they can help you get your proposal into shape.

So far, we have 6 proposals submitted to our org in Google's web app.
Keep them coming, and keep them good!

Rob Buels
OBF GSoC 2010 Administrator


From nagendravns at gmail.com  Sun Apr  4 16:12:11 2010
From: nagendravns at gmail.com (nagendra kumar)
Date: Sun, 4 Apr 2010 21:42:11 +0530
Subject: [Biojava-l] how to add api
Message-ID: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>

sir i want bio java develop one project please give me detail how bio java
api install in system


From chapman at cs.wisc.edu  Sun Apr  4 17:54:59 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Sun, 04 Apr 2010 12:54:59 -0500
Subject: [Biojava-l] how to add api
In-Reply-To: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
References: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
Message-ID: <4BB8D273.7080601@cs.wisc.edu>

Everything you need is at:
http://biojava.org/wiki/BioJava:Download

On 4/4/2010 11:12 AM, nagendra kumar wrote:
> sir i want bio java develop one project please give me detail how bio java
> api install in system
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From anantpossible at gmail.com  Sun Apr  4 17:58:15 2010
From: anantpossible at gmail.com (Anant Jain)
Date: Sun, 4 Apr 2010 23:28:15 +0530
Subject: [Biojava-l] how to add api
In-Reply-To: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
References: <m2hb48b10f61004040912s7da77636vce5e2426dae7c936@mail.gmail.com>
Message-ID: <h2pbd096e3c1004041058h48b770c7gfbfe787d2972b141@mail.gmail.com>

On 4/4/10, nagendra kumar <nagendravns at gmail.com> wrote:
>
> sir i want bio java develop one project please give me detail how bio java
> api install in system
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


HI,

To use biojava API, all you need to download Biojava Jar from and perform
following steps...

1. Extract jar, you will get some more jars and files,,,
2. You need to paste these jars in following location "C:\Program
Files\Java\jre6\lib\ext", if your java install directory is C drive.


-- 
Anant Jain
B.Tech Bioinformatics, RHCE


From sacomoto at gmail.com  Tue Apr  6 05:29:23 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Tue, 6 Apr 2010 02:29:23 -0300
Subject: [Biojava-l] GSoC project on MSA
Message-ID: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>

Hello,

I'm currently a graduate student at University of S?o Paulo (Brazil)
and I'm quite interested in applying for the all-Java MSA project. I'm
already familiar with the multiple sequence alignment problem, I
developed a lossless filter for this problem as my undergraduate final
project, the work is described here
[http://www.almob.org/content/4/1/3] and there is an online version of
the algorithm here
[http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].

Now, regarding the project, just to make it clear, when you say in the
"straightforward approach for building up the MSA progressively", you
mean the standard dynamic programming approach for pairwise alignment
following the guide tree built in the second step, right?

One last question, should I send my proposal direct to the Google's
web app or here first?

Thanks,

Gustavo Sacomoto


From andreas at sdsc.edu  Tue Apr  6 17:46:16 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 6 Apr 2010 10:46:16 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
Message-ID: <l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>

Hi Gustavo,

With straightforward I meant that we only have 3 months for this project and
we should not try to solve all problems at the same time. Probably a
realistic approach is to start with trying to keep things modular and simple
(think interfaces and implementations) and stick to standard solutions that
have been shown to work elsewhere. If there is more time in the project one
can then replace some of the implementations with technically more advanced
ones.

Since we are doing things in Java I am interested in having support for
parallelisation wherever possible. Another issue is how to verify that the
created alignments are meaningful. One could e.g. use the biojava structure
modules to calculate protein structure alignments to verify the quality of
the obtained multiple sequence alignments.

All applications have to be made via Google. We are providing comments  on
drafts of proposals and try to work together with applicants to improve the
submissions. Note: The application deadline is soon and speed is important
now.

Andreas


On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto <
sacomoto at gmail.com> wrote:

> Hello,
>
> I'm currently a graduate student at University of S?o Paulo (Brazil)
> and I'm quite interested in applying for the all-Java MSA project. I'm
> already familiar with the multiple sequence alignment problem, I
> developed a lossless filter for this problem as my undergraduate final
> project, the work is described here
> [http://www.almob.org/content/4/1/3] and there is an online version of
> the algorithm here
> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].
>
> Now, regarding the project, just to make it clear, when you say in the
> "straightforward approach for building up the MSA progressively", you
> mean the standard dynamic programming approach for pairwise alignment
> following the guide tree built in the second step, right?
>
> One last question, should I send my proposal direct to the Google's
> web app or here first?
>
> Thanks,
>
> Gustavo Sacomoto
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From sacomoto at gmail.com  Tue Apr  6 18:53:04 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Tue, 6 Apr 2010 15:53:04 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
Message-ID: <j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>

Hello Andreas,

On Tue, Apr 6, 2010 at 2:46 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> With straightforward I meant that we only have 3 months for this project and
> we should not try to solve all problems at the same time. Probably a
> realistic approach is to start with trying to keep things modular and simple
> (think interfaces and implementations) and stick to standard solutions that
> have been shown to work elsewhere. If there is more time in the project one
> can then replace some of the implementations with technically more advanced
> ones.

I think my question wasn't very clear, my intention in this project is
to follow the approach (with the tree steps) outlined in the project's
page. Using the classical progressive alignment heuristic: build the
distance matrix, build the guide tree and using this tree
progressively align more sequences together.

What I propose for the third step is a first implementation using the
(more simple) dynamic programming described in the first CLUSTAL paper
(I thinks it's from 1988) and incrementally improving the algorithm to
get closer to the one described in CLUSTALW paper (from 1994). Is this
more or less what you had in mind?

> Since we are doing things in Java I am interested in having support for
> parallelisation wherever possible. Another issue is how to verify that the
> created alignments are meaningful. One could e.g. use the biojava structure
> modules to calculate protein structure alignments to verify the quality of
> the obtained multiple sequence alignments.

About parallel strategies, I think a relative easy way we could use it
is in the distance matrix construction, we could have several threads
calculating the pairwise alignment for different pairs of sequence in
the set.

Now, the alignment quality measures is a tougher issue. The CLUSTALW
paper doesn't give any way to measure the quality of the result, they
consider a good alignment the one that is hard to improve by eye (But
they claim that for sequences sufficient similar, no pair less than
35% identical, the results are good). Can I do the same as in CLUSTALW
paper and leave the quality measure to the user? How concerned should
I be with that in this project?

> All applications have to be made via Google. We are providing comments? on
> drafts of proposals and try to work together with applicants to improve the
> submissions. Note: The application deadline is soon and speed is important
> now.

I will try send to this mailing list a proposal draft until tomorrow
to have some feedback from you.

> Andreas
>
>
>
> On Mon, Apr 5, 2010 at 10:29 PM, Gustavo Akio Tominaga Sacomoto
> <sacomoto at gmail.com> wrote:
>>
>> Hello,
>>
>> I'm currently a graduate student at University of S?o Paulo (Brazil)
>> and I'm quite interested in applying for the all-Java MSA project. I'm
>> already familiar with the multiple sequence alignment problem, I
>> developed a lossless filter for this problem as my undergraduate final
>> project, the work is described here
>> [http://www.almob.org/content/4/1/3] and there is an online version of
>> the algorithm here
>> [http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu].
>>
>> Now, regarding the project, just to make it clear, when you say in the
>> "straightforward approach for building up the MSA progressively", you
>> mean the standard dynamic programming approach for pairwise alignment
>> following the guide tree built in the second step, right?
>>
>> One last question, should I send my proposal direct to the Google's
>> web app or here first?
>>
>> Thanks,
>>
>> Gustavo Sacomoto
>>
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>

Thanks for your help.

gustavo


From andreas at sdsc.edu  Tue Apr  6 21:27:15 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 6 Apr 2010 14:27:15 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
Message-ID: <g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>

Hi Gustavo,

In principle I agree to all, see details below:


I think my question wasn't very clear, my intention in this project is

> to follow the approach (with the tree steps) outlined in the project's
> page. Using the classical progressive alignment heuristic: build the
> distance matrix, build the guide tree and using this tree
> progressively align more sequences together.
>

yes


>
> What I propose for the third step is a first implementation using the
> (more simple) dynamic programming described in the first CLUSTAL paper
> (I thinks it's from 1988) and incrementally improving the algorithm to
> get closer to the one described in CLUSTALW paper (from 1994). Is this
> more or less what you had in mind?
>

yes, sounds good.


>
> About parallel strategies, I think a relative easy way we could use it
> is in the distance matrix construction, we could have several threads
> calculating the pairwise alignment for different pairs of sequence in
> the set.
>

Correct. Probably a first implementation would be for a single machine/
multi CPU. More advanced implementations could provide support e.g. for
Map/Reduce, JPPF, or something like that...

Now, the alignment quality measures is a tougher issue. The CLUSTALW
> paper doesn't give any way to measure the quality of the result, they
> consider a good alignment the one that is hard to improve by eye (But
> they claim that for sequences sufficient similar, no pair less than
> 35% identical, the results are good). Can I do the same as in CLUSTALW
> paper and leave the quality measure to the user? How concerned should
> I be with that in this project?
>

Getting an overall core-algorithm that works should be priority. The
benchmarking part is not mandatory, but something to keep in mind... I have
plenty of material for that, once we get to that stage...

 I will try send to this mailing list a proposal draft until tomorrow
> to have some feedback from you.
>

Excellent, looking forward to it.

Andreas

-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From sacomoto at gmail.com  Wed Apr  7 05:29:31 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Wed, 7 Apr 2010 02:29:31 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com> 
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com> 
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
Message-ID: <q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>

Hi Andreas,

My proposal is pasted at the end of this e-mail.

I'm waiting for your feedback.

Thanks,

gustavo


-------------------------------------------------------------

GSoC proposal

Abstract
--------

This project aims to develop an all-Java implementation of a multiple
sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
using the progressive algorithm described in the CLUSTALW paper [1].

The Importance
--------------

Multiple sequence alignment is a frequently performed task in sequence
analysis with the goal to identify new members of protein families and
infer phylogenetic relationships between proteins and genes. At the
present there is no Java-only implementation for this algorithm. As
such the number of already existing and Java related BioInformatics
tools and web sites would benefit from this implementation and
sequence analysis could be more easily performed by the end-user.

About Me
--------

I am a graduate student at University of S?o Paulo (Brazil), I got my
undergraduate degree from the same university with a major in Computer
Science and a minor in Biology. I have been involved with
Bioinformatics for 5 years, always with sequence analysis with
particular interest in the MSA problem. Also, in my undergraduate
final project I developed a lossless filter (pruning algorithm) for
the MSA problem, the work is published in [3] and there is an online
implementation of the algorithm in [4]. Finally, I have experience
with the C, C++, Java, Python and Ruby programming languages; Git and
SVN version control systems.

Project Plan
------------

The project is divided in four main steps, at the end of each step a
completely functional and bug-free new algorithm will be added to the
Biojava code base. It should be noticed that each step has a strong
dependence on the previous one, so before move to the next step a
careful testing will be done.

The four steps are described below, estimated times for accomplishment
of each step are also given and in some steps extra enhancements are
described, they will be implemented if there is some time remaining
after all steps are completed.

** 1. Study the Biojava pairwise alignment code and update it to be
compliant with Biojava 3.

 The pairwise alignment will play an important role in the MSA
algorithm. This step is also important for me to get used to the
Biojava coding standards and get in touch with the Biojava dev
community.

 ETA: 2 weeks.

** 2. Implement the algorithm to build the distance matrix.

 This is done using the pairwise alignment for each pair of sequence
in the set to be aligned.

 ETA: 1 week.

 EXTRA: Enhance the basic algorithm to use parallel strategies, use
several threads to calculate the pairwise alignment for different
pairs in the sequence set.

** 3. Implement the algorithm to build the guide tree.

 The guide tree is based on the distance matrix built in the last
step, the tree construction strategy adopted will be the Neighbor
Joining Algorithm.

 ETA: 2 weeks.

** 4. Implement the algorithm for progressive MSA using the guide tree.

 This is certainly the most difficult part of the project, so to make
sure we are going to deliver a fully functional MSA algorithm, a safer
approach is going to be taken. In the first place, a dynamic
programming algorithm described in [2] will be implemented. Once this
get successfully done and the code fully integrated to the Biojava
code base, the features described in [1] are going to be incrementally
added (and tested) in order to implement the full dynamic programming
algorithm.

 ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.

 EXTRA: Implement some benchmark technique to measure the final
alignment quality.

References
----------

[1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
[2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
[3] http://www.almob.org/content/4/1/3
[4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu


On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> In principle I agree to all, see details below:
>
>
> I think my question wasn't very clear, my intention in this project is
>>
>> to follow the approach (with the tree steps) outlined in the project's
>> page. Using the classical progressive alignment heuristic: build the
>> distance matrix, build the guide tree and using this tree
>> progressively align more sequences together.
>
> yes
>
>>
>> What I propose for the third step is a first implementation using the
>> (more simple) dynamic programming described in the first CLUSTAL paper
>> (I thinks it's from 1988) and incrementally improving the algorithm to
>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>> more or less what you had in mind?
>
> yes, sounds good.
>
>>
>> About parallel strategies, I think a relative easy way we could use it
>> is in the distance matrix construction, we could have several threads
>> calculating the pairwise alignment for different pairs of sequence in
>> the set.
>
> Correct. Probably a first implementation would be for a single machine/
> multi CPU. More advanced implementations could provide support e.g. for
> Map/Reduce, JPPF, or something like that...
>
>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>> paper doesn't give any way to measure the quality of the result, they
>> consider a good alignment the one that is hard to improve by eye (But
>> they claim that for sequences sufficient similar, no pair less than
>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>> paper and leave the quality measure to the user? How concerned should
>> I be with that in this project?
>
> Getting an overall core-algorithm that works should be priority. The
> benchmarking part is not mandatory, but something to keep in mind... I have
> plenty of material for that, once we get to that stage...
>
>> I will try send to this mailing list a proposal draft until tomorrow
>> to have some feedback from you.
>
> Excellent, looking forward to it.
>
> Andreas
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From sma.hmc at gmail.com  Wed Apr  7 07:52:34 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Wed, 7 Apr 2010 00:52:34 -0700
Subject: [Biojava-l] Questions about Summer of Code Project
Message-ID: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>

I had previously sent this, but was not part of the mailing list, so I
can only assume it got lost in a spam loop.

I was interested in applying for the All-Java Multiple Sequence
Alignment Google Summer of Code project. I wanted to create a project
plan but had some questions about the package as it stands now.

1. What exactly has changed with the transition to BioJava 3? From
what I've read on the BioJava 3 proposal page, it seems like that the
changes are to the organization of the code. Additionally there are
some new standards to follow. Java 6 usage is desired, but I am unsure
of what of the new features could be used in modifying pairwise
sequence alignments.

2. Is the Neighbor Joining Algorithm really the best for this? Are
other multiple alignments implementations desired? I have implemented
the neighbor joining algorithm very inefficiently in python, it was
not particularly difficult. This step seems like it will not take very
long. Additionally, parallelism, I have no experience with parallelism
in Java and will only have some experience with it in C, will that be
an issue?

3. Is there a specific paper with the exact algorithm that should be
implemented here?

General: Will use cases be provided? Will test data be provided? These
would both be useful in coding the test cases which seem to be coded
first.

Additionally, I have access to my current windows machine as well as
as Linux machine for testing, but no Mac. While in theory with java,
if it works on one, then it works on another, and especially with if
it works on Linux, it should be fine on Mac, should I be worried about
strange peculiarities?

Thanks,
Singer Ma
Harvey Mudd College 2011


From ayates at ebi.ac.uk  Wed Apr  7 11:27:27 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 7 Apr 2010 12:27:27 +0100
Subject: [Biojava-l] Anonymous svn down
In-Reply-To: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
Message-ID: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>

By the looks of things this is quite a simple process to do:

http://github.com/guides/import-from-subversion

http://blog.woobling.org/2009/06/git-svn-abandon.html

http://blog.johngoulah.com/2009/11/migrating-svn-to-git/

The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up

Andy

On 3 Apr 2010, at 16:08, Andreas Prlic wrote:

> Hi,
> 
> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github?
> 
> Andreas
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From Stefan.Bleckmann at uni-duesseldorf.de  Wed Apr  7 12:08:45 2010
From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann)
Date: Wed, 07 Apr 2010 14:08:45 +0200
Subject: [Biojava-l] SubstitutionMatrix
Message-ID: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>

Hi all!

I have a problems reading the NUC4.2 and 4.4 matrix files with the SubstitutionMatrix class included in BioJava 1.7.1. 
A small example:


		File d = new File("/Users/-----/Desktop/NUC");
		FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager.alphabetForName("DNA");
		try {
			@SuppressWarnings("unused")
			final SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet,d);
		} catch (NumberFormatException e) {
			e.printStackTrace();
		} catch (NoSuchElementException e) {
			e.printStackTrace();
		} catch (BioException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}


Thrown exception:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0
	at java.lang.String.charAt(String.java:686)
	at org.biojava.bio.alignment.SubstitutionMatrix.parseMatrix(SubstitutionMatrix.java:304)
	at org.biojava.bio.alignment.SubstitutionMatrix.<init>(SubstitutionMatrix.java:100)
	at MatrixTest.main(MatrixTest.java:30)


All BLOSUM matrix files I have downloaded work, so I don't think there is a problem like wrong encoding or something similar.
Anybody an idea?

Cheers Stefan


From andreas.draeger at uni-tuebingen.de  Wed Apr  7 13:32:23 2010
From: andreas.draeger at uni-tuebingen.de (Andreas =?iso-8859-1?b?RHLkZ2Vy?=)
Date: Wed, 07 Apr 2010 15:32:23 +0200
Subject: [Biojava-l] SubstitutionMatrix
In-Reply-To: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>
References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>
Message-ID: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de>

Hi Stefan,

Thank you for this hint. I don't know what the problem is. Recently, I  
tested it and it worked. I'll have a look on it tomorrow and come back  
to you with an answer pretty soon!

Cheers
Andreas

Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From holland at eaglegenomics.com  Wed Apr  7 13:48:21 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 7 Apr 2010 14:48:21 +0100
Subject: [Biojava-l] SubstitutionMatrix
In-Reply-To: <20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de>
References: <0DAB7300-372D-43D1-BFC6-2D8DB1C7CCC1@uni-duesseldorf.de>
	<20100407153223.20121fzzwyubkr53@webmail.uni-tuebingen.de>
Message-ID: <20ACD602-7575-46DB-AFD7-348AEB37CF68@eaglegenomics.com>

I've found the problem already - the SubstitutionMatrix class has a few inconsistencies in the use of trimmed and untrimmed versions of lines. The guessAlphabet() method in this case is falling over because of an unchecked blank line in the matrix file.

I've submitted a patch to trunk which fixes all the inconsistencies and should also fix this problem with the NUC files.


On 7 Apr 2010, at 14:32, Andreas Dr?ger wrote:

> Hi Stefan,
> 
> Thank you for this hint. I don't know what the problem is. Recently, I tested it and it worked. I'll have a look on it tomorrow and come back to you with an answer pretty soon!
> 
> Cheers
> Andreas
> 
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
> 
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From Stefan.Bleckmann at uni-duesseldorf.de  Wed Apr  7 14:01:04 2010
From: Stefan.Bleckmann at uni-duesseldorf.de (Stefan Bleckmann)
Date: Wed, 07 Apr 2010 16:01:04 +0200
Subject: [Biojava-l] SubstitutionMatrix
Message-ID: <512EA47A-6F40-4A38-B69D-5990D273C9DD@uni-duesseldorf.de>

Hi Richard,

Thx for your fast replay. I found the same solution. Two additional line breaks in the file was the problem which I didn't saw in the editor I used to check the file.


Cheers Stefan


From andreas.prlic at gmail.com  Wed Apr  7 15:13:04 2010
From: andreas.prlic at gmail.com (Andreas Prlic)
Date: Wed, 7 Apr 2010 08:13:04 -0700
Subject: [Biojava-l] Anonymous svn down
In-Reply-To: <36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>
References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
	<36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>
Message-ID: <k2v59a41c431004070813w746f184ch4fa5a2b2b45c0bca@mail.gmail.com>

Hi Andy,

In the meanwhile Kyle Ellrott already has set  up a first github clone...

http://github.com/biojava/biojava

We are just monitoring it a bit to make sure it works properly...

Is the usermapping important? We have some 50+ users so that might be
painful...

Andreas

On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> By the looks of things this is quite a simple process to do:
>
> http://github.com/guides/import-from-subversion
>
> http://blog.woobling.org/2009/06/git-svn-abandon.html
>
> http://blog.johngoulah.com/2009/11/migrating-svn-to-git/
>
> The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up
>
> Andy
>
> On 3 Apr 2010, at 16:08, Andreas Prlic wrote:
>
>> Hi,
>>
>> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github?
>>
>> Andreas
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From ayates at ebi.ac.uk  Wed Apr  7 15:17:27 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 7 Apr 2010 16:17:27 +0100
Subject: [Biojava-l] Anonymous svn down
In-Reply-To: <k2v59a41c431004070813w746f184ch4fa5a2b2b45c0bca@mail.gmail.com>
References: <2AF9846C-41A0-4CEB-9589-6DB698B2107A@gmail.com>
	<36985994-D7D3-4228-90C7-0C594DA0A706@ebi.ac.uk>
	<k2v59a41c431004070813w746f184ch4fa5a2b2b45c0bca@mail.gmail.com>
Message-ID: <647FD3F8-5222-487C-872F-DF00B693C809@ebi.ac.uk>

Hey Andreas,

The user mapping file only matters if we want a coherent link between our SVN users & those who have a github account. For example any commit of mine appears as ayates however it would probably be of more use to link to my github user since that would have more information about what I'm doing with the repo e.g. writing some snazzy new BJ3 code :). 

Andy

On 7 Apr 2010, at 16:13, Andreas Prlic wrote:

> Hi Andy,
> 
> In the meanwhile Kyle Ellrott already has set  up a first github clone...
> 
> http://github.com/biojava/biojava
> 
> We are just monitoring it a bit to make sure it works properly...
> 
> Is the usermapping important? We have some 50+ users so that might be
> painful...
> 
> Andreas
> 
> On Wed, Apr 7, 2010 at 4:27 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> By the looks of things this is quite a simple process to do:
>> 
>> http://github.com/guides/import-from-subversion
>> 
>> http://blog.woobling.org/2009/06/git-svn-abandon.html
>> 
>> http://blog.johngoulah.com/2009/11/migrating-svn-to-git/
>> 
>> The difficult things seem to be providing a SVN -> GitHub user mapping. Apart from that it's a question of how much space will the import take up
>> 
>> Andy
>> 
>> On 3 Apr 2010, at 16:08, Andreas Prlic wrote:
>> 
>>> Hi,
>>> 
>>> the anonymous svn server seems to be down again. I have already contacted support @ obf, but not recieved back a response, when it should be back up. In the meanwhile, is anybody volunteering to set up a failback mirror at github?
>>> 
>>> Andreas
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From andreas at sdsc.edu  Wed Apr  7 19:12:27 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 7 Apr 2010 12:12:27 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>
Message-ID: <q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>

Hi Gustavo,

here my 0.02$:

* For some of your steps there is already code available in BioJava.
MIght be good to take a look at what is already there...   (look at
the alignment and phylo modules for dynamic programming and
Neighbour-Joining)

* What about risks? Where do you expect difficulties and how to work
around them?

* Step 4: Can you add more details? How do you plan to approach this?
E.g. Clustalw has a number of rules implemented at this stage. Do you
plan to support multiple rules as well and how to do this technically.
Something nice would be the possibility to use structure alignments to
guide the sequence alignments. (structure module)

Andreas


> -------------------------------------------------------------
>
> GSoC proposal
>
> Abstract
> --------
>
> This project aims to develop an all-Java implementation of a multiple
> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
> using the progressive algorithm described in the CLUSTALW paper [1].
>
> The Importance
> --------------
>
> Multiple sequence alignment is a frequently performed task in sequence
> analysis with the goal to identify new members of protein families and
> infer phylogenetic relationships between proteins and genes. At the
> present there is no Java-only implementation for this algorithm. As
> such the number of already existing and Java related BioInformatics
> tools and web sites would benefit from this implementation and
> sequence analysis could be more easily performed by the end-user.
>
> About Me
> --------
>
> I am a graduate student at University of S?o Paulo (Brazil), I got my
> undergraduate degree from the same university with a major in Computer
> Science and a minor in Biology. I have been involved with
> Bioinformatics for 5 years, always with sequence analysis with
> particular interest in the MSA problem. Also, in my undergraduate
> final project I developed a lossless filter (pruning algorithm) for
> the MSA problem, the work is published in [3] and there is an online
> implementation of the algorithm in [4]. Finally, I have experience
> with the C, C++, Java, Python and Ruby programming languages; Git and
> SVN version control systems.
>
> Project Plan
> ------------
>
> The project is divided in four main steps, at the end of each step a
> completely functional and bug-free new algorithm will be added to the
> Biojava code base. It should be noticed that each step has a strong
> dependence on the previous one, so before move to the next step a
> careful testing will be done.
>
> The four steps are described below, estimated times for accomplishment
> of each step are also given and in some steps extra enhancements are
> described, they will be implemented if there is some time remaining
> after all steps are completed.
>
> ** 1. Study the Biojava pairwise alignment code and update it to be
> compliant with Biojava 3.
>
> ?The pairwise alignment will play an important role in the MSA
> algorithm. This step is also important for me to get used to the
> Biojava coding standards and get in touch with the Biojava dev
> community.
>
> ?ETA: 2 weeks.
>
> ** 2. Implement the algorithm to build the distance matrix.
>
> ?This is done using the pairwise alignment for each pair of sequence
> in the set to be aligned.
>
> ?ETA: 1 week.
>
> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
> several threads to calculate the pairwise alignment for different
> pairs in the sequence set.
>
> ** 3. Implement the algorithm to build the guide tree.
>
> ?The guide tree is based on the distance matrix built in the last
> step, the tree construction strategy adopted will be the Neighbor
> Joining Algorithm.
>
> ?ETA: 2 weeks.
>
> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>
> ?This is certainly the most difficult part of the project, so to make
> sure we are going to deliver a fully functional MSA algorithm, a safer
> approach is going to be taken. In the first place, a dynamic
> programming algorithm described in [2] will be implemented. Once this
> get successfully done and the code fully integrated to the Biojava
> code base, the features described in [1] are going to be incrementally
> added (and tested) in order to implement the full dynamic programming
> algorithm.
>
> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>
> ?EXTRA: Implement some benchmark technique to measure the final
> alignment quality.
>
> References
> ----------
>
> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
> [3] http://www.almob.org/content/4/1/3
> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>
>
>
> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Gustavo,
>>
>> In principle I agree to all, see details below:
>>
>>
>> I think my question wasn't very clear, my intention in this project is
>>>
>>> to follow the approach (with the tree steps) outlined in the project's
>>> page. Using the classical progressive alignment heuristic: build the
>>> distance matrix, build the guide tree and using this tree
>>> progressively align more sequences together.
>>
>> yes
>>
>>>
>>> What I propose for the third step is a first implementation using the
>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>> more or less what you had in mind?
>>
>> yes, sounds good.
>>
>>>
>>> About parallel strategies, I think a relative easy way we could use it
>>> is in the distance matrix construction, we could have several threads
>>> calculating the pairwise alignment for different pairs of sequence in
>>> the set.
>>
>> Correct. Probably a first implementation would be for a single machine/
>> multi CPU. More advanced implementations could provide support e.g. for
>> Map/Reduce, JPPF, or something like that...
>>
>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>> paper doesn't give any way to measure the quality of the result, they
>>> consider a good alignment the one that is hard to improve by eye (But
>>> they claim that for sequences sufficient similar, no pair less than
>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>> paper and leave the quality measure to the user? How concerned should
>>> I be with that in this project?
>>
>> Getting an overall core-algorithm that works should be priority. The
>> benchmarking part is not mandatory, but something to keep in mind... I have
>> plenty of material for that, once we get to that stage...
>>
>>> I will try send to this mailing list a proposal draft until tomorrow
>>> to have some feedback from you.
>>
>> Excellent, looking forward to it.
>>
>> Andreas
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Wed Apr  7 19:30:19 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 7 Apr 2010 12:30:19 -0700
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
Message-ID: <n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>

Hi Singer,

> I had previously sent this, but was not part of the mailing list, so I
> can only assume it got lost in a spam loop.

You need to be subscribed in order to be able to post...

> I was interested in applying for the All-Java Multiple Sequence
> Alignment Google Summer of Code project.

Several students have expressed their interest  in this project.
Depending on how the funding situation will be, at maximum one will be
able to work on this... There is also a 2nd BioJava related project or
you could propose your own ideas...
http://biojava.org/wiki/Google_Summer_of_Code


 I wanted to create a project
> plan but had some questions about the package as it stands now.
>
> 1. What exactly has changed with the transition to BioJava 3? From
> what I've read on the BioJava 3 proposal page, it seems like that the
> changes are to the organization of the code. Additionally there are
> some new standards to follow. Java 6 usage is desired, but I am unsure
> of what of the new features could be used in modifying pairwise
> sequence alignments.

BioJava is more modular in version 3. There is a new module for
working with sequences. The current alignment module is still based on
the old version of BioJava though.

>
> 2. Is the Neighbor Joining Algorithm really the best for this? Are
> other multiple alignments implementations desired? I have implemented
> the neighbor joining algorithm very inefficiently in python, it was
> not particularly difficult.

NJ is a clustering technique, but there are also others.
http://en.wikipedia.org/wiki/Neighbor-joining
Another online lecture that might be useful is:
http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html

This step seems like it will not take very
> long. Additionally, parallelism, I have no experience with parallelism
> in Java and will only have some experience with it in C, will that be
> an issue?

I have never written multi threaded code in C, but I would guess it is
much much easier in Java...

> 3. Is there a specific paper with the exact algorithm that should be
> implemented here?

We have only 3 months for this project so having a modular core
algorithm that can be extended would be a priority. I recommend
reading the Clustalw, T-Coffee and Muscle papers.

> General: Will use cases be provided? Will test data be provided? These
> would both be useful in coding the test cases which seem to be coded
> first.

I can provide plenty of data for that.


> Additionally, I have access to my current windows machine as well as
> as Linux machine for testing, but no Mac. While in theory with java,
> if it works on one, then it works on another, and especially with if
> it works on Linux, it should be fine on Mac, should I be worried about
> strange peculiarities?

>From my experience Java works pretty fine on any platform. There might
be issues with user interfaces that require testing, but we are not
going to do  user interfaces here...

Andreas


>
> Thanks,
> Singer Ma
> Harvey Mudd College 2011
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas.draeger at uni-tuebingen.de  Thu Apr  8 07:13:17 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Thu, 08 Apr 2010 09:13:17 +0200
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
Message-ID: <4BBD820D.9070200@uni-tuebingen.de>

Hi all,

This e-mail is just for your information about somebody new, who'd like 
to contribute to our project.

Cheers
Andreas


Subject:
Re: Fwd: Proposing a project on "Biojava alignment lead"
From:
Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
Date:
Wed, 07 Apr 2010 09:27:13 +0200
To:
Cai Shaojiang <caishaojiang at gmail.com>

Hi Cai Shaojiang,

Thank you for you e-mail! I don't know what happened to the e-mail list. 
Sometimes it takes a while due to the spam filters, I guess.

 > I am a PhD student from National University of Singapore. My major 
research area is local alignment algorithms and data structures for SNP 
identification. And I have used Java and Eclipse for years for software 
development. I am very interested in your GSoC programme. I find that 
there is a module called "biojava-alignment lead" whose mentor is you. I 
want to propose a new project on this module. I have several questions 
about this module.

Yes, that's me. So great to get your support.

 > 1. It seems that pairwise alignment is to find similarity between two 
short sequences. Existing pairwise alignment is based on dynamic 
programming, is it Smith-Waterman algorithm?

So, currently, BioJava contains three different alignment approaches. 
There are two deterministic algorithms, i.e., Smith-Waterman for local 
alignment and Needleman-Wunsch for global alignment. Third, there is the 
possibility to apply Hidden Markov Models for alignment. An example of 
the latter approach should be in the cookbook.

 > 2. What is the exact task of "refactoring of underlying data structures"?

Yes, this is something, I did last week already but it could still be 
improved. The problem was that the alignment algorithms actually 
produced a kind of string that looks similar to the output of BLAST. 
This string contained the score, the computation time, the length of the 
alignment etc. The problem was that people wanted to perform 
higher-level computation on the score value or evaluate some other 
information. Now, the alignment will produce a data structure that 
contains all the information and can, in addition to that, also produce 
such a BLAST-like output. There is, however, still the following 
problem: The data structure requires both sequences in the pair-wise 
alignment to have an identical length. In case of local alignment this 
is especially stupid (actually), because gaps are inserted to fill the 
sequences. And then the data structure tries to keep the old sequence 
coordinates, leading to the effect that the numbers "query start", 
"query end", "subject start", and "subject end" are required to shift 
the sequences against each other when displaying the output. So, you 
cannot easily print the sequences below of each other, you first have to 
shift them. Please check out the latest version of this package via 
anonymeous svn and have a look ;-)

 > 3. My existing research area is aiming to deal with aligning short 
read (10s~100s bp) against extremely long sequences (e.g., human 
genome). Af far as I know, there is not existing such alignment tools 
implemented in Java. Would you consider this direction?

See, this would be very nice to include. But this requires that we no 
longer fill the short sequence with many, many gap symbols (just a waist 
of memory), but improve the data structure. There is already an 
UnequalLenghtAlignment (just a data structure, no algorithm) and I think 
we could use this as a starting point. Then your algorithm should only 
produce such a data structure and this would be fine.

 > 4. It seems that the existing tools is just lacking of some 
refactoring and representation interfaces. Any more underlying tasks?

Hm. Yes: With the release of BioJava 3 data structures have changed 
again. So maybe there's also some adaptation to the new structure required.

 > I am keeping an eye on GSoC from last month, but sorry to find out 
that I sent the initial email to the mailing list before I subscribe it...

Ok. Sounds good. Thanks for your interest. So I suggest: Download the 
latest trunk, have a look, play around and if you can improve something 
we'll put it into the trunk and write your name into the authors' tag.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From ayates at ebi.ac.uk  Thu Apr  8 10:23:06 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Thu, 8 Apr 2010 11:23:06 +0100
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
	<n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
Message-ID: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>

Hi Singer,

To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:

* Mutable objects are the work of the devil & should be avoided
* Tasks & Futures are quite lightweight things to produce; threads are not
* Multiple tasks can be given to a queue to be processed by a number of threads in a pool
* Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
* Assume that things will fail
* Write your program with a view to be concurrent; do not force concurrency on an already written program

Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/). 

Andy

On 7 Apr 2010, at 20:30, Andreas Prlic wrote:

> Hi Singer,
> 
>> I had previously sent this, but was not part of the mailing list, so I
>> can only assume it got lost in a spam loop.
> 
> You need to be subscribed in order to be able to post...
> 
>> I was interested in applying for the All-Java Multiple Sequence
>> Alignment Google Summer of Code project.
> 
> Several students have expressed their interest  in this project.
> Depending on how the funding situation will be, at maximum one will be
> able to work on this... There is also a 2nd BioJava related project or
> you could propose your own ideas...
> http://biojava.org/wiki/Google_Summer_of_Code
> 
> 
> I wanted to create a project
>> plan but had some questions about the package as it stands now.
>> 
>> 1. What exactly has changed with the transition to BioJava 3? From
>> what I've read on the BioJava 3 proposal page, it seems like that the
>> changes are to the organization of the code. Additionally there are
>> some new standards to follow. Java 6 usage is desired, but I am unsure
>> of what of the new features could be used in modifying pairwise
>> sequence alignments.
> 
> BioJava is more modular in version 3. There is a new module for
> working with sequences. The current alignment module is still based on
> the old version of BioJava though.
> 
>> 
>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>> other multiple alignments implementations desired? I have implemented
>> the neighbor joining algorithm very inefficiently in python, it was
>> not particularly difficult.
> 
> NJ is a clustering technique, but there are also others.
> http://en.wikipedia.org/wiki/Neighbor-joining
> Another online lecture that might be useful is:
> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
> 
> This step seems like it will not take very
>> long. Additionally, parallelism, I have no experience with parallelism
>> in Java and will only have some experience with it in C, will that be
>> an issue?
> 
> I have never written multi threaded code in C, but I would guess it is
> much much easier in Java...
> 
>> 3. Is there a specific paper with the exact algorithm that should be
>> implemented here?
> 
> We have only 3 months for this project so having a modular core
> algorithm that can be extended would be a priority. I recommend
> reading the Clustalw, T-Coffee and Muscle papers.
> 
>> General: Will use cases be provided? Will test data be provided? These
>> would both be useful in coding the test cases which seem to be coded
>> first.
> 
> I can provide plenty of data for that.
> 
> 
>> Additionally, I have access to my current windows machine as well as
>> as Linux machine for testing, but no Mac. While in theory with java,
>> if it works on one, then it works on another, and especially with if
>> it works on Linux, it should be fine on Mac, should I be worried about
>> strange peculiarities?
> 
>> From my experience Java works pretty fine on any platform. There might
> be issues with user interfaces that require testing, but we are not
> going to do  user interfaces here...
> 
> Andreas
> 
> 
>> 
>> Thanks,
>> Singer Ma
>> Harvey Mudd College 2011
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From sma.hmc at gmail.com  Thu Apr  8 10:38:41 2010
From: sma.hmc at gmail.com (Singer Ma)
Date: Thu, 8 Apr 2010 03:38:41 -0700
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
	<n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
	<7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>
Message-ID: <h2k62ed8c081004080338xb1ee5a27k8253c38bb2b13fec@mail.gmail.com>

So, my questions were generated from looking past just the Summer of
Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as
part of its proposal, lists:

Make methods parallel-aware and take advantage of this when possible,
and provide a global variable to specify how much parallelisation can
take place.

on http://www.biojava.org/wiki/BioJava3_Proposal

How important it this to incorporate into the Summer of Code project?
Obviously anything that is already concurrent can remain that way, but
for the new code in multiple sequence alignment, does this need to be
parallel-aware? Clearly, in a multiple sequence alignment, certain
things can be made parallel such as the initial distance matrix
calculation, parts of the neighbor joining algorithm, etc. If I were
to contribute, I would want to uphold the agreed upon standards as
much as possible. I am just unsure of my capability to make multiple
sequence alignment parallel-aware.

Singer

On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Singer,
>
> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:
>
> * Mutable objects are the work of the devil & should be avoided
> * Tasks & Futures are quite lightweight things to produce; threads are not
> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool
> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
> * Assume that things will fail
> * Write your program with a view to be concurrent; do not force concurrency on an already written program
>
> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/).
>
> Andy
>
> On 7 Apr 2010, at 20:30, Andreas Prlic wrote:
>
>> Hi Singer,
>>
>>> I had previously sent this, but was not part of the mailing list, so I
>>> can only assume it got lost in a spam loop.
>>
>> You need to be subscribed in order to be able to post...
>>
>>> I was interested in applying for the All-Java Multiple Sequence
>>> Alignment Google Summer of Code project.
>>
>> Several students have expressed their interest ?in this project.
>> Depending on how the funding situation will be, at maximum one will be
>> able to work on this... There is also a 2nd BioJava related project or
>> you could propose your own ideas...
>> http://biojava.org/wiki/Google_Summer_of_Code
>>
>>
>> I wanted to create a project
>>> plan but had some questions about the package as it stands now.
>>>
>>> 1. What exactly has changed with the transition to BioJava 3? From
>>> what I've read on the BioJava 3 proposal page, it seems like that the
>>> changes are to the organization of the code. Additionally there are
>>> some new standards to follow. Java 6 usage is desired, but I am unsure
>>> of what of the new features could be used in modifying pairwise
>>> sequence alignments.
>>
>> BioJava is more modular in version 3. There is a new module for
>> working with sequences. The current alignment module is still based on
>> the old version of BioJava though.
>>
>>>
>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>>> other multiple alignments implementations desired? I have implemented
>>> the neighbor joining algorithm very inefficiently in python, it was
>>> not particularly difficult.
>>
>> NJ is a clustering technique, but there are also others.
>> http://en.wikipedia.org/wiki/Neighbor-joining
>> Another online lecture that might be useful is:
>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
>>
>> This step seems like it will not take very
>>> long. Additionally, parallelism, I have no experience with parallelism
>>> in Java and will only have some experience with it in C, will that be
>>> an issue?
>>
>> I have never written multi threaded code in C, but I would guess it is
>> much much easier in Java...
>>
>>> 3. Is there a specific paper with the exact algorithm that should be
>>> implemented here?
>>
>> We have only 3 months for this project so having a modular core
>> algorithm that can be extended would be a priority. I recommend
>> reading the Clustalw, T-Coffee and Muscle papers.
>>
>>> General: Will use cases be provided? Will test data be provided? These
>>> would both be useful in coding the test cases which seem to be coded
>>> first.
>>
>> I can provide plenty of data for that.
>>
>>
>>> Additionally, I have access to my current windows machine as well as
>>> as Linux machine for testing, but no Mac. While in theory with java,
>>> if it works on one, then it works on another, and especially with if
>>> it works on Linux, it should be fine on Mac, should I be worried about
>>> strange peculiarities?
>>
>>> From my experience Java works pretty fine on any platform. There might
>> be issues with user interfaces that require testing, but we are not
>> going to do ?user interfaces here...
>>
>> Andreas
>>
>>
>>>
>>> Thanks,
>>> Singer Ma
>>> Harvey Mudd College 2011
>>> _______________________________________________
>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer
> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/
>
>
>
>
>


From ayates at ebi.ac.uk  Thu Apr  8 10:46:15 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Thu, 8 Apr 2010 11:46:15 +0100
Subject: [Biojava-l] Questions about Summer of Code Project
In-Reply-To: <h2k62ed8c081004080338xb1ee5a27k8253c38bb2b13fec@mail.gmail.com>
References: <s2w62ed8c081004070052z19ac68c0i2ffa200d65d92d75@mail.gmail.com>
	<n2y59a41c431004071230yfd1ef60eq900e5cbc6cd13401@mail.gmail.com>
	<7EE9A2D1-0DC6-4D52-BC49-0D219AB0D88E@ebi.ac.uk>
	<h2k62ed8c081004080338xb1ee5a27k8253c38bb2b13fec@mail.gmail.com>
Message-ID: <91C9DF16-E6EF-4B7A-ADC4-E781275514EB@ebi.ac.uk>

Ahhh okay. So when we wrote this section it was with a view towards being able to do things in a concurrent manner as & when that framework appears. BioJava3 is still in an incubation phase; a lot of code is in place but we are all having to do this along with work commitments (which in my case is working on a Perl project so my work/BJ contributions are very limited). 

Anyway to go back to the question about being "framework" standard. The MSA algorithm would be the first case we would have to make concurrent (as far as I am  aware but Scooter is a better person to confirm this) and so the framework of building a concurrent application would come from this project. If the code is written using the standard concurrent library interfaces then it should be possible to transplant it into any concurrent Java framework and that's really the important thing here.

Andy

On 8 Apr 2010, at 11:38, Singer Ma wrote:

> So, my questions were generated from looking past just the Summer of
> Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as
> part of its proposal, lists:
> 
> Make methods parallel-aware and take advantage of this when possible,
> and provide a global variable to specify how much parallelisation can
> take place.
> 
> on http://www.biojava.org/wiki/BioJava3_Proposal
> 
> How important it this to incorporate into the Summer of Code project?
> Obviously anything that is already concurrent can remain that way, but
> for the new code in multiple sequence alignment, does this need to be
> parallel-aware? Clearly, in a multiple sequence alignment, certain
> things can be made parallel such as the initial distance matrix
> calculation, parts of the neighbor joining algorithm, etc. If I were
> to contribute, I would want to uphold the agreed upon standards as
> much as possible. I am just unsure of my capability to make multiple
> sequence alignment parallel-aware.
> 
> Singer
> 
> On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Singer,
>> 
>> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:
>> 
>> * Mutable objects are the work of the devil & should be avoided
>> * Tasks & Futures are quite lightweight things to produce; threads are not
>> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool
>> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
>> * Assume that things will fail
>> * Write your program with a view to be concurrent; do not force concurrency on an already written program
>> 
>> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/).
>> 
>> Andy
>> 
>> On 7 Apr 2010, at 20:30, Andreas Prlic wrote:
>> 
>>> Hi Singer,
>>> 
>>>> I had previously sent this, but was not part of the mailing list, so I
>>>> can only assume it got lost in a spam loop.
>>> 
>>> You need to be subscribed in order to be able to post...
>>> 
>>>> I was interested in applying for the All-Java Multiple Sequence
>>>> Alignment Google Summer of Code project.
>>> 
>>> Several students have expressed their interest  in this project.
>>> Depending on how the funding situation will be, at maximum one will be
>>> able to work on this... There is also a 2nd BioJava related project or
>>> you could propose your own ideas...
>>> http://biojava.org/wiki/Google_Summer_of_Code
>>> 
>>> 
>>> I wanted to create a project
>>>> plan but had some questions about the package as it stands now.
>>>> 
>>>> 1. What exactly has changed with the transition to BioJava 3? From
>>>> what I've read on the BioJava 3 proposal page, it seems like that the
>>>> changes are to the organization of the code. Additionally there are
>>>> some new standards to follow. Java 6 usage is desired, but I am unsure
>>>> of what of the new features could be used in modifying pairwise
>>>> sequence alignments.
>>> 
>>> BioJava is more modular in version 3. There is a new module for
>>> working with sequences. The current alignment module is still based on
>>> the old version of BioJava though.
>>> 
>>>> 
>>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>>>> other multiple alignments implementations desired? I have implemented
>>>> the neighbor joining algorithm very inefficiently in python, it was
>>>> not particularly difficult.
>>> 
>>> NJ is a clustering technique, but there are also others.
>>> http://en.wikipedia.org/wiki/Neighbor-joining
>>> Another online lecture that might be useful is:
>>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
>>> 
>>> This step seems like it will not take very
>>>> long. Additionally, parallelism, I have no experience with parallelism
>>>> in Java and will only have some experience with it in C, will that be
>>>> an issue?
>>> 
>>> I have never written multi threaded code in C, but I would guess it is
>>> much much easier in Java...
>>> 
>>>> 3. Is there a specific paper with the exact algorithm that should be
>>>> implemented here?
>>> 
>>> We have only 3 months for this project so having a modular core
>>> algorithm that can be extended would be a priority. I recommend
>>> reading the Clustalw, T-Coffee and Muscle papers.
>>> 
>>>> General: Will use cases be provided? Will test data be provided? These
>>>> would both be useful in coding the test cases which seem to be coded
>>>> first.
>>> 
>>> I can provide plenty of data for that.
>>> 
>>> 
>>>> Additionally, I have access to my current windows machine as well as
>>>> as Linux machine for testing, but no Mac. While in theory with java,
>>>> if it works on one, then it works on another, and especially with if
>>>> it works on Linux, it should be fine on Mac, should I be worried about
>>>> strange peculiarities?
>>> 
>>>> From my experience Java works pretty fine on any platform. There might
>>> be issues with user interfaces that require testing, but we are not
>>> going to do  user interfaces here...
>>> 
>>> Andreas
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Singer Ma
>>>> Harvey Mudd College 2011
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From mitlox at op.pl  Thu Apr  8 11:30:13 2010
From: mitlox at op.pl (xyz)
Date: Thu, 8 Apr 2010 21:30:13 +1000
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
References: <20100330215047.084f6b00@wp01>
	<Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
Message-ID: <20100408213013.63a99b8c@wp01>

On Wed, 31 Mar 2010 23:56:42 -0400 (EDT)
Michael Heuer wrote:

> import static ...RichSequence.Tools.*;
> import static ...RichSequence.IOTools.*;
> 
> Fastq fastq = ...;
> Namespace namepace = ...;
> RichSequence richSequence = createRichSequence(
>   namespace,
>   fastq.getDescription(),
>   fastq.getSequence(),
>   DNATools.getDNA());
> 
> writeFasta(outputStream, richSequence, namespace);

I have tried this but I got this error:
Fastq2Fasta.java:52: cannot find symbol
symbol  : method
createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet)
location: class Fastq2Fasta RichSequence richSequence =
createRichSequence(ns, 
1 error

The complete code looks now :

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import org.biojava.bio.program.fastq.Fastq;
import org.biojava.bio.program.fastq.FastqBuilder;
import org.biojava.bio.program.fastq.FastqReader;
import org.biojava.bio.program.fastq.FastqVariant;
import org.biojava.bio.program.fastq.FastqWriter;
import org.biojava.bio.program.fastq.IlluminaFastqReader;
import org.biojava.bio.program.fastq.IlluminaFastqWriter;
import org.biojava.bio.seq.DNATools;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;


public class Fastq2Fasta {

  public static void main(String[] args) throws FileNotFoundException,
  IOException {

    FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); 
    FastqReader qReader = new IlluminaFastqReader();

    FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); 
    FastqWriter qWriter = new IlluminaFastqWriter();

    //SimpleNamespace ns = new SimpleNamespace("biojava");

    FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta");


    for (Fastq fastq : qReader.read(inputFastq)) {
      System.out.println(fastq.getDescription());
      System.out.println(fastq.getSequence());
      String trimSeq = fastq.getSequence().substring(0,
      		fastq.getSequence().length() - 6); 
      System.out.println(trimSeq);
      System.out.println(fastq.getQuality());
      String trimQual = fastq.getQuality().substring(0,
    		fastq.getQuality().length() - 6);
      System.out.println(trimQual);

      FastqBuilder trimFastq = new FastqBuilder();
      trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA);
      trimFastq.withDescription(fastq.getDescription());
      trimFastq.appendSequence(trimSeq);
      trimFastq.appendQuality(trimQual);

      qWriter.write(outputFastq, trimFastq.build());


      SimpleNamespace ns = new SimpleNamespace("biojava");
      RichSequence richSequence = createRichSequence(ns,
              fastq.getDescription(), trimSeq, DNATools.getDNA());
      RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns);
    }
  }
}

What did I wrong?


> 
> > Suggestions:
> > 1)
> > After I trimmed the fastq files the header information for quality
> > is empty
> >
> > @HWI-EAS406:5:1:0:1390#0/1
> > GGGTGATGGCCGCTGCCGATGGCGTCAAAA
> > +
> > OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
> >
> > this reduced the size of the files but is it compatible with
> > SOAP and TopHat?
> 
> Sorry, not sure what you are asking here.
> 
Usually  @-headerand and +-header are equal eg.
@HWI-EAS406:5:1:0:1390#0/1
+HWI-EAS406:5:1:0:1390#0/1
but after trimming and writting to fastq file I got this
@HWI-EAS406:5:1:0:1390#0/1
+
The +-header is empty. Is this ok like this and standard compatible?

Best regards,


From mitlox at op.pl  Thu Apr  8 11:30:52 2010
From: mitlox at op.pl (xyz)
Date: Thu, 8 Apr 2010 21:30:52 +1000
Subject: [Biojava-l] readFasta problem
Message-ID: <20100408213052.662beb8e@wp01>

Hello,
I would like to read fasta file without to specify whether it is DNA,
RNA or Protein in code and I wrote this code

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import org.biojava.bio.BioException;
import org.biojavax.SimpleNamespace;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;

public class SortFasta {

  public static void main(String[] args) throws FileNotFoundException,
  BioException {


    BufferedReader br = new BufferedReader(new
    FileReader("sortFasta.fasta")); 
    SimpleNamespace ns = new SimpleNamespace("biojava");

    // You can use any of the convenience methods found in the BioJava 1.6 API 
    //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br,  ns); 
    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns);

    // Since a single file can contain more than a sequence, you need
    // to iterate over rsi to get the information.
    while (rsi.hasNext()) {
      RichSequence rs = rsi.nextRichSequence();
      System.out.println(rs.getComments());
      System.out.println(rs.seqString());
    }
  }
}
but unfortunately it I have got following error:
it the details that follow to biojava-l at biojava.org or post a bug
    report to http://bugzilla.open-bio.org/ 

Format_object=org.biojavax.bio.seq.io.FastaFormat
Accession=
Id=
Comments=problem parsing symbols
Parse_block=atccccc
Stack trace follows ....


        at
        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222)
        at
        org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ...
        1 more Caused by: java.lang.NullPointerException at
        org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:165)
        at
        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ...
        2 more Java Result: 1

What did I wrong?

Thank you in advance.

Best regards,


From holland at eaglegenomics.com  Thu Apr  8 11:41:25 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Thu, 8 Apr 2010 12:41:25 +0100
Subject: [Biojava-l] readFasta problem
In-Reply-To: <20100408213052.662beb8e@wp01>
References: <20100408213052.662beb8e@wp01>
Message-ID: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>

You have passed null into the tokenizer parameter of RichSequence.IOTools.readFasta() - this is not allowed. The parser cannot guess the type of sequence, it must be told what to expect by specifying the tokenizer to use. (Importantly this also means that you cannot mix different types of sequence within the same file to be parsed.)


On 8 Apr 2010, at 12:30, xyz wrote:

> Hello,
> I would like to read fasta file without to specify whether it is DNA,
> RNA or Protein in code and I wrote this code
> 
> import java.io.BufferedReader;
> import java.io.FileNotFoundException;
> import java.io.FileReader;
> import org.biojava.bio.BioException;
> import org.biojavax.SimpleNamespace;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> 
> public class SortFasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  BioException {
> 
> 
>    BufferedReader br = new BufferedReader(new
>    FileReader("sortFasta.fasta")); 
>    SimpleNamespace ns = new SimpleNamespace("biojava");
> 
>    // You can use any of the convenience methods found in the BioJava 1.6 API 
>    //RichSequenceIterator rsi = RichSequence.IOTools.readFastaDNA(br,  ns); 
>    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, null, ns);
> 
>    // Since a single file can contain more than a sequence, you need
>    // to iterate over rsi to get the information.
>    while (rsi.hasNext()) {
>      RichSequence rs = rsi.nextRichSequence();
>      System.out.println(rs.getComments());
>      System.out.println(rs.seqString());
>    }
>  }
> }
> but unfortunately it I have got following error:
> it the details that follow to biojava-l at biojava.org or post a bug
>    report to http://bugzilla.open-bio.org/ 
> 
> Format_object=org.biojavax.bio.seq.io.FastaFormat
> Accession=
> Id=
> Comments=problem parsing symbols
> Parse_block=atccccc
> Stack trace follows ....
> 
> 
>        at
>        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:222)
>        at
>        org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) ...
>        1 more Caused by: java.lang.NullPointerException at
>        org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:165)
>        at
>        org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:213) ...
>        2 more Java Result: 1
> 
> What did I wrong?
> 
> Thank you in advance.
> 
> Best regards,
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From holland at eaglegenomics.com  Thu Apr  8 11:36:36 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Thu, 8 Apr 2010 12:36:36 +0100
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <20100408213013.63a99b8c@wp01>
References: <20100330215047.084f6b00@wp01>
	<Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
	<20100408213013.63a99b8c@wp01>
Message-ID: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com>

You haven't included the two import static lines in your code. See first two lines of Michael's example code (expanding the ellipses to the full classpath).

On 8 Apr 2010, at 12:30, xyz wrote:

> On Wed, 31 Mar 2010 23:56:42 -0400 (EDT)
> Michael Heuer wrote:
> 
>> import static ...RichSequence.Tools.*;
>> import static ...RichSequence.IOTools.*;
>> 
>> Fastq fastq = ...;
>> Namespace namepace = ...;
>> RichSequence richSequence = createRichSequence(
>>  namespace,
>>  fastq.getDescription(),
>>  fastq.getSequence(),
>>  DNATools.getDNA());
>> 
>> writeFasta(outputStream, richSequence, namespace);
> 
> I have tried this but I got this error:
> Fastq2Fasta.java:52: cannot find symbol
> symbol  : method
> createRichSequence(org.biojavax.SimpleNamespace,java.lang.String,java.lang.String,org.biojava.bio.symbol.FiniteAlphabet)
> location: class Fastq2Fasta RichSequence richSequence =
> createRichSequence(ns, 
> 1 error
> 
> The complete code looks now :
> 
> import java.io.FileInputStream;
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import org.biojava.bio.program.fastq.Fastq;
> import org.biojava.bio.program.fastq.FastqBuilder;
> import org.biojava.bio.program.fastq.FastqReader;
> import org.biojava.bio.program.fastq.FastqVariant;
> import org.biojava.bio.program.fastq.FastqWriter;
> import org.biojava.bio.program.fastq.IlluminaFastqReader;
> import org.biojava.bio.program.fastq.IlluminaFastqWriter;
> import org.biojava.bio.seq.DNATools;
> import org.biojavax.SimpleNamespace;
> import org.biojavax.bio.seq.RichSequence;
> 
> 
> public class Fastq2Fasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  IOException {
> 
>    FileInputStream inputFastq = new FileInputStream("fastq2fasta.fastq"); 
>    FastqReader qReader = new IlluminaFastqReader();
> 
>    FileOutputStream outputFastq = new FileOutputStream("fastq2fastaTrim.fastq"); 
>    FastqWriter qWriter = new IlluminaFastqWriter();
> 
>    //SimpleNamespace ns = new SimpleNamespace("biojava");
> 
>    FileOutputStream outputFasta = new FileOutputStream("fastq2fastaTrim.fasta");
> 
> 
>    for (Fastq fastq : qReader.read(inputFastq)) {
>      System.out.println(fastq.getDescription());
>      System.out.println(fastq.getSequence());
>      String trimSeq = fastq.getSequence().substring(0,
>      		fastq.getSequence().length() - 6); 
>      System.out.println(trimSeq);
>      System.out.println(fastq.getQuality());
>      String trimQual = fastq.getQuality().substring(0,
>    		fastq.getQuality().length() - 6);
>      System.out.println(trimQual);
> 
>      FastqBuilder trimFastq = new FastqBuilder();
>      trimFastq.withVariant(FastqVariant.FASTQ_ILLUMINA);
>      trimFastq.withDescription(fastq.getDescription());
>      trimFastq.appendSequence(trimSeq);
>      trimFastq.appendQuality(trimQual);
> 
>      qWriter.write(outputFastq, trimFastq.build());
> 
> 
>      SimpleNamespace ns = new SimpleNamespace("biojava");
>      RichSequence richSequence = createRichSequence(ns,
>              fastq.getDescription(), trimSeq, DNATools.getDNA());
>      RichSequence.IOTools.writeFasta(outputFasta, richSequence, ns);
>    }
>  }
> }
> 
> What did I wrong?
> 
> 
>> 
>>> Suggestions:
>>> 1)
>>> After I trimmed the fastq files the header information for quality
>>> is empty
>>> 
>>> @HWI-EAS406:5:1:0:1390#0/1
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAAA
>>> +
>>> OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
>>> 
>>> this reduced the size of the files but is it compatible with
>>> SOAP and TopHat?
>> 
>> Sorry, not sure what you are asking here.
>> 
> Usually  @-headerand and +-header are equal eg.
> @HWI-EAS406:5:1:0:1390#0/1
> +HWI-EAS406:5:1:0:1390#0/1
> but after trimming and writting to fastq file I got this
> @HWI-EAS406:5:1:0:1390#0/1
> +
> The +-header is empty. Is this ok like this and standard compatible?
> 
> Best regards,
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From chapman at cs.wisc.edu  Thu Apr  8 12:47:12 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Thu, 08 Apr 2010 07:47:12 -0500
Subject: [Biojava-l] GSoC Application
Message-ID: <4BBDD050.6090208@cs.wisc.edu>

I would appreciate any feedback on my proposal from mentors or other developers. 
  Check it out at: 
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817

Thanks in advance,
Mark


From caishaojiang at gmail.com  Thu Apr  8 13:28:11 2010
From: caishaojiang at gmail.com (Cai Shaojiang)
Date: Thu, 8 Apr 2010 06:28:11 -0700
Subject: [Biojava-l] [Fwd: Re:  GSoC project on MSA]
In-Reply-To: <4BBDCFD2.3000507@uni-tuebingen.de>
References: <4BBC80A8.5000608@uni-tuebingen.de>
	<v2j927e071e1004072144t557b480au27666262c79094e2@mail.gmail.com>
	<4BBDCFD2.3000507@uni-tuebingen.de>
Message-ID: <r2p927e071e1004080628hfdce95c2y1081153aeeaaecef@mail.gmail.com>

Dear Sir:

I have submitted the proposal through Google.

Cheers.

On Thu, Apr 8, 2010 at 5:45 AM, Andreas Dr?ger <
andreas.draeger at uni-tuebingen.de> wrote:

> Hi Cai,
>
> Oh yes, it is in the alignment package. But it is only an interface. It
> already has two sub-types: AbstractULAlignment and this has the
> implementation SubULAlignment. We should check first if we can already use
> these data structures to easily produce a paired alignment. Can you see how
> the AlignmentPair is produced by the alignment algorithms in the alignment
> package? We should do something similar but with this different data
> structure, I suggest.
>
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
>


-- 
Cai Shaojiang
Department of Information Systems,
School of Computing,
National University of Singapore
Telephone: +65 93-4870-93
Email: caishaojiang at gmail.com; shaoj at comp.nus.edu.sg


From sacomoto at gmail.com  Thu Apr  8 16:26:55 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Thu, 8 Apr 2010 13:26:55 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com> 
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com> 
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com> 
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com> 
	<q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>
Message-ID: <x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com>

Hi Andreas,

On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Gustavo,
>
> here my 0.02$:
>
> * For some of your steps there is already code available in BioJava.
> MIght be good to take a look at what is already there... ? (look at
> the alignment and phylo modules for dynamic programming and
> Neighbour-Joining)
>
> * What about risks? Where do you expect difficulties and how to work
> around them?
>
> * Step 4: Can you add more details? How do you plan to approach this?
> E.g. Clustalw has a number of rules implemented at this stage. Do you
> plan to support multiple rules as well and how to do this technically.
> Something nice would be the possibility to use structure alignments to
> guide the sequence alignments. (structure module)

Based on it I rewrote the step 4 and add a "Main Risks" section.

I pasted just the new version of step 4 and the new section at the end
of this e-mal.

Thank you very much for your feedback.

gustavo


-------------------------------------------------------------------------------------------

** 4. Implement the algorithm for progressive MSA and the MSA wrapper.

 A progressive MSA is a heuristic approach for the MSA problem, at
each step a pairwise alignment between two sequences, a sequence and
an alignment or between two alignments is done. So, the multiple
alignment is built incrementally, at each iteration more sequences are
aligned together. The guide tree gives an order for this incremental
alignment, in a bottom-up (in the tree) fashion sequences (or groups
of sequences) with greater similarity are aligned first. Therefore, in
order to have a more flexible and reusable code, the code design will
allow any binary tree of the sequences to be used as a guide tree, not
only the one built in the last step. This will allow a priori
phylogenetic or tertiary similarity (structural similarity) knowledge
be used to guide the multiple alignment order.

 This is certainly the most difficult part of the project, so to make
sure we are going to deliver a fully functional MSA algorithm, a safer
approach is going to be taken. In the first place, a a basic algorithm
described in [2] will be implemented. Once this get successfully done
and the code fully integrated to the Biojava code base, the features
described in [1] are going to be incrementally added (and tested) in
order to implement the full algorithm. This step is further divided in
substeps.

*** 4.1 Implement a first simpler dynamic programming (DP) algorithm.

  This is the generalized pairwise alignment used in each iteration of
the progressive MSA. Gaps  already presents in one of the alignments
(profiles) remain fixed, gap opening penalties remain unchanged, this
means that opening new gaps inside existent gaps will be fully
penalized. The code for this algorithm is similar to, the already
present in Biojava, code for regular pairwise alignment.

*** 4.2 Implement the basic progressive MSA algorithm.

  In this substep is going to be implemented the incremental algorithm
to built the MSA, transversing a guide tree (parameter, could be the
one built in step 3 or any other one) in a bottom-up fashion and using
the algorithm from substep 4.1 at each iteration.

*** 4.3 Implement the MSA wrapper.

  The MSA wrapper is going to be a method that wraps steps 2, 3 and
4.2, giving a simple method (for the final user) to calculate the MSA.
Receiving as parameters the set of sequences to be aligned, the gap
opening penalty, gap extend penalty and residue matrix. Returning the
MSA for the sequence set.
  At the end of this substep, we get a basic fully functional MSA
algorithm, using the progressive heuristic.

*** 4.4 Implement gaps penalties rescaling and parameter default values.

  Gap penalties to open a new gap an extend a existing one (the affine
gap weight model) are user defined parameters. This substep will
define default values, based on the residue matrix, for this
parameters and implement global rescaling rules (based on sequences
sizes) for this parameters.

*** 4.5 Enhance the DP algorithm to use different sequences weight.

  Based on the guide tree, for each sequence a different weight
(divergent sequences receive high values) is calculated and used in
the scoring scheme of the generalized DP algorithm.

*** 4.6 Enhance the DP algorithm to use position based gap penalties.

  The DP algorithm from substep 4.1 uses globally defined gap opening
penalty. In this substep, the algorithm is going to be modified do use
position based penalty, this is simple, once is known an array of
opening penalties for each sequence position. This array is calculated
based on several hierarchical (only apply the first one that fits, if
any) rules, those are rescaling rules and the array is initialized
with the original gap opening penalty.

Given the hierarchical nature of the rules, they can be implemented in
a incremental way, from the highest priority rule to the lowest, the
algorithm of each step being a refinement of the previous one. I am
omitting the detailed description of each rule. However, to verify if
a given rule apply to a given position, all that is necessary is to
check at most 16 adjacent positions and the same position in the other
already aligned sequences.

At the end of each of the following steps we a have functional
algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.

**** 4.6.1 Lowered gap opening penalties at existing gaps.
**** 4.6.2 Increased gap opening penalties near existing gaps.
**** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
**** 4.6.4 Residue specific gap penalties.

 ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.

 EXTRA: Implement some benchmark technique to measure the final
alignment quality.

Main Risks
----------

The main risk to this project is the intrinsic complexity of the MSA
progressive algorithm. To deal with that we decided to break the
implementation in a large number of small and manageable steps, and
the steps are designed in a way that, at the end of each of them, we
will have a complete and testable new function (or a modification of
an existing one). Besides that, to be extra careful the project aims
to produce a simple full functional MSA algorithm as early as
possible, the estimated time is 8 weeks, this way we guarantee to
deliver at a simpler, but working and bug-free, version.


> Andreas
>
>
>> -------------------------------------------------------------
>>
>> GSoC proposal
>>
>> Abstract
>> --------
>>
>> This project aims to develop an all-Java implementation of a multiple
>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>> using the progressive algorithm described in the CLUSTALW paper [1].
>>
>> The Importance
>> --------------
>>
>> Multiple sequence alignment is a frequently performed task in sequence
>> analysis with the goal to identify new members of protein families and
>> infer phylogenetic relationships between proteins and genes. At the
>> present there is no Java-only implementation for this algorithm. As
>> such the number of already existing and Java related BioInformatics
>> tools and web sites would benefit from this implementation and
>> sequence analysis could be more easily performed by the end-user.
>>
>> About Me
>> --------
>>
>> I am a graduate student at University of S?o Paulo (Brazil), I got my
>> undergraduate degree from the same university with a major in Computer
>> Science and a minor in Biology. I have been involved with
>> Bioinformatics for 5 years, always with sequence analysis with
>> particular interest in the MSA problem. Also, in my undergraduate
>> final project I developed a lossless filter (pruning algorithm) for
>> the MSA problem, the work is published in [3] and there is an online
>> implementation of the algorithm in [4]. Finally, I have experience
>> with the C, C++, Java, Python and Ruby programming languages; Git and
>> SVN version control systems.
>>
>> Project Plan
>> ------------
>>
>> The project is divided in four main steps, at the end of each step a
>> completely functional and bug-free new algorithm will be added to the
>> Biojava code base. It should be noticed that each step has a strong
>> dependence on the previous one, so before move to the next step a
>> careful testing will be done.
>>
>> The four steps are described below, estimated times for accomplishment
>> of each step are also given and in some steps extra enhancements are
>> described, they will be implemented if there is some time remaining
>> after all steps are completed.
>>
>> ** 1. Study the Biojava pairwise alignment code and update it to be
>> compliant with Biojava 3.
>>
>> ?The pairwise alignment will play an important role in the MSA
>> algorithm. This step is also important for me to get used to the
>> Biojava coding standards and get in touch with the Biojava dev
>> community.
>>
>> ?ETA: 2 weeks.
>>
>> ** 2. Implement the algorithm to build the distance matrix.
>>
>> ?This is done using the pairwise alignment for each pair of sequence
>> in the set to be aligned.
>>
>> ?ETA: 1 week.
>>
>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
>> several threads to calculate the pairwise alignment for different
>> pairs in the sequence set.
>>
>> ** 3. Implement the algorithm to build the guide tree.
>>
>> ?The guide tree is based on the distance matrix built in the last
>> step, the tree construction strategy adopted will be the Neighbor
>> Joining Algorithm.
>>
>> ?ETA: 2 weeks.
>>
>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>
>> ?This is certainly the most difficult part of the project, so to make
>> sure we are going to deliver a fully functional MSA algorithm, a safer
>> approach is going to be taken. In the first place, a dynamic
>> programming algorithm described in [2] will be implemented. Once this
>> get successfully done and the code fully integrated to the Biojava
>> code base, the features described in [1] are going to be incrementally
>> added (and tested) in order to implement the full dynamic programming
>> algorithm.
>>
>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>
>> ?EXTRA: Implement some benchmark technique to measure the final
>> alignment quality.
>>
>> References
>> ----------
>>
>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>> [3] http://www.almob.org/content/4/1/3
>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>
>>
>>
>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Gustavo,
>>>
>>> In principle I agree to all, see details below:
>>>
>>>
>>> I think my question wasn't very clear, my intention in this project is
>>>>
>>>> to follow the approach (with the tree steps) outlined in the project's
>>>> page. Using the classical progressive alignment heuristic: build the
>>>> distance matrix, build the guide tree and using this tree
>>>> progressively align more sequences together.
>>>
>>> yes
>>>
>>>>
>>>> What I propose for the third step is a first implementation using the
>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>> more or less what you had in mind?
>>>
>>> yes, sounds good.
>>>
>>>>
>>>> About parallel strategies, I think a relative easy way we could use it
>>>> is in the distance matrix construction, we could have several threads
>>>> calculating the pairwise alignment for different pairs of sequence in
>>>> the set.
>>>
>>> Correct. Probably a first implementation would be for a single machine/
>>> multi CPU. More advanced implementations could provide support e.g. for
>>> Map/Reduce, JPPF, or something like that...
>>>
>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>> paper doesn't give any way to measure the quality of the result, they
>>>> consider a good alignment the one that is hard to improve by eye (But
>>>> they claim that for sequences sufficient similar, no pair less than
>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>> paper and leave the quality measure to the user? How concerned should
>>>> I be with that in this project?
>>>
>>> Getting an overall core-algorithm that works should be priority. The
>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>> plenty of material for that, once we get to that stage...
>>>
>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>> to have some feedback from you.
>>>
>>> Excellent, looking forward to it.
>>>
>>> Andreas
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From andreas at sdsc.edu  Thu Apr  8 17:26:03 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 8 Apr 2010 10:26:03 -0700
Subject: [Biojava-l] GSoC Application
In-Reply-To: <4BBDD050.6090208@cs.wisc.edu>
References: <4BBDD050.6090208@cs.wisc.edu>
Message-ID: <x2t59a41c431004081026s4eb39908gf7fb8cc30e99d483@mail.gmail.com>

Hi Mark,

looks pretty good,

* The time schedule feels tight. Where do you see possible
difficulties and risks. What might take longer than expected?

* I would like to be able to use 3D structure alignment information to
guide the final alignment. This should increase reliability of the
final alignment for remote sequence similarities. Any thoughts on how
to accomplish this?

Andreas


On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman <chapman at cs.wisc.edu> wrote:
> I would appreciate any feedback on my proposal from mentors or other
> developers. ?Check it out at:
> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817
>
> Thanks in advance,
> Mark
> _______________________________________________
> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Thu Apr  8 17:36:56 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Thu, 8 Apr 2010 10:36:56 -0700
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com>
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com>
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com>
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com>
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com>
	<q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com>
	<x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com>
Message-ID: <w2n59a41c431004081036o535f1696qbbe13f59f6f6f56b@mail.gmail.com>

Looks pretty good.

One issue during the progressive alignment build up: 3D structure
alignments can increase the reliability of the sequence alignments,
particularly if the sequences are only distantly related. Having a way
to incorporate the 3D structure info would be nice...

Andreas

On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto
<sacomoto at gmail.com> wrote:
> Hi Andreas,
>
> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> Hi Gustavo,
>>
>> here my 0.02$:
>>
>> * For some of your steps there is already code available in BioJava.
>> MIght be good to take a look at what is already there... ? (look at
>> the alignment and phylo modules for dynamic programming and
>> Neighbour-Joining)
>>
>> * What about risks? Where do you expect difficulties and how to work
>> around them?
>>
>> * Step 4: Can you add more details? How do you plan to approach this?
>> E.g. Clustalw has a number of rules implemented at this stage. Do you
>> plan to support multiple rules as well and how to do this technically.
>> Something nice would be the possibility to use structure alignments to
>> guide the sequence alignments. (structure module)
>
> Based on it I rewrote the step 4 and add a "Main Risks" section.
>
> I pasted just the new version of step 4 and the new section at the end
> of this e-mal.
>
> Thank you very much for your feedback.
>
> gustavo
>
>
>
> -------------------------------------------------------------------------------------------
>
> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper.
>
> ?A progressive MSA is a heuristic approach for the MSA problem, at
> each step a pairwise alignment between two sequences, a sequence and
> an alignment or between two alignments is done. So, the multiple
> alignment is built incrementally, at each iteration more sequences are
> aligned together. The guide tree gives an order for this incremental
> alignment, in a bottom-up (in the tree) fashion sequences (or groups
> of sequences) with greater similarity are aligned first. Therefore, in
> order to have a more flexible and reusable code, the code design will
> allow any binary tree of the sequences to be used as a guide tree, not
> only the one built in the last step. This will allow a priori
> phylogenetic or tertiary similarity (structural similarity) knowledge
> be used to guide the multiple alignment order.
>
> ?This is certainly the most difficult part of the project, so to make
> sure we are going to deliver a fully functional MSA algorithm, a safer
> approach is going to be taken. In the first place, a a basic algorithm
> described in [2] will be implemented. Once this get successfully done
> and the code fully integrated to the Biojava code base, the features
> described in [1] are going to be incrementally added (and tested) in
> order to implement the full algorithm. This step is further divided in
> substeps.
>
> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm.
>
> ?This is the generalized pairwise alignment used in each iteration of
> the progressive MSA. Gaps ?already presents in one of the alignments
> (profiles) remain fixed, gap opening penalties remain unchanged, this
> means that opening new gaps inside existent gaps will be fully
> penalized. The code for this algorithm is similar to, the already
> present in Biojava, code for regular pairwise alignment.
>
> *** 4.2 Implement the basic progressive MSA algorithm.
>
> ?In this substep is going to be implemented the incremental algorithm
> to built the MSA, transversing a guide tree (parameter, could be the
> one built in step 3 or any other one) in a bottom-up fashion and using
> the algorithm from substep 4.1 at each iteration.
>
> *** 4.3 Implement the MSA wrapper.
>
> ?The MSA wrapper is going to be a method that wraps steps 2, 3 and
> 4.2, giving a simple method (for the final user) to calculate the MSA.
> Receiving as parameters the set of sequences to be aligned, the gap
> opening penalty, gap extend penalty and residue matrix. Returning the
> MSA for the sequence set.
> ?At the end of this substep, we get a basic fully functional MSA
> algorithm, using the progressive heuristic.
>
> *** 4.4 Implement gaps penalties rescaling and parameter default values.
>
> ?Gap penalties to open a new gap an extend a existing one (the affine
> gap weight model) are user defined parameters. This substep will
> define default values, based on the residue matrix, for this
> parameters and implement global rescaling rules (based on sequences
> sizes) for this parameters.
>
> *** 4.5 Enhance the DP algorithm to use different sequences weight.
>
> ?Based on the guide tree, for each sequence a different weight
> (divergent sequences receive high values) is calculated and used in
> the scoring scheme of the generalized DP algorithm.
>
> *** 4.6 Enhance the DP algorithm to use position based gap penalties.
>
> ?The DP algorithm from substep 4.1 uses globally defined gap opening
> penalty. In this substep, the algorithm is going to be modified do use
> position based penalty, this is simple, once is known an array of
> opening penalties for each sequence position. This array is calculated
> based on several hierarchical (only apply the first one that fits, if
> any) rules, those are rescaling rules and the array is initialized
> with the original gap opening penalty.
>
> Given the hierarchical nature of the rules, they can be implemented in
> a incremental way, from the highest priority rule to the lowest, the
> algorithm of each step being a refinement of the previous one. I am
> omitting the detailed description of each rule. However, to verify if
> a given rule apply to a given position, all that is necessary is to
> check at most 16 adjacent positions and the same position in the other
> already aligned sequences.
>
> At the end of each of the following steps we a have functional
> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.
>
> **** 4.6.1 Lowered gap opening penalties at existing gaps.
> **** 4.6.2 Increased gap opening penalties near existing gaps.
> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
> **** 4.6.4 Residue specific gap penalties.
>
> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>
> ?EXTRA: Implement some benchmark technique to measure the final
> alignment quality.
>
> Main Risks
> ----------
>
> The main risk to this project is the intrinsic complexity of the MSA
> progressive algorithm. To deal with that we decided to break the
> implementation in a large number of small and manageable steps, and
> the steps are designed in a way that, at the end of each of them, we
> will have a complete and testable new function (or a modification of
> an existing one). Besides that, to be extra careful the project aims
> to produce a simple full functional MSA algorithm as early as
> possible, the estimated time is 8 weeks, this way we guarantee to
> deliver at a simpler, but working and bug-free, version.
>
>
>
>
>> Andreas
>>
>>
>>> -------------------------------------------------------------
>>>
>>> GSoC proposal
>>>
>>> Abstract
>>> --------
>>>
>>> This project aims to develop an all-Java implementation of a multiple
>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>>> using the progressive algorithm described in the CLUSTALW paper [1].
>>>
>>> The Importance
>>> --------------
>>>
>>> Multiple sequence alignment is a frequently performed task in sequence
>>> analysis with the goal to identify new members of protein families and
>>> infer phylogenetic relationships between proteins and genes. At the
>>> present there is no Java-only implementation for this algorithm. As
>>> such the number of already existing and Java related BioInformatics
>>> tools and web sites would benefit from this implementation and
>>> sequence analysis could be more easily performed by the end-user.
>>>
>>> About Me
>>> --------
>>>
>>> I am a graduate student at University of S?o Paulo (Brazil), I got my
>>> undergraduate degree from the same university with a major in Computer
>>> Science and a minor in Biology. I have been involved with
>>> Bioinformatics for 5 years, always with sequence analysis with
>>> particular interest in the MSA problem. Also, in my undergraduate
>>> final project I developed a lossless filter (pruning algorithm) for
>>> the MSA problem, the work is published in [3] and there is an online
>>> implementation of the algorithm in [4]. Finally, I have experience
>>> with the C, C++, Java, Python and Ruby programming languages; Git and
>>> SVN version control systems.
>>>
>>> Project Plan
>>> ------------
>>>
>>> The project is divided in four main steps, at the end of each step a
>>> completely functional and bug-free new algorithm will be added to the
>>> Biojava code base. It should be noticed that each step has a strong
>>> dependence on the previous one, so before move to the next step a
>>> careful testing will be done.
>>>
>>> The four steps are described below, estimated times for accomplishment
>>> of each step are also given and in some steps extra enhancements are
>>> described, they will be implemented if there is some time remaining
>>> after all steps are completed.
>>>
>>> ** 1. Study the Biojava pairwise alignment code and update it to be
>>> compliant with Biojava 3.
>>>
>>> ?The pairwise alignment will play an important role in the MSA
>>> algorithm. This step is also important for me to get used to the
>>> Biojava coding standards and get in touch with the Biojava dev
>>> community.
>>>
>>> ?ETA: 2 weeks.
>>>
>>> ** 2. Implement the algorithm to build the distance matrix.
>>>
>>> ?This is done using the pairwise alignment for each pair of sequence
>>> in the set to be aligned.
>>>
>>> ?ETA: 1 week.
>>>
>>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
>>> several threads to calculate the pairwise alignment for different
>>> pairs in the sequence set.
>>>
>>> ** 3. Implement the algorithm to build the guide tree.
>>>
>>> ?The guide tree is based on the distance matrix built in the last
>>> step, the tree construction strategy adopted will be the Neighbor
>>> Joining Algorithm.
>>>
>>> ?ETA: 2 weeks.
>>>
>>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>>
>>> ?This is certainly the most difficult part of the project, so to make
>>> sure we are going to deliver a fully functional MSA algorithm, a safer
>>> approach is going to be taken. In the first place, a dynamic
>>> programming algorithm described in [2] will be implemented. Once this
>>> get successfully done and the code fully integrated to the Biojava
>>> code base, the features described in [1] are going to be incrementally
>>> added (and tested) in order to implement the full dynamic programming
>>> algorithm.
>>>
>>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>>
>>> ?EXTRA: Implement some benchmark technique to measure the final
>>> alignment quality.
>>>
>>> References
>>> ----------
>>>
>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>>> [3] http://www.almob.org/content/4/1/3
>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>>
>>>
>>>
>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi Gustavo,
>>>>
>>>> In principle I agree to all, see details below:
>>>>
>>>>
>>>> I think my question wasn't very clear, my intention in this project is
>>>>>
>>>>> to follow the approach (with the tree steps) outlined in the project's
>>>>> page. Using the classical progressive alignment heuristic: build the
>>>>> distance matrix, build the guide tree and using this tree
>>>>> progressively align more sequences together.
>>>>
>>>> yes
>>>>
>>>>>
>>>>> What I propose for the third step is a first implementation using the
>>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>>> more or less what you had in mind?
>>>>
>>>> yes, sounds good.
>>>>
>>>>>
>>>>> About parallel strategies, I think a relative easy way we could use it
>>>>> is in the distance matrix construction, we could have several threads
>>>>> calculating the pairwise alignment for different pairs of sequence in
>>>>> the set.
>>>>
>>>> Correct. Probably a first implementation would be for a single machine/
>>>> multi CPU. More advanced implementations could provide support e.g. for
>>>> Map/Reduce, JPPF, or something like that...
>>>>
>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>>> paper doesn't give any way to measure the quality of the result, they
>>>>> consider a good alignment the one that is hard to improve by eye (But
>>>>> they claim that for sequences sufficient similar, no pair less than
>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>>> paper and leave the quality measure to the user? How concerned should
>>>>> I be with that in this project?
>>>>
>>>> Getting an overall core-algorithm that works should be priority. The
>>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>>> plenty of material for that, once we get to that stage...
>>>>
>>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>>> to have some feedback from you.
>>>>
>>>> Excellent, looking forward to it.
>>>>
>>>> Andreas
>>>>
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Dr. Andreas Prlic
>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>> University of California, San Diego
>>>> (+1) 858.246.0526
>>>> -----------------------------------------------------------------------
>>>>
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From chapman at cs.wisc.edu  Thu Apr  8 20:45:21 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Thu, 08 Apr 2010 15:45:21 -0500
Subject: [Biojava-l] GSoC Application
In-Reply-To: <x2t59a41c431004081026s4eb39908gf7fb8cc30e99d483@mail.gmail.com>
References: <4BBDD050.6090208@cs.wisc.edu>
	<x2t59a41c431004081026s4eb39908gf7fb8cc30e99d483@mail.gmail.com>
Message-ID: <4BBE4061.3000204@cs.wisc.edu>

Hi Andreas,

Thanks for the feedback.

Difficulties and risks:
By viewing progressive multiple sequence alignment as four separate stages, I 
believe the pieces become easier to manage.  However, I also expect a few of my 
ideas to prove quite challenging to implement.  One of these challenges will be 
efficient parallelization.  Instead of spending all summer finding the optimal 
approach, I plan to make routines which are called in sequence in a simple 
implementation and in parallel in a separate one.  Later work could then extend 
the parallelism to a distributed computing framework such as hadoop or condor. 
Another difficult aspect is to make a general interface for choosing anchors in 
profile-profile alignment.  The Myers-Miller algorithm chooses optimal midpoints 
as anchors in an internal decision process.  I hope to generalize this to allow 
external identification of candidate anchors, as well.

Structural alignment integration:
At least three options exist for inserting structural information into the 
multiple sequence alignment task: pairwise scoring, anchoring, and profile 
scoring.  First, scores from pairwise structural alignments could be used to 
construct the similarity matrix.  This would create a guide tree that aligns 
sequences with similar structures earlier in the progressive alignment.  Second, 
structural alignment could identify possible anchors.  The profile-profile 
alignments would then conserve known structures when two profiles share some 
anchor candidates.  Both of these options are in my plan.  The third option 
would follow the consistency method of profile-profile alignment which replaces 
scoring from a substitution matrix with a consistency score.  This technique is 
used in T-Coffee and ProbCons.  The consistency score comes from how often 
residues in each profile aligned when combining information from pairwise 
alignments.  If these were structural pairwise alignments, then the multiple 
sequence alignment would preserve structural information.  Later work could 
implement this method as an alternative profile-profile alignment.

I'll try to incorporate these ideas when I revise my application later tonight. 
  And thanks again for your input.

Mark


On 4/8/2010 12:26 PM, Andreas Prlic wrote:
> Hi Mark,
>
> looks pretty good,
>
> * The time schedule feels tight. Where do you see possible
> difficulties and risks. What might take longer than expected?
>
> * I would like to be able to use 3D structure alignment information to
> guide the final alignment. This should increase reliability of the
> final alignment for remote sequence similarities. Any thoughts on how
> to accomplish this?
>
> Andreas
>
>
>
>
> On Thu, Apr 8, 2010 at 5:47 AM, Mark Chapman<chapman at cs.wisc.edu>  wrote:
>> I would appreciate any feedback on my proposal from mentors or other
>> developers.  Check it out at:
>> http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/mark_chapman/t127055148817
>>
>> Thanks in advance,
>> Mark
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>


From sacomoto at gmail.com  Fri Apr  9 00:36:27 2010
From: sacomoto at gmail.com (Gustavo Akio Tominaga Sacomoto)
Date: Thu, 8 Apr 2010 21:36:27 -0300
Subject: [Biojava-l] GSoC project on MSA
In-Reply-To: <w2n59a41c431004081036o535f1696qbbe13f59f6f6f56b@mail.gmail.com>
References: <w2r6a8f5b081004052229u7d9369f5q38f0f827a0c3e9a6@mail.gmail.com> 
	<l2m59a41c431004061046mdc4c4270r825341932e59dfcc@mail.gmail.com> 
	<j2r6a8f5b081004061153la171784dx7fbbf933d7dc7326@mail.gmail.com> 
	<g2r59a41c431004061427kd3c3d8d4j45a253e7e50ad66e@mail.gmail.com> 
	<q2q6a8f5b081004062229ja299ca51j7d43daa93c155e0b@mail.gmail.com> 
	<q2r59a41c431004071212j3697b470g448df70d9b239993@mail.gmail.com> 
	<x2w6a8f5b081004080926k21ce1ff5o21a7999761fd99ec@mail.gmail.com> 
	<w2n59a41c431004081036o535f1696qbbe13f59f6f6f56b@mail.gmail.com>
Message-ID: <n2t6a8f5b081004081736jb6894b71ub3cfd6649b5a7b8d@mail.gmail.com>

Hi Andreas,

On Thu, Apr 8, 2010 at 2:36 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Looks pretty good.
>
> One issue during the progressive alignment build up: 3D structure
> alignments can increase the reliability of the sequence alignments,
> particularly if the sequences are only distantly related. Having a way
> to incorporate the 3D structure info would be nice...

A first idea to incorporate some information about 3D structure
alignment is to extract from this alignment some matching substrings,
i.e. obtain the sequence substrings that correspond to the
superimposed points in the 3D alignment. And then, force the final MSA
to contain those same aligned substrings, in order to do that the DP
algorithm of step 4.1 should be modified in a way described here [
http://www.ncbi.nlm.nih.gov/pubmed/9018604 ] .

Thanks again.

gustavo

> Andreas
>
> On Thu, Apr 8, 2010 at 9:26 AM, Gustavo Akio Tominaga Sacomoto
> <sacomoto at gmail.com> wrote:
>> Hi Andreas,
>>
>> On Wed, Apr 7, 2010 at 4:12 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>> Hi Gustavo,
>>>
>>> here my 0.02$:
>>>
>>> * For some of your steps there is already code available in BioJava.
>>> MIght be good to take a look at what is already there... ? (look at
>>> the alignment and phylo modules for dynamic programming and
>>> Neighbour-Joining)
>>>
>>> * What about risks? Where do you expect difficulties and how to work
>>> around them?
>>>
>>> * Step 4: Can you add more details? How do you plan to approach this?
>>> E.g. Clustalw has a number of rules implemented at this stage. Do you
>>> plan to support multiple rules as well and how to do this technically.
>>> Something nice would be the possibility to use structure alignments to
>>> guide the sequence alignments. (structure module)
>>
>> Based on it I rewrote the step 4 and add a "Main Risks" section.
>>
>> I pasted just the new version of step 4 and the new section at the end
>> of this e-mal.
>>
>> Thank you very much for your feedback.
>>
>> gustavo
>>
>>
>>
>> -------------------------------------------------------------------------------------------
>>
>> ** 4. Implement the algorithm for progressive MSA and the MSA wrapper.
>>
>> ?A progressive MSA is a heuristic approach for the MSA problem, at
>> each step a pairwise alignment between two sequences, a sequence and
>> an alignment or between two alignments is done. So, the multiple
>> alignment is built incrementally, at each iteration more sequences are
>> aligned together. The guide tree gives an order for this incremental
>> alignment, in a bottom-up (in the tree) fashion sequences (or groups
>> of sequences) with greater similarity are aligned first. Therefore, in
>> order to have a more flexible and reusable code, the code design will
>> allow any binary tree of the sequences to be used as a guide tree, not
>> only the one built in the last step. This will allow a priori
>> phylogenetic or tertiary similarity (structural similarity) knowledge
>> be used to guide the multiple alignment order.
>>
>> ?This is certainly the most difficult part of the project, so to make
>> sure we are going to deliver a fully functional MSA algorithm, a safer
>> approach is going to be taken. In the first place, a a basic algorithm
>> described in [2] will be implemented. Once this get successfully done
>> and the code fully integrated to the Biojava code base, the features
>> described in [1] are going to be incrementally added (and tested) in
>> order to implement the full algorithm. This step is further divided in
>> substeps.
>>
>> *** 4.1 Implement a first simpler dynamic programming (DP) algorithm.
>>
>> ?This is the generalized pairwise alignment used in each iteration of
>> the progressive MSA. Gaps ?already presents in one of the alignments
>> (profiles) remain fixed, gap opening penalties remain unchanged, this
>> means that opening new gaps inside existent gaps will be fully
>> penalized. The code for this algorithm is similar to, the already
>> present in Biojava, code for regular pairwise alignment.
>>
>> *** 4.2 Implement the basic progressive MSA algorithm.
>>
>> ?In this substep is going to be implemented the incremental algorithm
>> to built the MSA, transversing a guide tree (parameter, could be the
>> one built in step 3 or any other one) in a bottom-up fashion and using
>> the algorithm from substep 4.1 at each iteration.
>>
>> *** 4.3 Implement the MSA wrapper.
>>
>> ?The MSA wrapper is going to be a method that wraps steps 2, 3 and
>> 4.2, giving a simple method (for the final user) to calculate the MSA.
>> Receiving as parameters the set of sequences to be aligned, the gap
>> opening penalty, gap extend penalty and residue matrix. Returning the
>> MSA for the sequence set.
>> ?At the end of this substep, we get a basic fully functional MSA
>> algorithm, using the progressive heuristic.
>>
>> *** 4.4 Implement gaps penalties rescaling and parameter default values.
>>
>> ?Gap penalties to open a new gap an extend a existing one (the affine
>> gap weight model) are user defined parameters. This substep will
>> define default values, based on the residue matrix, for this
>> parameters and implement global rescaling rules (based on sequences
>> sizes) for this parameters.
>>
>> *** 4.5 Enhance the DP algorithm to use different sequences weight.
>>
>> ?Based on the guide tree, for each sequence a different weight
>> (divergent sequences receive high values) is calculated and used in
>> the scoring scheme of the generalized DP algorithm.
>>
>> *** 4.6 Enhance the DP algorithm to use position based gap penalties.
>>
>> ?The DP algorithm from substep 4.1 uses globally defined gap opening
>> penalty. In this substep, the algorithm is going to be modified do use
>> position based penalty, this is simple, once is known an array of
>> opening penalties for each sequence position. This array is calculated
>> based on several hierarchical (only apply the first one that fits, if
>> any) rules, those are rescaling rules and the array is initialized
>> with the original gap opening penalty.
>>
>> Given the hierarchical nature of the rules, they can be implemented in
>> a incremental way, from the highest priority rule to the lowest, the
>> algorithm of each step being a refinement of the previous one. I am
>> omitting the detailed description of each rule. However, to verify if
>> a given rule apply to a given position, all that is necessary is to
>> check at most 16 adjacent positions and the same position in the other
>> already aligned sequences.
>>
>> At the end of each of the following steps we a have functional
>> algorithm, and after 4.6.4 the full CLUSTALW algorithm is complete.
>>
>> **** 4.6.1 Lowered gap opening penalties at existing gaps.
>> **** 4.6.2 Increased gap opening penalties near existing gaps.
>> **** 4.6.3 Reduced gap opening penalties in hydrophilic stretches.
>> **** 4.6.4 Residue specific gap penalties.
>>
>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>
>> ?EXTRA: Implement some benchmark technique to measure the final
>> alignment quality.
>>
>> Main Risks
>> ----------
>>
>> The main risk to this project is the intrinsic complexity of the MSA
>> progressive algorithm. To deal with that we decided to break the
>> implementation in a large number of small and manageable steps, and
>> the steps are designed in a way that, at the end of each of them, we
>> will have a complete and testable new function (or a modification of
>> an existing one). Besides that, to be extra careful the project aims
>> to produce a simple full functional MSA algorithm as early as
>> possible, the estimated time is 8 weeks, this way we guarantee to
>> deliver at a simpler, but working and bug-free, version.
>>
>>
>>
>>
>>> Andreas
>>>
>>>
>>>> -------------------------------------------------------------
>>>>
>>>> GSoC proposal
>>>>
>>>> Abstract
>>>> --------
>>>>
>>>> This project aims to develop an all-Java implementation of a multiple
>>>> sequence alignment (MSA) algorithm to be added to the Biojava toolkit,
>>>> using the progressive algorithm described in the CLUSTALW paper [1].
>>>>
>>>> The Importance
>>>> --------------
>>>>
>>>> Multiple sequence alignment is a frequently performed task in sequence
>>>> analysis with the goal to identify new members of protein families and
>>>> infer phylogenetic relationships between proteins and genes. At the
>>>> present there is no Java-only implementation for this algorithm. As
>>>> such the number of already existing and Java related BioInformatics
>>>> tools and web sites would benefit from this implementation and
>>>> sequence analysis could be more easily performed by the end-user.
>>>>
>>>> About Me
>>>> --------
>>>>
>>>> I am a graduate student at University of S?o Paulo (Brazil), I got my
>>>> undergraduate degree from the same university with a major in Computer
>>>> Science and a minor in Biology. I have been involved with
>>>> Bioinformatics for 5 years, always with sequence analysis with
>>>> particular interest in the MSA problem. Also, in my undergraduate
>>>> final project I developed a lossless filter (pruning algorithm) for
>>>> the MSA problem, the work is published in [3] and there is an online
>>>> implementation of the algorithm in [4]. Finally, I have experience
>>>> with the C, C++, Java, Python and Ruby programming languages; Git and
>>>> SVN version control systems.
>>>>
>>>> Project Plan
>>>> ------------
>>>>
>>>> The project is divided in four main steps, at the end of each step a
>>>> completely functional and bug-free new algorithm will be added to the
>>>> Biojava code base. It should be noticed that each step has a strong
>>>> dependence on the previous one, so before move to the next step a
>>>> careful testing will be done.
>>>>
>>>> The four steps are described below, estimated times for accomplishment
>>>> of each step are also given and in some steps extra enhancements are
>>>> described, they will be implemented if there is some time remaining
>>>> after all steps are completed.
>>>>
>>>> ** 1. Study the Biojava pairwise alignment code and update it to be
>>>> compliant with Biojava 3.
>>>>
>>>> ?The pairwise alignment will play an important role in the MSA
>>>> algorithm. This step is also important for me to get used to the
>>>> Biojava coding standards and get in touch with the Biojava dev
>>>> community.
>>>>
>>>> ?ETA: 2 weeks.
>>>>
>>>> ** 2. Implement the algorithm to build the distance matrix.
>>>>
>>>> ?This is done using the pairwise alignment for each pair of sequence
>>>> in the set to be aligned.
>>>>
>>>> ?ETA: 1 week.
>>>>
>>>> ?EXTRA: Enhance the basic algorithm to use parallel strategies, use
>>>> several threads to calculate the pairwise alignment for different
>>>> pairs in the sequence set.
>>>>
>>>> ** 3. Implement the algorithm to build the guide tree.
>>>>
>>>> ?The guide tree is based on the distance matrix built in the last
>>>> step, the tree construction strategy adopted will be the Neighbor
>>>> Joining Algorithm.
>>>>
>>>> ?ETA: 2 weeks.
>>>>
>>>> ** 4. Implement the algorithm for progressive MSA using the guide tree.
>>>>
>>>> ?This is certainly the most difficult part of the project, so to make
>>>> sure we are going to deliver a fully functional MSA algorithm, a safer
>>>> approach is going to be taken. In the first place, a dynamic
>>>> programming algorithm described in [2] will be implemented. Once this
>>>> get successfully done and the code fully integrated to the Biojava
>>>> code base, the features described in [1] are going to be incrementally
>>>> added (and tested) in order to implement the full dynamic programming
>>>> algorithm.
>>>>
>>>> ?ETA: (basic algorithm) 3 weeks. (full algorithm) 7 weeks.
>>>>
>>>> ?EXTRA: Implement some benchmark technique to measure the final
>>>> alignment quality.
>>>>
>>>> References
>>>> ----------
>>>>
>>>> [1] http://www.ncbi.nlm.nih.gov/pubmed/7984417
>>>> [2] http://www.ncbi.nlm.nih.gov/pubmed/3243435
>>>> [3] http://www.almob.org/content/4/1/3
>>>> [4] http://mobyle.genouest.org/cgi-bin/Mobyle/portal.py?form=tuiuiu
>>>>
>>>>
>>>>
>>>> On Tue, Apr 6, 2010 at 6:27 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Gustavo,
>>>>>
>>>>> In principle I agree to all, see details below:
>>>>>
>>>>>
>>>>> I think my question wasn't very clear, my intention in this project is
>>>>>>
>>>>>> to follow the approach (with the tree steps) outlined in the project's
>>>>>> page. Using the classical progressive alignment heuristic: build the
>>>>>> distance matrix, build the guide tree and using this tree
>>>>>> progressively align more sequences together.
>>>>>
>>>>> yes
>>>>>
>>>>>>
>>>>>> What I propose for the third step is a first implementation using the
>>>>>> (more simple) dynamic programming described in the first CLUSTAL paper
>>>>>> (I thinks it's from 1988) and incrementally improving the algorithm to
>>>>>> get closer to the one described in CLUSTALW paper (from 1994). Is this
>>>>>> more or less what you had in mind?
>>>>>
>>>>> yes, sounds good.
>>>>>
>>>>>>
>>>>>> About parallel strategies, I think a relative easy way we could use it
>>>>>> is in the distance matrix construction, we could have several threads
>>>>>> calculating the pairwise alignment for different pairs of sequence in
>>>>>> the set.
>>>>>
>>>>> Correct. Probably a first implementation would be for a single machine/
>>>>> multi CPU. More advanced implementations could provide support e.g. for
>>>>> Map/Reduce, JPPF, or something like that...
>>>>>
>>>>>> Now, the alignment quality measures is a tougher issue. The CLUSTALW
>>>>>> paper doesn't give any way to measure the quality of the result, they
>>>>>> consider a good alignment the one that is hard to improve by eye (But
>>>>>> they claim that for sequences sufficient similar, no pair less than
>>>>>> 35% identical, the results are good). Can I do the same as in CLUSTALW
>>>>>> paper and leave the quality measure to the user? How concerned should
>>>>>> I be with that in this project?
>>>>>
>>>>> Getting an overall core-algorithm that works should be priority. The
>>>>> benchmarking part is not mandatory, but something to keep in mind... I have
>>>>> plenty of material for that, once we get to that stage...
>>>>>
>>>>>> I will try send to this mailing list a proposal draft until tomorrow
>>>>>> to have some feedback from you.
>>>>>
>>>>> Excellent, looking forward to it.
>>>>>
>>>>> Andreas
>>>>>
>>>>> --
>>>>> -----------------------------------------------------------------------
>>>>> Dr. Andreas Prlic
>>>>> Senior Scientist, RCSB PDB Protein Data Bank
>>>>> University of California, San Diego
>>>>> (+1) 858.246.0526
>>>>> -----------------------------------------------------------------------
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


From sheoran143 at gmail.com  Sun Apr 11 19:16:29 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Sun, 11 Apr 2010 14:16:29 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
Message-ID: <4BC2200D.8000109@gmail.com>

Hi,

Their is very fundamental issue in SimpleNCBITaxon class becuase of 
which it is producing wrong taxonomy hierarchy. I am explaing what I 
have found let me what you guys think of it, and me suggest how to fix it.

1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, 
nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to 
have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not 
true. The value which "parent_taxon_id" have is "taxon_id" which have 
parent_ncbi_taxon_id of current ncbi_taxon_id.

<property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
<property name="nodeRank" column="node_rank"/>
<property name="geneticCode" column="genetic_code"/>
<property name="mitoGeneticCode" column="mito_genetic_code"/>
<property name="leftValue" column="left_value"/>
<property name="rightValue" column="right_value"/>
<property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- 
its not correct column parent_taxon_id stores the taxon_id which have 
parent_ncbi_taxon_id for current entry

Thanks
Deepak Sheoran


From holland at eaglegenomics.com  Sun Apr 11 19:53:06 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Sun, 11 Apr 2010 20:53:06 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <4BC2200D.8000109@gmail.com>
References: <4BC2200D.8000109@gmail.com>
Message-ID: <B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>

I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).

thanks,
Richard

On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:

> Hi,
> 
> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
> 
> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
> 
> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
> <property name="nodeRank" column="node_rank"/>
> <property name="geneticCode" column="genetic_code"/>
> <property name="mitoGeneticCode" column="mito_genetic_code"/>
> <property name="leftValue" column="left_value"/>
> <property name="rightValue" column="right_value"/>
> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
> 
> Thanks
> Deepak Sheoran
> 
> 

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From sheoran143 at gmail.com  Sun Apr 11 21:08:22 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Sun, 11 Apr 2010 16:08:22 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
Message-ID: <4BC23A46.7090304@gmail.com>

I am using same table with biojava and bioperl taxon program and the 
output I get is below:

*Biojava:*
For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage 
i get is
             Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia 
australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum 
var. haydenii.

Biojava process of finding names: 
11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   
(wrong way of doing things)

*Bioperl:*
For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage 
i get is
           Retroviridae; Orthoretrovirinae; Alpharetrovirus; 
unclassified  Alpharetrovirus.

Bioperl process of finding names: 
11876==>353825==>153057==>327045==>11632   (Right way of doing things)

Hint: biojava search ncbi_taxon_id column with a value from 
parent_taxon_id where bioperl search taxon_id column with a value from 
parent_taxon_id.

*Taxon and Taxon_name Table content which is being relevant  in discussion:*

taxon_id 	ncbi_taxon_id 	parent_taxon_id 	node_rank 	name 	name_class
2901 	3609 	276240 	genus 	Rhamnus 	scientific name
3610 	4403 	3609 	species 	Platanus occidentalis 	scientific name
29052 	48579 	4403 	species 	Suillus placidus 	scientific name
114412 	143975 	48579 	species 	Diadasia australis 	scientific name
143976 	176516 	143975 	species 	Arnicastrum guerrerense 	scientific name
30680 	50447 	176516 	family 	Labiduridae 	scientific name
254757 	301952 	50447 	varietas 	Oreostemma alpigenum var. haydenii 
scientific name
9394 	11632 	17394 	family 	Retroviridae 	scientific name
277861 	327045 	9394 	subfamily 	Orthoretrovirinae 	scientific name
122448 	153057 	277861 	genus 	Alpharetrovirus 	scientific name
301952 	353825 	122448 	no rank 	unclassified Alpharetrovirus 
scientific name
9584
	11876
	301952
	species
	Avian sarcoma virus
	scientifice name


Thanks
Deepak

On 4/11/2010 2:53 PM, Richard Holland wrote:
> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>
> thanks,
> Richard
>
> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>
>    
>> Hi,
>>
>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>
>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>
>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>> <property name="nodeRank" column="node_rank"/>
>> <property name="geneticCode" column="genetic_code"/>
>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>> <property name="leftValue" column="left_value"/>
>> <property name="rightValue" column="right_value"/>
>> <property name="parentNCBITaxID" column="parent_taxon_id"/>       ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>
>> Thanks
>> Deepak Sheoran
>>
>>
>>      
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
>    


From sheoran143 at gmail.com  Sun Apr 11 22:48:00 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Sun, 11 Apr 2010 17:48:00 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <4BC23A46.7090304@gmail.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
Message-ID: <4BC251A0.4090602@gmail.com>

If we don't want to change the current code in biojava and still want to 
fix this bug I have found a way,
1) we can do this by changing one of hibernate files called 
"Taxon.hbm.xml" and replace the line
<property name="parentNCBITaxID" column="parent_taxon_id"/>
     with
<property name="parentNCBITaxID" formula="(select tax.ncbi_taxon_id from 
taxon tax where tax.taxon_id = parent_taxon_id)"/>

by changing the above setting in hibernate setting I am able to get the 
correct linage for ncbi_taxon_id = 11876(Avian sarcoma virus) which is
              Viruses; Retro-transcribing viruses; Retroviridae; 
Orthoretrovirinae; Alpharetrovirus; unclassified Alpharetrovirus.

2) But the possible issue which we might get is with Taxonomy loader 
class which want to insert something for parent taxon_id into taxon 
table which  I think won't be possible if we do this change to hibernate 
con-fig file.

Deepak Sheoran


On 4/11/2010 4:08 PM, Deepak Sheoran wrote:
> I am using same table with biojava and bioperl taxon program and the 
> output I get is below:
>
> *Biojava:*
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the 
> lineage i get is
>             Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia 
> australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum 
> var. haydenii.
>
> Biojava process of finding names: 
> 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   
> (wrong way of doing things)
>
> *Bioperl:*
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the 
> lineage i get is
>           Retroviridae; Orthoretrovirinae; Alpharetrovirus; 
> unclassified  Alpharetrovirus.
>
> Bioperl process of finding names: 
> 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
>
> Hint: biojava search ncbi_taxon_id column with a value from 
> parent_taxon_id where bioperl search taxon_id column with a value from 
> parent_taxon_id.
>
> *Taxon and Taxon_name Table content which is being relevant  in 
> discussion:*
>
> taxon_id 	ncbi_taxon_id 	parent_taxon_id 	node_rank 	name 	name_class
> 2901 	3609 	276240 	genus 	Rhamnus 	scientific name
> 3610 	4403 	3609 	species 	Platanus occidentalis 	scientific name
> 29052 	48579 	4403 	species 	Suillus placidus 	scientific name
> 114412 	143975 	48579 	species 	Diadasia australis 	scientific name
> 143976 	176516 	143975 	species 	Arnicastrum guerrerense 	scientific name
> 30680 	50447 	176516 	family 	Labiduridae 	scientific name
> 254757 	301952 	50447 	varietas 	Oreostemma alpigenum var. haydenii 
> scientific name
> 9394 	11632 	17394 	family 	Retroviridae 	scientific name
> 277861 	327045 	9394 	subfamily 	Orthoretrovirinae 	scientific name
> 122448 	153057 	277861 	genus 	Alpharetrovirus 	scientific name
> 301952 	353825 	122448 	no rank 	unclassified Alpharetrovirus 
> scientific name
> 9584
> 	11876
> 	301952
> 	species
> 	Avian sarcoma virus
> 	scientifice name
>
>
> Thanks
> Deepak
>
> On 4/11/2010 2:53 PM, Richard Holland wrote:
>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>>
>> thanks,
>> Richard
>>
>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>
>>    
>>> Hi,
>>>
>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>>
>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>
>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>> <property name="nodeRank" column="node_rank"/>
>>> <property name="geneticCode" column="genetic_code"/>
>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>> <property name="leftValue" column="left_value"/>
>>> <property name="rightValue" column="right_value"/>
>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>       ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>>
>>> Thanks
>>> Deepak Sheoran
>>>
>>>
>>>      
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E:holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>>    
>


From holland at eaglegenomics.com  Mon Apr 12 06:57:57 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Mon, 12 Apr 2010 07:57:57 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <4BC23A46.7090304@gmail.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
Message-ID: <D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>

Thanks Deepak. 

I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. 

BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used.

BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.)

I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results.

This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now.

cheers,
Richard

On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:

> I am using same table with biojava and bioperl taxon program and the output I get is below:
> 
> Biojava:
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>             Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. 
> 
> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   (wrong way of doing things)
> 
> Bioperl:    
> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>           Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified  Alpharetrovirus.
> 
> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
> 
> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id.
> 
> Taxon and Taxon_name Table content which is being relevant  in discussion:
> 
> taxon_id	ncbi_taxon_id	parent_taxon_id	node_rank	name	name_class
> 2901	3609	276240	genus	Rhamnus	scientific name
> 3610	4403	3609	species	Platanus occidentalis	scientific name
> 29052	48579	4403	species	Suillus placidus	scientific name
> 114412	143975	48579	species	Diadasia australis	scientific name
> 143976	176516	143975	species	Arnicastrum guerrerense	scientific name
> 30680	50447	176516	family	Labiduridae	scientific name
> 254757	301952	50447	varietas	Oreostemma alpigenum var. haydenii	scientific name
> 9394	11632	17394	family	Retroviridae	scientific name
> 277861	327045	9394	subfamily	Orthoretrovirinae	scientific name
> 122448	153057	277861	genus	Alpharetrovirus	scientific name
> 301952	353825	122448	no rank	unclassified Alpharetrovirus	scientific name
> 9584
> 11876
> 301952
> species
> Avian sarcoma virus
> scientifice name
> 
> Thanks
> Deepak 
> 
> On 4/11/2010 2:53 PM, Richard Holland wrote:
>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>> 
>> thanks,
>> Richard
>> 
>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>> 
>>   
>> 
>>> Hi,
>>> 
>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>> 
>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>> 
>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>> <property name="nodeRank" column="node_rank"/>
>>> <property name="geneticCode" column="genetic_code"/>
>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>> <property name="leftValue" column="left_value"/>
>>> <property name="rightValue" column="right_value"/>
>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>> 
>>> Thanks
>>> Deepak Sheoran
>>> 
>>> 
>>>     
>>> 
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: 
>> holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>> 
>> 
>>   
>> 
> 

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From holland at eaglegenomics.com  Mon Apr 12 07:07:55 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Mon, 12 Apr 2010 08:07:55 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
	<D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
Message-ID: <E7FB88D1-52D9-496C-86FA-738419FFF579@eaglegenomics.com>

Incidentally, BioJava's approach matches the description in the BioSQL docs at:

 http://biosql.org/wiki/Schema_Overview#TAXON.2C_TAXON_NAME

(first example SQL statement - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join)

The BioPerl/BioSQL load_ncbi_taxonomy.pl script however does not match this description.

cheers,
Richard

On 12 Apr 2010, at 07:57, Richard Holland wrote:

> Thanks Deepak. 
> 
> I've had a look at the code and I believe its due to the different ways in which BioJava and BioPerl load the taxon table. 
> 
> BioJava sets the ncbi_taxon_id and parent_taxon_id columns based on the values from the NCBI taxonomy file. The taxon_id column in BioJava is a meaningless auto-generated value that is never used.
> 
> BioPerl however is generating taxon_id values and linking them by setting parent_taxon_id to the generated value. The parent value from the NCBI taxonomy file is therefore replaced with the BioPerl generated parent ID, meaning that instead of linking from parent_taxon_id to ncbi_taxon_id as per BioJava, the link is to taxon_id instead. (I'm basing this comment on looking at load_ncbi_taxonomy.pl from the BioSQL archives.)
> 
> I believe if you load the taxonomy table using BioJava, you should see BioJava giving correct behaviour. Likewise if you load it using BioPerl, BioPerl will behave correctly. But if you load with one then query with the other, you'll get incorrect results.
> 
> This sounds like a case for discussion on both lists - a matter of standardisation between the two projects. Not quickly/easily solvable for now.
> 
> cheers,
> Richard
> 
> On 11 Apr 2010, at 22:08, Deepak Sheoran wrote:
> 
>> I am using same table with biojava and bioperl taxon program and the output I get is below:
>> 
>> Biojava:
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>>            Rhamnus; Platanus occidentalis; Suillus placidus; Diadasia australis; Arnicastrum guerrerense; Labiduridae; Oreostemma alpigenum var. haydenii. 
>> 
>> Biojava process of finding names: 11876==>3019252==>50447==>176516==>143975==>48579==>4403==>3609==>276240   (wrong way of doing things)
>> 
>> Bioperl:    
>> For example for ncbi_taxon_id = 11876 (Avian sarcoma virus), the lineage i get is 
>>          Retroviridae; Orthoretrovirinae; Alpharetrovirus; unclassified  Alpharetrovirus.
>> 
>> Bioperl process of finding names: 11876==>353825==>153057==>327045==>11632   (Right way of doing things)
>> 
>> Hint: biojava search ncbi_taxon_id column with a value from parent_taxon_id where bioperl search taxon_id column with a value from parent_taxon_id.
>> 
>> Taxon and Taxon_name Table content which is being relevant  in discussion:
>> 
>> taxon_id	ncbi_taxon_id	parent_taxon_id	node_rank	name	name_class
>> 2901	3609	276240	genus	Rhamnus	scientific name
>> 3610	4403	3609	species	Platanus occidentalis	scientific name
>> 29052	48579	4403	species	Suillus placidus	scientific name
>> 114412	143975	48579	species	Diadasia australis	scientific name
>> 143976	176516	143975	species	Arnicastrum guerrerense	scientific name
>> 30680	50447	176516	family	Labiduridae	scientific name
>> 254757	301952	50447	varietas	Oreostemma alpigenum var. haydenii	scientific name
>> 9394	11632	17394	family	Retroviridae	scientific name
>> 277861	327045	9394	subfamily	Orthoretrovirinae	scientific name
>> 122448	153057	277861	genus	Alpharetrovirus	scientific name
>> 301952	353825	122448	no rank	unclassified Alpharetrovirus	scientific name
>> 9584
>> 11876
>> 301952
>> species
>> Avian sarcoma virus
>> scientifice name
>> 
>> Thanks
>> Deepak 
>> 
>> On 4/11/2010 2:53 PM, Richard Holland wrote:
>>> I'm sorry but I don't understand your example. Could you provide a real example of correct values for each column from a sample taxon entry in NCBI, plus an example of what BioJava is doing wrong? (i.e. give a sample record to use as reference, then point out the correct value of parent_taxon_id, and point out what value BioJava is using instead).
>>> 
>>> thanks,
>>> Richard
>>> 
>>> On 11 Apr 2010, at 20:16, Deepak Sheoran wrote:
>>> 
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> Their is very fundamental issue in SimpleNCBITaxon class becuase of which it is producing wrong taxonomy hierarchy. I am explaing what I have found let me what you guys think of it, and me suggest how to fix it.
>>>> 
>>>> 1) Columns in taxon table are (taxon_id, ncbi_taxon_id, parent_taxon_id, nodeRank, geneticCode, mitoGeneticCode, leftValue, rightValue)
>>>> 2) In the class SimpleNCBITaxon we are thinking "parent_taxon_id" to have parent ncbi_taxon_id for current ncbi_taxon_id value, but its not true. The value which "parent_taxon_id" have is "taxon_id" which have parent_ncbi_taxon_id of current ncbi_taxon_id.
>>>> 
>>>> <property name="NCBITaxID" column="ncbi_taxon_id" node="@NCBITaxId"/>
>>>> <property name="nodeRank" column="node_rank"/>
>>>> <property name="geneticCode" column="genetic_code"/>
>>>> <property name="mitoGeneticCode" column="mito_genetic_code"/>
>>>> <property name="leftValue" column="left_value"/>
>>>> <property name="rightValue" column="right_value"/>
>>>> <property name="parentNCBITaxID" column="parent_taxon_id"/>      ----- its not correct column parent_taxon_id stores the taxon_id which have parent_ncbi_taxon_id for current entry
>>>> 
>>>> Thanks
>>>> Deepak Sheoran
>>>> 
>>>> 
>>>> 
>>>> 
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: 
>>> holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>> 
>>> 
>>> 
>>> 
>> 
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From mara.axiom at gmail.com  Tue Apr 13 14:55:50 2010
From: mara.axiom at gmail.com (Mara Axiom)
Date: Tue, 13 Apr 2010 10:55:50 -0400
Subject: [Biojava-l] BioJava implementation of a phylogenetic tree
	reconstruction algorithm
Message-ID: <z2z6375ed361004130755oc61dc936j140bc4515b0270fc@mail.gmail.com>

Hello all,

Does anyone have BioJava implementation of a phylogenetic tree
reconstruction algorithm, except neighbor-joining or UPGMA? I need this for
a research. We have neighbor-joining or UPGMA implementation already, and we
want to look at other algorithms other than these. I am new to BioJava, any
information will help.

Here is what we want.

1 - Compare sequences in a FASTA file, and find sequences that are similar
to each other.
2 - Construct the tree.
3 - Output the tree in Newick (XML will work too) format.

In particular we are interested in implementation of BNNP (
http://www.cs.cmu.edu/~guyb/papers/SDBHRS06.pdf) and Align Free (
http://www.math.ucla.edu/~roch/research_files/align-free.pdf) algorithms,
but we are open to other algorithms too.

Please do not recommend a P-tree reconstruction tool. We are only interested
in a source code to meet our specific purpose.

Thanks in advance,
Mara


From biopython at maubp.freeserve.co.uk  Thu Apr 15 17:54:56 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Apr 2010 18:54:56 +0100
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
References: <4BC2200D.8000109@gmail.com>
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>
	<4BC23A46.7090304@gmail.com>
	<D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
Message-ID: <m2o320fb6e01004151054rcb57a28fvad135dffbe35d5fa@mail.gmail.com>

Hi,

I've CC'd this to the BioSQL mailing list for cross project
discussion.

On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland  wrote:
> Thanks Deepak.
>
> I've had a look at the code and I believe its due to the
> different ways in which BioJava and BioPerl load the
> taxon table.
>
> BioJava sets the ncbi_taxon_id and parent_taxon_id
> columns based on the values from the NCBI taxonomy
> file. The taxon_id column in BioJava is a meaningless
> auto-generated value that is never used.
>
> BioPerl however is generating taxon_id values and
> linking them by setting parent_taxon_id to the
> generated value. The parent value from the NCBI
> taxonomy file is therefore replaced with the BioPerl
> generated parent ID, meaning that instead of linking
> from parent_taxon_id to ncbi_taxon_id as per BioJava,
> the link is to taxon_id instead. (I'm basing this
> comment on looking at load_ncbi_taxonomy.pl from
> the BioSQL archives.)

Note that old versions of load_ncbi_taxonomy.pl
(which is part of BioSQL, not part of BioPerl) would
set taxon_id equal to ncbi_taxon_id, see:
http://bugzilla.open-bio.org/show_bug.cgi?id=2470

This may help explain the confusion.

> I believe if you load the taxonomy table using BioJava,
> you should see BioJava giving correct behaviour.
> Likewise if you load it using BioPerl, BioPerl will
> behave correctly. But if you load with one then query
> with the other, you'll get incorrect results.
>
> This sounds like a case for discussion on both lists -
> a matter of standardisation between the two projects.
> Not quickly/easily solvable for now.

Its not just two projects (BioPerl & BioJava) (grin).
Its at least five projects (BioSQL itself plus BioRuby
and Biopython).

I'm not sure about BioRuby's implementation, but
currently I think BioJava is the odd one out - BioPerl,
Biopython, and the BioSQL's load_ncbi_taxonomy.pl
all make entries in parent_taxon_id reference the
automatically generated taxon_id (please correct
me if I am wrong).

My personal view is that bioperl-db is the reference
implementation and should be followed in the event
of any ambiguity within BioSQL. In this particular
case, there is actually a BioSQL script to check
against too (load_ncbi_taxonomy.pl).

Hopefully Hilmar can give us an official verdict...

Peter


From andreas.draeger at uni-tuebingen.de  Wed Apr  7 13:22:26 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Wed, 07 Apr 2010 15:22:26 +0200
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
Message-ID: <4BBC8712.90907@uni-tuebingen.de>

Hi all,

This e-mail is just for your information about somebody new, who'd like 
to contribute to our project.

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091
-------------- next part --------------
An embedded message was scrubbed...
From: =?ISO-8859-1?Q?Andreas_Dr=E4ger?=
 <andreas.draeger at uni-tuebingen.de>
Subject: Re: Fwd: Proposing a project on "Biojava alignment lead"
Date: Wed, 07 Apr 2010 09:27:13 +0200
Size: 4779
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20100407/6a3f0bf8/attachment-0002.eml>

From jbdundas at gmail.com  Fri Apr 16 13:57:41 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Fri, 16 Apr 2010 19:27:41 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de>
References: <4BBD820D.9070200@uni-tuebingen.de>
Message-ID: <j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>

Dear Sir,

I am very interested in contributing to this project.

I am looking for a good problem,more on the research side. I can also
help in coding (I also work as a software
engineer-j2ee/eclipse/jboss/tomcat ..

Anything that I could work on...

Regards,
Jitesh Dundas

On 4/8/10, Andreas Dr?ger <andreas.draeger at uni-tuebingen.de> wrote:
> Hi all,
>
> This e-mail is just for your information about somebody new, who'd like
> to contribute to our project.
>
> Cheers
> Andreas
>
>
> Subject:
> Re: Fwd: Proposing a project on "Biojava alignment lead"
> From:
> Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
> Date:
> Wed, 07 Apr 2010 09:27:13 +0200
> To:
> Cai Shaojiang <caishaojiang at gmail.com>
>
> Hi Cai Shaojiang,
>
> Thank you for you e-mail! I don't know what happened to the e-mail list.
> Sometimes it takes a while due to the spam filters, I guess.
>
>  > I am a PhD student from National University of Singapore. My major
> research area is local alignment algorithms and data structures for SNP
> identification. And I have used Java and Eclipse for years for software
> development. I am very interested in your GSoC programme. I find that
> there is a module called "biojava-alignment lead" whose mentor is you. I
> want to propose a new project on this module. I have several questions
> about this module.
>
> Yes, that's me. So great to get your support.
>
>  > 1. It seems that pairwise alignment is to find similarity between two
> short sequences. Existing pairwise alignment is based on dynamic
> programming, is it Smith-Waterman algorithm?
>
> So, currently, BioJava contains three different alignment approaches.
> There are two deterministic algorithms, i.e., Smith-Waterman for local
> alignment and Needleman-Wunsch for global alignment. Third, there is the
> possibility to apply Hidden Markov Models for alignment. An example of
> the latter approach should be in the cookbook.
>
>  > 2. What is the exact task of "refactoring of underlying data structures"?
>
> Yes, this is something, I did last week already but it could still be
> improved. The problem was that the alignment algorithms actually
> produced a kind of string that looks similar to the output of BLAST.
> This string contained the score, the computation time, the length of the
> alignment etc. The problem was that people wanted to perform
> higher-level computation on the score value or evaluate some other
> information. Now, the alignment will produce a data structure that
> contains all the information and can, in addition to that, also produce
> such a BLAST-like output. There is, however, still the following
> problem: The data structure requires both sequences in the pair-wise
> alignment to have an identical length. In case of local alignment this
> is especially stupid (actually), because gaps are inserted to fill the
> sequences. And then the data structure tries to keep the old sequence
> coordinates, leading to the effect that the numbers "query start",
> "query end", "subject start", and "subject end" are required to shift
> the sequences against each other when displaying the output. So, you
> cannot easily print the sequences below of each other, you first have to
> shift them. Please check out the latest version of this package via
> anonymeous svn and have a look ;-)
>
>  > 3. My existing research area is aiming to deal with aligning short
> read (10s~100s bp) against extremely long sequences (e.g., human
> genome). Af far as I know, there is not existing such alignment tools
> implemented in Java. Would you consider this direction?
>
> See, this would be very nice to include. But this requires that we no
> longer fill the short sequence with many, many gap symbols (just a waist
> of memory), but improve the data structure. There is already an
> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
> we could use this as a starting point. Then your algorithm should only
> produce such a data structure and this would be fine.
>
>  > 4. It seems that the existing tools is just lacking of some
> refactoring and representation interfaces. Any more underlying tasks?
>
> Hm. Yes: With the release of BioJava 3 data structures have changed
> again. So maybe there's also some adaptation to the new structure required.
>
>  > I am keeping an eye on GSoC from last month, but sorry to find out
> that I sent the initial email to the mailing list before I subscribe it...
>
> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> latest trunk, have a look, play around and if you can improve something
> we'll put it into the trunk and write your name into the authors' tag.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From chapman at cs.wisc.edu  Fri Apr 16 17:28:33 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Fri, 16 Apr 2010 12:28:33 -0500
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on
 "Biojava	alignment lead"]
In-Reply-To: <j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>
Message-ID: <4BC89E41.4030009@cs.wisc.edu>

A great place to start finding ideas is the wiki.
Both http://biojava.org/wiki/BioJava:Modules
and http://biojava.org/wiki/BioJava3_Proposal
list the next steps planned/desired for BioJava.

What research area did you have in mind?

Have fun,
Mark


On 4/16/2010 8:57 AM, jitesh dundas wrote:
> Dear Sir,
>
> I am very interested in contributing to this project.
>
> I am looking for a good problem,more on the research side. I can also
> help in coding (I also work as a software
> engineer-j2ee/eclipse/jboss/tomcat ..
>
> Anything that I could work on...
>
> Regards,
> Jitesh Dundas
>
> On 4/8/10, Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>  wrote:
>> Hi all,
>>
>> This e-mail is just for your information about somebody new, who'd like
>> to contribute to our project.
>>
>> Cheers
>> Andreas
>>
>>
>> Subject:
>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>> From:
>> Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>
>> Date:
>> Wed, 07 Apr 2010 09:27:13 +0200
>> To:
>> Cai Shaojiang<caishaojiang at gmail.com>
>>
>> Hi Cai Shaojiang,
>>
>> Thank you for you e-mail! I don't know what happened to the e-mail list.
>> Sometimes it takes a while due to the spam filters, I guess.
>>
>>   >  I am a PhD student from National University of Singapore. My major
>> research area is local alignment algorithms and data structures for SNP
>> identification. And I have used Java and Eclipse for years for software
>> development. I am very interested in your GSoC programme. I find that
>> there is a module called "biojava-alignment lead" whose mentor is you. I
>> want to propose a new project on this module. I have several questions
>> about this module.
>>
>> Yes, that's me. So great to get your support.
>>
>>   >  1. It seems that pairwise alignment is to find similarity between two
>> short sequences. Existing pairwise alignment is based on dynamic
>> programming, is it Smith-Waterman algorithm?
>>
>> So, currently, BioJava contains three different alignment approaches.
>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>> alignment and Needleman-Wunsch for global alignment. Third, there is the
>> possibility to apply Hidden Markov Models for alignment. An example of
>> the latter approach should be in the cookbook.
>>
>>   >  2. What is the exact task of "refactoring of underlying data structures"?
>>
>> Yes, this is something, I did last week already but it could still be
>> improved. The problem was that the alignment algorithms actually
>> produced a kind of string that looks similar to the output of BLAST.
>> This string contained the score, the computation time, the length of the
>> alignment etc. The problem was that people wanted to perform
>> higher-level computation on the score value or evaluate some other
>> information. Now, the alignment will produce a data structure that
>> contains all the information and can, in addition to that, also produce
>> such a BLAST-like output. There is, however, still the following
>> problem: The data structure requires both sequences in the pair-wise
>> alignment to have an identical length. In case of local alignment this
>> is especially stupid (actually), because gaps are inserted to fill the
>> sequences. And then the data structure tries to keep the old sequence
>> coordinates, leading to the effect that the numbers "query start",
>> "query end", "subject start", and "subject end" are required to shift
>> the sequences against each other when displaying the output. So, you
>> cannot easily print the sequences below of each other, you first have to
>> shift them. Please check out the latest version of this package via
>> anonymeous svn and have a look ;-)
>>
>>   >  3. My existing research area is aiming to deal with aligning short
>> read (10s~100s bp) against extremely long sequences (e.g., human
>> genome). Af far as I know, there is not existing such alignment tools
>> implemented in Java. Would you consider this direction?
>>
>> See, this would be very nice to include. But this requires that we no
>> longer fill the short sequence with many, many gap symbols (just a waist
>> of memory), but improve the data structure. There is already an
>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
>> we could use this as a starting point. Then your algorithm should only
>> produce such a data structure and this would be fine.
>>
>>   >  4. It seems that the existing tools is just lacking of some
>> refactoring and representation interfaces. Any more underlying tasks?
>>
>> Hm. Yes: With the release of BioJava 3 data structures have changed
>> again. So maybe there's also some adaptation to the new structure required.
>>
>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>> that I sent the initial email to the mailing list before I subscribe it...
>>
>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>> latest trunk, have a look, play around and if you can improve something
>> we'll put it into the trunk and write your name into the authors' tag.
>>
>> Cheers
>> Andreas
>>
>> --
>> Dipl.-Bioinform. Andreas Dr?ger
>> Eberhard Karls University T?bingen
>> Center for Bioinformatics (ZBIT)
>> Sand 1
>> 72076 T?bingen
>> Germany
>>
>> Phone: +49-7071-29-70436
>> Fax:   +49-7071-29-5091
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From sheoran143 at gmail.com  Fri Apr 16 18:43:59 2010
From: sheoran143 at gmail.com (Deepak Sheoran)
Date: Fri, 16 Apr 2010 13:43:59 -0500
Subject: [Biojava-l] Issue with SimpleNCBITaxon class
In-Reply-To: <m2o320fb6e01004151054rcb57a28fvad135dffbe35d5fa@mail.gmail.com>
References: <4BC2200D.8000109@gmail.com>	
	<B2DECC5B-650E-434E-8955-5F02DB4297AC@eaglegenomics.com>	
	<4BC23A46.7090304@gmail.com>	
	<D75549FE-6866-4397-ACEE-A897C719C441@eaglegenomics.com>
	<m2o320fb6e01004151054rcb57a28fvad135dffbe35d5fa@mail.gmail.com>
Message-ID: <4BC8AFEF.70107@gmail.com>

What my experience says on this issue we should make use of taxon_id 
because its a unique key in a local instance of biosql.
ncbi_taxon_id should only be used for mapping purpose only so that a 
person can map his local taxon_id to a ncbi_taxon_id otherwise it defeat 
the sole purpose of having taxon_id as primary key in taxon table. The 
main goal which I think when biosql is designed is to make it 
independent of any other organization like genbank or NCBI but its a 
feature so that we can map a number(ncbi_taxon_id) given by a know 
authority to a local number (taxon_id).

Deepak Sheoran

On 4/15/2010 12:54 PM, Peter wrote:
> Hi,
>
> I've CC'd this to the BioSQL mailing list for cross project
> discussion.
>
> On Mon, Apr 12, 2010 at 7:57 AM, Richard Holland  wrote:
>    
>> Thanks Deepak.
>>
>> I've had a look at the code and I believe its due to the
>> different ways in which BioJava and BioPerl load the
>> taxon table.
>>
>> BioJava sets the ncbi_taxon_id and parent_taxon_id
>> columns based on the values from the NCBI taxonomy
>> file. The taxon_id column in BioJava is a meaningless
>> auto-generated value that is never used.
>>
>> BioPerl however is generating taxon_id values and
>> linking them by setting parent_taxon_id to the
>> generated value. The parent value from the NCBI
>> taxonomy file is therefore replaced with the BioPerl
>> generated parent ID, meaning that instead of linking
>> from parent_taxon_id to ncbi_taxon_id as per BioJava,
>> the link is to taxon_id instead. (I'm basing this
>> comment on looking at load_ncbi_taxonomy.pl from
>> the BioSQL archives.)
>>      
> Note that old versions of load_ncbi_taxonomy.pl
> (which is part of BioSQL, not part of BioPerl) would
> set taxon_id equal to ncbi_taxon_id, see:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2470
>
> This may help explain the confusion.
>
>    
>> I believe if you load the taxonomy table using BioJava,
>> you should see BioJava giving correct behaviour.
>> Likewise if you load it using BioPerl, BioPerl will
>> behave correctly. But if you load with one then query
>> with the other, you'll get incorrect results.
>>
>> This sounds like a case for discussion on both lists -
>> a matter of standardisation between the two projects.
>> Not quickly/easily solvable for now.
>>      
> Its not just two projects (BioPerl&  BioJava) (grin).
> Its at least five projects (BioSQL itself plus BioRuby
> and Biopython).
>
> I'm not sure about BioRuby's implementation, but
> currently I think BioJava is the odd one out - BioPerl,
> Biopython, and the BioSQL's load_ncbi_taxonomy.pl
> all make entries in parent_taxon_id reference the
> automatically generated taxon_id (please correct
> me if I am wrong).
>
> My personal view is that bioperl-db is the reference
> implementation and should be followed in the event
> of any ambiguity within BioSQL. In this particular
> case, there is actually a BioSQL script to check
> against too (load_ncbi_taxonomy.pl).
>
> Hopefully Hilmar can give us an official verdict...
>
> Peter
>    


From jbdundas at gmail.com  Sat Apr 17 02:20:12 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 17 Apr 2010 07:50:12 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BC89E41.4030009@cs.wisc.edu>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<j2n326ea8621004160657ga9002ed9w52a2646d5befd22@mail.gmail.com>
	<4BC89E41.4030009@cs.wisc.edu>
Message-ID: <u2j326ea8621004161920wd63884fftaac222d022edcdbe@mail.gmail.com>

Hi Everyone,

I went throug  the URLs sent by Dr Chapman. Interesting  work that you
are doing here.:)...

I was wondering if there is anyone who could consider on these. I
would like to also be a part of the research work being carried out
using Biojava( especially in sequence alignment, miRNA signature
Analysis (especially for cancers)...)

1) A set of tools for converting flat data (e.g. sequence strings,
taxononmy strings) into BioJava-like objects (e.g. SymbolLists,
NCBITaxon). These BioJava-like objects could then be used for more
advanced applications.
 A set of tools for manipulating the BioJava-like objects.

2) Module?: biojava-ws-blast Module?: biojava-ws-biolit
Proposed Module: biojava-j2ee Lead: Mark Schreiber

- This would probably take the form of SessionBeans and WebServices
that can be deployed to Glassfish/ JBoss etc to provide biological
services for people who want to make client server or SOA apps.

3) I also liked what  Mr. Gang Wu is working on(I read the
discussions). I was wondering if I could
do something of that  sort...

May I request the leads to tell me how I could chip in...

Regards,
Jitesh Dundas


On 4/16/10, Mark Chapman <chapman at cs.wisc.edu> wrote:
> A great place to start finding ideas is the wiki.
> Both http://biojava.org/wiki/BioJava:Modules
> and http://biojava.org/wiki/BioJava3_Proposal
> list the next steps planned/desired for BioJava.
>
> What research area did you have in mind?
>
> Have fun,
> Mark
>
>
> On 4/16/2010 8:57 AM, jitesh dundas wrote:
>> Dear Sir,
>>
>> I am very interested in contributing to this project.
>>
>> I am looking for a good problem,more on the research side. I can also
>> help in coding (I also work as a software
>> engineer-j2ee/eclipse/jboss/tomcat ..
>>
>> Anything that I could work on...
>>
>> Regards,
>> Jitesh Dundas
>>
>> On 4/8/10, Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>  wrote:
>>> Hi all,
>>>
>>> This e-mail is just for your information about somebody new, who'd like
>>> to contribute to our project.
>>>
>>> Cheers
>>> Andreas
>>>
>>>
>>> Subject:
>>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>>> From:
>>> Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>
>>> Date:
>>> Wed, 07 Apr 2010 09:27:13 +0200
>>> To:
>>> Cai Shaojiang<caishaojiang at gmail.com>
>>>
>>> Hi Cai Shaojiang,
>>>
>>> Thank you for you e-mail! I don't know what happened to the e-mail list.
>>> Sometimes it takes a while due to the spam filters, I guess.
>>>
>>>   >  I am a PhD student from National University of Singapore. My major
>>> research area is local alignment algorithms and data structures for SNP
>>> identification. And I have used Java and Eclipse for years for software
>>> development. I am very interested in your GSoC programme. I find that
>>> there is a module called "biojava-alignment lead" whose mentor is you. I
>>> want to propose a new project on this module. I have several questions
>>> about this module.
>>>
>>> Yes, that's me. So great to get your support.
>>>
>>>   >  1. It seems that pairwise alignment is to find similarity between
>>> two
>>> short sequences. Existing pairwise alignment is based on dynamic
>>> programming, is it Smith-Waterman algorithm?
>>>
>>> So, currently, BioJava contains three different alignment approaches.
>>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>>> alignment and Needleman-Wunsch for global alignment. Third, there is the
>>> possibility to apply Hidden Markov Models for alignment. An example of
>>> the latter approach should be in the cookbook.
>>>
>>>   >  2. What is the exact task of "refactoring of underlying data
>>> structures"?
>>>
>>> Yes, this is something, I did last week already but it could still be
>>> improved. The problem was that the alignment algorithms actually
>>> produced a kind of string that looks similar to the output of BLAST.
>>> This string contained the score, the computation time, the length of the
>>> alignment etc. The problem was that people wanted to perform
>>> higher-level computation on the score value or evaluate some other
>>> information. Now, the alignment will produce a data structure that
>>> contains all the information and can, in addition to that, also produce
>>> such a BLAST-like output. There is, however, still the following
>>> problem: The data structure requires both sequences in the pair-wise
>>> alignment to have an identical length. In case of local alignment this
>>> is especially stupid (actually), because gaps are inserted to fill the
>>> sequences. And then the data structure tries to keep the old sequence
>>> coordinates, leading to the effect that the numbers "query start",
>>> "query end", "subject start", and "subject end" are required to shift
>>> the sequences against each other when displaying the output. So, you
>>> cannot easily print the sequences below of each other, you first have to
>>> shift them. Please check out the latest version of this package via
>>> anonymeous svn and have a look ;-)
>>>
>>>   >  3. My existing research area is aiming to deal with aligning short
>>> read (10s~100s bp) against extremely long sequences (e.g., human
>>> genome). Af far as I know, there is not existing such alignment tools
>>> implemented in Java. Would you consider this direction?
>>>
>>> See, this would be very nice to include. But this requires that we no
>>> longer fill the short sequence with many, many gap symbols (just a waist
>>> of memory), but improve the data structure. There is already an
>>> UnequalLenghtAlignment (just a data structure, no algorithm) and I think
>>> we could use this as a starting point. Then your algorithm should only
>>> produce such a data structure and this would be fine.
>>>
>>>   >  4. It seems that the existing tools is just lacking of some
>>> refactoring and representation interfaces. Any more underlying tasks?
>>>
>>> Hm. Yes: With the release of BioJava 3 data structures have changed
>>> again. So maybe there's also some adaptation to the new structure
>>> required.
>>>
>>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>>> that I sent the initial email to the mailing list before I subscribe
>>> it...
>>>
>>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>>> latest trunk, have a look, play around and if you can improve something
>>> we'll put it into the trunk and write your name into the authors' tag.
>>>
>>> Cheers
>>> Andreas
>>>
>>> --
>>> Dipl.-Bioinform. Andreas Dr?ger
>>> Eberhard Karls University T?bingen
>>> Center for Bioinformatics (ZBIT)
>>> Sand 1
>>> 72076 T?bingen
>>> Germany
>>>
>>> Phone: +49-7071-29-70436
>>> Fax:   +49-7071-29-5091
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From jbdundas at gmail.com  Sat Apr 17 02:31:46 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 17 Apr 2010 08:01:46 +0530
Subject: [Biojava-l] Analytical Tool- Prediction of Unknown Protein's
	location on an a Predicted pathway
Message-ID: <j2w326ea8621004161931m69912d64t2e5d7452ac22cd8e@mail.gmail.com>

Dear All,

I wanted to propose an analytical tool in BioJava.

For e.g.) if we have  a large datasets with complete pathway
information  and the related information(e.g. p53 pathway will have
all the genes,proteins,miRNA s involved,etc ) mentioned, could we find
the location of a specific unknown (and just predicted protein)
protein/gene on a predicted pathway.

This was a suggestion on  the possible t ings on the analytical side
that we could do.Could we think of doing something of this sort for
BioJava (or atleast make it capable to handle such aspects)

Any ideas / comments are most welcome...

Regards,
Jitesh Dundas

On 4/17/10, jitesh dundas <jbdundas at gmail.com> wrote:
> Hi Everyone,
>
> I went throug  the URLs sent by Dr Chapman. Interesting  work that you
> are doing here.:)...
>
> I was wondering if there is anyone who could consider on these. I
> would like to also be a part of the research work being carried out
> using Biojava( especially in sequence alignment, miRNA signature
> Analysis (especially for cancers)...)
>
> 1) A set of tools for converting flat data (e.g. sequence strings,
> taxononmy strings) into BioJava-like objects (e.g. SymbolLists,
> NCBITaxon). These BioJava-like objects could then be used for more
> advanced applications.
>  A set of tools for manipulating the BioJava-like objects.
>
> 2) Module?: biojava-ws-blast Module?: biojava-ws-biolit
> Proposed Module: biojava-j2ee Lead: Mark Schreiber
>
> - This would probably take the form of SessionBeans and WebServices
> that can be deployed to Glassfish/ JBoss etc to provide biological
> services for people who want to make client server or SOA apps.
>
> 3) I also liked what  Mr. Gang Wu is working on(I read the
> discussions). I was wondering if I could
> do something of that  sort...
>
> May I request the leads to tell me how I could chip in...
>
> Regards,
> Jitesh Dundas
>
>
>
> On 4/16/10, Mark Chapman <chapman at cs.wisc.edu> wrote:
>> A great place to start finding ideas is the wiki.
>> Both http://biojava.org/wiki/BioJava:Modules
>> and http://biojava.org/wiki/BioJava3_Proposal
>> list the next steps planned/desired for BioJava.
>>
>> What research area did you have in mind?
>>
>> Have fun,
>> Mark
>>
>>
>> On 4/16/2010 8:57 AM, jitesh dundas wrote:
>>> Dear Sir,
>>>
>>> I am very interested in contributing to this project.
>>>
>>> I am looking for a good problem,more on the research side. I can also
>>> help in coding (I also work as a software
>>> engineer-j2ee/eclipse/jboss/tomcat ..
>>>
>>> Anything that I could work on...
>>>
>>> Regards,
>>> Jitesh Dundas
>>>
>>> On 4/8/10, Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>  wrote:
>>>> Hi all,
>>>>
>>>> This e-mail is just for your information about somebody new, who'd like
>>>> to contribute to our project.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>>
>>>> Subject:
>>>> Re: Fwd: Proposing a project on "Biojava alignment lead"
>>>> From:
>>>> Andreas Dr?ger<andreas.draeger at uni-tuebingen.de>
>>>> Date:
>>>> Wed, 07 Apr 2010 09:27:13 +0200
>>>> To:
>>>> Cai Shaojiang<caishaojiang at gmail.com>
>>>>
>>>> Hi Cai Shaojiang,
>>>>
>>>> Thank you for you e-mail! I don't know what happened to the e-mail
>>>> list.
>>>> Sometimes it takes a while due to the spam filters, I guess.
>>>>
>>>>   >  I am a PhD student from National University of Singapore. My major
>>>> research area is local alignment algorithms and data structures for SNP
>>>> identification. And I have used Java and Eclipse for years for software
>>>> development. I am very interested in your GSoC programme. I find that
>>>> there is a module called "biojava-alignment lead" whose mentor is you.
>>>> I
>>>> want to propose a new project on this module. I have several questions
>>>> about this module.
>>>>
>>>> Yes, that's me. So great to get your support.
>>>>
>>>>   >  1. It seems that pairwise alignment is to find similarity between
>>>> two
>>>> short sequences. Existing pairwise alignment is based on dynamic
>>>> programming, is it Smith-Waterman algorithm?
>>>>
>>>> So, currently, BioJava contains three different alignment approaches.
>>>> There are two deterministic algorithms, i.e., Smith-Waterman for local
>>>> alignment and Needleman-Wunsch for global alignment. Third, there is
>>>> the
>>>> possibility to apply Hidden Markov Models for alignment. An example of
>>>> the latter approach should be in the cookbook.
>>>>
>>>>   >  2. What is the exact task of "refactoring of underlying data
>>>> structures"?
>>>>
>>>> Yes, this is something, I did last week already but it could still be
>>>> improved. The problem was that the alignment algorithms actually
>>>> produced a kind of string that looks similar to the output of BLAST.
>>>> This string contained the score, the computation time, the length of
>>>> the
>>>> alignment etc. The problem was that people wanted to perform
>>>> higher-level computation on the score value or evaluate some other
>>>> information. Now, the alignment will produce a data structure that
>>>> contains all the information and can, in addition to that, also produce
>>>> such a BLAST-like output. There is, however, still the following
>>>> problem: The data structure requires both sequences in the pair-wise
>>>> alignment to have an identical length. In case of local alignment this
>>>> is especially stupid (actually), because gaps are inserted to fill the
>>>> sequences. And then the data structure tries to keep the old sequence
>>>> coordinates, leading to the effect that the numbers "query start",
>>>> "query end", "subject start", and "subject end" are required to shift
>>>> the sequences against each other when displaying the output. So, you
>>>> cannot easily print the sequences below of each other, you first have
>>>> to
>>>> shift them. Please check out the latest version of this package via
>>>> anonymeous svn and have a look ;-)
>>>>
>>>>   >  3. My existing research area is aiming to deal with aligning short
>>>> read (10s~100s bp) against extremely long sequences (e.g., human
>>>> genome). Af far as I know, there is not existing such alignment tools
>>>> implemented in Java. Would you consider this direction?
>>>>
>>>> See, this would be very nice to include. But this requires that we no
>>>> longer fill the short sequence with many, many gap symbols (just a
>>>> waist
>>>> of memory), but improve the data structure. There is already an
>>>> UnequalLenghtAlignment (just a data structure, no algorithm) and I
>>>> think
>>>> we could use this as a starting point. Then your algorithm should only
>>>> produce such a data structure and this would be fine.
>>>>
>>>>   >  4. It seems that the existing tools is just lacking of some
>>>> refactoring and representation interfaces. Any more underlying tasks?
>>>>
>>>> Hm. Yes: With the release of BioJava 3 data structures have changed
>>>> again. So maybe there's also some adaptation to the new structure
>>>> required.
>>>>
>>>>   >  I am keeping an eye on GSoC from last month, but sorry to find out
>>>> that I sent the initial email to the mailing list before I subscribe
>>>> it...
>>>>
>>>> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
>>>> latest trunk, have a look, play around and if you can improve something
>>>> we'll put it into the trunk and write your name into the authors' tag.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>> --
>>>> Dipl.-Bioinform. Andreas Dr?ger
>>>> Eberhard Karls University T?bingen
>>>> Center for Bioinformatics (ZBIT)
>>>> Sand 1
>>>> 72076 T?bingen
>>>> Germany
>>>>
>>>> Phone: +49-7071-29-70436
>>>> Fax:   +49-7071-29-5091
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>


From jbdundas at gmail.com  Sat Apr 17 13:34:20 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Sat, 17 Apr 2010 19:04:20 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BBD820D.9070200@uni-tuebingen.de>
References: <4BBD820D.9070200@uni-tuebingen.de>
Message-ID: <r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>

Dear SIr,

Could anyone tell me where I could start? Is there any lead who might need
my help in Software Development and research-oriebted aspects?

Any comments on my previous emails would be most welcomed...

Regards,
JItesh Dundas


On 4/8/10, Andreas Dr?ger <andreas.draeger at uni-tuebingen.de> wrote:
>
> Hi all,
>
> This e-mail is just for your information about somebody new, who'd like to
> contribute to our project.
>
> Cheers
> Andreas
>
>
> Subject:
> Re: Fwd: Proposing a project on "Biojava alignment lead"
> From:
> Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
> Date:
> Wed, 07 Apr 2010 09:27:13 +0200
> To:
> Cai Shaojiang <caishaojiang at gmail.com>
>
> Hi Cai Shaojiang,
>
> Thank you for you e-mail! I don't know what happened to the e-mail list.
> Sometimes it takes a while due to the spam filters, I guess.
>
> > I am a PhD student from National University of Singapore. My major
> research area is local alignment algorithms and data structures for SNP
> identification. And I have used Java and Eclipse for years for software
> development. I am very interested in your GSoC programme. I find that there
> is a module called "biojava-alignment lead" whose mentor is you. I want to
> propose a new project on this module. I have several questions about this
> module.
>
> Yes, that's me. So great to get your support.
>
> > 1. It seems that pairwise alignment is to find similarity between two
> short sequences. Existing pairwise alignment is based on dynamic
> programming, is it Smith-Waterman algorithm?
>
> So, currently, BioJava contains three different alignment approaches.
> There are two deterministic algorithms, i.e., Smith-Waterman for local
> alignment and Needleman-Wunsch for global alignment. Third, there is the
> possibility to apply Hidden Markov Models for alignment. An example of the
> latter approach should be in the cookbook.
>
> > 2. What is the exact task of "refactoring of underlying data structures"?
>
> Yes, this is something, I did last week already but it could still be
> improved. The problem was that the alignment algorithms actually produced a
> kind of string that looks similar to the output of BLAST. This string
> contained the score, the computation time, the length of the alignment etc.
> The problem was that people wanted to perform higher-level computation on
> the score value or evaluate some other information. Now, the alignment will
> produce a data structure that contains all the information and can, in
> addition to that, also produce such a BLAST-like output. There is, however,
> still the following problem: The data structure requires both sequences in
> the pair-wise alignment to have an identical length. In case of local
> alignment this is especially stupid (actually), because gaps are inserted to
> fill the sequences. And then the data structure tries to keep the old
> sequence coordinates, leading to the effect that the numbers "query start",
> "query end", "subject start", and "subject end" are required to shift the
> sequences against each other when displaying the output. So, you cannot
> easily print the sequences below of each other, you first have to shift
> them. Please check out the latest version of this package via anonymeous svn
> and have a look ;-)
>
> > 3. My existing research area is aiming to deal with aligning short read
> (10s~100s bp) against extremely long sequences (e.g., human genome). Af far
> as I know, there is not existing such alignment tools implemented in Java.
> Would you consider this direction?
>
> See, this would be very nice to include. But this requires that we no
> longer fill the short sequence with many, many gap symbols (just a waist of
> memory), but improve the data structure. There is already an
> UnequalLenghtAlignment (just a data structure, no algorithm) and I think we
> could use this as a starting point. Then your algorithm should only produce
> such a data structure and this would be fine.
>
> > 4. It seems that the existing tools is just lacking of some refactoring
> and representation interfaces. Any more underlying tasks?
>
> Hm. Yes: With the release of BioJava 3 data structures have changed again.
> So maybe there's also some adaptation to the new structure required.
>
> > I am keeping an eye on GSoC from last month, but sorry to find out that I
> sent the initial email to the mailing list before I subscribe it...
>
> Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> latest trunk, have a look, play around and if you can improve something
> we'll put it into the trunk and write your name into the authors' tag.
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From caishaojiang at gmail.com  Mon Apr 19 03:16:39 2010
From: caishaojiang at gmail.com (Cai Shaojiang)
Date: Sun, 18 Apr 2010 20:16:39 -0700
Subject: [Biojava-l] [Fwd: Re:  GSoC project on MSA]
In-Reply-To: <4BC84CD5.7030703@uni-tuebingen.de>
References: <4BBC80A8.5000608@uni-tuebingen.de>
	<v2j927e071e1004072144t557b480au27666262c79094e2@mail.gmail.com>
	<4BBDCFD2.3000507@uni-tuebingen.de>
	<y2y927e071e1004080221u778ca151l4e4eab6762b93603@mail.gmail.com>
	<s2o927e071e1004150536j7fc81d8av161035609eeed116@mail.gmail.com> 
	<4BC84CD5.7030703@uni-tuebingen.de>
Message-ID: <s2j927e071e1004182016u1807400eod14fe9cdccee2b21@mail.gmail.com>

Sorry to disturb you again. But when i wanted to modify my proposal in GSOC,
i got the error "This page is inactive at this time." So we cannot modify
the proposal now? Could you help me? Thanks.


From andreas at sdsc.edu  Mon Apr 19 03:58:05 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sun, 18 Apr 2010 20:58:05 -0700
Subject: [Biojava-l] Fwd: Biojava3-genetics
In-Reply-To: <33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl>
References: <4BC806F4.3090302@wur.nl>
	<r2n59a41c431004161039hd93b268eu159de8a6659d969f@mail.gmail.com>
	<33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl>
Message-ID: <i2h59a41c431004182058pf0ee4b80s960a579b9b8c7cbe@mail.gmail.com>

Hi Richard,

I am forwarding your message to the mailing list, since that is the best
place to meet other people interested in genetics application.

The BioJava source code is available via anonymous svn or the download page
on the wiki.

Andreas

---------- Forwarded message ----------
From: Finkers, Richard <richard.finkers at wur.nl>
Date: Sat, Apr 17, 2010 at 12:46 AM
Subject: RE: Biojava3-genetics
To: Andreas Prlic <andreas at sdsc.edu>


Hi Andreas,

To start with, associations with e.g. sequence variation (454) and phenotype
data within larger sets of genetically different individuals. This will be
code which I will have to write the coming year for one of my projects. I am
planning to use this in combination the sequence and phylogeny based biojava
modules.

I also might consider migrating some of my current code to this module. This
includes graphical representations of genetic data but also some statistical
analysis for which we use the package R for the calculations but the rest of
the data handling / formatting is done in Java.

Some of the functionality, that I am thinking about, is available from other
packages but I did not find the (java) source code.

Richard


-----Original Message-----
From: andreas.prlic at gmail.com on behalf of Andreas Prlic
Sent: Fri 2010-04-16 19:39
To: Finkers, Richard
Cc: biojava-dev at lists.open-bio.org
Subject: Re: Biojava3-genetics

Hi Richard,

any contribution is welcome. What do you have in mind in particular? Perhaps
there is already something there along those lines...

Andreas

On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers <Richard.Finkers at wur.nl
>wrote:

> Dear List,
>
> I would be interested in adding a module for genetic analysis to the
> biojava3 project. Are there others who are interested in this as well and
> with who should I discuss this further?
>
> Thanks,
> Richard
>
>
> --
> Dr. Richard Finkers
> Researcher Plant Breeding
> Wageningen UR Plant Breeding
> P.O. Box 16, 6700 AA, Wageningen, The Netherlands
> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB
> Wageningen, The Netherlands
> Tel. +31-317-484165 Fax +31-317-418094
> http://www.plantbreeding.wur.nl/ <http://www.plantbreeding.wur.nl>
> https://www.eu-sol.wur.nl/ <https://www.eu-sol.wur.nl>
> https://cbsgdbase.wur.nl/ <https://cbsgdbase.wur.nl>
> http://solgenomics.wur.nl/ <http://solgenomics.wur.nl>
> http://www.disclaimer-uk.wur.nl/
>
>


--
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Mon Apr 19 04:14:24 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Sun, 18 Apr 2010 21:14:24 -0700
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
Message-ID: <u2t59a41c431004182114j34529046h9dc22b4bec3cc51c@mail.gmail.com>

Hi Jitesh,

BioJava is an open source project with the goal to support Bioinformatics
applications. While we are always happy about any contribution, be it
documentation, bug fixes or email support on the mailing list, for a
research relate project it is probably easier to team up with your local
university and do an internship there.

Andreas


On Sat, Apr 17, 2010 at 6:34 AM, jitesh dundas <jbdundas at gmail.com> wrote:

> Dear SIr,
>
> Could anyone tell me where I could start? Is there any lead who might need
> my help in Software Development and research-oriebted aspects?
>
> Any comments on my previous emails would be most welcomed...
>
> Regards,
> JItesh Dundas
>
>
> On 4/8/10, Andreas Dr?ger <andreas.draeger at uni-tuebingen.de> wrote:
> >
> > Hi all,
> >
> > This e-mail is just for your information about somebody new, who'd like
> to
> > contribute to our project.
> >
> > Cheers
> > Andreas
> >
> >
> > Subject:
> > Re: Fwd: Proposing a project on "Biojava alignment lead"
> > From:
> > Andreas Dr?ger <andreas.draeger at uni-tuebingen.de>
> > Date:
> > Wed, 07 Apr 2010 09:27:13 +0200
> > To:
> > Cai Shaojiang <caishaojiang at gmail.com>
> >
> > Hi Cai Shaojiang,
> >
> > Thank you for you e-mail! I don't know what happened to the e-mail list.
> > Sometimes it takes a while due to the spam filters, I guess.
> >
> > > I am a PhD student from National University of Singapore. My major
> > research area is local alignment algorithms and data structures for SNP
> > identification. And I have used Java and Eclipse for years for software
> > development. I am very interested in your GSoC programme. I find that
> there
> > is a module called "biojava-alignment lead" whose mentor is you. I want
> to
> > propose a new project on this module. I have several questions about this
> > module.
> >
> > Yes, that's me. So great to get your support.
> >
> > > 1. It seems that pairwise alignment is to find similarity between two
> > short sequences. Existing pairwise alignment is based on dynamic
> > programming, is it Smith-Waterman algorithm?
> >
> > So, currently, BioJava contains three different alignment approaches.
> > There are two deterministic algorithms, i.e., Smith-Waterman for local
> > alignment and Needleman-Wunsch for global alignment. Third, there is the
> > possibility to apply Hidden Markov Models for alignment. An example of
> the
> > latter approach should be in the cookbook.
> >
> > > 2. What is the exact task of "refactoring of underlying data
> structures"?
> >
> > Yes, this is something, I did last week already but it could still be
> > improved. The problem was that the alignment algorithms actually produced
> a
> > kind of string that looks similar to the output of BLAST. This string
> > contained the score, the computation time, the length of the alignment
> etc.
> > The problem was that people wanted to perform higher-level computation on
> > the score value or evaluate some other information. Now, the alignment
> will
> > produce a data structure that contains all the information and can, in
> > addition to that, also produce such a BLAST-like output. There is,
> however,
> > still the following problem: The data structure requires both sequences
> in
> > the pair-wise alignment to have an identical length. In case of local
> > alignment this is especially stupid (actually), because gaps are inserted
> to
> > fill the sequences. And then the data structure tries to keep the old
> > sequence coordinates, leading to the effect that the numbers "query
> start",
> > "query end", "subject start", and "subject end" are required to shift the
> > sequences against each other when displaying the output. So, you cannot
> > easily print the sequences below of each other, you first have to shift
> > them. Please check out the latest version of this package via anonymeous
> svn
> > and have a look ;-)
> >
> > > 3. My existing research area is aiming to deal with aligning short read
> > (10s~100s bp) against extremely long sequences (e.g., human genome). Af
> far
> > as I know, there is not existing such alignment tools implemented in
> Java.
> > Would you consider this direction?
> >
> > See, this would be very nice to include. But this requires that we no
> > longer fill the short sequence with many, many gap symbols (just a waist
> of
> > memory), but improve the data structure. There is already an
> > UnequalLenghtAlignment (just a data structure, no algorithm) and I think
> we
> > could use this as a starting point. Then your algorithm should only
> produce
> > such a data structure and this would be fine.
> >
> > > 4. It seems that the existing tools is just lacking of some refactoring
> > and representation interfaces. Any more underlying tasks?
> >
> > Hm. Yes: With the release of BioJava 3 data structures have changed
> again.
> > So maybe there's also some adaptation to the new structure required.
> >
> > > I am keeping an eye on GSoC from last month, but sorry to find out that
> I
> > sent the initial email to the mailing list before I subscribe it...
> >
> > Ok. Sounds good. Thanks for your interest. So I suggest: Download the
> > latest trunk, have a look, play around and if you can improve something
> > we'll put it into the trunk and write your name into the authors' tag.
> >
> > Cheers
> > Andreas
> >
> > --
> > Dipl.-Bioinform. Andreas Dr?ger
> > Eberhard Karls University T?bingen
> > Center for Bioinformatics (ZBIT)
> > Sand 1
> > 72076 T?bingen
> > Germany
> >
> > Phone: +49-7071-29-70436
> > Fax:   +49-7071-29-5091
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From jbdundas at gmail.com  Mon Apr 19 08:33:57 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Mon, 19 Apr 2010 14:03:57 +0530
Subject: [Biojava-l] Fwd: Biojava3-genetics
In-Reply-To: <i2h59a41c431004182058pf0ee4b80s960a579b9b8c7cbe@mail.gmail.com>
References: <4BC806F4.3090302@wur.nl>
	<r2n59a41c431004161039hd93b268eu159de8a6659d969f@mail.gmail.com>
	<33AFFE3255BCA043AF09514A6F6BFBAED04C3C@scomp0039.wurnet.nl>
	<i2h59a41c431004182058pf0ee4b80s960a579b9b8c7cbe@mail.gmail.com>
Message-ID: <o2v326ea8621004190133oc0ae71b3l2f58c9967fd2fcb0@mail.gmail.com>

Dear Sir,

I would like to work on this module.

How can I help?

Regards,
Jitesh Dundas

On 4/19/10, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Richard,
>
> I am forwarding your message to the mailing list, since that is the best
> place to meet other people interested in genetics application.
>
> The BioJava source code is available via anonymous svn or the download page
> on the wiki.
>
> Andreas
>
> ---------- Forwarded message ----------
> From: Finkers, Richard <richard.finkers at wur.nl>
> Date: Sat, Apr 17, 2010 at 12:46 AM
> Subject: RE: Biojava3-genetics
> To: Andreas Prlic <andreas at sdsc.edu>
>
>
> Hi Andreas,
>
> To start with, associations with e.g. sequence variation (454) and phenotype
> data within larger sets of genetically different individuals. This will be
> code which I will have to write the coming year for one of my projects. I am
> planning to use this in combination the sequence and phylogeny based biojava
> modules.
>
> I also might consider migrating some of my current code to this module. This
> includes graphical representations of genetic data but also some statistical
> analysis for which we use the package R for the calculations but the rest of
> the data handling / formatting is done in Java.
>
> Some of the functionality, that I am thinking about, is available from other
> packages but I did not find the (java) source code.
>
> Richard
>
>
>
>
> -----Original Message-----
> From: andreas.prlic at gmail.com on behalf of Andreas Prlic
> Sent: Fri 2010-04-16 19:39
> To: Finkers, Richard
> Cc: biojava-dev at lists.open-bio.org
> Subject: Re: Biojava3-genetics
>
> Hi Richard,
>
> any contribution is welcome. What do you have in mind in particular? Perhaps
> there is already something there along those lines...
>
> Andreas
>
> On Thu, Apr 15, 2010 at 11:43 PM, Richard Finkers <Richard.Finkers at wur.nl
>>wrote:
>
>> Dear List,
>>
>> I would be interested in adding a module for genetic analysis to the
>> biojava3 project. Are there others who are interested in this as well and
>> with who should I discuss this further?
>>
>> Thanks,
>> Richard
>>
>>
>> --
>> Dr. Richard Finkers
>> Researcher Plant Breeding
>> Wageningen UR Plant Breeding
>> P.O. Box 16, 6700 AA, Wageningen, The Netherlands
>> Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB
>> Wageningen, The Netherlands
>> Tel. +31-317-484165 Fax +31-317-418094
>> http://www.plantbreeding.wur.nl/ <http://www.plantbreeding.wur.nl>
>> https://www.eu-sol.wur.nl/ <https://www.eu-sol.wur.nl>
>> https://cbsgdbase.wur.nl/ <https://cbsgdbase.wur.nl>
>> http://solgenomics.wur.nl/ <http://solgenomics.wur.nl>
>> http://www.disclaimer-uk.wur.nl/
>>
>>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>
>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From andreas.draeger at uni-tuebingen.de  Wed Apr 21 03:17:05 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Wed, 21 Apr 2010 12:17:05 +0900
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on
 "Biojava	alignment lead"]
In-Reply-To: <r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
Message-ID: <4BCE6E31.70504@uni-tuebingen.de>

Hi Jitesh,

Thanks for your interest to contribute to our BioJava project! In the 
alignment package, lots of help is required. What would be very nice, is 
a verstatile visual representation of the alignment data structures that 
can be included into graphical user interfaces with little effort. To 
this end, it should be very flexible and abstract. Would you be interested?

Cheers
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From mitlox at op.pl  Wed Apr 21 10:46:22 2010
From: mitlox at op.pl (xyz)
Date: Wed, 21 Apr 2010 20:46:22 +1000
Subject: [Biojava-l] Reading and writting Fastq files
In-Reply-To: <5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com>
References: <20100330215047.084f6b00@wp01>
	<Pine.GSO.4.44.1003312334350.18726-100000@shell3.shore.net>
	<20100408213013.63a99b8c@wp01>
	<5EBA99CE-1DAC-442A-B7FD-2E738C7F586B@eaglegenomics.com>
Message-ID: <20100421204622.68f9ac1b@wp01>

On Thu, 8 Apr 2010 12:36:36 +0100
Richard Holland <holland at eaglegenomics.com> wrote:

> You haven't included the two import static lines in your code. See
> first two lines of Michael's example code (expanding the ellipses to
> the full classpath).
> 

Thank you it was enough to include 
import static
org.biojavax.bio.seq.RichSequence.Tools.createRichSequence;

Usually Netbeans solve this kind of problems for me, but this time was
no help from the IDE. 


From mitlox at op.pl  Wed Apr 21 11:18:24 2010
From: mitlox at op.pl (xyz)
Date: Wed, 21 Apr 2010 21:18:24 +1000
Subject: [Biojava-l] readFasta problem
In-Reply-To: <5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
Message-ID: <20100421211824.75b7ada2@wp01>

On Thu, 8 Apr 2010 12:41:25 +0100
Richard Holland <holland at eaglegenomics.com> wrote:

> You have passed null into the tokenizer parameter of
> RichSequence.IOTools.readFasta() - this is not allowed. The parser
> cannot guess the type of sequence, it must be told what to expect by
> specifying the tokenizer to use. (Importantly this also means that
> you cannot mix different types of sequence within the same file to be
> parsed.)
> 

Thank you. 

Q1:
Does RichSequenceIterator read the complete file in memory and then I
retrieve each read from memory? Or does it read the file line by line
and I get each read?

Q2:
Why am I not able to retrieve the header from the following fasta file:
>1
atccccc
>2
atccccctttttt
>3
atccccccccccccccccctttt
>4
tttttttccccccccccccccccccccccc
>5
tttttttcccccccccccccccccccccca

with the following code:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import org.biojava.bio.BioException;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojava.bio.symbol.AlphabetManager;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;

public class SortFasta {

  public static void main(String[] args) throws FileNotFoundException,
  BioException {


    BufferedReader br = new BufferedReader(new
    FileReader("sortFasta.fasta")); String type = "DNA";
    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
					.getTokenization("token");


    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
    null);

    while (rsi.hasNext()) {
      RichSequence rs = rsi.nextRichSequence();
      System.out.println(rs.getDescription());
      System.out.println(rs.seqString());
    }
  }
}

What did I wrong in order to retrieve the header?


From holland at eaglegenomics.com  Wed Apr 21 11:29:57 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 21 Apr 2010 12:29:57 +0100
Subject: [Biojava-l] readFasta problem
In-Reply-To: <20100421211824.75b7ada2@wp01>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
	<20100421211824.75b7ada2@wp01>
Message-ID: <BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>


On 21 Apr 2010, at 12:18, xyz wrote:

> On Thu, 8 Apr 2010 12:41:25 +0100
> Richard Holland <holland at eaglegenomics.com> wrote:
> 
>> You have passed null into the tokenizer parameter of
>> RichSequence.IOTools.readFasta() - this is not allowed. The parser
>> cannot guess the type of sequence, it must be told what to expect by
>> specifying the tokenizer to use. (Importantly this also means that
>> you cannot mix different types of sequence within the same file to be
>> parsed.)
>> 
> 
> Thank you. 
> 
> Q1:
> Does RichSequenceIterator read the complete file in memory and then I
> retrieve each read from memory? Or does it read the file line by line
> and I get each read?


Line by line.

> Q2:
> Why am I not able to retrieve the header from the following fasta file:
>> 1
> atccccc
>> 2
> atccccctttttt
>> 3
> atccccccccccccccccctttt
>> 4
> tttttttccccccccccccccccccccccc
>> 5
> tttttttcccccccccccccccccccccca
> 
> with the following code:
> 
> import java.io.BufferedReader;
> import java.io.FileNotFoundException;
> import java.io.FileReader;
> import org.biojava.bio.BioException;
> import org.biojava.bio.seq.io.SymbolTokenization;
> import org.biojava.bio.symbol.AlphabetManager;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> 
> public class SortFasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  BioException {
> 
> 
>    BufferedReader br = new BufferedReader(new
>    FileReader("sortFasta.fasta")); String type = "DNA";
>    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
> 					.getTokenization("token");
> 
> 
>    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
>    null);
> 
>    while (rsi.hasNext()) {
>      RichSequence rs = rsi.nextRichSequence();
>      System.out.println(rs.getDescription());
>      System.out.println(rs.seqString());
>    }
>  }
> }
> 
> What did I wrong in order to retrieve the header?


Try the other methods on RichSequence - getName() for instance.

cheers,
Richard

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From mitlox at op.pl  Wed Apr 21 12:40:48 2010
From: mitlox at op.pl (xyz)
Date: Wed, 21 Apr 2010 22:40:48 +1000
Subject: [Biojava-l] NCBI Accession Number prefixes
Message-ID: <20100421224048.1848c2f2@wp01>

Hello,
is it possible to download GenBank entries (AC) with BioJava?

Thank you in advance.

Best regards,


From holland at eaglegenomics.com  Wed Apr 21 12:44:16 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 21 Apr 2010 13:44:16 +0100
Subject: [Biojava-l] NCBI Accession Number prefixes
In-Reply-To: <20100421224048.1848c2f2@wp01>
References: <20100421224048.1848c2f2@wp01>
Message-ID: <577294DB-EABD-48DF-A55A-5DA9629AC352@eaglegenomics.com>

See http://www.biojava.org/docs/api/org/biojavax/bio/db/ncbi/GenbankRichSequenceDB.html

On 21 Apr 2010, at 13:40, xyz wrote:

> Hello,
> is it possible to download GenBank entries (AC) with BioJava?
> 
> Thank you in advance.
> 
> Best regards,
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From jbdundas at gmail.com  Wed Apr 21 13:45:00 2010
From: jbdundas at gmail.com (jitesh dundas)
Date: Wed, 21 Apr 2010 19:15:00 +0530
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <4BCE6E31.70504@uni-tuebingen.de>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
	<4BCE6E31.70504@uni-tuebingen.de>
Message-ID: <i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>

Yes Sir, I will be very interested. Please send me the details. I will be
working on Weekends though as office work is taking my time right now.

Regards,
jd

On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger <
andreas.draeger at uni-tuebingen.de> wrote:

> Hi Jitesh,
>
> Thanks for your interest to contribute to our BioJava project! In the
> alignment package, lots of help is required. What would be very nice, is a
> verstatile visual representation of the alignment data structures that can
> be included into graphical user interfaces with little effort. To this end,
> it should be very flexible and abstract. Would you be interested?
>
>
> Cheers
> Andreas
>
> --
> Dipl.-Bioinform. Andreas Dr?ger
> Eberhard Karls University T?bingen
> Center for Bioinformatics (ZBIT)
> Sand 1
> 72076 T?bingen
> Germany
>
> Phone: +49-7071-29-70436
> Fax:   +49-7071-29-5091
>


From er.indupandey at gmail.com  Fri Apr 23 08:11:05 2010
From: er.indupandey at gmail.com (indu pandey)
Date: Fri, 23 Apr 2010 01:11:05 -0700
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
	<4BCE6E31.70504@uni-tuebingen.de>
	<i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
Message-ID: <k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>

hi all
 can any body help me in creating code in biojava for converting dna
sequence to corresponding amino acid sequence

regards
 indu

On 4/21/10, jitesh dundas <jbdundas at gmail.com> wrote:
>
> Yes Sir, I will be very interested. Please send me the details. I will be
> working on Weekends though as office work is taking my time right now.
>
> Regards,
> jd
>
> On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger <
> andreas.draeger at uni-tuebingen.de> wrote:
>
> > Hi Jitesh,
> >
> > Thanks for your interest to contribute to our BioJava project! In the
> > alignment package, lots of help is required. What would be very nice, is
> a
> > verstatile visual representation of the alignment data structures that
> can
> > be included into graphical user interfaces with little effort. To this
> end,
> > it should be very flexible and abstract. Would you be interested?
> >
> >
> > Cheers
> > Andreas
> >
> > --
> > Dipl.-Bioinform. Andreas Dr?ger
> > Eberhard Karls University T?bingen
> > Center for Bioinformatics (ZBIT)
> > Sand 1
> > 72076 T?bingen
> > Germany
> >
> > Phone: +49-7071-29-70436
> > Fax:   +49-7071-29-5091
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From genjasp at gmail.com  Fri Apr 23 08:26:10 2010
From: genjasp at gmail.com (Alessandro Cipriani)
Date: Fri, 23 Apr 2010 10:26:10 +0200
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
	alignment lead"]
In-Reply-To: <k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>
	<4BCE6E31.70504@uni-tuebingen.de>
	<i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
	<k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
Message-ID: <k2j46b9a2151004230126i287b838frcd874f8ce9bab47d@mail.gmail.com>

Hi
Follow this link: http://www.biojava.org/wiki/BioJava:CookBook#Translation
I think it could be usefull

regards
ale


2010/4/23 indu pandey <er.indupandey at gmail.com>

> hi all
>  can any body help me in creating code in biojava for converting dna
> sequence to corresponding amino acid sequence
>
> regards
>  indu
>
> On 4/21/10, jitesh dundas <jbdundas at gmail.com> wrote:
> >
> > Yes Sir, I will be very interested. Please send me the details. I will be
> > working on Weekends though as office work is taking my time right now.
> >
> > Regards,
> > jd
> >
> > On Wed, Apr 21, 2010 at 8:47 AM, Andreas Dr?ger <
> > andreas.draeger at uni-tuebingen.de> wrote:
> >
> > > Hi Jitesh,
> > >
> > > Thanks for your interest to contribute to our BioJava project! In the
> > > alignment package, lots of help is required. What would be very nice,
> is
> > a
> > > verstatile visual representation of the alignment data structures that
> > can
> > > be included into graphical user interfaces with little effort. To this
> > end,
> > > it should be very flexible and abstract. Would you be interested?
> > >
> > >
> > > Cheers
> > > Andreas
> > >
> > > --
> > > Dipl.-Bioinform. Andreas Dr?ger
> > > Eberhard Karls University T?bingen
> > > Center for Bioinformatics (ZBIT)
> > > Sand 1
> > > 72076 T?bingen
> > > Germany
> > >
> > > Phone: +49-7071-29-70436
> > > Fax:   +49-7071-29-5091
> > >
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Alessandro Cipriani
(+39) 3206009509
http://www.cipriania.it
skype:genjasp at gmail.com <skype%3Agenjasp at gmail.com>
msn:jaspzz


From thomascramera at dnastar.com  Fri Apr 23 22:58:05 2010
From: thomascramera at dnastar.com (Andy Thomas-Cramer)
Date: Fri, 23 Apr 2010 17:58:05 -0500
Subject: [Biojava-l] PDBFileParser and Atom element symbol
Message-ID: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>


Is there an easy way to identify the type of atom referenced by an Atom
object? 

For example, if Atom.getName() is "CA", is the element calcium or the
atom carbon alpha?

If not, would it be feasible to add a method providing this in Atom,
AtomImpl, and parsing it in PDBFileParser, using the columns defined at
http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?


From andreas at sdsc.edu  Fri Apr 23 23:52:15 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Fri, 23 Apr 2010 16:52:15 -0700
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
Message-ID: <n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>

Hi Andy,

you could check with  Atom.getFullname(), which contains the space
characters from the PDB file:
e.g Calpha: " CA ", Calcium "CA  "

in addition the parent group of a Calpha atom is usually an AminoAcid and
for Calciums it is a Hetatom group...

Andreas

On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer <
thomascramera at dnastar.com> wrote:

>
>
> Is there an easy way to identify the type of atom referenced by an Atom
> object?
>
> For example, if Atom.getName() is "CA", is the element calcium or the
> atom carbon alpha?
>
> If not, would it be feasible to add a method providing this in Atom,
> AtomImpl, and parsing it in PDBFileParser, using the columns defined at
> http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From mitlox at op.pl  Sun Apr 25 05:19:25 2010
From: mitlox at op.pl (xyz)
Date: Sun, 25 Apr 2010 15:19:25 +1000
Subject: [Biojava-l] readFasta problem
In-Reply-To: <BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
	<20100421211824.75b7ada2@wp01>
	<BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>
Message-ID: <20100425151925.1c5c9a03@wp01>

On Wed, 21 Apr 2010 12:29:57 +0100
Richard Holland wrote:

> > Q1:
> > Does RichSequenceIterator read the complete file in memory and then
> > I retrieve each read from memory? Or does it read the file line by
> > line and I get each read?
> 
> 
> Line by line.

That save memory.

> > Q2:
> > Why am I not able to retrieve the header from the following fasta
> > file:
> >> 1
> > atccccc
> >> 2
> > atccccctttttt
> >> 3
> > atccccccccccccccccctttt
> >> 4
> > tttttttccccccccccccccccccccccc
> >> 5
> > tttttttcccccccccccccccccccccca
> 
> Try the other methods on RichSequence - getName() for instance.

Thank you getName() works.

I have tried to write fasta file line by line with IOTools, but I have
got the following error:
Exception in thread "main" java.lang.RuntimeException: Uncompilable
source code 1
        at SortFasta.main(SortFasta.java:31)
atccccc
Java Result: 1

Here is the complete code:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import org.biojava.bio.BioException;
import org.biojava.bio.seq.io.SymbolTokenization;
import org.biojava.bio.symbol.AlphabetManager;
import org.biojavax.bio.seq.RichSequence;
import org.biojavax.bio.seq.RichSequenceIterator;

public class SortFasta {

  public static void main(String[] args) throws FileNotFoundException,
  BioException {


    BufferedReader br = new BufferedReader(new
    FileReader("sortFasta.fasta")); String type = "DNA";
    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
					.getTokenization("token");

    FileOutputStream outputFasta = new FileOutputStream("test.fasta");

    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
    null);

    while (rsi.hasNext()) {
      RichSequence rs = rsi.nextRichSequence();
      System.out.println(rs.getName());
      System.out.println(rs.seqString());

      RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null,
              rs.getName() + "1");
    }
  }
}

How is it possible to write fasta files line by line?


From holland at eaglegenomics.com  Sun Apr 25 08:21:22 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Sun, 25 Apr 2010 09:21:22 +0100
Subject: [Biojava-l] readFasta problem
In-Reply-To: <20100425151925.1c5c9a03@wp01>
References: <20100408213052.662beb8e@wp01>
	<5B252A84-6822-4978-BBD1-83A8BA09EBC5@eaglegenomics.com>
	<20100421211824.75b7ada2@wp01>
	<BBA023B5-4B9B-469A-BE46-A0171DB7B681@eaglegenomics.com>
	<20100425151925.1c5c9a03@wp01>
Message-ID: <316097DC-6011-4205-83BC-9A24398D034D@eaglegenomics.com>

Hi.

You are calling a non-existing version of writeFasta. I'm surprised your code even compiles!

Have a look at the JavaDocs to find out what you can actually do with writeFasta. For a start, it takes Sequence and FastaHeader objects as parameters, not Strings as you are trying to do.

http://www.biojava.org/docs/api17/org/biojavax/bio/seq/RichSequence.IOTools.html

cheers,
Richard

On 25 Apr 2010, at 06:19, xyz wrote:

> On Wed, 21 Apr 2010 12:29:57 +0100
> Richard Holland wrote:
> 
>>> Q1:
>>> Does RichSequenceIterator read the complete file in memory and then
>>> I retrieve each read from memory? Or does it read the file line by
>>> line and I get each read?
>> 
>> 
>> Line by line.
> 
> That save memory.
> 
>>> Q2:
>>> Why am I not able to retrieve the header from the following fasta
>>> file:
>>>> 1
>>> atccccc
>>>> 2
>>> atccccctttttt
>>>> 3
>>> atccccccccccccccccctttt
>>>> 4
>>> tttttttccccccccccccccccccccccc
>>>> 5
>>> tttttttcccccccccccccccccccccca
>> 
>> Try the other methods on RichSequence - getName() for instance.
> 
> Thank you getName() works.
> 
> I have tried to write fasta file line by line with IOTools, but I have
> got the following error:
> Exception in thread "main" java.lang.RuntimeException: Uncompilable
> source code 1
>        at SortFasta.main(SortFasta.java:31)
> atccccc
> Java Result: 1
> 
> Here is the complete code:
> 
> import java.io.BufferedReader;
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.FileReader;
> import org.biojava.bio.BioException;
> import org.biojava.bio.seq.io.SymbolTokenization;
> import org.biojava.bio.symbol.AlphabetManager;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> 
> public class SortFasta {
> 
>  public static void main(String[] args) throws FileNotFoundException,
>  BioException {
> 
> 
>    BufferedReader br = new BufferedReader(new
>    FileReader("sortFasta.fasta")); String type = "DNA";
>    SymbolTokenization toke = AlphabetManager.alphabetForName(type)
> 					.getTokenization("token");
> 
>    FileOutputStream outputFasta = new FileOutputStream("test.fasta");
> 
>    RichSequenceIterator rsi = RichSequence.IOTools.readFasta(br, toke,
>    null);
> 
>    while (rsi.hasNext()) {
>      RichSequence rs = rsi.nextRichSequence();
>      System.out.println(rs.getName());
>      System.out.println(rs.seqString());
> 
>      RichSequence.IOTools.writeFasta(outputFasta, rs.seqString(), null,
>              rs.getName() + "1");
>    }
>  }
> }
> 
> How is it possible to write fasta files line by line?

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From andreas.draeger at uni-tuebingen.de  Mon Apr 26 01:04:44 2010
From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=)
Date: Mon, 26 Apr 2010 10:04:44 +0900
Subject: [Biojava-l] [Fwd: Re: Fwd: Proposing a project on "Biojava
 alignment lead"]
In-Reply-To: <k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
References: <4BBD820D.9070200@uni-tuebingen.de>	
	<r2w326ea8621004170634u6f7275f5xfc915d14a7c94489@mail.gmail.com>	
	<4BCE6E31.70504@uni-tuebingen.de>	
	<i2y326ea8621004210645o206b44afr22a0617651cc4e21@mail.gmail.com>
	<k2r8ea551a31004230111r1fd6fc0fq70c44b1e97976025@mail.gmail.com>
Message-ID: <4BD4E6AC.8030901@uni-tuebingen.de>

Dear Indu,

If you have a question regarding to BioJava, please do not just reply to 
some previous e-mail. In this case, your question appears in the e-mail 
tree related to the BioJava alignment lead. However, you have a question 
related to working and manipulating symbols. Therefore, you should 
better open a new thread. Sorry for telling you that but this is 
necessary to keep an overview about all the e-mails.

Best wishes
Andreas

-- 
Dipl.-Bioinform. Andreas Dr?ger
Eberhard Karls University T?bingen
Center for Bioinformatics (ZBIT)
Sand 1
72076 T?bingen
Germany

Phone: +49-7071-29-70436
Fax:   +49-7071-29-5091


From asidhu at biomap.org  Mon Apr 26 06:27:30 2010
From: asidhu at biomap.org (Amandeep Sidhu)
Date: Mon, 26 Apr 2010 14:27:30 +0800
Subject: [Biojava-l] CFP: 23rd IEEE International Symposium on
	Computer-Based Medical Systems 2010
Message-ID: <B41481C1-F539-4CDC-8373-A58A0DA14FA7@biomap.org>

IEEE CBMS 2010
23rd IEEE International Symposium on Computer-Based Medical Systems 2010
Perth, Australia, 12-15 October 2010

http://www.cbms2010.curtin.edu.au/

The 23rd IEEE International Symposium on Computer-Based Medical Systems (CBMS 2010) is intended to provide an international forum for discussing the latest results in the field of computational medicine. The scientific program of CBMS 2010 will consist of invited keynote talks given by leading scientists in the field, and regular and special track sessions that cover a broad array of issues which relate computing to medicine.

RELEVANT TOPICS

Network and Telemedicine Systems
Medical Databases & Information Systems
Computer-Aided Diagnosis
Medical Devices with Embedded Computers
Bioinformatics in Medicine
Software Systems in Medicine
Pervasive Health Systems and Services
Web-based Delivery of Medical Information
Medical Image Segmentation & Compression
Content Analysis of Biomedical Image Data
Knowledge-Based & Decision Support Systems
Hand-held Computing Applications in Medicine
Knowledge Discovery & Data Mining
Signal and Image Processing in Medicine
Multimedia Biomedical Databases

CBMS 2010 invites original previously unpublished contributions that are not submitted concurrently to a journal or another conference. Many of the above listed topics are represented by corresponding Special Tracks, while others are solely covered by the general CBMS track. Prospective authors are expected to submit their contributions to one of the corresponding Special Tracks or to the general track if none of the special tracks is relevant.

SPECIAL TRACKS

ST1: Computational Proteomics and Genomics
ST2: Knowledge Discovery and Decision Systems in Biomedicine
ST3: Ontologies for Biomedical Systems
ST4: HealthGrid & Cloud Computing
ST5: Technology Enhanced Learning in Medical Education
ST6: Intelligent Patient Management
ST7: Data Streams in Healthcare
ST8: Supporting Collaboration among Healthcare Workers
ST9: Telemedicine
ST10: Computer-Based Systems for Mental Health
ST11: Image Informatics in Biomedical Research and Clinical Medicine
ST12: e-Health

SUBMISSION GUIDELINES

Papers should be submitted electronically using EasyChair online submission system. The papers must be prepared following the IEEE two-column format and should not exceed the length of 6 (six) Letter-sized pages. LaTeX or Microsoft Word templates can be used when preparing the papers. Please, note that only PDF format of submissions is allowed.

Submission web site: http://www.easychair.org/conferences/?conf=cbms2010

All submissions will be peer-reviewed by at least three reviewers. The proceedings will be published by the IEEE Computer Society Press. At least one of the authors of accepted papers is required to register and present the work at the conference; otherwise their papers will be removed from the digital library after the conference.

IMPORTANT DATES

Submission deadline for regular papers:        		24 June 2010
Deadline for tutorial submission:                       24 June 2010
Notification of acceptation for papers and tutorials:    2 Aug 2010
Final camera ready due:                                  2 Sep 2010
Author registration:                                     2 Sep 2010

INTENDED AUDIENCE

Engineers, scientists, clinicians and managers involved in medical computing projects are encouraged to submit papers to the symposium and/or attend the symposium. The symposium provides its attendees with an opportunity to experience state-of-the-art research and development in a variety of topics directly and indirectly related to their own work. In addition to research papers, keynote speakers and tutorial sessions it provides participants with an opportunity to come up-to-date on important technological issues. The symposium encourages the participation of students engaged in research/development in computer-based medical systems.

Organizing Committee

GENERAL CHAIRS

Tharam Dillon, Curtin University of Technology, Australia
Daniel Rubin, National Center for Biomedical Ontologies, USA
William Gallagher, University College Dublin, Ireland

PROGRAM CHAIRS

Amandeep Sidhu, Curtin University of Technology, Australia
Alexey Tsymbal, Siemens, Germany

PUBLICATION CHAIRS

Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands
Tony Hu, Drexel University, USA

SPECIAL TRACK CHAIRS

Maja Hadzic, Curtin University of Technology, Australia
Jake Chen, Indiana University, USA

TUTORIAL CHAIRS

Phoebe Chen, La Trobe University, Australia
Xiaofang Zhou, University of Queensland, Australia

PUBLICITY CHAIRS

Carolyn McGregor, University of Ontario Institute of Technology, Canada
Meifania Chen, Curtin University of Technology, Australia


From thomascramera at dnastar.com  Mon Apr 26 14:51:23 2010
From: thomascramera at dnastar.com (Andy Thomas-Cramer)
Date: Mon, 26 Apr 2010 09:51:23 -0500
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
	<n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>

 
Thank you. I had not noticed the pattern that columns 13-14 at least
sometimes contain the element symbol, whether one- or two-character.

 
Questions:

* Is this pattern documented in the PDB specification? 

* If this pattern can be relied on, why are columns 77-78 also dedicated
to the element symbol?

* Should reliance on the pattern be hidden behind a BioJava method?

 
________________________________

From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf
Of Andreas Prlic
Sent: Friday, April 23, 2010 6:52 PM
To: Andy Thomas-Cramer
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol

 
Hi Andy,

you could check with  Atom.getFullname(), which contains the space
characters from the PDB file:
e.g Calpha: " CA ", Calcium "CA  " 

in addition the parent group of a Calpha atom is usually an AminoAcid
and for Calciums it is a Hetatom group...

Andreas

On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer
<thomascramera at dnastar.com> wrote:


Is there an easy way to identify the type of atom referenced by an Atom
object?

For example, if Atom.getName() is "CA", is the element calcium or the
atom carbon alpha?

If not, would it be feasible to add a method providing this in Atom,
AtomImpl, and parsing it in PDBFileParser, using the columns defined at
http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?


_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From andreas at sdsc.edu  Tue Apr 27 01:07:53 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Mon, 26 Apr 2010 18:07:53 -0700
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
	<n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>
Message-ID: <m2v59a41c431004261807h4440eac5t31871134c0a5d02f@mail.gmail.com>

Hi Andy

Questions:

> * Is this pattern documented in the PDB specification?
>

see here:
http://www.wwpdb.org/documentation/format23/sect9.html#ATOM


> * If this pattern can be relied on, why are columns 77-78 also dedicated to
> the element symbol?
>
That is the atom's element symbol (as given in the periodic table), in
contrast to the first name, which contains numbering information.

* Should reliance on the pattern be hidden behind a BioJava method?
>

If you think that is important we could probably provide an enum for all
atom types. There are two categories though: the periodic table symbol and
the one that is related to the position in an amino acid....

Andreas


>
>
>
>  ------------------------------
>
> *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On
> Behalf Of *Andreas Prlic
> *Sent:* Friday, April 23, 2010 6:52 PM
> *To:* Andy Thomas-Cramer
> *Cc:* biojava-l at lists.open-bio.org
> *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol
>
>
>
> Hi Andy,
>
> you could check with  Atom.getFullname(), which contains the space
> characters from the PDB file:
> e.g Calpha: " CA ", Calcium "CA  "
>
> in addition the parent group of a Calpha atom is usually an AminoAcid and
> for Calciums it is a Hetatom group...
>
> Andreas
>
> On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer <
> thomascramera at dnastar.com> wrote:
>
>
>
> Is there an easy way to identify the type of atom referenced by an Atom
> object?
>
> For example, if Atom.getName() is "CA", is the element calcium or the
> atom carbon alpha?
>
> If not, would it be feasible to add a method providing this in Atom,
> AtomImpl, and parsing it in PDBFileParser, using the columns defined at
> http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From rmb32 at cornell.edu  Mon Apr 26 22:02:11 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 15:02:11 -0700
Subject: [Biojava-l] Google Summer of Code - accepted students
Message-ID: <4BD60D63.1040400@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of 
Code students, listed in alphabetical order with their project titles 
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including 
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, 
Classification, and Visualization of Posttranslational Modification of 
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & 
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending 
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally 
assigned, plus 1 extra) allotted to us by Google.  Proposals were 
extremely competitive: 6 out of 52 translates to an 11.5% acceptance 
rate.  We received a lot of really excellent proposals, the decisions 
were not easy.

Thanks very much to all the students who applied, we very much 
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do 
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From andreas at sdsc.edu  Tue Apr 27 05:33:51 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Mon, 26 Apr 2010 22:33:51 -0700
Subject: [Biojava-l] accepted GSoC projects
Message-ID: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>

Dear all,

Google has released the results for GSoC: Congratulations to Mark Chapman
and Jianjiong Gao for having been accepted to work on the MSA and PTM
projects for BioJava! Let's start the "community bonding" process (
http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we all are
looking forward to work with you on this during the summer. The Mentors and
co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle
Ellrott for the MSA project (and me).

I want to thank all of of you who submitted proposals or showed interest in
other ways for the Google Summer of Code. We hope you are not too
disappointed if your application did not get accepted this time. We had a
large number (52) applications and the the overall quality of the
submissions was very high. We would like to stay in touch with you and we
hope that you are interested in BioJava also beyond the scope of GSoC. There
are a number of different ways how to contribute:  We are always looking for
people who provide code and patches to further improve our library, help out
with the documentation on the Wiki page, or answer questions on the mailing
lists.

Let's all give Mark and Jianjiong  a warm welcome to the BioJava community.
For those of you who are interested in following the progress of the
projects, as usually, the development related discussions are going to be on
the biojava-dev list.

Happy coding!

Andreas


From rmb32 at cornell.edu  Tue Apr 27 05:52:57 2010
From: rmb32 at cornell.edu (Robert Buels)
Date: Mon, 26 Apr 2010 22:52:57 -0700
Subject: [Biojava-l] Google Summer of Code - accepted students
Message-ID: <4BD67BB9.3000804@cornell.edu>

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally
assigned, plus 1 extra) allotted to us by Google.  Proposals were
extremely competitive: 6 out of 52 translates to an 11.5% acceptance
rate.  We received a lot of really excellent proposals, the decisions
were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


From jianjiong.gao at gmail.com  Tue Apr 27 19:13:12 2010
From: jianjiong.gao at gmail.com (Jianjiong Gao)
Date: Tue, 27 Apr 2010 14:13:12 -0500
Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects
In-Reply-To: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
Message-ID: <h2kc82264f51004271213u1ea78e1bq29184a65b6315cbe@mail.gmail.com>

Dear Dr. Prlic and Everyone,

Thanks for the warm welcome. I am so glad that I have the chance to
work with the BioJava community this summer. I would like to briefly
introduce myself. My name is Jianjiong (JJ) Gao. I am a PhD student in
Computer Science at University of Missouri, Columbia. My study is
focusing on Bioinformatics, specifically computational proteomics and
PTMs.

I came across BioJava about two years ago when I was working on a
plugin for Cytoscape, and was attracted by the idea of providing
generic Java API for bioinformatics applications. I was thinking maybe
someday I could do some coding for BioJava. And now I got the chance
:)

Best Regards,
-JJ

On Tue, Apr 27, 2010 at 12:33 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Dear all,
>
> Google has released the results for GSoC: Congratulations to Mark Chapman
> and Jianjiong Gao for having been accepted to work on the MSA and PTM
> projects for BioJava! Let's start the "community bonding" process (
> http://en.flossmanuals.net/GSoCMentoring/MindtheGap ) ?and we all are
> looking forward to work with you on this during the summer. The Mentors and
> co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle
> Ellrott for the MSA project (and me).
>
> I want to thank all of of you who submitted proposals or showed interest in
> other ways for the Google Summer of Code. We hope you are not too
> disappointed if your application did not get accepted this time. We had a
> large number (52) applications and the the overall quality of the
> submissions was very high. We would like to stay in touch with you and we
> hope that you are interested in BioJava also beyond the scope of GSoC. There
> are a number of different ways how to contribute: ?We are always looking for
> people who provide code and patches to further improve our library, help out
> with the documentation on the Wiki page, or answer questions on the mailing
> lists.
>
> Let's all give Mark and Jianjiong ?a warm welcome to the BioJava community.
> For those of you who are interested in following the progress of the
> projects, as usually, the development related discussions are going to be on
> the biojava-dev list.
>
> Happy coding!
>
> Andreas
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>


From chapman at cs.wisc.edu  Wed Apr 28 04:18:25 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Tue, 27 Apr 2010 23:18:25 -0500
Subject: [Biojava-l] accepted GSoC projects
In-Reply-To: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
Message-ID: <4BD7B711.9090108@cs.wisc.edu>

Hi all,

Thank you to Google, Open Bioinformatics Foundation, BioJava, and my mentors for 
this opportunity.  As a short introduction, I am Mark Chapman, a graduate 
student in Computer Sciences at the University of Wisconsin - Madison.  My focus 
is in artificial intelligence and bioinformatics.  This summer, I will add a 
Multiple Sequence Alignment module to BioJava.

My first task will be to update the alignment module to BioJava3 and to design 
the interface for MSA.  My second goal is to implement a progressive MSA styled 
after clustalw.  After that, I will add alternative routines for each step.

Any ideas for the MSA project as well as more sources of programming wisdom are 
quite welcome.  For example, Andreas suggested a series about Java parallelism 
and lazy execution 
(http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/). 
  I also noted a useful tip for iterative development 
(http://en.flossmanuals.net/GSoCMentoring/Workflow).

Thanks again,
Mark


On 4/27/2010 12:33 AM, Andreas Prlic wrote:
> Dear all,
>
> Google has released the results for GSoC: Congratulations to Mark
> Chapman and Jianjiong Gao for having been accepted to work on the MSA
> and PTM projects for BioJava! Let's start the "community bonding"
> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we
> all are looking forward to work with you on this during the summer. The
> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis
> and Kyle Ellrott for the MSA project (and me).
>
> I want to thank all of of you who submitted proposals or showed interest
> in other ways for the Google Summer of Code. We hope you are not too
> disappointed if your application did not get accepted this time. We had
> a  large number (52) applications and the the overall quality of the
> submissions was very high. We would like to stay in touch with you and
> we hope that you are interested in BioJava also beyond the scope of
> GSoC. There are a number of different ways how to contribute:  We are
> always looking for people who provide code and patches to further
> improve our library, help out with the documentation on the Wiki page,
> or answer questions on the mailing lists.
>
> Let's all give Mark and Jianjiong  a warm welcome to the BioJava
> community.  For those of you who are interested in following the
> progress of the projects, as usually, the development related
> discussions are going to be on the biojava-dev list.
>
> Happy coding!
>
> Andreas
>
>


From bernd.jagla at pasteur.fr  Wed Apr 28 07:25:05 2010
From: bernd.jagla at pasteur.fr (Bernd Jagla)
Date: Wed, 28 Apr 2010 09:25:05 +0200
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
Message-ID: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>

Hi there,	

I am trying to retrieve information (features) from the UCSC genome browser
using the DAS interface. 
I am looking at the org.biojava.bio.program.das sources. I can retrieve all
top level entry points with 
DASSequenceDB(dbURL)
(Apperently the last entry from the return XML object gives a 
[Fatal Error] :1:1: Content is not allowed in prolog.
Which I am ignoring...)

and also the DSN entries using:
DAS das = new DAS();
    das.addDasURL(new URL(dbURLString));
    for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); )
{....
     
When I try to access features for a top level entry point, i.e. a reference
sequence I have the impression that first all features for a given reference
sequence are being downloaded. 

My questions: 

How can I access only the features of a specific region? I guess in DAS
terms I want to specify the segment part of the URL
(http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
00).

I would also like to get the list of available features. How can I achieve
this? From a wireshark output I can see that this is being retrieved somehow
behind the scene. How can I access this information?

I am looking at TestDAS*.java; are there any other examples around that I
can use to learn from?

Thanks a lot for your kind support,

Best,

Bernd


From er.indupandey at gmail.com  Wed Apr 28 16:22:10 2010
From: er.indupandey at gmail.com (indu pandey)
Date: Wed, 28 Apr 2010 09:22:10 -0700
Subject: [Biojava-l] regarding errors
Message-ID: <q2n8ea551a31004280922v903336ffia92507377387fb43@mail.gmail.com>

hi

When i m trying to run this code

package javaapplication10;
import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;

public class TranscribeDNAtoRNA {
   public static void main(String[] args) {
      try {
       //make a DNA SymbolList
       SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT");
       //transcribe it to RNA (after BioJava 1.4 this method is deprecated)
       symL = RNATools.transcribe(symL);
       //(after BioJava 1.4 use this method instead)
       symL = DNATools.toRNA(symL);
       //just to prove it worked
       System.out.println(symL.seqString());
      }
      catch (IllegalSymbolException ex) {
        //this will happen if you try and make the DNA seq using non IUB
symbols
         ex.printStackTrace();
      }catch (IllegalAlphabetException ex) {
       //this will happen if you try and transcribe a non DNA SymbolList
         ex.printStackTrace();
      }
   }
}


i get following errors:.

*org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and
translation table source alphabets don't match: RNA and DNA
        at
org.biojava.bio.symbol.TranslatedSymbolList.<init>(TranslatedSymbolList.java:75)
        at
org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125)
        at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490)
        at
javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23)
*


From andreas at sdsc.edu  Wed Apr 28 17:31:58 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Wed, 28 Apr 2010 10:31:58 -0700
Subject: [Biojava-l] accepted GSoC projects
In-Reply-To: <4BD7B711.9090108@cs.wisc.edu>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
	<4BD7B711.9090108@cs.wisc.edu>
Message-ID: <w2g59a41c431004281031oe53560d6j2826a4cf4e5cb24d@mail.gmail.com>

> Any ideas for the MSA project as well as more sources of programming wisdom
> are quite welcome.  For example, Andreas suggested a series about Java
> parallelism and lazy execution (
> http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/).
>


credits for the links go to Scooter, who recommended those ;-)  My general
recommendation is to read Joshua Bloch's "Effective Java".
http://java.sun.com/docs/books/effective/ It is a collection of  rules that
should help in avoiding some frequently made mistakes...

Andreas


>  I also noted a useful tip for iterative development (
> http://en.flossmanuals.net/GSoCMentoring/Workflow).
>
> Thanks again,
> Mark
>
>
>
> On 4/27/2010 12:33 AM, Andreas Prlic wrote:
>
>> Dear all,
>>
>> Google has released the results for GSoC: Congratulations to Mark
>> Chapman and Jianjiong Gao for having been accepted to work on the MSA
>> and PTM projects for BioJava! Let's start the "community bonding"
>> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we
>> all are looking forward to work with you on this during the summer. The
>> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis
>> and Kyle Ellrott for the MSA project (and me).
>>
>> I want to thank all of of you who submitted proposals or showed interest
>> in other ways for the Google Summer of Code. We hope you are not too
>> disappointed if your application did not get accepted this time. We had
>> a  large number (52) applications and the the overall quality of the
>> submissions was very high. We would like to stay in touch with you and
>> we hope that you are interested in BioJava also beyond the scope of
>> GSoC. There are a number of different ways how to contribute:  We are
>> always looking for people who provide code and patches to further
>> improve our library, help out with the documentation on the Wiki page,
>> or answer questions on the mailing lists.
>>
>> Let's all give Mark and Jianjiong  a warm welcome to the BioJava
>> community.  For those of you who are interested in following the
>> progress of the projects, as usually, the development related
>> discussions are going to be on the biojava-dev list.
>>
>> Happy coding!
>>
>> Andreas
>>
>>
>>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From jw12 at sanger.ac.uk  Wed Apr 28 20:21:13 2010
From: jw12 at sanger.ac.uk (Jonathan Warren)
Date: Wed, 28 Apr 2010 21:21:13 +0100
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
Message-ID: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>

Hi Bernd

For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads 
  there is a section called "Downloading data from the UCSC DAS server"

for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2

the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert 
) for DAS client creation, but there is a also a good javascript  
library as well called JSDas.

Any more info then don't hesitate to ask.

Jonathan.

On 28 Apr 2010, at 08:25, Bernd Jagla wrote:

> Hi there,	
>
> I am trying to retrieve information (features) from the UCSC genome  
> browser
> using the DAS interface.
> I am looking at the org.biojava.bio.program.das sources. I can  
> retrieve all
> top level entry points with
> DASSequenceDB(dbURL)
> (Apperently the last entry from the return XML object gives a
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Which I am ignoring...)
>
> and also the DSN entries using:
> DAS das = new DAS();
>    das.addDasURL(new URL(dbURLString));
>    for(Iterator i = das.getReferenceServers().iterator();  
> i.hasNext(); )
> {....
>
> When I try to access features for a top level entry point, i.e. a  
> reference
> sequence I have the impression that first all features for a given  
> reference
> sequence are being downloaded.
>
> My questions:
>
> How can I access only the features of a specific region? I guess in  
> DAS
> terms I want to specify the segment part of the URL
> (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
> 00).
>
> I would also like to get the list of available features. How can I  
> achieve
> this? From a wireshark output I can see that this is being retrieved  
> somehow
> behind the scene. How can I access this information?
>
> I am looking at TestDAS*.java; are there any other examples around  
> that I
> can use to learn from?
>
> Thanks a lot for your kind support,
>
> Best,
>
> Bernd
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

Jonathan Warren
Senior Developer and DAS coordinator
jw12 at sanger.ac.uk
Ext: 2314
Telephone: 01223 492314


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From chapman at cs.wisc.edu  Thu Apr 29 01:09:07 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Wed, 28 Apr 2010 20:09:07 -0500
Subject: [Biojava-l] [Biojava-dev] accepted GSoC projects
In-Reply-To: <6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu>
References: <u2w59a41c431004262233xe5553c17je23c2b42a3aae81d@mail.gmail.com>
	<4BD7B711.9090108@cs.wisc.edu>
	<w2g59a41c431004281031oe53560d6j2826a4cf4e5cb24d@mail.gmail.com>
	<6C3A102F-AF2B-4E29-9C84-BB6B881BD083@scripps.edu>
Message-ID: <4BD8DC33.7010607@cs.wisc.edu>

Here is a summary of the concurrency lessons I learned that are useful with or 
without the functional programming paradigm --

1: implement Callable<T> to submit tasks for concurrent/parallel/lazy execution
  - call() methods just wrap a call to the computation intensive method
2: share a fixed size thread pool with task queue to avoid
  - overhead of thread creation/destruction,
  - too many simultaneous threads, and
  - most blocking issues
3: place thread blocking Future<T>.get() calls within tasks later in the queue
  - while(!Future<T>.isDone()) Thread.yield(); may also help keep the pool active
4: execution in a task queue also enables easier logging and progress listening

There are two obvious places concurrent execution will fit in the MSA module --

1: building the distance matrix
  - queue pairwise alignment/scoring tasks in loop over all sequence pairs
2: progressive alignment
  - queue profile-profile alignment tasks in postfix traversal of guide tree 
(from leaves to root)

All our library copies of "Effective Java" are checked out, so I ordered a copy 
for my personal library.  The sample chapter on generics sold me.

Mark


On 4/28/2010 12:57 PM, Scooter Willis wrote:
> Andreas
>
> Those links were sent to me by Mark Southern who sits a couple doors down and a past BioJava contributor for the sequence viewer. We should avoid bringing in any external parallel frameworks but at minimum give ourselves enough abstraction with a backend multi-threaded job-processing approach to take advantage of a multi-processor box and a cluster via Terracotta.  If the abstraction of the jobs and the mapping of resources is generic enough then that allows different implementations in various cluster environments for those who have found the next best thing in parallel computing!
>
> Scooter
>
> On Apr 28, 2010, at 1:31 PM, Andreas Prlic wrote:
>
>>> Any ideas for the MSA project as well as more sources of programming wisdom
>>> are quite welcome.  For example, Andreas suggested a series about Java
>>> parallelism and lazy execution (
>>> http://apocalisp.wordpress.com/2008/06/18/parallel-strategies-and-the-callable-monad/).
>>>
>>
>>
>> credits for the links go to Scooter, who recommended those ;-)  My general
>> recommendation is to read Joshua Bloch's "Effective Java".
>> http://java.sun.com/docs/books/effective/ It is a collection of  rules that
>> should help in avoiding some frequently made mistakes...
>>
>> Andreas
>>
>>
>>
>>
>>
>>
>>> I also noted a useful tip for iterative development (
>>> http://en.flossmanuals.net/GSoCMentoring/Workflow).
>>>
>>> Thanks again,
>>> Mark
>>>
>>>
>>>
>>> On 4/27/2010 12:33 AM, Andreas Prlic wrote:
>>>
>>>> Dear all,
>>>>
>>>> Google has released the results for GSoC: Congratulations to Mark
>>>> Chapman and Jianjiong Gao for having been accepted to work on the MSA
>>>> and PTM projects for BioJava! Let's start the "community bonding"
>>>> process ( http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we
>>>> all are looking forward to work with you on this during the summer. The
>>>> Mentors and co-mentors will be Peter Rose for the PTM and Scooter Willis
>>>> and Kyle Ellrott for the MSA project (and me).
>>>>
>>>> I want to thank all of of you who submitted proposals or showed interest
>>>> in other ways for the Google Summer of Code. We hope you are not too
>>>> disappointed if your application did not get accepted this time. We had
>>>> a  large number (52) applications and the the overall quality of the
>>>> submissions was very high. We would like to stay in touch with you and
>>>> we hope that you are interested in BioJava also beyond the scope of
>>>> GSoC. There are a number of different ways how to contribute:  We are
>>>> always looking for people who provide code and patches to further
>>>> improve our library, help out with the documentation on the Wiki page,
>>>> or answer questions on the mailing lists.
>>>>
>>>> Let's all give Mark and Jianjiong  a warm welcome to the BioJava
>>>> community.  For those of you who are interested in following the
>>>> progress of the projects, as usually, the development related
>>>> discussions are going to be on the biojava-dev list.
>>>>
>>>> Happy coding!
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>


From bernd.jagla at pasteur.fr  Thu Apr 29 06:30:03 2010
From: bernd.jagla at pasteur.fr (Bernd Jagla)
Date: Thu, 29 Apr 2010 08:30:03 +0200
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
	<58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
Message-ID: <C9F821CA96FB4466A66E969AEFCF125E@zillumina>

Hi Jonathan,

 
Just to clarify, I need to write my own das client? I was hoping to be able
to use most of the functionality especially for the parsing of the XML and
creating the URLs by means of functions/methods that are already around. 

I am now going into debug mode for the DAS package in biojava to look for
the XML parsing, if you any further pointers on specific methods I should be
looking at it would mean a lot to me.

In short, I think I can create the URLs from scratch with not much effort. I
don't currently know how to put the XML into a data structure and how this
data structure should look like.

 
Thanks for your kind help,

 
Bernd

 
  _____  

From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] 
Sent: Wednesday, April 28, 2010 10:21 PM
To: Bernd Jagla
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence
region

 
Hi Bernd

 
For the UCSC you need to filter on types. see
http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section
called "Downloading data from the UCSC DAS server"

 
for DAS libraries you can see a tutorial here
http://www.biodas.org/wiki/DASWorkshop2010#Day_2

 
the one you would be most interested in is the Dasobert tutorial
(http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for
DAS client creation, but there is a also a good javascript library as well
called JSDas.

 
Any more info then don't hesitate to ask.

 
Jonathan.


On 28 Apr 2010, at 08:25, Bernd Jagla wrote:


Hi there,         

I am trying to retrieve information (features) from the UCSC genome browser
using the DAS interface. 
I am looking at the org.biojava.bio.program.das sources. I can retrieve all
top level entry points with 
DASSequenceDB(dbURL)
(Apperently the last entry from the return XML object gives a 
[Fatal Error] :1:1: Content is not allowed in prolog.
Which I am ignoring...)

and also the DSN entries using:
DAS das = new DAS();
   das.addDasURL(new URL(dbURLString));
   for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); )
{....

When I try to access features for a top level entry point, i.e. a reference
sequence I have the impression that first all features for a given reference
sequence are being downloaded. 

My questions: 

How can I access only the features of a specific region? I guess in DAS
terms I want to specify the segment part of the URL
(http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
00).

I would also like to get the list of available features. How can I achieve
this? From a wireshark output I can see that this is being retrieved somehow
behind the scene. How can I access this information?

I am looking at TestDAS*.java; are there any other examples around that I
can use to learn from?

Thanks a lot for your kind support,

Best,

Bernd


_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

 
Jonathan Warren

Senior Developer and DAS coordinator

jw12 at sanger.ac.uk

Ext: 2314

Telephone: 01223 492314

 
-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE. 


From jw12 at sanger.ac.uk  Thu Apr 29 08:26:40 2010
From: jw12 at sanger.ac.uk (Jonathan Warren)
Date: Thu, 29 Apr 2010 09:26:40 +0100
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <C9F821CA96FB4466A66E969AEFCF125E@zillumina>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
	<58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
	<C9F821CA96FB4466A66E969AEFCF125E@zillumina>
Message-ID: <A8AF8818-8B33-4869-8EF9-07A782C496FC@sanger.ac.uk>

The link I gave you http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert 
  shows examples of how to connect to 'European' style das sources.  
For the UCSC and GBrowse type DAS sources you may have to play around  
with the urls to get the info you want as they work slightly  
differently to other DAS data sources and use the types to filter  
data. I would suggest contacting the UCSC for more info.

The dasobert library is what you should use- the DASSequenceDB.java  
that you are currently looking at in biojava are old and not really  
supported anymore.

> I was hoping to be able to use most of the functionality especially  
> for the parsing of the XML and creating the URLs by means of  
> functions/methods that are already around?
this is what the dasobert library is for ;)


On 29 Apr 2010, at 07:30, Bernd Jagla wrote:

> Hi Jonathan,
>
> Just to clarify, I need to write my own das client? I was hoping to  
> be able to use most of the functionality especially for the parsing  
> of the XML and creating the URLs by means of functions/methods that  
> are already around?
> I am now going into debug mode for the DAS package in biojava to  
> look for the XML parsing, if you any further pointers on specific  
> methods I should be looking at it would mean a lot to me?
> In short, I think I can create the URLs from scratch with not much  
> effort. I don?t currently know how to put the XML into a data  
> structure and how this data structure should look like.
>
> Thanks for your kind help,
>
> Bernd
>
> From: Jonathan Warren [mailto:jw12 at sanger.ac.uk]
> Sent: Wednesday, April 28, 2010 10:21 PM
> To: Bernd Jagla
> Cc: biojava-l at lists.open-bio.org
> Subject: Re: [Biojava-l] DAS client: how to retrieve features for a  
> sequence region
>
> Hi Bernd
>
> For the UCSC you need to filter on types. see http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads 
>  there is a section called "Downloading data from the UCSC DAS server"
>
> for DAS libraries you can see a tutorial here http://www.biodas.org/wiki/DASWorkshop2010#Day_2
>
> the one you would be most interested in is the Dasobert tutorial (http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert 
> ) for DAS client creation, but there is a also a good javascript  
> library as well called JSDas.
>
> Any more info then don't hesitate to ask.
>
> Jonathan.
>
>
> On 28 Apr 2010, at 08:25, Bernd Jagla wrote:
>
>
> Hi there,
>
> I am trying to retrieve information (features) from the UCSC genome  
> browser
> using the DAS interface.
> I am looking at the org.biojava.bio.program.das sources. I can  
> retrieve all
> top level entry points with
> DASSequenceDB(dbURL)
> (Apperently the last entry from the return XML object gives a
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Which I am ignoring...)
>
> and also the DSN entries using:
> DAS das = new DAS();
>    das.addDasURL(new URL(dbURLString));
>    for(Iterator i = das.getReferenceServers().iterator();  
> i.hasNext(); )
> {....
>
> When I try to access features for a top level entry point, i.e. a  
> reference
> sequence I have the impression that first all features for a given  
> reference
> sequence are being downloaded.
>
> My questions:
>
> How can I access only the features of a specific region? I guess in  
> DAS
> terms I want to specify the segment part of the URL
> (http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
> 00).
>
> I would also like to get the list of available features. How can I  
> achieve
> this? From a wireshark output I can see that this is being retrieved  
> somehow
> behind the scene. How can I access this information?
>
> I am looking at TestDAS*.java; are there any other examples around  
> that I
> can use to learn from?
>
> Thanks a lot for your kind support,
>
> Best,
>
> Bernd
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> Jonathan Warren
> Senior Developer and DAS coordinator
> jw12 at sanger.ac.uk
> Ext: 2314
> Telephone: 01223 492314
>
>
>
>
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome  
> Research Limited, a charity registered in England with number  
> 1021457 and a company registered in England with number 2742969,  
> whose registered office is 215 Euston Road, London, NW1 2BE.

Jonathan Warren
Senior Developer and DAS coordinator
jw12 at sanger.ac.uk
Ext: 2314
Telephone: 01223 492314


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From ayates at ebi.ac.uk  Thu Apr 29 08:51:23 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Thu, 29 Apr 2010 09:51:23 +0100
Subject: [Biojava-l] regarding errors
In-Reply-To: <q2n8ea551a31004280922v903336ffia92507377387fb43@mail.gmail.com>
References: <q2n8ea551a31004280922v903336ffia92507377387fb43@mail.gmail.com>
Message-ID: <C7EE3258-C65D-4899-A47F-B4F19B7DE2B9@ebi.ac.uk>

I believe your problem is that you are attempting to transcribe the DNA to RNA twice. If you comment out the line:

//symL = RNATools.transcribe(symL);

Then you should find the code will work

Regards,

Andy

On 28 Apr 2010, at 17:22, indu pandey wrote:

> hi
> 
> When i m trying to run this code
> 
> package javaapplication10;
> import org.biojava.bio.symbol.*;
> import org.biojava.bio.seq.*;
> 
> public class TranscribeDNAtoRNA {
>   public static void main(String[] args) {
>      try {
>       //make a DNA SymbolList
>       SymbolList symL = DNATools.createDNA("ATGTAAGGCCAGTGT");
>       //transcribe it to RNA (after BioJava 1.4 this method is deprecated)
>       symL = RNATools.transcribe(symL);
>       //(after BioJava 1.4 use this method instead)
>       symL = DNATools.toRNA(symL);
>       //just to prove it worked
>       System.out.println(symL.seqString());
>      }
>      catch (IllegalSymbolException ex) {
>        //this will happen if you try and make the DNA seq using non IUB
> symbols
>         ex.printStackTrace();
>      }catch (IllegalAlphabetException ex) {
>       //this will happen if you try and transcribe a non DNA SymbolList
>         ex.printStackTrace();
>      }
>   }
> }
> 
> 
> i get following errors:.
> 
> *org.biojava.bio.symbol.IllegalAlphabetException: The source alphabet and
> translation table source alphabets don't match: RNA and DNA
>        at
> org.biojava.bio.symbol.TranslatedSymbolList.<init>(TranslatedSymbolList.java:75)
>        at
> org.biojava.bio.symbol.SymbolListViews.translate(SymbolListViews.java:125)
>        at org.biojava.bio.seq.DNATools.toRNA(DNATools.java:490)
>        at
> javaapplication10.TranscribeDNAtoRNA.main(TranscribeDNAtoRNA.java:23)
> *
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From bernd.jagla at pasteur.fr  Thu Apr 29 09:57:58 2010
From: bernd.jagla at pasteur.fr (Bernd Jagla)
Date: Thu, 29 Apr 2010 11:57:58 +0200
Subject: [Biojava-l] DAS client: how to retrieve features for a sequence
	region
In-Reply-To: <A8AF8818-8B33-4869-8EF9-07A782C496FC@sanger.ac.uk>
References: <7F6FFC6512D64E0E98DF74DB837C9610@zillumina>
	<58A5F4B6-C3E3-45FA-9C15-6676BB953C4F@sanger.ac.uk>
	<C9F821CA96FB4466A66E969AEFCF125E@zillumina>
	<A8AF8818-8B33-4869-8EF9-07A782C496FC@sanger.ac.uk>
Message-ID: <F1A08360F6094E559A20ED9CD0EA0168@zillumina>

Great that is very helpful. One more question: Should I be using the Das1 or
Das2 implementations. The demo I am looking at uses Das2 (I think), but I am
running into problems. By modifying things in the Das2SourceHandler I can
now get Ids (instead of using uri). Is this the right way of approaching
this or should I be looking somewhere else..

 
When you say I have to play around with the URLs can you give me an example?
Is the problem described above part of this? (this is not the URL but rather
the XML..)

 
Sorry for these questions, but I find it extremely difficult to get my head
around all these different versions (DAS1/2; dasobert/programs.das;
European/Rest;.)

 
Thanks a lot,

 
Bernd

 
PS. I guess I should have attended the recent meeting. ;(

 
  _____  

From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] 
Sent: Thursday, April 29, 2010 10:27 AM
To: Bernd Jagla
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence
region

 
The link I gave you
http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert shows
examples of how to connect to 'European' style das sources. For the UCSC and
GBrowse type DAS sources you may have to play around with the urls to get
the info you want as they work slightly differently to other DAS data
sources and use the types to filter data. I would suggest contacting the
UCSC for more info.

 
The dasobert library is what you should use- the DASSequenceDB.java that you
are currently looking at in biojava are old and not really supported
anymore.

 
I was hoping to be able to use most of the functionality especially for the
parsing of the XML and creating the URLs by means of functions/methods that
are already around.

this is what the dasobert library is for ;)

 
On 29 Apr 2010, at 07:30, Bernd Jagla wrote:


Hi Jonathan,

 
Just to clarify, I need to write my own das client? I was hoping to be able
to use most of the functionality especially for the parsing of the XML and
creating the URLs by means of functions/methods that are already around.

I am now going into debug mode for the DAS package in biojava to look for
the XML parsing, if you any further pointers on specific methods I should be
looking at it would mean a lot to me.

In short, I think I can create the URLs from scratch with not much effort. I
don't currently know how to put the XML into a data structure and how this
data structure should look like.

 
Thanks for your kind help,

 
Bernd

 
  _____  

From: Jonathan Warren [mailto:jw12 at sanger.ac.uk] 
Sent: Wednesday, April 28, 2010 10:21 PM
To: Bernd Jagla
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] DAS client: how to retrieve features for a sequence
region

 
Hi Bernd

 
For the UCSC you need to filter on types. see
http://genome.ucsc.edu/FAQ/FAQdownloads.html#downloads there is a section
called "Downloading data from the UCSC DAS server"

 
for DAS libraries you can see a tutorial here
http://www.biodas.org/wiki/DASWorkshop2010#Day_2

 
the one you would be most interested in is the Dasobert tutorial
(http://www.ebi.ac.uk/~rafael/dokuwiki/doku.php?id=das:courses:dasobert) for
DAS client creation, but there is a also a good javascript library as well
called JSDas.

 
Any more info then don't hesitate to ask.

 
Jonathan.


On 28 Apr 2010, at 08:25, Bernd Jagla wrote:


Hi there,         

I am trying to retrieve information (features) from the UCSC genome browser
using the DAS interface. 
I am looking at the org.biojava.bio.program.das sources. I can retrieve all
top level entry points with 
DASSequenceDB(dbURL)
(Apperently the last entry from the return XML object gives a 
[Fatal Error] :1:1: Content is not allowed in prolog.
Which I am ignoring...)

and also the DSN entries using:
DAS das = new DAS();
   das.addDasURL(new URL(dbURLString));
   for(Iterator i = das.getReferenceServers().iterator(); i.hasNext(); )
{....

When I try to access features for a top level entry point, i.e. a reference
sequence I have the impression that first all features for a given reference
sequence are being downloaded. 

My questions: 

How can I access only the features of a specific region? I guess in DAS
terms I want to specify the segment part of the URL
(http://genome.ucsc.edu/cgi-bin/das/hg17/features?segment=22:15000000,160000
00).

I would also like to get the list of available features. How can I achieve
this? From a wireshark output I can see that this is being retrieved somehow
behind the scene. How can I access this information?

I am looking at TestDAS*.java; are there any other examples around that I
can use to learn from?

Thanks a lot for your kind support,

Best,

Bernd


_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

 
Jonathan Warren

Senior Developer and DAS coordinator

jw12 at sanger.ac.uk

Ext: 2314

Telephone: 01223 492314

 
-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE.

 
Jonathan Warren

Senior Developer and DAS coordinator

jw12 at sanger.ac.uk

Ext: 2314

Telephone: 01223 492314

 
-- The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a company
registered in England with number 2742969, whose registered office is 215
Euston Road, London, NW1 2BE. 


From thomascramera at dnastar.com  Thu Apr 29 18:14:27 2010
From: thomascramera at dnastar.com (Andy Thomas-Cramer)
Date: Thu, 29 Apr 2010 13:14:27 -0500
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <m2v59a41c431004261807h4440eac5t31871134c0a5d02f@mail.gmail.com>
References: <A4009967D1886D4286A9B7931FD58610021BEC1B@FS1.dnastar.com>
	<n2j59a41c431004231652sa0729050z9df823f6dbf0dfd8@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD58610021BEC82@FS1.dnastar.com>
	<m2v59a41c431004261807h4440eac5t31871134c0a5d02f@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD58610021BF1D7@FS1.dnastar.com>

 
Yes, I would like to have direct access to the element symbol data
that's in the file. Otherwise, anyone that needs the element type has to
create rules for interpreting it from the "atom name" field. It feels
wrong to attempt to deduce data when it is provided explicitly.


These PDB remediation project notes suggest using the element symbol
specified in 77-78 

http://nar.oxfordjournals.org/cgi/content/full/36/suppl_1/D426#SEC3 

"Atom types are provided for every atom (i.e. ATOM record columns
77-78), so prior atom name justification conventions should no longer be
assumed in reading atom names."

 
JMOL uses the PDB element symbol if present, else interprets from the
"atom name" field. 

http://wiki.jmol.org/index.php/AtomSets 

"On PDB format, Jmol will identify the element from columns 77-78
(element symbol, right-justified). If this is absent, then it will
interpret the "atom name" field (columns 13-14) to deduce the element
identity."

JMOL is LGPL. If it interpretation is desirable, could start with its
current approach. Personally, I would be happy just with access to the
data in the file.

 
________________________________

From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] On Behalf
Of Andreas Prlic
Sent: Monday, April 26, 2010 8:08 PM
To: Andy Thomas-Cramer
Cc: biojava-l at lists.open-bio.org
Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol

 
Hi Andy

Questions:

	* Is this pattern documented in the PDB specification? 


see here:
http://www.wwpdb.org/documentation/format23/sect9.html#ATOM
 

	* If this pattern can be relied on, why are columns 77-78 also
dedicated to the element symbol?

That is the atom's element symbol (as given in the periodic table), in
contrast to the first name, which contains numbering information.

	* Should reliance on the pattern be hidden behind a BioJava
method?


If you think that is important we could probably provide an enum for all
atom types. There are two categories though: the periodic table symbol
and the one that is related to the position in an amino acid....

Andreas 
 

________________________________


	From: andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com]
On Behalf Of Andreas Prlic
	Sent: Friday, April 23, 2010 6:52 PM
	To: Andy Thomas-Cramer
	Cc: biojava-l at lists.open-bio.org
	Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol

	 
	Hi Andy,
	
	you could check with  Atom.getFullname(), which contains the
space characters from the PDB file:
	e.g Calpha: " CA ", Calcium "CA  " 
	
	in addition the parent group of a Calpha atom is usually an
AminoAcid and for Calciums it is a Hetatom group...
	
	Andreas

	On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer
<thomascramera at dnastar.com> wrote:

	
	Is there an easy way to identify the type of atom referenced by
an Atom
	object?
	
	For example, if Atom.getName() is "CA", is the element calcium
or the
	atom carbon alpha?
	
	If not, would it be feasible to add a method providing this in
Atom,
	AtomImpl, and parsing it in PDBFileParser, using the columns
defined at
	http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
	
	
	_______________________________________________
	Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
	http://lists.open-bio.org/mailman/listinfo/biojava-l

	
	-- 
	
-----------------------------------------------------------------------
	Dr. Andreas Prlic
	Senior Scientist, RCSB PDB Protein Data Bank
	University of California, San Diego
	(+1) 858.246.0526
	
-----------------------------------------------------------------------


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


From pwrose at ucsd.edu  Thu Apr 29 19:53:33 2010
From: pwrose at ucsd.edu (Peter Rose)
Date: Thu, 29 Apr 2010 12:53:33 -0700
Subject: [Biojava-l] PDBFileParser and Atom element symbol
In-Reply-To: <mailman.9.1272384003.32055.biojava-l@lists.open-bio.org>
References: <mailman.9.1272384003.32055.biojava-l@lists.open-bio.org>
Message-ID: <002f01cae7d5$a673fcf0$f35bf6d0$@edu>

Since there was a request to be able to access element information, I've
added an Element enum to the org.biojava.bio.structure package that I had
developed for another application.

Each element has a number of properties such as atomic number, mass, min and
max valence, electronegativity, etc. that should be useful.

The AtomImpl class now has a getter and setter for Element.

Also, the PDB parser now populates the Element in the Atom class. By default
the PDB parser tries to parse the element from columns 77-78. As a fallback
for mis-formatted PDB files that don't contain an element column, the
element is parsed from the atom name.

We'll also add element support for the cif parser soon.

-Peter

________________________________________________
Peter Rose, Ph.D.                         
Scientific Lead
RCSB Protein Data Bank (www.pdb.org)

San Diego Supercomputer Center (SDSC) and
Skaggs School of Pharmacy and Pharmaceutical Sciences

Pharmaceutical Sciences Building
University of California San Diego


-----Original Message-----
From: biojava-l-bounces at lists.open-bio.org
[mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of
biojava-l-request at lists.open-bio.org
Sent: Tuesday, April 27, 2010 9:00 AM
To: biojava-l at lists.open-bio.org
Subject: Biojava-l Digest, Vol 87, Issue 26

Send Biojava-l mailing list submissions to
	biojava-l at lists.open-bio.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.open-bio.org/mailman/listinfo/biojava-l
or, via email, send a message with subject or body 'help' to
	biojava-l-request at lists.open-bio.org

You can reach the person managing the list at
	biojava-l-owner at lists.open-bio.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Biojava-l digest..."


Today's Topics:

   1. Re: PDBFileParser and Atom element symbol (Andreas Prlic)
   2. Google Summer of Code - accepted students (Robert Buels)
   3. accepted GSoC projects (Andreas Prlic)
   4. Google Summer of Code - accepted students (Robert Buels)


----------------------------------------------------------------------

Message: 1
Date: Mon, 26 Apr 2010 18:07:53 -0700
From: Andreas Prlic <andreas at sdsc.edu>
Subject: Re: [Biojava-l] PDBFileParser and Atom element symbol
To: Andy Thomas-Cramer <thomascramera at dnastar.com>
Cc: biojava-l at lists.open-bio.org
Message-ID:
	<m2v59a41c431004261807h4440eac5t31871134c0a5d02f at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi Andy

Questions:

> * Is this pattern documented in the PDB specification?
>

see here:
http://www.wwpdb.org/documentation/format23/sect9.html#ATOM


> * If this pattern can be relied on, why are columns 77-78 also dedicated
to
> the element symbol?
>
That is the atom's element symbol (as given in the periodic table), in
contrast to the first name, which contains numbering information.

* Should reliance on the pattern be hidden behind a BioJava method?
>

If you think that is important we could probably provide an enum for all
atom types. There are two categories though: the periodic table symbol and
the one that is related to the position in an amino acid....

Andreas


>
>
>
>  ------------------------------
>
> *From:* andreas.prlic at gmail.com [mailto:andreas.prlic at gmail.com] *On
> Behalf Of *Andreas Prlic
> *Sent:* Friday, April 23, 2010 6:52 PM
> *To:* Andy Thomas-Cramer
> *Cc:* biojava-l at lists.open-bio.org
> *Subject:* Re: [Biojava-l] PDBFileParser and Atom element symbol
>
>
>
> Hi Andy,
>
> you could check with  Atom.getFullname(), which contains the space
> characters from the PDB file:
> e.g Calpha: " CA ", Calcium "CA  "
>
> in addition the parent group of a Calpha atom is usually an AminoAcid and
> for Calciums it is a Hetatom group...
>
> Andreas
>
> On Fri, Apr 23, 2010 at 3:58 PM, Andy Thomas-Cramer <
> thomascramera at dnastar.com> wrote:
>
>
>
> Is there an easy way to identify the type of atom referenced by an Atom
> object?
>
> For example, if Atom.getName() is "CA", is the element calcium or the
> atom carbon alpha?
>
> If not, would it be feasible to add a method providing this in Atom,
> AtomImpl, and parsing it in PDBFileParser, using the columns defined at
> http://www.wwpdb.org/documentation/format32/sect9.html#ATOM?
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>


-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------


------------------------------

Message: 2
Date: Mon, 26 Apr 2010 15:02:11 -0700
From: Robert Buels <rmb32 at cornell.edu>
Subject: [Biojava-l] Google Summer of Code - accepted students
To: rmb32 at cornell.edu
Message-ID: <4BD60D63.1040400 at cornell.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of 
Code students, listed in alphabetical order with their project titles 
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including 
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification, 
Classification, and Visualization of Posttranslational Modification of 
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation & 
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending 
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally 
assigned, plus 1 extra) allotted to us by Google.  Proposals were 
extremely competitive: 6 out of 52 translates to an 11.5% acceptance 
rate.  We received a lot of really excellent proposals, the decisions 
were not easy.

Thanks very much to all the students who applied, we very much 
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do 
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


------------------------------

Message: 3
Date: Mon, 26 Apr 2010 22:33:51 -0700
From: Andreas Prlic <andreas at sdsc.edu>
Subject: [Biojava-l] accepted GSoC projects
To: Jianjiong Gao <jianjong.gao at gmail.com>, Mark Chapman
	<chapman at cs.wisc.edu>,	Biojava <biojava-l at lists.open-bio.org>,
	biojava-dev <biojava-dev at lists.open-bio.org>
Cc: "Rose, Peter" <pwrose at ucsd.edu>, Scooter Willis
	<HWillis at scripps.edu>,	Kyle Ellrott <kellrott at ucsd.edu>
Message-ID:
	<u2w59a41c431004262233xe5553c17je23c2b42a3aae81d at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Dear all,

Google has released the results for GSoC: Congratulations to Mark Chapman
and Jianjiong Gao for having been accepted to work on the MSA and PTM
projects for BioJava! Let's start the "community bonding" process (
http://en.flossmanuals.net/GSoCMentoring/MindtheGap )  and we all are
looking forward to work with you on this during the summer. The Mentors and
co-mentors will be Peter Rose for the PTM and Scooter Willis and Kyle
Ellrott for the MSA project (and me).

I want to thank all of of you who submitted proposals or showed interest in
other ways for the Google Summer of Code. We hope you are not too
disappointed if your application did not get accepted this time. We had a
large number (52) applications and the the overall quality of the
submissions was very high. We would like to stay in touch with you and we
hope that you are interested in BioJava also beyond the scope of GSoC. There
are a number of different ways how to contribute:  We are always looking for
people who provide code and patches to further improve our library, help out
with the documentation on the Wiki page, or answer questions on the mailing
lists.

Let's all give Mark and Jianjiong  a warm welcome to the BioJava community.
For those of you who are interested in following the progress of the
projects, as usually, the development related discussions are going to be on
the biojava-dev list.

Happy coding!

Andreas


------------------------------

Message: 4
Date: Mon, 26 Apr 2010 22:52:57 -0700
From: Robert Buels <rmb32 at cornell.edu>
Subject: [Biojava-l] Google Summer of Code - accepted students
To: BioPerl List <bioperl-l at lists.open-bio.org>,	BioPython List
	<biopython at lists.open-bio.org>,	BioJava List
	<biojava-l at lists.open-bio.org>,	BioRuby List
	<bioruby at lists.open-bio.org>,	BioSQL List
	<biosql-l at lists.open-bio.org>,	BioLib List
	<biolib-dev at lists.open-bio.org>,	Open-Bio List
	<open-bio-l at lists.open-bio.org>,	BioDAS List
<das at lists.open-bio.org>
Message-ID: <4BD67BB9.3000804 at cornell.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi all,

I'm pleased to announce the acceptance of OBF's 2010 Google Summer of
Code students, listed in alphabetical order with their project titles
and primary mentors:

Mark Chapman (PM Andreas Prlic) - Improvements to BioJava including
Implementation of Multiple Sequence Alignment Algorithms

Jianjiong Gao (PM Peter Rose) - BioJava Packages for Identification,
Classification, and Visualization of Posttranslational Modification of
Proteins

Kazuhiro Hayashi (PM Naohisa Goto) - Ruby 1.9.2 support of BioRuby

Sara Rayburn (PM Christian Zmasek) - Implementing Speciation &
Duplication Inference Algorithm for Binary and Non-binary Species Tree

Joao Pedro Garcia Lopes Maia Rodrigues (PM Eric Talevich) - Extending
Bio.PDB: broadening the usefulness of BioPython's Structural Biology module

Jun Yin (PM Chris Fields) - BioPerl Alignment Subsystem Refactoring

Congratulations to our accepted students!

All told, we had 52 applications submitted for the 6 slots (5 originally
assigned, plus 1 extra) allotted to us by Google.  Proposals were
extremely competitive: 6 out of 52 translates to an 11.5% acceptance
rate.  We received a lot of really excellent proposals, the decisions
were not easy.

Thanks very much to all the students who applied, we very much
appreciate your hard work.

Here's to a great 2010 Summer of Code, I'm sure these students will do
some wonderful work.

Rob Buels
OBF GSoC 2010 Administrator


------------------------------

_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l


End of Biojava-l Digest, Vol 87, Issue 26
*****************************************


From marcel.huntemann at gmail.com  Fri Apr 30 00:49:10 2010
From: marcel.huntemann at gmail.com (Marcel Huntemann)
Date: Thu, 29 Apr 2010 17:49:10 -0700
Subject: [Biojava-l] Error during genbank parsing
Message-ID: <4BDA2906.20801@Gmail.com>

Hi!

I get the following error during the parsing of a genbank file:

Exception in thread "main" org.biojava.bio.BioException: Could not read
sequence
	at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
	at gov.doe.jgi.img.pangenomes.Controller.createGeneMap(Controller.java:303)
	at gov.doe.jgi.img.pangenomes.Controller.start(Controller.java:197)
	at gov.doe.jgi.img.pangenomes.Main.createAndStartController(Main.java:105)
	at gov.doe.jgi.img.pangenomes.Main.main(Main.java:35)
Caused by: org.biojava.bio.seq.io.ParseException:

A Exception Has Occurred During Parsing.
Please submit the details that follow to biojava-l at biojava.org or post a
bug report to http://bugzilla.open-bio.org/

Format_object=org.biojavax.bio.seq.io.GenbankFormat
Accession=null
Id=null
Comments=Bad locus line
Parse_block=LOCUS   NC_008711      4597686 bp      DNA circular
17-DEC-2009
Stack trace follows ....


	at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:322)
	at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
	... 4 more

No matter which genbank file I use, I always get this error (for sure with
a different LOCUS line. The strange thing is that this used to work about
1/2 - 1 year ago. No I wanted to use my program again and get always this
error, although I didn't really change anything on that code.
The only thing I can think of that's different, since the last time I used
it (when it worked), is that I switched from a 32bit Linux to a 64bit
Linux machine. But can that really cause it?

Here's my code and how I use it:

for ( String taxonId : givenTaxonIds )
		{
    		gbkFile = new File( dirPath + taxonId + gbkSuffix );
    		if ( ! gbkFile.exists() )
    		{
    			logr.fatal( "Couldn't find genbank file for taxonOID " + taxonId +
    					"!\nI tried " + gbkFile.getPath() + ", but it doesn't exist!" );
    			System.exit( 0 );
    		}
    		
    		BufferedReader br = new BufferedReader( new FileReader( gbkFile ) );
        	Namespace ns = RichObjectFactory.getDefaultNamespace();

        	RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(
br, ns );
    		numberInGenome = 0;
    		while ( seqs.hasNext() )
    		{
    			RichSequence contig = seqs.nextRichSequence();
    			// Get genes and their positions
    			Set<Feature> features = contig.getFeatureSet();
    			positions = new ArrayList<int[]>();
    			geneIds = new ArrayList<String>();
    			
			    for ( Feature richFeature : features )
				{
			    	if ( richFeature.getType().equals( "CDS" ) )
					{
			    		RichLocation loc = (RichLocation) richFeature.getLocation();
			    		position = new int[3];
			    		position[0] = loc.getMin();
			    		position[1] = loc.getMax();
			    		position[2] = loc.getStrand().intValue();
			    		Annotation a = richFeature.getAnnotation();
		    			split = a.getProperty( "note" ).toString().split( "=" );
		    			geneIds.add( split[1].trim() );
			    		positions.add( position );
					}
			    	else if ( richFeature.getType().equals( "gene" ) )
					{
			    		Annotation a = richFeature.getAnnotation();
			    		if ( a.containsProperty( "pseudo" ) )
						{
			    			RichLocation loc = (RichLocation) richFeature.getLocation();
				    		position = new int[3];
				    		position[0] = loc.getMin();
				    		position[1] = loc.getMax();
				    		position[2] = loc.getStrand().intValue();
				    		split = a.getProperty( "note" ).toString().split( "=" );
			    			geneIds.add( split[1].trim() );
				    		positions.add( position );
						}
					}
				}

Thanks 4 the help,
Marcel

P.S.: Also the info on some of the biojava pages seems outdated. I got the
latest version from your svn trunk and on the GetStarted page it says that
 one just has to call ant to build it. But there's now build.xml in the
biojava folder. Instead there's a pom.xml, so I guess u switched to maven.
I bet a lot of people don'tknow how to geal with and have no clue what to
do, when the ant command didn't work...


From narciso at cnpaf.embrapa.br  Fri Apr 30 21:32:02 2010
From: narciso at cnpaf.embrapa.br (Marcelo Goncalves Narciso (Pesquisador))
Date: Fri, 30 Apr 2010 19:32:02 -0200
Subject: [Biojava-l] problems with intallation of biojava in windows 7
In-Reply-To: <20100430184758.M13673@cnpaf.embrapa.br>
References: <20100430184758.M13673@cnpaf.embrapa.br>
Message-ID: <20100430212950.M75279@cnpaf.embrapa.br>

Hi, people,

I need your help.

When I try to install biojava in windows 7, it happens:

> C:\Users\narciso\biojava>java -jar biojava-1.7.1-all.jar
> Failed to load Main-Class manifest attribute from
> biojava-1.7.1-all.jar
How can I fix it?

Thanks a lot

Marcelo