[BioRuby] Parsing ClustalW files

Mon Dec 28 10:26:52 EST 2009

Hi,

On Sun, 27 Dec 2009 17:07:47 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have
> added a unit test for ClustalW ALN format, as well as an update to the
> tutorial. 
> 
> I have three comments. First I think the alignment parser belong in
> ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb.
> I can see how that originated, but it is an independent database
> format. This should also change the constructor call to, for example,
> Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files
> are ubiquous we may want to rename this to an ALN format.

I think it is good to follow EMBOSS's naming rule.
In EMBOSS, the format names are "clustal" or "aln".
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

By the way, it is interesting that Clustal format isn't
described in the EMBOSS alignment formats
(http://emboss.sourceforge.net/docs/themes/AlignFormats.html).

> Second, I added an index method [], to Bio::ClustalW::Report, so I can
> refetch a Bio::Sequence object *with* the ID/definition (see below).
> However it may be more appropriate to have this shared at the
> Bio::Alignment level. If you have a better way, I am all ears.

Why no methods that return a Bio::Sequence object is because the
ClustalW parser and Bio::Alginment were first written before
Bio::Sequence have been improved. It is good to write methods
returning Bio::Sequence object(s) for ClustalW parser.

Bio::Alginment is a container class, and I'm still seeking
what are better ways to store sequences and other information.
Any suggestions are welcomed.

>    bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join)

I think using File.read("...") is better, instead of
File.new("...").readlines.join.

>    bioruby> aln.header
>    ==> "CLUSTAL 2.0.9 multiple sequence alignment"
> 
> Fetch a sequence
> 
>    bioruby> seq = aln[1]
>    bioruby> seq.definition
>    ==> "gi|115023|sp|P10425|"
> 
> Get the partial sequences
> 
>    bioruby> seq.to_s[60..120]
>    ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD"
> 
> Show the full alignment residue match information for the sequences in the set
> 
>    bioruby> aln.match_line[60..120]
>    ==> "     .     **. .   ..   ::*:       . * : : .        .: .* * *"
> 
> Return a Bio::Alignment object
> 
>    bioruby> aln.alignment.consensus[60..120]
>    ==> "???????????SN?????????????D??????????L??????????????????H?H?D"
> 
> I also kinda disagree with the implementation of the current parser
> (Report). It has virtually no checking for bad input data,

Because no strict format definition and no detailed documents, and
it is hard to distinguish what is really "bad". In addition, when I
implemented the parser, I thoght it was good to be able to salvage
data from broken or incomplete format rather than to report error
and to stop parsing.

> and it
> should accept an array of lines in addition to a String. 

I don't think so, because to accept two differenct data types would
make things complicated, and make harder to implement parsers.

> Was that three comments already? ;)
> 
> Happy new year to everyone, and let 2010 be a strong year for BioRuby
> and friends!
> 
> Pj.
> 

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org