[BioRuby] Parsing ClustalW files

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Mon Dec 28 10:26:52 EST 2009


On Sun, 27 Dec 2009 17:07:47 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have
> added a unit test for ClustalW ALN format, as well as an update to the
> tutorial. 
> I have three comments. First I think the alignment parser belong in
> ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb.
> I can see how that originated, but it is an independent database
> format. This should also change the constructor call to, for example,
> Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files
> are ubiquous we may want to rename this to an ALN format.

I think it is good to follow EMBOSS's naming rule.
In EMBOSS, the format names are "clustal" or "aln".

By the way, it is interesting that Clustal format isn't
described in the EMBOSS alignment formats

> Second, I added an index method [], to Bio::ClustalW::Report, so I can
> refetch a Bio::Sequence object *with* the ID/definition (see below).
> However it may be more appropriate to have this shared at the
> Bio::Alignment level. If you have a better way, I am all ears.

Why no methods that return a Bio::Sequence object is because the
ClustalW parser and Bio::Alginment were first written before
Bio::Sequence have been improved. It is good to write methods
returning Bio::Sequence object(s) for ClustalW parser.

Bio::Alginment is a container class, and I'm still seeking
what are better ways to store sequences and other information.
Any suggestions are welcomed.

>    bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join)

I think using File.read("...") is better, instead of

>    bioruby> aln.header
>    ==> "CLUSTAL 2.0.9 multiple sequence alignment"
> Fetch a sequence
>    bioruby> seq = aln[1]
>    bioruby> seq.definition
>    ==> "gi|115023|sp|P10425|"
> Get the partial sequences
>    bioruby> seq.to_s[60..120]
> Show the full alignment residue match information for the sequences in the set
>    bioruby> aln.match_line[60..120]
>    ==> "     .     **. .   ..   ::*:       . * : : .        .: .* * *"
> Return a Bio::Alignment object
>    bioruby> aln.alignment.consensus[60..120]
>    ==> "???????????SN?????????????D??????????L??????????????????H?H?D"
> I also kinda disagree with the implementation of the current parser
> (Report). It has virtually no checking for bad input data,

Because no strict format definition and no detailed documents, and
it is hard to distinguish what is really "bad". In addition, when I
implemented the parser, I thoght it was good to be able to salvage
data from broken or incomplete format rather than to report error
and to stop parsing.

> and it
> should accept an array of lines in addition to a String. 

I don't think so, because to accept two differenct data types would
make things complicated, and make harder to implement parsers.

> Was that three comments already? ;)
> Happy new year to everyone, and let 2010 be a strong year for BioRuby
> and friends!
> Pj.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

More information about the BioRuby mailing list