[BioRuby] Parsing ClustalW files
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Mon Dec 28 15:26:52 UTC 2009
Hi,
On Sun, 27 Dec 2009 17:07:47 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
> On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have
> added a unit test for ClustalW ALN format, as well as an update to the
> tutorial.
>
> I have three comments. First I think the alignment parser belong in
> ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb.
> I can see how that originated, but it is an independent database
> format. This should also change the constructor call to, for example,
> Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files
> are ubiquous we may want to rename this to an ALN format.
I think it is good to follow EMBOSS's naming rule.
In EMBOSS, the format names are "clustal" or "aln".
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
By the way, it is interesting that Clustal format isn't
described in the EMBOSS alignment formats
(http://emboss.sourceforge.net/docs/themes/AlignFormats.html).
> Second, I added an index method [], to Bio::ClustalW::Report, so I can
> refetch a Bio::Sequence object *with* the ID/definition (see below).
> However it may be more appropriate to have this shared at the
> Bio::Alignment level. If you have a better way, I am all ears.
Why no methods that return a Bio::Sequence object is because the
ClustalW parser and Bio::Alginment were first written before
Bio::Sequence have been improved. It is good to write methods
returning Bio::Sequence object(s) for ClustalW parser.
Bio::Alginment is a container class, and I'm still seeking
what are better ways to store sequences and other information.
Any suggestions are welcomed.
> bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join)
I think using File.read("...") is better, instead of
File.new("...").readlines.join.
> bioruby> aln.header
> ==> "CLUSTAL 2.0.9 multiple sequence alignment"
>
> Fetch a sequence
>
> bioruby> seq = aln[1]
> bioruby> seq.definition
> ==> "gi|115023|sp|P10425|"
>
> Get the partial sequences
>
> bioruby> seq.to_s[60..120]
> ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD"
>
> Show the full alignment residue match information for the sequences in the set
>
> bioruby> aln.match_line[60..120]
> ==> " . **. . .. ::*: . * : : . .: .* * *"
>
> Return a Bio::Alignment object
>
> bioruby> aln.alignment.consensus[60..120]
> ==> "???????????SN?????????????D??????????L??????????????????H?H?D"
>
> I also kinda disagree with the implementation of the current parser
> (Report). It has virtually no checking for bad input data,
Because no strict format definition and no detailed documents, and
it is hard to distinguish what is really "bad". In addition, when I
implemented the parser, I thoght it was good to be able to salvage
data from broken or incomplete format rather than to report error
and to stop parsing.
> and it
> should accept an array of lines in addition to a String.
I don't think so, because to accept two differenct data types would
make things complicated, and make harder to implement parsers.
> Was that three comments already? ;)
>
> Happy new year to everyone, and let 2010 be a strong year for BioRuby
> and friends!
>
> Pj.
>
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list