[BioRuby] Parsing ClustalW files

Tue Dec 29 14:00:49 UTC 2009

Hi Naohisa,

Thanks for the reply. I can improve my proposal if you help decide on
below points.

On Tue, Dec 29, 2009 at 12:26:52AM +0900, Naohisa GOTO wrote:
> I think it is good to follow EMBOSS's naming rule.
> In EMBOSS, the format names are "clustal" or "aln".
> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

So, should be move the existing app/clustalw/report.rb to
db/clustal.rb? Or do we start a new db/clustal.rb - retaining the old
one for compatibility? Or do we just merge in my small changes?

> By the way, it is interesting that Clustal format isn't
> described in the EMBOSS alignment formats
> (http://emboss.sourceforge.net/docs/themes/AlignFormats.html).

It is a bit of a non-standard. I guess. Clustal and Muscle use it. It
is mostly nice for checking alignments in a text file. I need it to
send to other people. It is standard enough.

> > Second, I added an index method [], to Bio::ClustalW::Report, so I can
> > refetch a Bio::Sequence object *with* the ID/definition (see below).
> > However it may be more appropriate to have this shared at the
> > Bio::Alignment level. If you have a better way, I am all ears.
> 
> Why no methods that return a Bio::Sequence object is because the
> ClustalW parser and Bio::Alginment were first written before
> Bio::Sequence have been improved. It is good to write methods
> returning Bio::Sequence object(s) for ClustalW parser.
> 
> Bio::Alginment is a container class, and I'm still seeking
> what are better ways to store sequences and other information.
> Any suggestions are welcomed.

I think using [] is a good way to fetch Bio::Sequence objects from the
alignment. Also we could introduce an 'each_sequence' iterator,
returning Bio::Sequence, though 'each' itself would be more
consistent. The latter will break things, perhaps, for users.

The current alignment class stores keys and sequences separately. I
guess a list of Bio::Sequence would be more consistent. Maybe we can
discuss deeper design issues soon. I have some opinions which are
better vented in a round table. 

> >    bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join)
> 
> I think using File.read("...") is better, instead of
> File.new("...").readlines.join.

OK

> Because no strict format definition and no detailed documents, and
> it is hard to distinguish what is really "bad". In addition, when I
> implemented the parser, I thoght it was good to be able to salvage
> data from broken or incomplete format rather than to report error
> and to stop parsing.

Hmmm. I think Bioruby should throw an exception when hitting bad
data. In this case it is easy - see the example I sent two days ago.
It just tests the indentation. Any data failure would be signalled.

> > and it
> > should accept an array of lines in addition to a String. 
> 
> I don't think so, because to accept two differenct data types would
> make things complicated, and make harder to implement parsers.

Well, we may want to change this. The current edition takes a string,
and its first action is to split it into an Array. I have to do an
Array.join to pass it in. That is two actions too many. This can get
bad with big data.

I also think the current parsers should *not* be string based. For
this example it is not a problem (ALN files are generally small), but
the only way to get rid of memory use issues is splitting the data. I
think it is not a good idea to have 1 Gb strings. 

Anyway, based on above I think my current ALN patch is acceptable -
apart from the documentation.

Pj.