[Bioperl-l] Bio::TreeIO

David Ardell dave.ardell@ebc.uu.se
Mon, 11 Feb 2002 18:53:27 +0100


Hello Jason, Hello Bioperl,

Thanks for the response to my email a week ago regarding Tree and TreeIO. Since I've come
home to my fast connection I've had time to skim bioperl 0.9.2 and the Tree and TreeIO
modules. 

You asked about what modules I have written, and what suggestions I could make about
bioperl. Over the last year, I have made notes about this. I am a bit afraid that these are
already out of date, although a quick check of the 0.9.2 Changes file suggests maybe not. So
here goes:

First I'll illustrate what I have used bioperl to do for my work.

Scripts:
gapfree  -- Remove gap containing sites from alignments
            (UnivAln). Important for some analyses.
subfasta -- Extract sequences from multifasta files                    corresponding to
reg-ex match on IDs or                    sequences. 
fasuniq  -- Uniquify fasta files
xl       -- front end to translate, but supports gapped
            input (via my GapSeq class, see below),
            alignment of aa seq to coding seq, etc
monocomp -- monomer composition with various levels of                 strictness on type,
flexible about alphabets
poscomp  -- monomer composition in frames
codaln   -- an updated replacement to protal2dna --                    alignment of DNA
            sequences by their protein translations.
            using the bioperl CLUSTALW driver

I realize this must mostly be standard fare.

A more specialized research application that I wrote in bioperl takes 1) an annotated
sequence and 2) an alignment that includes that sequence, to then manipulate the alignment
according to feature annotations of the sequence, to produce a reordered (subsequenced,
complemented, etc.) alignment.

-----------------------

Two packages that I made and the reasons why:

GapSeqI -- an extension to PrimarySeqI to allow translation of sequences containing gaps
(preserving frame).

MySeqStats -- removed type-checking
------------------------

I guess the one major design-choice in bioperl that I find myself working around most is the
built-in sequence strictness in parsers, constructors, and object methods.

Some examples off the top of my head, where this has presented problems: MSF files can have
tildes ('~') as gaps, and if I bring an MSF to a bioperl readable format, the tildes are
retained. The '-' character is okay, but tildes choke somewhere between Seq and SeqIO.
Another example: I use the sprinzl tRNA database that annotates nonstandard nucleotides
in-sequence with 70 non-alphanumeric characters. Of course I can't be expected to translate
this kind of sequence meaningfully, but why can't I convert it from fasta to selex within
bioperl? 
I can't get this data past the bioperl-y gates.

Suggestion: my wish would be for bioperl to have the same type-philosophy as perl itself --
type-permissive, and user beware. Functions always try to return something as reasonable as
possible given the data. Strictness could be optionally enforced at the method- or
programmer-level through predefined regexp tests a la 
if (seq->seq =~ [:DNA-IUPAC:]) {}, etc, so that functions could know what they are dealing
with by examining the sequence and act appropriately.

As a result of wanting to be able to translate gap-containing sequences (logically
well-defined) I wrote GapSeq which is a PrimarySeqI. 

This led me to this question (actually, this note is old and I forget where I was making
GapSeq to come to this question): should there be a copy constructor in order to be able to
initialize derived objects with data from a file? How are you supposed to use the IO object
with a derived sequence object?

The following comments are just jotted notes:

SeqIO:
  how do you get no-clobber behavior?

ClustalW module:
  maybe a public list of supported parameter names?

Bio::Tools::CodonTable:
1. Genetic code design should have a hash from names to code numbers, which would protect
against reordering by NCBI -- ie programmer access should be 'ciliate' rather than '6'
2. Again, an exported hash of the tables supported by the module would be useful.
3. The name: what about Bio::Tools::TranslationTable? As a codon table means something
pretty different to me.

That's about it. I did have a look at Tree and TreeIO classes. They look like a excellent
interface. I like that the abstraction and functions encompass gene genealogies right from
the start. I haven't had a chance to play with them yet though.

My modules, the ones I wrote about before, are built on top of Graph, Graph::Reader, and
Graph::Writer. Their functionality seem to complement the functionality already existing in
Bio::Tree and Bio::TreeIO. My modules are really focused on some of the more tedious
everyday work of manipulating and publishing with phylogenetic trees.

I would certainly like your suggestions of where to publish them (from a namespace
perspective). Maybe the two efforts could be integrated.

Thanks again for all of your tremendous effort for open source bioinformatics. I'll be
passing along some scripts shortly.

all the best
dave

-- 
Dr. David Ardell                
NSF Fellow in Bioinformatics    
Dept. of Molecular Evolution   
Uppsala University
Norbyvägen 18C 
SE-75236 Uppsala, SWEDEN