[BioSQL-l] Recording "nucleotide" in the sequence table?

Peter biopython at maubp.freeserve.co.uk
Sat May 16 23:06:41 UTC 2009


On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  I think we'll have to define carefully what we mean by "generic nucleotide
> alphabet". (Normally I hear nucleotide used as the type of a sequence, but
> not its alphabet.)

In Biopython the type of a sequence (e.g. DNA, RNA or Protein) is
recorded by an alphabet object (which may also record the expected
range of letters).

>  A nucleotide alphabet in the way you describe it also can't really be the
> "base class" for either a DNA or RNA alphabet, can it? Typically in OOP,
> derived classes expand on a base class, not restrict it. So isn't there
> potential for confusion?

Well, that's how it was done for the Biopython alphabet classes.
I'm simplifying slightly, but at the top level we have a generic
alphabet, which has as children generic protein and generic
nucleotide (which has as its children generic dna and generic
rna).  Each of these then has IUPAC subclasses which are further
restrictions where the valid letters are proscribed.

> What you are essentially talking about is the case when a sequence
> contains only A, C, and G. In that case, we don't know either that
> it's not protein, do we?
>
> > [...] In python "guessing" is discouraged.  If we have a nucleotide
> > sequence like GCGCGCGA, this could be DNA or RNA - you can't
> > tell.
>
> And how do you tell it's nucleotide to begin with?

That is the whole point.  When deciding what to record in the
biosequence.alphabet field in BioSQL we (Bioython) can only
go by what the alphabet associated with the sequence object.
Whoever created the sequence specified the alphabet based
on meta data, external knowledge, or guessed. If this was
done by a parser, then the file format itself may have
specified the sequence type.

If none of BioPerl, BioJava and BioRuby have an analogous
sequence representation for a nucleotide sequence which
might be DNA or RNA, then perhaps the current situation
with only "protein", "dna", "rna" and "unknown" in the
biosequence.alphabet field in BioSQL is sufficient.

Peter



More information about the BioSQL-l mailing list