Bio::Seq (was:Robert's chapter :))

Georg Fuellen fuellen@dali.mathematik.uni-bielefeld.de
Tue, 10 Dec 1996 18:35:18 +0000 (GMT)


Chris:

> >Do you think you could prepare Bio::Seq for a first beta release until, say,
> >Dec 15 (and Steve Brenner and/or I would do some final checking) ?
> 
> >That would include:
> >+ completing the POD (search for ``to be completed'' in Seq.pm)
> >+ add code to testSeq.pm testing/demonstrating your new code in particular
> >+ alphabet checking (before storing sequences in the object hash)
> >+ integrate Gilbert's ReadSeq so that it's called as a command for
> >  converting exotic formats, _if_ it's installed (before calling
> >  parse_bad )
> 
> I'd be happy to do what I can with Bio::Seq in preparation for a beta release.

Thanks, that's very much appreciated !

> I will need some "guidance" from people on some questions that I have
> listed below.

I hope the response below is of help; if Steve can comment as well (he's 
travelling in the US), that would be nice, but let's not wait for him.

> Regards,
> Chris Dagdigian
> cdagdigian@genetics.com
> 
> 
> #########################################################################
> [apology in advance if these are uninformed questions]
> 
> [This refers to the Seq.pm code as found at
> http://www.techfak.uni-bielefeld.de/bcd/Tec/Bioperl/Code/Bio/Seq.pm]
> 
> alphabet checking:
> 
>  o Should the 'base' DNA alphabet be "ACTG" or should we use the IUPAC
> extended genetic alphabet as the default base? It would be nice to have
> recognition for nucleotide ambiguity right from the start. Is there a
> downside to having the larger alphabet as the base for DNA sequences? [The
> IUPAC alphabet is described at
> http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/node7.html]

Can you try out using IUPAC, and if you encounter problems
(maybe w/ the complement subroutine ??), then you mail again ?
It seems that currently the alphabet matters only at very few 
locations in the code, so that you can easily go back to ACGT 
as soon as you discover a problem.

> alphabets in general:
> 
>  o With the %alphabet hash, there are some bits of code to add a gap "-" or
> an missing "?" character to the alphabets. I can see how ''1Mg'' =
> $SeqType{Dna}Mg and would contain an alphabet of ["A","C","T","G","?"].
> 
> What I'm not clear about is how a user would invoke that "alternate"
> alphabet over the base alphabet of $SeqType{Dna} = ["A","C","T","G"].
> Passing in '1Mg' as the "Type" field would result in a "Unknown" response
> when the %SeqType hash is checked wouldnt it? Am I missing something really
> simple here or should there be an optional constructor for specifying the
> alphabet?
> 
> That way you could construct:
>  $myseq = new Bio::Seq(-seq=>'ACTG???',-type=>'Dna',-alphabet=>'1Mg');
> 
>   -or even-
>  $myseq = new Bio::Seq(-seq=>'A-C-TG?-?-?',-type=>'Dna',-alphabet=>'1MgGp');

I wouldn't do this. In particular, numbers and acronyms like ``1Mg''
should not be visible to the user. 
I'd suggest that we simply always check against the maximum
possible alphabet for a given type, i.e. '1MgGp' for DNA, etc, and
provide a function that allows the user or other parts of the software
to check a sequence against a given alphabet, on demand. E.g., he/she
can check whether there are ``?'' (missing) symbols, or whether the
pure ACGT alphabet is used. That should be all for the moment. 
We may have got to spend some time on the issue as soon as Steve 
can comment, or maybe he just cleans up the issue when finalizing stuff ?!
*We shouldn't spend much time here at this moment !*

best wishes,
georg

> I guess without having the alphabet defined seperatly, I'm not quite sure
> how I would go about checking it whenever a sequence is assigned.
> 
> 
>  o ReadSeq
> 
>    No questions now but I'm sure I'll have some later :)
> 
> 
>