[Biopython] StockholmIO replaces "." with "-", why?

Chris Fields cjfields at illinois.edu
Fri Apr 9 13:28:42 UTC 2010


On Apr 9, 2010, at 8:21 AM, Peter wrote:

> On Fri, Apr 9, 2010 at 1:51 PM, Chris Fields <cjfields at illinois.edu> wrote:
>> 
>> 
>> Just curious, b/c this is a point of contention in BioPerl.  How does BioPython
>> internally set what symbols correspond to residues/gaps/frameshifts/other?
>> BioPerl retains the original sequence but uses regexes for validation and
>> methods that return symbol-related information (e.g. gap counts).
>> 
>> (BTW, the contention here isn't that we use regexes, but that we set them globally).
>> 
>> chris
> 
> Hi Chris,
> 
> The short answer is gaps are by default "-", and stop codons are "*", but
> beyond that it would be down to user code to interpret odd symbols.
> 
> Our sequences have an alphabet object which can specify the letters (as
> a set of expected characters), with explicit support for a single gap
> character (usually "-"), and for proteins a single stop codon symbol (usually
> "*"). This could in theory be extended to define other symbols too. The gap
> char does get treated specially in some of the alignment code (e.g. for
> calling a consensus), but I don't think we have anything built in regarding
> frameshifts.
> 
> Peter

Within LocatableSeq we define the following:

$GAP_SYMBOLS = '\-\.=~';
$FRAMESHIFT_SYMBOLS = '\\\/';
$OTHER_SYMBOLS = '\?';
$RESIDUE_SYMBOLS = '0-9A-Za-z\*';

Combined these can be used in a regex to validate sequence, or separately used for other purposes (counting gaps, frameshifts, etc.).  The OTHER_SYMBOLS is rally a catch-all for anything residue-like (counted in the sequence).  All of these can be redefined, but currently that's global, so it can have consequences in rare cases when mixing sequences from different formats.  We may localize them to work around that (part of GSoC project for alignment reimplementation).

We had a Symbol class at one point but I believe it was considered too 'heavy,' though this may be more a consequence of Perl's hammered-on OO.  

chris





More information about the Biopython mailing list