[Biopython] StockholmIO replaces "." with "-", why?

Chris Fields cjfields at illinois.edu
Fri Apr 9 12:51:35 UTC 2010


On Apr 9, 2010, at 7:08 AM, Peter wrote:

> On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>>> Greetings All!
>>> 
>>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>> 
>>> seqs[id] += seq.replace(".","-")
>>> 
>>> So when you load into memory alignments that mark gaps created to
>>> allow alignment to inserts with ".", (such as PFam alignments or the
>>> output of hmmer) that information is lost.
>>> 
>>> I know there must be a good reason for this, but I am finding it a
>>> problem on my end..
>>> 
>>> -Bryan Lunt
>> 
>> Hi Bryan,
>> 
>> Yes, is it done deliberately. The dot is a problem - it has a quite
>> specific meaning of "same as above" on other alignment file
>> formats, while "-" is an almost universal shorthand for gap/insertion.
>> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
>> conversion.
>> 
>> Have you got a sample output file we can use as a unit test or
>> at least discuss? As I recall, on the PFAM alignments I looked
>> at there was no data loss by doing the dot to dash mapping.
> 
> According to http://sonnhammer.sbc.su.se/Stockholm.html
>>> Sequence letters may include any characters except
>>> whitespace. Gaps may be indicated by "." or "-".
> 
> So a Stockholm file using a mixture of "." and "-" would be
> valid but a bit odd. Why would anyone do that?
> 
> Peter

Just curious, b/c this is a point of contention in BioPerl.  How does BioPython internally set what symbols correspond to residues/gaps/frameshifts/other?  BioPerl retains the original sequence but uses regexes for validation and methods that return symbol-related information (e.g. gap counts).  

(BTW, the contention here isn't that we use regexes, but that we set them globally).


chris







More information about the Biopython mailing list