[Biopython] StockholmIO replaces "." with "-", why?
Chris Fields
cjfields at illinois.edu
Fri Apr 9 12:51:35 UTC 2010
On Apr 9, 2010, at 7:08 AM, Peter wrote:
> On Thu, Apr 8, 2010 at 9:04 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Apr 8, 2010 at 1:57 AM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
>>> Greetings All!
>>>
>>> It looks like line 364 of Bio.AlignIO.StockholmIO reads:
>>>
>>> seqs[id] += seq.replace(".","-")
>>>
>>> So when you load into memory alignments that mark gaps created to
>>> allow alignment to inserts with ".", (such as PFam alignments or the
>>> output of hmmer) that information is lost.
>>>
>>> I know there must be a good reason for this, but I am finding it a
>>> problem on my end..
>>>
>>> -Bryan Lunt
>>
>> Hi Bryan,
>>
>> Yes, is it done deliberately. The dot is a problem - it has a quite
>> specific meaning of "same as above" on other alignment file
>> formats, while "-" is an almost universal shorthand for gap/insertion.
>> Consider the use case of Stockholm to PHYLIP/FASTA/Clustal
>> conversion.
>>
>> Have you got a sample output file we can use as a unit test or
>> at least discuss? As I recall, on the PFAM alignments I looked
>> at there was no data loss by doing the dot to dash mapping.
>
> According to http://sonnhammer.sbc.su.se/Stockholm.html
>>> Sequence letters may include any characters except
>>> whitespace. Gaps may be indicated by "." or "-".
>
> So a Stockholm file using a mixture of "." and "-" would be
> valid but a bit odd. Why would anyone do that?
>
> Peter
Just curious, b/c this is a point of contention in BioPerl. How does BioPython internally set what symbols correspond to residues/gaps/frameshifts/other? BioPerl retains the original sequence but uses regexes for validation and methods that return symbol-related information (e.g. gap counts).
(BTW, the contention here isn't that we use regexes, but that we set them globally).
chris
More information about the Biopython
mailing list