[Biopython-dev] SeqRecord and Alignment inconsistencies

Fri Feb 18 13:10:00 EST 2005

In the storm of fixes leading up to the next BioPython release, I have a
basic design question. There seems to be some inconsistency when it
comes to the use of the name, id, and description fields of the FASTA
format in the Alignment class. 

The FASTA format itself is ambiguous, but in general I expect the title
of the sequence to be a variant of one of these:

>id|description
>id description
>id

The important thing is that the description is optional, and the
SeqRecord() object can be instantiated with id and/or description (and
also a 'name', which makes things even more confusing!).

Unfortunately the Alignment class is not consistent with this
formatting. Its __init__() function asks for a descriptor, and then when
a sequence is being added to the Alignment in Alignment.add_sequence()
it uses this:

new_record = SeqRecord(new_seq, description = descriptor)

The id is never requested. The result of this inconsistency can be seen
in the following code. Consider the following two sequences in
foo.fasta:

>human
ATGGCAGCGGGGTTCGGG

>chimp
GTCCTGAGAAGTATCGGC

My instinct is to do this:
# -----
from Bio.SeqIO.FASTA import FastaReader
from Bio.Fasta.FastaAlign import FastaAlignment
from Bio.Alphabet import IUPAC

reader = FastaReader(open('foo.fasta',"r"),alphabet=IUPAC.ambiguous_dna)
alignment = FastaAlignment(alphabet=IUPAC.ambiguous_dna)
for seqRec in reader:
    alignment.add_sequence(sequence=seqRec.seq.data,\
            descriptor=seqRec.description)
print alignment
# -----

but this yields:

>
ATGGCAGCGGGGTTCGGG

>
GTCCTGAGAAGTATCGGC

This is because 'human' and 'chimp' are the id, not the description. Of
course one can get around this problem by simply passing the
SeqRecord.id into add_sequence(), but the SeqRecords stored in the
Alignment object are still messed up.

Is this behaviour intentional? If not, I'd recommend the following
changes:

1. Change the Alignment.add_sequence() signature to accommodate name,
id, and description.

2. Within add_sequence(), pass these to the internal SeqRecord object.

Unfortunately, derived classes like FastaAlignment have accommodated
this by simply using each SeqRecord's description as the id (see e.g.
FastaAlign.py, line 71). So any derived classes which are used for
output will also need to change so that they correctly use the id as the
primary name. This is therefore a radical change and is likely to break
some user code since it occurs in such a widely-applicable class.

I also would like to see a few other additions to the generic Alignment
class:

3. an add_sequence_record() method
4. a dictionary which keys each SeqRecord by id
5. a get_seq_by_id() method

I have made these changes in my local copy of the code, and if the
BioPython developers think these changes would be warranted I would be
happy to provide the updated Alignment class (Generic.py). As for the
derived classes, only FastaAlignment and ClustalAlignment are directly
derived, but there may be other methods which assume the description is
the id and would have to be changed. For example, searching for 'import
Alignment' shows that Saf.Record and example_ais2 both seem to depend on
this behavior. Hopefully regression testing would catch these.

Am I just being pedantic here, or has this confusion caused as many
subtle bugs for other people as it has for me?

Thanks,
Matt Dimmic