[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Wed Aug 2 12:56:23 UTC 2006

Leighton Pritchard wrote:
> Fair point, but GFF3 (see below) can optionally carry sequence data,
> and I use them for exactly what you say here:
> 
>> maybe those files could be turned into SeqRecords or SeqFeatures 
>> (with empty sequences).
> 
> I was thinking that GFF3 would be more useful than GFF:
> 
> http://song.sourceforge.net/gff3.shtml
> 

Thanks for the links... interesting that GFF3 allows embedding Fasta
sequences.

>> Reading your other comments, it looks like you wouldn't miss 
>> FastaRecord or GenBank records if they were phased out.
> 
> Not personally, but others may have strong opinions and breakable 
> code, yet.

There is no need to remove the current modules, just mark them as
depreciated.  Of course, if there is some strong support for these
objects then we might not want to be so harsh...

> It may be a side-issue, but should a Clustal parser return an 
> Alignment object or iterate over SeqRecord objects?  And for that 
> matter, what about other MSA files in FASTA format?  I think we ought
> allow parsers to return an Alignment where the user requests it, 
> which is a functionality I'm not currently aware of in the FASTA 
> sequence parsers.

In my opinion we should offer both.  I would go for loading
clustal/fasta alignments as sequence iterators (as part of the new SeqIO
code) and make it very easy to turn ANY sequence iterator returning
SeqRecords into an alignment.

The current alignment object stores its sequences as SeqRecords
internally but doesn't (yet) allow simple addition of SeqRecords - that
would have to be fixed but it looks easy enough.  Accepting a
SequenceIterator for __init__ would also be nice.

>> The individual readers should offer some level of control, for 
>> example the title2ids function for Fasta files lets the user decide
>> how the title line should be broken up into id/name/description. 
>> Also for some file formats the user should be able to specify the 
>> alphabet.
> 
> Could the alphabet be optionally specified by the user on parsing, 
> and maybe return a warning or error if there are non-compliant 
> symbols in the file, as a quick validator for bad sequences, or 
> reminder to the occasionally forgetful that, for example, they're not
> working with nucleotide sequences, today <cough, embarrassed glance 
> at floor> ;)

For some file formats the parser should be able to deduce the alphabet,
but other like Fasta it must be specified.  I like the idea of
optionally checking the alphabet - but it would impose a speed penalty.

Do you think this should be done by the SeqRecord object (on request)?
Each parser could simply ask the SeqRecord object to verify itself
before returning it.

Peter