[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Jul 31 16:08:50 UTC 2006

On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:

>
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than
>>> ">" which could appear in the description of a sequence.
>>
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with
>> "%s>" % os.linesep as the split marker, just in case.
>
> Good point.  I wonder how many people even know this function exists?
>

The only problem with this is that if someone sends you a file not  
created on your system. I remember hugh problems 5 or so years ago in  
BioPerl with dealing with the Mac, Unix, Windows line-ending issues.  
This has mostly simplied down to two - Unix and Windows - unless the  
person uses a Mac GUI app some of which use \r (CR) instead of \n  
(LF) where Windows uses \r\n (CRLF). I think the standard python  
disto comes with crlf.py and lfcr.py that can convert the line endings.

> Maybe we should avoid loading entire files into memory while parsing -
> except for those formats like Clustal alignments where there is no  
> real
> choice.
>
> Have you got a feeling for the difference in memory required for a  
> large
> Fasta file in memory as:
> * Title string, sequence string
> * Title string, sequence as Seq object
> * SeqRecords (which include the sequence as a Seq object)
>
> While its overkill for simple file formats like FASTA, I think we do
> need a fairly high level object like the SeqRecord when dealing with
> things like Genbank/EMBL to hold the basic annotation and identifiers
> (id/name/description).
>
> I am thinking that we should have a set of sequence parsers that all
> return SeqRecord objects (with format specific options in some  
> cases to
> control the exact mapping of the data, e.g. title2ids for Fasta  
> files).
>
> And a matching set of sequence writers that take SeqRecord object 
> (s) and
> write them to a file.
>
> Such a mapping won't be perfect, so maybe there is still a place for
> "format specific representations" like the Record object in
> Bio.GenBank.Record
>
> In the short term maybe we should just replace the internals of the
> current Bio.Fasta module with a pure python implementation like  
> that in
> Bio.SeqIO.FASTA - good idea?  Bad idea?

I would keep them separate but change the documentation on the how-to  
site to point to using the Bio.SeqIO.FASTA since that is where I  
think we want people to start going. The code change to Bio.Fasta  
should be to add a depreciation warning.

Marc