[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Jul 31 15:14:04 UTC 2006

> 
> They're multiple lines with max length 50, and the whole file is 33Mb.
> It's not the largest FASTA sequence file I'm working with, that's 353Mb
> (530801 sequences, it's most of a eukaryotic genome with sequences split
> into multiple lines), so I ran your test script on it, just to see what
> happened:
> 
> 419.42s FormatIO/SeqRecord (for record in interator)
> 389.05s FormatIO/SeqRecord (iterator.next)
> 35.46s SeqIO.FASTA.FastaReader (for record in interator)
> 33.73s SeqIO.FASTA.FastaReader (iterator.next)
> 36.19s SeqIO.FASTA.FastaReader (iterator[i])
> 490.19s Fasta.RecordParser (for record in interator)
> 555.43s Fasta.SequenceParser (for record in interator)
> 546.87s Fasta.SequenceParser (iterator.next)
> 37.94s SeqUtils/quick_FASTA_reader
> 12.84s pyfastaseqlexer/next_record
> 6.06s pyfastaseqlexer/quick_FASTA_reader
> 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
> 12.27s pyfastaseqlexer/next_record (conversion to Seq)
> 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
> 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)
> 
> This is only one run - my patience has limits <grin>  Again, scaling is
> a big problem for some methods.

Interesting - but no big surprises, except maybe just how slow Martel 
is.  Did you notice if it run out of memory, and have to page to the 
hard disk?

>>The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
>>entire file into memory in one go, and then parses it.  On the other 
>>hand its not perfect: I would use "\n>" as the split marker rather than 
>>">" which could appear in the description of a sequence.
> 
> I agree (not that it's bitten me, yet), but I'd be inclined to go with
> "%s>" % os.linesep as the split marker, just in case.

Good point.  I wonder how many people even know this function exists?

>>Do we need to worry about the size of the raw file in memory - allowing
 >>the parsers to load it into memory could make things much faster...
> 
> I use very few FASTA files where that would be a problem, so long as the
> sequences remain as strings - when they're converted to
> SeqRecords/SeqFeatures is where I start to get nervous about memory use.

Maybe we should avoid loading entire files into memory while parsing - 
except for those formats like Clustal alignments where there is no real 
choice.

Have you got a feeling for the difference in memory required for a large 
Fasta file in memory as:
* Title string, sequence string
* Title string, sequence as Seq object
* SeqRecords (which include the sequence as a Seq object)

While its overkill for simple file formats like FASTA, I think we do 
need a fairly high level object like the SeqRecord when dealing with 
things like Genbank/EMBL to hold the basic annotation and identifiers 
(id/name/description).

I am thinking that we should have a set of sequence parsers that all 
return SeqRecord objects (with format specific options in some cases to 
control the exact mapping of the data, e.g. title2ids for Fasta files).

And a matching set of sequence writers that take SeqRecord object(s) and 
write them to a file.

Such a mapping won't be perfect, so maybe there is still a place for 
"format specific representations" like the Record object in 
Bio.GenBank.Record

In the short term maybe we should just replace the internals of the 
current Bio.Fasta module with a pure python implementation like that in 
Bio.SeqIO.FASTA - good idea?  Bad idea?

Peter