[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Mon Jul 31 15:14:04 UTC 2006
>
> They're multiple lines with max length 50, and the whole file is 33Mb.
> It's not the largest FASTA sequence file I'm working with, that's 353Mb
> (530801 sequences, it's most of a eukaryotic genome with sequences split
> into multiple lines), so I ran your test script on it, just to see what
> happened:
>
> 419.42s FormatIO/SeqRecord (for record in interator)
> 389.05s FormatIO/SeqRecord (iterator.next)
> 35.46s SeqIO.FASTA.FastaReader (for record in interator)
> 33.73s SeqIO.FASTA.FastaReader (iterator.next)
> 36.19s SeqIO.FASTA.FastaReader (iterator[i])
> 490.19s Fasta.RecordParser (for record in interator)
> 555.43s Fasta.SequenceParser (for record in interator)
> 546.87s Fasta.SequenceParser (iterator.next)
> 37.94s SeqUtils/quick_FASTA_reader
> 12.84s pyfastaseqlexer/next_record
> 6.06s pyfastaseqlexer/quick_FASTA_reader
> 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
> 12.27s pyfastaseqlexer/next_record (conversion to Seq)
> 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
> 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)
>
> This is only one run - my patience has limits <grin> Again, scaling is
> a big problem for some methods.
Interesting - but no big surprises, except maybe just how slow Martel
is. Did you notice if it run out of memory, and have to page to the
hard disk?
>>The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>entire file into memory in one go, and then parses it. On the other
>>hand its not perfect: I would use "\n>" as the split marker rather than
>>">" which could appear in the description of a sequence.
>
> I agree (not that it's bitten me, yet), but I'd be inclined to go with
> "%s>" % os.linesep as the split marker, just in case.
Good point. I wonder how many people even know this function exists?
>>Do we need to worry about the size of the raw file in memory - allowing
>>the parsers to load it into memory could make things much faster...
>
> I use very few FASTA files where that would be a problem, so long as the
> sequences remain as strings - when they're converted to
> SeqRecords/SeqFeatures is where I start to get nervous about memory use.
Maybe we should avoid loading entire files into memory while parsing -
except for those formats like Clustal alignments where there is no real
choice.
Have you got a feeling for the difference in memory required for a large
Fasta file in memory as:
* Title string, sequence string
* Title string, sequence as Seq object
* SeqRecords (which include the sequence as a Seq object)
While its overkill for simple file formats like FASTA, I think we do
need a fairly high level object like the SeqRecord when dealing with
things like Genbank/EMBL to hold the basic annotation and identifiers
(id/name/description).
I am thinking that we should have a set of sequence parsers that all
return SeqRecord objects (with format specific options in some cases to
control the exact mapping of the data, e.g. title2ids for Fasta files).
And a matching set of sequence writers that take SeqRecord object(s) and
write them to a file.
Such a mapping won't be perfect, so maybe there is still a place for
"format specific representations" like the Record object in
Bio.GenBank.Record
In the short term maybe we should just replace the internals of the
current Bio.Fasta module with a pure python implementation like that in
Bio.SeqIO.FASTA - good idea? Bad idea?
Peter
More information about the Biopython-dev
mailing list