[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Marc Colosimo
mcolosimo at mitre.org
Mon Jul 31 16:08:50 UTC 2006
On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
>
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it. On the other
>>> hand its not perfect: I would use "\n>" as the split marker
>>> rather than
>>> ">" which could appear in the description of a sequence.
>>
>> I agree (not that it's bitten me, yet), but I'd be inclined to go
>> with
>> "%s>" % os.linesep as the split marker, just in case.
>
> Good point. I wonder how many people even know this function exists?
>
The only problem with this is that if someone sends you a file not
created on your system. I remember hugh problems 5 or so years ago in
BioPerl with dealing with the Mac, Unix, Windows line-ending issues.
This has mostly simplied down to two - Unix and Windows - unless the
person uses a Mac GUI app some of which use \r (CR) instead of \n
(LF) where Windows uses \r\n (CRLF). I think the standard python
disto comes with crlf.py and lfcr.py that can convert the line endings.
> Maybe we should avoid loading entire files into memory while parsing -
> except for those formats like Clustal alignments where there is no
> real
> choice.
>
> Have you got a feeling for the difference in memory required for a
> large
> Fasta file in memory as:
> * Title string, sequence string
> * Title string, sequence as Seq object
> * SeqRecords (which include the sequence as a Seq object)
>
> While its overkill for simple file formats like FASTA, I think we do
> need a fairly high level object like the SeqRecord when dealing with
> things like Genbank/EMBL to hold the basic annotation and identifiers
> (id/name/description).
>
> I am thinking that we should have a set of sequence parsers that all
> return SeqRecord objects (with format specific options in some
> cases to
> control the exact mapping of the data, e.g. title2ids for Fasta
> files).
>
> And a matching set of sequence writers that take SeqRecord object
> (s) and
> write them to a file.
>
> Such a mapping won't be perfect, so maybe there is still a place for
> "format specific representations" like the Record object in
> Bio.GenBank.Record
>
> In the short term maybe we should just replace the internals of the
> current Bio.Fasta module with a pure python implementation like
> that in
> Bio.SeqIO.FASTA - good idea? Bad idea?
I would keep them separate but change the documentation on the how-to
site to point to using the Bio.SeqIO.FASTA since that is where I
think we want people to start going. The code change to Bio.Fasta
should be to add a depreciation warning.
Marc
More information about the Biopython-dev
mailing list