[Biopython-dev] parsing summary
Andrew Dalke
adalke at mindspring.com
Fri Dec 21 06:04:00 EST 2001
To summarize:
I'm working on a way to minimize the amount of work needed to handle
the standard case of
for record in data_file:
do_something(record)
write record to output_file
I think I have an API, which is easy to use
from Bio import SeqRecord
writer = SeqRecord.io.make_writer("genbank")
for record in SeqRecord.io.readFile(open("unknown.dat")):
do_something(record)
writer.write(record)
and can handle different intermediate data types
from Bio import SimpleSeq
writer = SimpleSeq.io.make_writer("fasta")
for record in SimpleSeq.io.readFile(open("unknown.dat")):
do_something(record)
writer.write(record)
And it's all built on powerful lower-level forms which are still
relatively easy to use.
The biggest problem I have is in registeration of all the different
format and conversion types. Ideally, added a new format shouldn't
affect performance until its presence is needed. That speaks for some
sort of file-based discovery mechanism. The simplest solution is to
load all files at once, but I expect that to yield poor performance.
So there needs to be some sort of defered loading mechanism. Or at
least such a mechanism should not be precluded.
What I want to do requires coming up with standardized names and data
types. These include file formats, field types, and data structures.
Thank you for letting me write all this. It's helped clear
up what my bottlenecks are in this work. Hopefully you all have
some ideas - or you can way I'm trying to be too clever for my
own good !
Andrew
dalke at dalkescientific.com
More information about the Biopython-dev
mailing list