[Biopython-dev] draft blog post for 1.52 stuff
David Winter
winda002 at student.otago.ac.nz
Mon Sep 21 01:30:44 EDT 2009
As I mentioned in the draft release announcement it might be useful to
have a
a blog post up explaining how the new functions for SeqIO and AlignIO
work (thanks to Peter for this idea).
I've written a draft for a post that looks at the convert function that
could do with a little more detail and ignores the indexed_dict()
function entirely because I just don't have a good enough idea of how it
works.
Again, any comments are welcome. Is it a good idea to have a post like
this or should we just extend the release announcement to include a
little bit more detail?
++
It's only been a month since we released Biopython 1.51 but in that time the
CVS server has stacked up enough cool new features that we are going to put
together a new release soon. As ever the new functions will be documented in
the official tutorial and cookbook but we thought we'd show off a few of
these tools here
Simple, optimized format conversion with SeqIO and AlignIO
No one has ever complained that bioinformatics just doesn't have enough file
formats - you probably frequently find yourself converting sequence
files to suit
particular applications with SeqIO. At the moment this is usually a two step
process, something like this:
>>>records = SeqIO.parse(in_handle "genbank")
>>>SeqIO.write(records, out_handle, "fasta")
As of Biopython 1.52 you'll be able to achieve the same result in a
single step:
>>>SeqIO.convert(in_handle, "genbank", out_handle, "fasta")
Adding the convert function to SeqIO will make your scripts more
readable and
might even save you a couple of lines of code but more importantly it
allows the
conversion process to be optimized for two formats being used. In the above
example we are moving from a genbank file, which might include multiple
features for each sequence, to a fasta file, which doesn't include features.
If we used the two step process above we'd be spending time reading each
sequence's features into memory just to skip them when they get passed
to the write function. SeqIO.convert() knows that the sequences in the
input
file are destined to be written to a fasta file so it can skip over the
features
and save a bit of time in doing the conversion.
Obviously, the optimization in SeqIO.convert() is most powerful when its
used
on very large files like those produced in next generation sequencing
projects.
When converting between each of the FASTQ file format's variants with
the "SeqIO two step" a siginficant amount of time is taken creating
SeqRecord objects for each record in the input file but none of the
attributes or methods of the SeqRecord object are required to do the
conversion. For this reason SeqIO.convert() deals with each record as
two simple strings, one for the record's sequence, the other for its ID.
[some information on just how much time that saves on a big file should
probably go here!]
+++
More information about the Biopython-dev
mailing list