[Biopython-dev] draft blog post for 1.52 stuff

Mon Sep 21 01:30:44 EDT 2009

As I mentioned in the draft release announcement it might be useful to 
have a
a blog post up explaining how the new functions for SeqIO and AlignIO 
work (thanks to Peter for this idea).

I've written a draft for a post that looks at the convert function that 
could do with a little more detail and ignores the indexed_dict() 
function entirely because I just don't have a good enough idea of how it 
works.

Again, any comments are welcome. Is it a good idea to have a post like 
this or should we just extend the release announcement to include a 
little bit more detail?

++
It's only been a month since we released Biopython 1.51 but in that time the
CVS server has stacked up enough cool new features that we are going to put
together a new release soon. As ever the new functions will be documented in
the official tutorial and cookbook but we thought we'd show off a few of
these tools here

Simple, optimized format conversion with SeqIO and AlignIO

No one has ever complained that bioinformatics just doesn't have enough file
formats - you probably frequently find yourself converting sequence 
files to suit
particular applications with SeqIO. At the moment this is usually a two step
process, something like this:

 >>>records = SeqIO.parse(in_handle "genbank")
 >>>SeqIO.write(records, out_handle, "fasta")

As of Biopython 1.52 you'll be able to achieve the same result in a 
single step:

 >>>SeqIO.convert(in_handle, "genbank", out_handle, "fasta")

Adding the convert function to SeqIO will make your scripts more 
readable and
might even save you a couple of lines of code but more importantly it 
allows the
conversion process to be optimized for two formats being used. In the above
example we are moving from a genbank file, which might include multiple
features for each sequence, to a fasta file, which doesn't include features.
If we used the two step process above we'd be spending time reading each 
sequence's features into memory just to skip them when they get passed 
to the write function. SeqIO.convert()  knows that the sequences in the 
input
file are destined to be written to a fasta file so it can skip over the 
features
and save a bit of time in doing the conversion.

Obviously, the optimization in SeqIO.convert() is most powerful when its 
used
on very large files like those produced in next generation sequencing 
projects.
When converting between each of the FASTQ file format's variants with 
the "SeqIO two step" a siginficant amount of time is taken creating 
SeqRecord objects for each record in the input file but none of the 
attributes or methods of the SeqRecord object are required to do the 
conversion. For this reason SeqIO.convert() deals with each record as 
two simple strings, one for the record's sequence, the other for its ID. 
[some information on just how much time that saves on a big file should 
probably go here!]
+++