[Biopython] Cookbook suggestion

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 16 09:02:58 UTC 2013


On Mon, Apr 15, 2013 at 8:40 PM, Justin Gibbons <jgibbons1 at mail.usf.edu> wrote:
> It looks like there is already an example of this in the tutorial under
> 18.1.5, but I was planning on making it a self contained cookbook example
> so that it is easier to find.
>
> If this is the fastest way to do it though:
>
> with open(new_file_path, "w") as handle:
>     for seq_id in seq_ids:
>         handle.write(indexed_fasta.
>         get_raw(seq_id))
> Is there any advantage to using SeqIO.write() other then it being shorter?

There are two linked choices here,

(a) Full parsing into SeqRecord objects using SeqIO.parse, or use
the SeqIO.index or SeqIO.index_db to just extract the record identifiers.
Unless you need some of the annotation or the sequence, parsing it
into a SeqRecord is a waste of CPU time.

(b) Convert the SeqRecord back into a file on disk, or reuse the
original representation from the input file. For a format like FASTA,
this is almost a moot point - the only change is the white space
(using SeqIO.write will produce consistent line wrapping). For
some of the richer formats like GenBank the parse/write round
trip is not expected to produce an identical output, so it can be
prudent to reuse the original. For some formats like we don't
have writing support, so you have to reuse the original.

My point whether to use SeqIO.write() or indexing and get_raw()
depends on the file format and what you are trying to do. My
recommendations would be to use get_raw to write simple file
formats without headers/footers if:

(*) You need to preserve original records exactly
(*) You need this to be as fast as possible
(*) SeqIO.write doesn't support the file format

Otherwise using SeqIO.write should be fine - it is also simpler
in terms of the code to call it.

If course, if you are editing the records in any way, then you
must use SeqIO.write anyway.

> I do not have a GitHub account so I cannot comment on whether
> it would be easier to use Github.

Thanks. My thinking right now you would need to register separately
for (1) the mailing lists, (2) editing the wiki, (3) reporting bugs on
RedMine, (4) submitting pull requests on github, If we used GitHub
for the wiki and/or issue tracker, this means less user accounts
so a little easier for contributors, but also less SysAdmin work
behind the scenes.

Peter



More information about the Biopython mailing list