[BioPython] Downloading CDS sequences

Peter biopython at maubp.freeserve.co.uk
Fri Jan 16 12:35:25 UTC 2009


On Fri, Jan 16, 2009 at 4:46 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
>
> Peter,
> Wow! The code(for positional frequency of codons) works 4 me. Thanks a ton.

Good.

> While we are at it please allow me to ask you another question related to
> downloading CDS sequences.

Sure - bit I would have changed the email subject line if I was you.

> I have copied one script from mailing list for
> downloading CDS given from Genbank record of protein sequence written by
> Andrew Dalke. I modified it a little bit to include few more exceptions and
> it work in most of the cases but it's still not bug free.

Do you have a link to the original in the mail archive?
http://lists.open-bio.org/pipermail/biopython/

One minor point is I would have used Bio.SeqIO rather than
Bio.GenBank.FeatureParser and Bio.GenBank.Iterator (the same parsing
code gets used internally - I just think the code is simpler).

>From a style point of view, breaking this up into some subfunctions
would make it a lot clearer what it going on.

I see you are looking at the "coded_by" qualifier, which will be a
location string like "join(NC_008114.1:51934..52632,
NC_008114.1:54315..55043)" including other sequence identifiers.  For
this example you download "NC_008114.1" and extract the two
subsequences and join them up.

The Bio.GenBank.LocationParser should be able to cope with parsing
these strings - but its a complicated thing to do.  As you have seen,
there can be joins etc to deal with - but there are also fuzzy
location which are more tricky.  You specific error is simple enough:

> Traceback (most recent call last):
>  File "C:\Documents and
> Settings\Animesh\Desktop\sequences\Features_extraction_final.py", line 76,
> in <module>
>    loc5=int(loc3[0])-1
> ValueError: invalid literal for int() with base 10: '<1'

You've got a location like "<1..456" meaning it starts before base one
and continues to base 456 (one based counting).  In this particular
case, you'll just have to take the sequence from the start (base 1).
The problem is your code does int("<1") which fails.

Peter



More information about the Biopython mailing list