[BioPython] Downloading CDS sequences

Peter biopython at maubp.freeserve.co.uk
Sat Jan 17 14:30:17 EST 2009


Peter wrote:
> You've got a location like "<1..456" meaning it starts before base one
> and continues to base 456 (one based counting).  In this particular
> case, you'll just have to take the sequence from the start (base 1).
> The problem is your code does int("<1") which fails.

>From my testing, in this case and similar examples like
"AF376133.1:<1..>553" it is safe to treat this as just from position 1
to 553.  The less than and greater than signs are to indicate that the
full protein CDS may well extend beyound this region, but it was not
sequenced.

Animesh wrote:
>
> http://lists.open-bio.org/pipermail/biopython/2003-April/001255.html
> This is the link to original script in the mailing list.
> Animesh
>

Thanks!  I see Andrew's original code just dealt with the "easy"
cases, where the coded_by string was a non-fuzzy location, and without
a join.

Andrew's code (and yours) uses Bio.EUtils to access the NCBI's "Entrez
Utitlities" online API.  I should point out this module has been
deprecated since Release 1.48 (its still there for now but will give a
warning message when used), and we recommend you use Bio.Entrez
instead.

I hope you don't mind me giving you a few comments about your code?

You seem to be struggling with handles.   Andrew defined this function:

def lookup(name, seq_start, seq_stop):
    h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name]))
    return h.efetch(retmode = "text", rettype = "fasta",
                    seq_start = seq_start, seq_stop = seq_stop).read()

The efetch call returns a handle, but you use its read method to get
all the data as a string.  This means your lookup function returns a
string containing the record in FASTA format.  However, for your code,
it would have made more sense to just stick with the handle - as you
had to convert back from a string of data to a handle using StringIO:

Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop)
fasta_handle = StringIO.StringIO(Temp4)
record = SeqIO.read(fasta_handle, "fasta")

Using Bio.Entrez.efetch (the equivalent to the old EUtils efetch
method you were using) which returns a handle this would be just:

from Bio import Entrez
fasta_handle = Entrez.efetch("nucleotides", id=name,
                    retmode="text", rettype="fasta",
                    seq_start=seq_start, seq_stop=seq_stop)
record = SeqIO.read(fasta_handle, "fasta")

Peter


More information about the BioPython mailing list