[BioPython] Downloading CDS sequences
Peter
biopython at maubp.freeserve.co.uk
Sat Jan 17 19:30:17 UTC 2009
Peter wrote:
> You've got a location like "<1..456" meaning it starts before base one
> and continues to base 456 (one based counting). In this particular
> case, you'll just have to take the sequence from the start (base 1).
> The problem is your code does int("<1") which fails.
>From my testing, in this case and similar examples like
"AF376133.1:<1..>553" it is safe to treat this as just from position 1
to 553. The less than and greater than signs are to indicate that the
full protein CDS may well extend beyound this region, but it was not
sequenced.
Animesh wrote:
>
> http://lists.open-bio.org/pipermail/biopython/2003-April/001255.html
> This is the link to original script in the mailing list.
> Animesh
>
Thanks! I see Andrew's original code just dealt with the "easy"
cases, where the coded_by string was a non-fuzzy location, and without
a join.
Andrew's code (and yours) uses Bio.EUtils to access the NCBI's "Entrez
Utitlities" online API. I should point out this module has been
deprecated since Release 1.48 (its still there for now but will give a
warning message when used), and we recommend you use Bio.Entrez
instead.
I hope you don't mind me giving you a few comments about your code?
You seem to be struggling with handles. Andrew defined this function:
def lookup(name, seq_start, seq_stop):
h = DBIdsClient.from_dbids(DBIds(db = "nucleotide", ids = [name]))
return h.efetch(retmode = "text", rettype = "fasta",
seq_start = seq_start, seq_stop = seq_stop).read()
The efetch call returns a handle, but you use its read method to get
all the data as a string. This means your lookup function returns a
string containing the record in FASTA format. However, for your code,
it would have made more sense to just stick with the handle - as you
had to convert back from a string of data to a handle using StringIO:
Temp4=lookup(parsed_loc_complement.path.accession, seq_start, seq_stop)
fasta_handle = StringIO.StringIO(Temp4)
record = SeqIO.read(fasta_handle, "fasta")
Using Bio.Entrez.efetch (the equivalent to the old EUtils efetch
method you were using) which returns a handle this would be just:
from Bio import Entrez
fasta_handle = Entrez.efetch("nucleotides", id=name,
retmode="text", rettype="fasta",
seq_start=seq_start, seq_stop=seq_stop)
record = SeqIO.read(fasta_handle, "fasta")
Peter
More information about the Biopython
mailing list