[Biopython] Summer of Code 2014 - Call for project ideas Re: going from protein to gene to oligos for cloning

Mon Feb 10 14:23:58 UTC 2014

Hi all -

Just another suggestion for the summer of code project....
Going from protein sequences to gene coding regions.

With the reduction of costs associated with DNA synthesis and the advent of
"buying genes", along with more robust robotics, we are now at a time where
many are making large lists of proteins to express for biochemistry,
biophysics and structural biology. However, parsing the data available to
make choices to refine those lists and then obtaining just the coding
regions for the proteins of interest is a little daunting.

As discussed previously, finding a protein at NCBI doesn't lend readily to
getting the gene (coding region) for cloning in a readily automated
fashion. I still haven't tested the code suggested by Peter below, but this
could be cleanup project if it is broken, and or a similar project could be
started from scratch. If it seems like something you are interested, I will
test the code earlier, if that's a starting point someone would like to
pursue... though, may need to speak to the author first, not sure.

Thanks,
Dave

> Hi Dave,
>
> The catch here is the protein IDs are not directly usable in the
> nucleotide database - which is where ELink (Entrez Link) comes
> in, available as the Entrez.elink(...) function in Biopython.
>
> I've not tried it myself, but a colleague posted a long example
> on his blog which sounds close to what you are aiming for:
>
>
> http://armchairbiology.blogspot.co.uk/2013/02/surely-this-has-been-done-already.html
>
> https://github.com/widdowquinn/scripts/blob/master/bioinformatics/get_NCBI_cds_from_protein.py
>
> Peter
>

On Fri, Dec 6, 2013 at 2:24 AM, Peter Cock <p.j.a.cock at googlemail.com>
 wrote:

> On Fri, Dec 6, 2013 at 7:27 AM, David Shin <davidsshin at lbl.gov> wrote:
> > Hi again,
> >
> > I'm trying to use biopython to help me grab a lot of protein sequences
> that
> > will eventually be used as the basis for cloning. I'm almost done
> screening
> > my protein sequences, and pretty much ok on that part...
> >
> > I was just curious if anyone has already developed, or has any decent
> > advice on going from protein codes to getting the actual coding sequences
> > of the genes.
> >
> > At this point, my plan is to take protein codes (ie. numbers in
> > gi|145323746|) and use these to search entrez nucleotide databases
> directly
> > to get hits (I have tested it once seems to work to get genbank
> records...
> > then try to use the information inside to get the nucleotide sequences...
> > or I guess the other way is to use the top hit from tblastn somehow?
> >
> > Thanks,
> >
> > Dave
>