From dejmail at gmail.com Wed Dec 1 10:13:02 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 1 Dec 2010 17:13:02 +0200 Subject: [Biopython] fasta fail Message-ID: hi everyone I have a list of sequences that I want to write to file in fasta format. This is easy enough, however I keep getting an error which I can't fix. SeqIO.write(final_seq, out_handle, "fasta") File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 398, in write count = writer_class(handle).write_file(sequences) File "/usr/lib/pymodules/python2.6/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/lib/pymodules/python2.6/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/lib/pymodules/python2.6/Bio/SeqIO/FastaIO.py", line 136, in write_record data = self._get_seq_string(record) #Catches sequence being None File "/usr/lib/pymodules/python2.6/Bio/SeqIO/Interfaces.py", line 164, in _get_seq_string % record.id) TypeError: SeqRecord (id=FN545840.1) has an invalid sequence. There is nothing wrong with the record, as far as I can see as I have written to and extraced from it numerous times as a fasta entry There is definitely sequence and other basic information on the record, which I can see from a simple print(final_seq[x].seq) The list of sequences is a "list" as opposed to a "SeqRecord", so I thought this could be problem ? Is there a way to convert a list to a SeqRecord or is this not necessary ? thanks Liam ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 http://web.wits.ac.za/Academic/Health/Pathology/AGTRU/ Tel: 2711 717 2465/7 Fax: 2711 717 2395 Skype: liam_thompson Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From dejmail at gmail.com Wed Dec 1 10:25:37 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 1 Dec 2010 17:25:37 +0200 Subject: [Biopython] rephrase Message-ID: Apologies, it seems each of the records in the list is a SeqRecord. >>> type(final_seq[0]) >>> type(final_seq) Thanks Liam From biopython at maubp.freeserve.co.uk Wed Dec 1 10:44:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Dec 2010 15:44:42 +0000 Subject: [Biopython] fasta fail In-Reply-To: References: Message-ID: On Wed, Dec 1, 2010 at 3:13 PM, Liam Thompson wrote: > hi everyone > > I have a list of sequences that I want to write to file in fasta format. > This is easy enough, however I keep getting an error which I can't fix. > > SeqIO.write(final_seq, out_handle, "fasta") > ... > TypeError: SeqRecord (id=FN545840.1) has an invalid sequence. > > There is nothing wrong with the record, as far as I can see as I have > written to and extraced from it numerous times as a fasta entry There is > definitely sequence and other basic information on the record, which I can > see from a simple print(final_seq[x].seq) The list of sequences is a "list" > as opposed to a "SeqRecord", so I thought this could be problem ? Is there a > way to convert a list to a SeqRecord or is this not necessary ? What is your final_seq object? You should have a list of SeqRecord objects (or in recent versions of Biopython you can also give SeqIO.write a single SeqRecord). Each SeqRecord's seq property should be a Seq object (or similar). Peter From dejmail at gmail.com Wed Dec 1 10:57:39 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 1 Dec 2010 17:57:39 +0200 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: Hi Peter Apologies, it seems each of the records in the list is a SeqRecord. >>> type(final_seq[0]) >>> type(final_seq) SeqIO seems to process the sequence fine, it just can't seem to write it. thanks Liam On 1 December 2010 17:44, Peter wrote: > On Wed, Dec 1, 2010 at 3:13 PM, Liam Thompson wrote: > > hi everyone > > > > I have a list of sequences that I want to write to file in fasta format. > > This is easy enough, however I keep getting an error which I can't fix. > > > > SeqIO.write(final_seq, out_handle, "fasta") > > ... > > TypeError: SeqRecord (id=FN545840.1) has an invalid sequence. > > > > There is nothing wrong with the record, as far as I can see as I have > > written to and extraced from it numerous times as a fasta entry There is > > definitely sequence and other basic information on the record, which I > can > > see from a simple print(final_seq[x].seq) The list of sequences is a > "list" > > as opposed to a "SeqRecord", so I thought this could be problem ? Is > there a > > way to convert a list to a SeqRecord or is this not necessary ? > > What is your final_seq object? > > You should have a list of SeqRecord objects (or in recent versions > of Biopython you can also give SeqIO.write a single SeqRecord). > Each SeqRecord's seq property should be a Seq object (or similar). > > Peter > From biopython at maubp.freeserve.co.uk Wed Dec 1 11:06:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Dec 2010 16:06:28 +0000 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: On Wed, Dec 1, 2010 at 3:57 PM, Liam Thompson wrote: > Hi Peter > > Apologies, it seems each of the records in the list is a SeqRecord. > >>>> type(final_seq[0]) > > >>>> type(final_seq) > > > SeqIO seems to process the sequence fine, it just can't seem > to write it. > > thanks > Liam If final_seq is a list of SeqRecord objects, then Bio.SeqIO.write should be able to save it - assuming they all have sequences. What does this do?, for record in final_seq: print record.id, type(record.seq) Peter From dejmail at gmail.com Thu Dec 2 05:33:23 2010 From: dejmail at gmail.com (Liam Thompson) Date: Thu, 2 Dec 2010 12:33:23 +0200 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: Hi Peter Not sure what the problem was, but I just did it using regular expressions as opposed to SeqIO. Thanks Liam From biopython at maubp.freeserve.co.uk Thu Dec 2 05:39:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 10:39:33 +0000 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: On Thu, Dec 2, 2010 at 10:33 AM, Liam Thompson wrote: > Hi Peter > > Not sure what the problem was, but I just did it using regular expressions > as opposed to SeqIO. > > Thanks > Liam If you did have a self contained example showing the failure I would still like to see it (email me directly with any attachments rather than the mailing list) to find out what went wrong, but I'm glad you've solved your immediate task. Peter From developer at allthingsprogress.com Thu Dec 2 15:42:09 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Thu, 2 Dec 2010 15:42:09 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' Message-ID: I want to do something obvious but can't find a good way to do it. Maybe I'm looking in the wrong places. Anyway, I figured I'd ask here. (Bear with me, I'm new to Python and Biopython.) My question is: What's the easiest way to find and parse DNA sequences from the gene database? I'd like to use something like: handle = Entrez.efetch(db='gene', id='2', rettype='gb') handle.read() But this doesn't work. After poking around, I've learned you can do this query on, the nucleotide database. But not on the gene database. Instead, I have to do this: handle = Entrez.efetch(db='gene', id='2', retmode='gb') I get back something like this: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=gb That isn't easily parseable, at least as far as I can tell. So what's the best way for me to find my sequence? And is there a parser for the string I get from retmode='gb'? Thanks, David From sdavis2 at mail.nih.gov Thu Dec 2 16:07:14 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 2 Dec 2010 16:07:14 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References: Message-ID: On Thu, Dec 2, 2010 at 3:42 PM, David Jacobs < developer at allthingsprogress.com> wrote: > I want to do something obvious but can't find a good way to do it. Maybe > I'm > looking in the wrong places. Anyway, I figured I'd ask here. (Bear with me, > I'm new to Python and Biopython.) > > My question is: What's the easiest way to find and parse DNA sequences from > the gene database? > > I'd like to use something like: > > handle = Entrez.efetch(db='gene', id='2', rettype='gb') > handle.read() > > But this doesn't work. After poking around, I've learned you can do this > query on, the nucleotide database. But not on the gene database. Instead, I > have to do this: > > handle = Entrez.efetch(db='gene', id='2', retmode='gb') > > I get back something like this: > > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=gb > > That isn't easily parseable, at least as far as I can tell. So what's the > best way for me to find my sequence? And is there a parser for the string I > get from retmode='gb'? > > Hi, David. Genes (in the sense used in Entrez Gene) do not have sequences. Their respective transcripts do, however, and there can be, in general, multiple transcripts per gene. Therefore, I think you would have to do a query for the gene of interest and then link to nucleotide to get the sequences for the associated transcripts. If you want to do this for many genes, it may be easier to download the entire refseq collection for your species of interest and simply load stuff into memory or index the fasta file. Sean From developer at allthingsprogress.com Thu Dec 2 16:42:19 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Thu, 2 Dec 2010 16:42:19 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

Message-ID: Hi Sean, Thanks for the info. I didn't realize the gene database wasn't concerned with sequences. (The distinction isn't so clear when you're using the web interface.) So now I'm trying to query nucleotide. My scripting approach has been: 1. Get list of gene names from a file 2. Query nucleotide for gene ID 3. Use that gene ID to download the proper nucleotide entry However, every time I get an ID from nucleotide, it's for an entire genome. How can I specify either a) a specific gene (as identified in the gene database) or b) a specific region of the genome? David On Thu, Dec 2, 2010 at 4:07 PM, Sean Davis wrote: > > > Hi, David. > > Genes (in the sense used in Entrez Gene) do not have sequences. Their > respective transcripts do, however, and there can be, in general, multiple > transcripts per gene. Therefore, I think you would have to do a query for > the gene of interest and then link to nucleotide to get the sequences for > the associated transcripts. If you want to do this for many genes, it may > be easier to download the entire refseq collection for your species of > interest and simply load stuff into memory or index the fasta file. > > Sean > > From kellrott at gmail.com Thu Dec 2 17:53:36 2010 From: kellrott at gmail.com (Kyle) Date: Thu, 2 Dec 2010 14:53:36 -0800 Subject: [Biopython] HMMER / Pfam support Message-ID: I would like to submit my hmmer branch for merge into the main BioPython tree, targeting inclusion in 1.57. This branch adds support for HMMER3 file parsing and some Pfam related file work. It's adapted from the PfamScan perl code found at ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ The code can be found at https://github.com/kellrott/biopython/tree/hmmer Kyle From sdavis2 at mail.nih.gov Thu Dec 2 20:15:28 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 2 Dec 2010 20:15:28 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

Message-ID: On Thu, Dec 2, 2010 at 4:42 PM, David Jacobs < developer at allthingsprogress.com> wrote: > Hi Sean, > > Thanks for the info. I didn't realize the gene database wasn't concerned > with sequences. (The distinction isn't so clear when you're using the web > interface.) So now I'm trying to query nucleotide. My scripting approach has > been: > > 1. Get list of gene names from a file > 2. Query nucleotide for gene ID > 3. Use that gene ID to download the proper nucleotide entry > > Hi, David. Perhaps you can give a concrete example. What is the starting value (gene name, HUGO gene symbol, Entrez Gene ID)? What is the expected output--you mention "proper nucleotide entry", but there will likely be more than one for a given gene? You also mention that you are interested in a specific region of the genome--do you want the gene locus or the transcripts or the CDS, or something else? Finally, how many genes are we talking about here? 5-10 or thousands? Sean > However, every time I get an ID from nucleotide, it's for an entire genome. > How can I specify either a) a specific gene (as identified in the gene > database) or b) a specific region of the genome? > > David > > On Thu, Dec 2, 2010 at 4:07 PM, Sean Davis wrote: >> >> >> Hi, David. >> >> Genes (in the sense used in Entrez Gene) do not have sequences. Their >> respective transcripts do, however, and there can be, in general, multiple >> transcripts per gene. Therefore, I think you would have to do a query for >> the gene of interest and then link to nucleotide to get the sequences for >> the associated transcripts. If you want to do this for many genes, it may >> be easier to download the entire refseq collection for your species of >> interest and simply load stuff into memory or index the fasta file. >> >> Sean >> >> > From developer at allthingsprogress.com Fri Dec 3 00:59:36 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Fri, 3 Dec 2010 00:59:36 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

Message-ID: Before I give an example, one note. I just realized where the "gene vs. sequence" confusion is coming from. I'm a bacteriologist, so I don't have to deal with introns or alternative splicing. In general, there is one definitive sequence for a gene that I want to look at. (Am I still missing something?) As an example of what I'm doing: say I want to find the sequence for fliC in Salmonella Typhi CT18. Since nucleotide is only giving me entire genomes, I've queried the gene database instead. The query is "fliC ct18" and it gives me one entry: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=fliC+ct18 Now I want the raw sequence for that gene. The sequence that shows up when I click "FASTA" on the above page: http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true How can I get that? As far as quantity, the plan is to implement a script that will help me analyze about 10 genes every two weeks or so. Regards, David On Thu, Dec 2, 2010 at 8:15 PM, Sean Davis wrote: Hi, David. > > Perhaps you can give a concrete example. What is the starting value (gene > name, HUGO gene symbol, Entrez Gene ID)? What is the expected output--you > mention "proper nucleotide entry", but there will likely be more than one > for a given gene? You also mention that you are interested in a specific > region of the genome--do you want the gene locus or the transcripts or the > CDS, or something else? Finally, how many genes are we talking about here? > 5-10 or thousands? > > Sean > From bjorn_johansson at bio.uminho.pt Fri Dec 3 02:45:26 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Fri, 3 Dec 2010 07:45:26 +0000 Subject: [Biopython] Sequence assembly in (bio)python Message-ID: Hi, I wonder if there is a sequence assembler (like phred, cap3) implemented in python? I am working on a small utility to assemble a handful of sequences and in this case I think that a standalone assembler might be overkill, and I would like to tweak the parameters easily. Alternatively, is there (bio)python bindings for any assembler program? I could not find any in biopython. thanks, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From biopython at maubp.freeserve.co.uk Fri Dec 3 08:22:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 13:22:40 +0000 Subject: [Biopython] HMMER / Pfam support In-Reply-To: References: Message-ID: On Thu, Dec 2, 2010 at 10:53 PM, Kyle wrote: > I would like to submit my hmmer branch for merge into the main BioPython > tree, targeting inclusion in 1.57. > This branch adds support for HMMER3 file parsing and some Pfam related file > work. It's adapted from the PfamScan perl code found at > ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ > The code can be found at https://github.com/kellrott/biopython/tree/hmmer > > Kyle Hi Kyle, I've had a quick look at the HMMER bits (but not the PFAM stuff) and have some initial comments. I'm concerned about the apparent use of Jaina Mistry & John Tate's Perl code which is under the GPL v2+ and thus cannot be included in Biopython. If you basically copied their code and translated it into Python. I think to be safe you'll have to ask the origin authors' permission to re-license it for Biopython (MIT/BSD style). If your code is a fresh implementation using their approach you may be OK, but the module text should be clarified. Does hmmscan work on Windows? Would there be much point writing a Bio.Application style wrapper class for it, rather than or to be used within your Bio.hmmer.HMMScan function etc? A unit test for calling the tool would be good, e.g. test_hmmer_tool.py which can be made conditional on the tool being found. In Bio.hmmer you have two functions, parseHMMER3 and parseMultiHMMER3 taking file handles, used for a single record and multiple records (right?). It would match Biopython usage to call these read (single) and parse (iterator) Is there anything here for HMMER2? I saw some apparent stub entries in the code so I guess not. A minor thing: In your unit test file test_hmmer.py do you really need to use the obsolete string module? Can't you use a string method? Also since this is quite a long lived branch looking over your changes isn't so simple what with all the merges. I'd find it easier to review the changes if you could rebase it off the current master, e.g. assuming you cloned from *your* repository on github, and added the official one as remote name upstream, something like this should do it: git checkout hmmer git branch hmmer_dec2010 git checkout hmmer_dec2010 git fetch upsteam git rebase upstream/master git push origin hmmer_dec2010 (Untested, but something like that). Is the PFAM stuff on this branch required for your HMMER code, or could we deal with the HMMER stuff separately first? If so a "clean" new branch with just the HMMER stuff would be preferable (if it makes life easier, you could do it as a single commit - assuming you don't care about the history to date) Peter From chapmanb at 50mail.com Fri Dec 3 08:44:15 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 3 Dec 2010 08:44:15 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

Message-ID: <20101203134415.GG23468@sobchak.mgh.harvard.edu> David; > As an example of what I'm doing: say I want to find the sequence for fliC in > Salmonella Typhi CT18. Since nucleotide is only giving me entire genomes, > I've queried the gene database instead. The query is "fliC ct18" and it > gives me one entry: > > http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=fliC+ct18 > > Now I want the raw sequence for that gene. The sequence that shows up when I > click "FASTA" on the above page: > > http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true The best approach here might be to download the FASTA files for your bacteria of interest, and then extract the sequences you need that way. For your example, this file has the genes pre-sliced: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn Using EUtils is hard here because there isn't an official identifier for the sequence you are interested in. In this case you'll have to pull down the genome and then subset it yourself based on the coordinates. Brad From biopython at maubp.freeserve.co.uk Fri Dec 3 09:44:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 14:44:58 +0000 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <20101203134415.GG23468@sobchak.mgh.harvard.edu> References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu> Message-ID: On Fri, Dec 3, 2010 at 1:44 PM, Brad Chapman wrote: > David; > >> As an example of what I'm doing: say I want to find the sequence for fliC in >> Salmonella Typhi CT18. Since nucleotide is only giving me entire genomes, >> I've queried the gene database instead. The query is "fliC ct18" and it >> gives me one entry: >> >> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=fliC+ct18 >> >> Now I want the raw sequence for that gene. The sequence that shows up when I >> click "FASTA" on the above page: >> >> http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true > > The best approach here might be to download the FASTA files for your > bacteria of interest, and then extract the sequences you need that > way. For your example, this file has the genes pre-sliced: > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn > For bacteria (well, prokaryotes) I agree the files on the NCBI FTP site (FASTA files and others like the GenBank flat files) are very handy. This is certainly worth looking at. > Using EUtils is hard here because there isn't an official identifier > for the sequence you are interested in. In this case you'll have to > pull down the genome and then subset it yourself based on the > coordinates. Actually you should be able to get just the subsequence of interest via EFetch, see the seq_start and seq_stop parameters: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html Peter From developer at allthingsprogress.com Fri Dec 3 14:38:59 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Fri, 3 Dec 2010 14:38:59 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu> Message-ID: Thanks for the info. There seems to be a huge disconnect between what I want to do and what this library is letting me do. It seems like there should be a really simple way to look up bacterial gene sequences by their names, and it's disappointing that that's not the case. Every workaround I've tried has also failed. For example, I've downloaded the full CT18 genome from the FTP server and parsed it using SeqIO. The problem is that SeqRecord doesn't give me an accessor to the "name" attribute of the sequence, as it would appear in the gene database. What's more, if I search the gene database for a name, I do, in fact, get an ID back. But that ID has no information about the start and stop indices for my sequence, so I can't use that information in conjunction with my downloaded genome. Further still, if I try to query the gene database for my gene's full information (using the ID that I grabbed from esearch(db=gene ...)), I get back data formatted in a way that BioPython can't parse. This is a touch aggravating. What am I missing? On Fri, Dec 3, 2010 at 9:44 AM, Peter wrote: > On Fri, Dec 3, 2010 at 1:44 PM, Brad Chapman wrote: > > David; > > > >> As an example of what I'm doing: say I want to find the sequence for > fliC in > >> Salmonella Typhi CT18. Since nucleotide is only giving me entire > genomes, > >> I've queried the gene database instead. The query is "fliC ct18" and it > >> gives me one entry: > >> > >> > http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=fliC+ct18 > >> > >> Now I want the raw sequence for that gene. The sequence that shows up > when I > >> click "FASTA" on the above page: > >> > >> > http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true > > > > The best approach here might be to download the FASTA files for your > > bacteria of interest, and then extract the sequences you need that > > way. For your example, this file has the genes pre-sliced: > > > > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn > > > > For bacteria (well, prokaryotes) I agree the files on the NCBI FTP > site (FASTA files and others like the GenBank flat files) are very > handy. This is certainly worth looking at. > > > Using EUtils is hard here because there isn't an official identifier > > for the sequence you are interested in. In this case you'll have to > > pull down the genome and then subset it yourself based on the > > coordinates. > > Actually you should be able to get just the subsequence of interest > via EFetch, see the seq_start and seq_stop parameters: > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html > > Peter > From biopython at maubp.freeserve.co.uk Fri Dec 3 15:32:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 20:32:21 +0000 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

Message-ID: On Fri, Dec 3, 2010 at 7:38 PM, David Jacobs wrote: > Thanks for the info. There seems to be a huge disconnect between what I want > to do and what this library is letting me do. It seems like there should be > a really simple way to look up bacterial gene sequences by their names, and > it's disappointing that that's not the case. > > Every workaround I've tried has also failed. > > For example, I've downloaded the full CT18 genome from the FTP server and > parsed it using SeqIO. The problem is that SeqRecord doesn't give me an > accessor to the "name" attribute of the sequence, as it would appear in the > gene database. You'll have to give me more to go on - what did you download by FTP, a FASTA file, GenBank? How about giving the URL and an example of the "name" you want to use. > What's more, if I search the gene database for a name, I do, > in fact, get an ID back. But that ID has no information about the start and > stop indices for my sequence, so I can't use that information in conjunction > with my downloaded genome. Have you looked at EInfo? It is for cross referencing between the different Entrez databases. > Further still, if I try to query the gene > database for my gene's full information (using the ID that I grabbed from > esearch(db=gene ...)), I get back data formatted in a way that BioPython > can't parse. Are you talking about using EFetch here? Which database? The valid combinations of retmode and rettype change according to this. See e.g.: http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html > > This is a touch aggravating. > > What am I missing? > The NCBI Entrez documentation is definitely sparse :( If all you want to do is get the nucleotide sequence for bacterial genes then I do suspect working with the FASTA or GenBank files would be easier than using Entrez (as Brad suggested earlier). Can you give a specific example - couple of gene names you want, and desired answer (the sequence want to find for them)? Sean did ask earlier - this really would and we'd be better able to help you. Peter From developer at allthingsprogress.com Fri Dec 3 16:19:13 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Fri, 3 Dec 2010 16:19:13 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

Message-ID: > > > Thanks for the info. There seems to be a huge disconnect between what I > want > > to do and what this library is letting me do. It seems like there should > be > > a really simple way to look up bacterial gene sequences by their names, > and > > it's disappointing that that's not the case. > > > > Every workaround I've tried has also failed. > > > > For example, I've downloaded the full CT18 genome from the FTP server and > > parsed it using SeqIO. The problem is that SeqRecord doesn't give me an > > accessor to the "name" attribute of the sequence, as it would appear in > the > > gene database. > > You'll have to give me more to go on - what did you download by FTP, > a FASTA file, GenBank? How about giving the URL and an example > of the "name" you want to use. > In this case, I downloaded the file Brad listed earlier: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn The "name" I'd like back is the name listed as "symbol" at http://www.ncbi.nlm.nih.gov/gene when I query "ct18 fliC"--in other words, I want to search the genome file I have for "fliC", and not just the human-readable description. It seems to me like the "name" attribute for SeqRecord would be a useful place to put this, especially since right now, "name" is just a duplicate of the information in "id". This already works for protein entries. See: https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L322 The thing is, the human-readable description of each gene is already annotated in the genome FASTA file I downloaded. I just need the symbol, as it's easily searchable and more canonical. > What's more, if I search the gene database for a name, I do, > > in fact, get an ID back. But that ID has no information about the start > and > > stop indices for my sequence, so I can't use that information in > conjunction > > with my downloaded genome. > > Have you looked at EInfo? It is for cross referencing between the different > Entrez databases. > I'll have a look. > > Further still, if I try to query the gene > > database for my gene's full information (using the ID that I grabbed from > > esearch(db=gene ...)), I get back data formatted in a way that BioPython > > can't parse. > > Are you talking about using EFetch here? Which database? The valid > combinations of retmode and rettype change according to this. See e.g.: > http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html I'm using ESearch. As I say, I'm trying to query the "gene" database using a name ("ct18 fliC"). I do get back just one entry, and it gives me the correct GID. When I try to query "gene" using this ID--in order to get the start and stop indices--the best I get is: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=1248507&retmode=txt Which I don't know how to parse. (Again, I just want the start and stop positions.) > This is a touch aggravating. > > > > What am I missing? > > > > The NCBI Entrez documentation is definitely sparse :( > > If all you want to do is get the nucleotide sequence for bacterial > genes then I do suspect working with the FASTA or GenBank files > would be easier than using Entrez (as Brad suggested earlier). > I'd rather not do this manually, though. It seems like BioPython should make tedious tasks like this easy and sustainable. > Can you give a specific example - couple of gene names you want, > and desired answer (the sequence want to find for them)? Sean did > ask earlier - this really would and we'd be better able to help you. For the example I've given over the last couple of e-mails, this is the sequence I want: http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true It's directly linked to as from the fliC page in the "gene" database. (From the link labeled "FASTA".) Does that make things clearer? David From chapmanb at 50mail.com Fri Dec 3 16:57:36 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 3 Dec 2010 16:57:36 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

Message-ID: <20101203215736.GO23468@sobchak.mgh.harvard.edu> David; [Bacterial download example] > http://www.ncbi.nlm.nih.gov/nuccore/NC_003198?report=fasta&from=2011173&to=2012693&strand=true Thanks for the additional explanation. The problem is that the fasta file download doesn't have the fliC identifier in the header line. It's not that Biopython doesn't present it to you correctly; rather, NCBI didn't include it in the fasta output. Thanks to Peter's tip about efetch supporting slicing, here is a script that replicates what you're getting by hand. The tricky part is that the gene record is returned as XML instead of their native record format, so it needs to be parsed for the items of interest. https://gist.github.com/727625 import xml.etree.ElementTree as ET from Bio import Entrez def fetch_gene_coordinates(search_term): handle = Entrez.esearch(db="gene", term=search_term) rec = Entrez.read(handle) gene_id = rec["IdList"][0] # assuming best match works handle = Entrez.efetch(db="gene", id=gene_id, retmode="xml") gene_locus = ET.parse(handle).getroot().find("Entrezgene/Entrezgene_locus") region = gene_locus.find("Gene-commentary/Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval") # add 1 to coordinates from XML to match what Entrez expected (0 to 1 based) start = int(region.find("Seq-interval_from").text) + 1 end = int(region.find("Seq-interval_to").text) + 1 gi_id = region.find("Seq-interval_id/Seq-id/Seq-id_gi").text strand = region.find("Seq-interval_strand/Na-strand").get("value") return gi_id, start, end, strand def get_fasta_seq(gi_id, start, end, strand): strand = 2 if strand.lower() == "minus" else 1 handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=gi_id, seq_start=start, seq_stop=end, strand=strand) return handle.read() Entrez.email = "yours at mail.com" search_term = "fliC ct18" gi_id, start, end, strand = fetch_gene_coordinates(search_term) print get_fasta_seq(gi_id, start, end, strand) Hope this works for you, Brad From biopython at maubp.freeserve.co.uk Fri Dec 3 17:42:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Dec 2010 22:42:05 +0000 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

Message-ID: On Fri, Dec 3, 2010 at 9:19 PM, David Jacobs wrote: > > In this case, I downloaded the file Brad listed earlier: > > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.ffn > > The "name" I'd like back is the name listed as "symbol" at > http://www.ncbi.nlm.nih.gov/gene when I query "ct18 fliC"--in other words, I > want to search the genome file I have for "fliC", and not just the > human-readable description. It seems to me like the "name" attribute for > SeqRecord would be a useful place to put this, especially since right now, > "name" is just a duplicate of the information in "id". > > This already works for protein entries. See: > > https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L322 That's not an example from parsing a FASTA file, its a record constructed "by hand" - not really a fair comparison. The trouble with FASTA files is there is no standard way to structure the information in the ">" line, other than the first word is the identifier. > The thing is, the human-readable description of each gene is already > annotated in the genome FASTA file I downloaded. I just need the > symbol, as it's easily searchable and more canonical. As Brad pointed out, sadly the gene name "fliC" is not in that FASTA file anywhere. However, it is in the GenBank file: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Salmonella_enterica_serovar_Typhi_CT18_uid57793/NC_003198.gbk You can loop over all the features, filter on type (e.g. gene or CDS) and look at the annotation (qualifiers is a dictionary, entries are lists of strings) for features with the gene name (or locus tag, or database cross reference) of interest: from Bio import SeqIO genome = SeqIO.read("NC_003198.gbk", "gb") for feature in genome.features: if feature.type=="CDS" \ and "fliC" in feature.qualifiers.get('gene',[]): print feature print feature.extract(genome.seq) Also have a look at this example for another way to pick out the feature of interest: http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features For the online approach with Entrez, Brad has replied already. Regards, Peter From kellrott at gmail.com Fri Dec 3 20:50:52 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 3 Dec 2010 17:50:52 -0800 Subject: [Biopython] HMMER / Pfam support In-Reply-To: References: Message-ID: > I'm concerned about the apparent use of Jaina Mistry & John Tate's > Perl code which is under the GPL v2+ and thus cannot be included > in Biopython. If you basically copied their code and translated it into > Python. I think to be safe you'll have to ask the origin authors' > permission to re-license it for Biopython (MIT/BSD style). If your > code is a fresh implementation using their approach you may be > OK, but the module text should be clarified. > I'll contact them (and CC you) off list to get permission to re-license the code. > Does hmmscan work on Windows? Would there be much point > writing a Bio.Application style wrapper class for it, rather than or > to be used within your Bio.hmmer.HMMScan function etc? A unit > test for calling the tool would be good, e.g. test_hmmer_tool.py > which can be made conditional on the tool being found. > I don't have a windows machine to test HMMER3 on, so somebody else will have to check that. I've started work on a wrapper module for it, as well as the unit testing code for it. > In Bio.hmmer you have two functions, parseHMMER3 and > parseMultiHMMER3 taking file handles, used for a single > record and multiple records (right?). It would match Biopython > usage to call these read (single) and parse (iterator) > This is patterned after the original implementation. I've only utilized parseMultiHMMER3 on the front end, but I've left the other methods intact in case I need them in the future. > Is there anything here for HMMER2? I saw some apparent > stub entries in the code so I guess not. > Given that HMMER2s last official release was 2003, HMMER3 is much faster, and that Pfam24 onward require HMMER3, I haven't put any effort into it. But I've left those reference in case someone demands HMMER2 support. It wouldn't been too difficult, probably only a few tweeks, to get it too work. The stub entries are also patterned after the original code, and I just left them incase I ended up needing them in the future. > Also since this is quite a long lived branch looking over your > changes isn't so simple what with all the merges. I'd find it > easier to review the changes if you could rebase it off the > current master, e.g. assuming you cloned from *your* > repository on github, and added the official one as remote > name upstream, something like this should do it: > I've reworked it into the hmmer_dec2010 branch, which comes straight out of master, with only two revisions. Is the PFAM stuff on this branch required for your HMMER > code, or could we deal with the HMMER stuff separately > first? If so a "clean" new branch with just the HMMER > stuff would be preferable (if it makes life easier, you > could do it as a single commit - assuming you don't care > about the history to date) > The HMMER stuff should be independent of the Pfam module, so it can be integrated by itself. I've already removed the Pfam stuff from the new hmmer_dec2010 branch. Kyle From ruchira.datta at gmail.com Fri Dec 3 23:50:07 2010 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Fri, 3 Dec 2010 20:50:07 -0800 Subject: [Biopython] HMMER / Pfam support In-Reply-To: References: Message-ID: I had written a HMMER3 parser independently (i.e., not based on any previous work), if that helps. However, I wrote it thinking of our group's needs, so it may not be BioPythonic. --Ruchira On Fri, Dec 3, 2010 at 5:22 AM, Peter wrote: > On Thu, Dec 2, 2010 at 10:53 PM, Kyle wrote: > > I would like to submit my hmmer branch for merge into the main BioPython > > tree, targeting inclusion in 1.57. > > This branch adds support for HMMER3 file parsing and some Pfam related > file > > work. It's adapted from the PfamScan perl code found at > > ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ > > The code can be found at > https://github.com/kellrott/biopython/tree/hmmer > > > > Kyle > > Hi Kyle, > > I've had a quick look at the HMMER bits (but not the PFAM stuff) > and have some initial comments. > > I'm concerned about the apparent use of Jaina Mistry & John Tate's > Perl code which is under the GPL v2+ and thus cannot be included > in Biopython. If you basically copied their code and translated it into > Python. I think to be safe you'll have to ask the origin authors' > permission to re-license it for Biopython (MIT/BSD style). If your > code is a fresh implementation using their approach you may be > OK, but the module text should be clarified. > > Does hmmscan work on Windows? Would there be much point > writing a Bio.Application style wrapper class for it, rather than or > to be used within your Bio.hmmer.HMMScan function etc? A unit > test for calling the tool would be good, e.g. test_hmmer_tool.py > which can be made conditional on the tool being found. > > In Bio.hmmer you have two functions, parseHMMER3 and > parseMultiHMMER3 taking file handles, used for a single > record and multiple records (right?). It would match Biopython > usage to call these read (single) and parse (iterator) > > Is there anything here for HMMER2? I saw some apparent > stub entries in the code so I guess not. > > A minor thing: In your unit test file test_hmmer.py do you really > need to use the obsolete string module? Can't you use a string > method? > > Also since this is quite a long lived branch looking over your > changes isn't so simple what with all the merges. I'd find it > easier to review the changes if you could rebase it off the > current master, e.g. assuming you cloned from *your* > repository on github, and added the official one as remote > name upstream, something like this should do it: > > git checkout hmmer > git branch hmmer_dec2010 > git checkout hmmer_dec2010 > git fetch upsteam > git rebase upstream/master > git push origin hmmer_dec2010 > > (Untested, but something like that). > > Is the PFAM stuff on this branch required for your HMMER > code, or could we deal with the HMMER stuff separately > first? If so a "clean" new branch with just the HMMER > stuff would be preferable (if it makes life easier, you > could do it as a single commit - assuming you don't care > about the history to date) > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Tue Dec 7 07:45:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 12:45:53 +0000 Subject: [Biopython] HMMER / Pfam support In-Reply-To: References:

Message-ID: On Sat, Dec 4, 2010 at 4:50 AM, Ruchira Datta wrote: > I had written a HMMER3 parser independently (i.e., not based on any previous > work), if that helps. ?However, I wrote it thinking of our group's needs, so > it may not be BioPythonic. Hi Ruchira, Is your code online somewhere? It would certainly be worth looking at - especially if we can't sort out the GPL license issue with Kyle's code. [Some of Kyle's HMMER3 code is fine from a license point of view, for example the Biopython style application wrapper class] Peter From devaniranjan at gmail.com Tue Dec 7 15:40:26 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 15:40:26 -0500 Subject: [Biopython] superimposing problem Message-ID: Hello everyone, Suppose I wan to see the conformational variation in protein loops and extracted say 2 loops of same length and want to superimpose the 1st and last residue (say like clamp them together like pivots) how will I go about doing that? I can use the superimposer and superimpose based on either the 1st/last residue calculate the rot/tran then apply to the entire molecule but don't know how I could do it for say the 1st and last while the intermediate loops are not superimposed but are free moving? Please look at the 2 PDF files of the loops where I want to superimpose the 1st and last resiudes but also look at the IMAGE I have attached of what I want to do for a better understanding. Thanks for your help and sorry if its written in a slightly confusing manner. -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.jpg Type: image/jpeg Size: 8723 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gH.pdb Type: chemical/x-pdb Size: 4262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: kH.pdb Type: chemical/x-pdb Size: 4262 bytes Desc: not available URL: From developer at allthingsprogress.com Tue Dec 7 15:52:29 2010 From: developer at allthingsprogress.com (David) Date: Tue, 07 Dec 2010 15:52:29 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

Message-ID: <1291755149.22122.7.camel@W01209978> > As Brad pointed out, sadly the gene name "fliC" is not in that > FASTA file anywhere. I believe I mentioned that, too. Only the human-readable descriptions are available, for some reason. Since the gene symbols are just as important (actually more important), it's confusing to me that NCBI doesn't include them. I guess that's neither here nor there. > You can loop over all the features, filter on type (e.g. gene or CDS) > and look at the annotation (qualifiers is a dictionary, entries are > lists of strings) for features with the gene name (or locus tag, or > database cross reference) of interest: > > from Bio import SeqIO > genome = SeqIO.read("NC_003198.gbk", "gb") > for feature in genome.features: > if feature.type=="CDS" \ > and "fliC" in feature.qualifiers.get('gene',[]): > print feature > print feature.extract(genome.seq) > > Also have a look at this example for another way to pick out the > feature of interest: > > http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features > > For the online approach with Entrez, Brad has replied already. > > Regards, > Peter Thanks, Brad and Peter, for your solutions. I'm going to use a modified version of Brad's Internet-based script to work with my genes of interest. I'll keep your solution in mind, Peter, in case I need to use the script offline in the future. It turns out one of the problems I was having was simply parsing XML with Python (I am used to Ruby). Brad's example script has worked well to solve that problem. One question that this has prompted (for me) is: Why not extend BioPython to support queries that NCBI does not, itself, support? I am a newcomer, but it seems like the Entrez model is simply a wrapper for the NCBI interfaces and protocols. But there is potential for more. Is there any interest in boosting the Entrez model to support common tasks like bacterial gene lookups? David From anaryin at gmail.com Tue Dec 7 16:14:54 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 7 Dec 2010 22:14:54 +0100 Subject: [Biopython] superimposing problem In-Reply-To: References: Message-ID: Hello George, >From what I understood, what you want to be compare the diversity of two loops based on anchor points. Peter already pointed you to a great resource here . To make it clearer, you should first define the residues you want to align (or the particular atoms) and then perform the alignment. For example, I wanted to align a loop in my protein so I defined the * resi_range* variable to restrict the alignment to those residues. Furthermore, I then made sure that I only used CA atoms with two for loops. Check the code here (http://pastebin.com/V2UsGYGL), might help you understand how it works. Best! From devaniranjan at gmail.com Tue Dec 7 16:30:02 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 16:30:02 -0500 Subject: [Biopython] superimposing problem In-Reply-To: References: Message-ID: Hello, I already do that for connecting fragments -based on the Warwick code by Peter but I want have a superimposition based on both the 1st and last residue but the residues in the middle of 1st and last should only be rot/tran and not superimposed. I get the basic idea already as I use a similar principle to connect the last residue of a fragment to a 1st one of another one to make a longer one but here I want the 1st superimposed with 1st and last with last and not the middle ones. Looking at the code you have sent-mine is similar, the one I use to make longer fragments joining 1st with last but not sure how to do this one. Thank you, George On Tue, Dec 7, 2010 at 4:14 PM, Jo?o Rodrigues wrote: > Hello George, > > From what I understood, what you want to be compare the diversity of two > loops based on anchor points. > > Peter already pointed you to a great resource here > . > > To make it clearer, you should first define the residues you want to align > (or the particular atoms) and then perform the alignment. > > For example, I wanted to align a loop in my protein so I defined the * > resi_range* variable to restrict the alignment to those residues. > Furthermore, I then made sure that I only used CA atoms with two for loops. > > Check the code here (http://pastebin.com/V2UsGYGL), might help you > understand how it works. > > Best! > From devaniranjan at gmail.com Tue Dec 7 16:37:35 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 16:37:35 -0500 Subject: [Biopython] superimposing problem In-Reply-To: References: Message-ID: Hello Jo?o, I can do this already based on Peter's code----see IMAGE called connected.jpg But I want to do See IMAGE On Tue, Dec 7, 2010 at 4:14 PM, Jo?o Rodrigues wrote: > Hello George, > > From what I understood, what you want to be compare the diversity of two > loops based on anchor points. > > Peter already pointed you to a great resource here > . > > To make it clearer, you should first define the residues you want to align > (or the particular atoms) and then perform the alignment. > > For example, I wanted to align a loop in my protein so I defined the * > resi_range* variable to restrict the alignment to those residues. > Furthermore, I then made sure that I only used CA atoms with two for loops. > > Check the code here (http://pastebin.com/V2UsGYGL), might help you > understand how it works. > > Best! > From devaniranjan at gmail.com Tue Dec 7 16:39:29 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 16:39:29 -0500 Subject: [Biopython] superimposing problem In-Reply-To: References: Message-ID: sorry message got sent before I finished. I can do this (connected.jpg) I want to do this (superimpose.jpg) Thanks George On Tue, Dec 7, 2010 at 4:14 PM, Jo?o Rodrigues wrote: > Hello George, > > From what I understood, what you want to be compare the diversity of two > loops based on anchor points. > > Peter already pointed you to a great resource here > . > > To make it clearer, you should first define the residues you want to align > (or the particular atoms) and then perform the alignment. > > For example, I wanted to align a loop in my protein so I defined the * > resi_range* variable to restrict the alignment to those residues. > Furthermore, I then made sure that I only used CA atoms with two for loops. > > Check the code here (http://pastebin.com/V2UsGYGL), might help you > understand how it works. > > Best! > -------------- next part -------------- A non-text attachment was scrubbed... Name: connected.jpg Type: image/jpeg Size: 12377 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.jpg Type: image/jpeg Size: 8723 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Tue Dec 7 17:25:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 22:25:43 +0000 Subject: [Biopython] HMMER / Pfam support In-Reply-To: References:

Message-ID: On Tue, Dec 7, 2010 at 12:45 PM, Peter wrote: > On Sat, Dec 4, 2010 at 4:50 AM, Ruchira Datta wrote: >> I had written a HMMER3 parser independently (i.e., not based on any previous >> work), if that helps. ?However, I wrote it thinking of our group's needs, so >> it may not be BioPythonic. > > Hi Ruchira, > > Is your code online somewhere? It would certainly be worth looking at > - especially if we can't sort out the GPL license issue with Kyle's code. > > [Some of Kyle's HMMER3 code is fine from a license point of view, for example > the Biopython style application wrapper class] > > Peter Hi Ruchira, I got the HMMER3 parsing code you sent off the mailing list. That could be very helpful, thank you. Peter From kellrott at gmail.com Tue Dec 7 17:35:46 2010 From: kellrott at gmail.com (Kyle) Date: Tue, 7 Dec 2010 14:35:46 -0800 Subject: [Biopython] HMMER / Pfam support In-Reply-To: References:

Message-ID: > > [Some of Kyle's HMMER3 code is fine from a license point of view, for > example > the Biopython style application wrapper class] > I've been reworking the code to be an extension of the AlignIO module. Send me a code, and I'll work it into that module. Kyle From devaniranjan at gmail.com Tue Dec 7 17:49:22 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 17:49:22 -0500 Subject: [Biopython] superimposing problem In-Reply-To: References:

Message-ID: Thanks Jo?o for trying. I was also thinking that you would have to deform the 2nd coil to get exactly what I want. What I tried and not completly sussefully but I will try that agian is as follows. 1) Superimpose 1st residue of coil1 with 1st residue of coil2, rot/tran coil2 and get the coordinates---lets call the rot/tran coil2 as mod_coil2 2) Superimpose last residue of coil1 with last residue of mod_coil2 and then rot/tran whole molecule---the result is not exactly what I want but it is the only solution I can think of for now. Thanks once agin Jo?o, George On Tue, Dec 7, 2010 at 5:40 PM, Jo?o Rodrigues wrote: > Hello George, > > I've tried restricting the alignment to a few atoms only. It works > perfectly for one residue only, but when you add a second residue it > probably tries to fit both (which it pretty much impossible unless both > starting residues have exactly the same rotamer) and it doesn't produce > exactly what you want. > > I'm attaching a pymol session picture of the result of thisscript that superimposes CA, CB, and HA (more or less a straight line) atoms > of the first residue of both loops. Both loops have a completely different > structure, starting already with a twist on the first ALA residue, so I > think it's pretty hard to do what you want (exactly) without deforming your > loops. > > Usually, to produce those figures, you should have anchor residues whose > coordinates remain the same and then generate the loop residues. > > I hope someone else can prove me wrong :) > > Best, > > Jo?o [...] Rodrigues > http://doeidoei.wordpress.com > > > > > On Tue, Dec 7, 2010 at 10:39 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> sorry message got sent before I finished. >> >> I can do this (connected.jpg) >> >> I want to do this (superimpose.jpg) >> >> >> Thanks >> George >> >> On Tue, Dec 7, 2010 at 4:14 PM, Jo?o Rodrigues wrote: >> >>> Hello George, >>> >>> From what I understood, what you want to be compare the diversity of two >>> loops based on anchor points. >>> >>> Peter already pointed you to a great resource here >>> . >>> >>> To make it clearer, you should first define the residues you want to >>> align (or the particular atoms) and then perform the alignment. >>> >>> For example, I wanted to align a loop in my protein so I defined the * >>> resi_range* variable to restrict the alignment to those residues. >>> Furthermore, I then made sure that I only used CA atoms with two for loops. >>> >>> Check the code here (http://pastebin.com/V2UsGYGL), might help you >>> understand how it works. >>> >>> Best! >>> >> >> > From biopython at maubp.freeserve.co.uk Tue Dec 7 18:00:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Dec 2010 23:00:10 +0000 Subject: [Biopython] superimposing problem In-Reply-To: References:

Message-ID: On Tue, Dec 7, 2010 at 10:49 PM, George Devaniranjan wrote: > Thanks Jo?o for trying. > I was also thinking that you would have to deform the 2nd coil to get > exactly what I want. > > What I tried and not completly sussefully but ?I will try that agian is as > follows. > > 1) Superimpose 1st residue of coil1 with 1st residue of coil2, rot/tran > coil2 and get the coordinates---lets call the rot/tran coil2 ? as mod_coil2 > > 2) Superimpose last residue of coil1 with last residue of mod_coil2 and then > rot/tran whole molecule---the result is not exactly what I want but it is > the only solution I can think of for now. > Thanks once agin Jo?o, > > George Hi George, Hopefully I have understood your aim, and the following makes sense... Have you thought about (in your head - no code needed) trying to simultaneously try to superimpose the first residues AND the last residues with a rigid body motion (rotation and translation)? Assuming you are using just the C-alpha atoms for this. Lets call these atoms S1, E1 and S2, E2 (start and end). Also suppose that the distance S1 to E1 is bigger than S2 to E2. The superposition will give you S1, S2, E2, E1 on a line in space. However, the relative rotation of the two loops is free. So, if using C-alpha atoms, I think you are going to have to include at least one pair of atoms from the loop, e.g. M1 and M2 for the C-alpha of the mid point residue. However, if you are using more than just the C-alpha atoms, that should give enough constraints for the superposition to be unique. It may still not be what you want, in which case try adding more constraints as above. Peter From anaryin at gmail.com Tue Dec 7 18:33:38 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 8 Dec 2010 00:33:38 +0100 Subject: [Biopython] superimposing problem In-Reply-To: References:

Message-ID: The problem here is that for that picture to be possible you have to have exactly similar flanking residues. And in those structure of yours, it seems you don't. Peter was trying to find a way of making sure that the superimposition was unambiguous. In other words, that the loops superimposed in the only way possible (rotation- and translation-wise). By having more than one point per residue, as you are doing with those CA, C, N and O atoms, you're getting the best superimposition possible. However, since the ends of the loop do not match, you can never fit them as in the picture. I'm assuming you're generating these loops out of loop modelling program of sampling program. Try adding dummy residues before and after the loops, say two ALAs, and fix them during the procedure. If, on the other hand, you want a good measure of displaying loop flexibility/variability, try superimposing only the first residue. That should be unique enough to display the different conformations. Sorry I can't be of much help... Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Wed, Dec 8, 2010 at 12:18 AM, George Devaniranjan wrote: > Hello Peter, > Thank you for your suggestion. > I am using CA/C/N/O for superimposition of S1/S2 and E1/E2. > > Not sure I got exactly what you meant but I am going to think about it > tonight and see if that makes sense. > > The idea that I am trying to see is ---if 2 coils from diff protein have > approximately the same length (S1-E1 = S2-E2 distance) and have the same > number of residues what is the "floppiness" or (no of conformations) that > you can see. > > So as in my figure I sent (reattached) it does not matter where the > intermediate residues are they can flop about. > Thanks I will think about this more in the next few days. > Regards, > George > > > > > On Tue, Dec 7, 2010 at 6:00 PM, Peter wrote: > >> On Tue, Dec 7, 2010 at 10:49 PM, George Devaniranjan >> wrote: >> > Thanks Jo?o for trying. >> > I was also thinking that you would have to deform the 2nd coil to get >> > exactly what I want. >> > >> > What I tried and not completly sussefully but I will try that agian is >> as >> > follows. >> > >> > 1) Superimpose 1st residue of coil1 with 1st residue of coil2, rot/tran >> > coil2 and get the coordinates---lets call the rot/tran coil2 as >> mod_coil2 >> > >> > 2) Superimpose last residue of coil1 with last residue of mod_coil2 and >> then >> > rot/tran whole molecule---the result is not exactly what I want but it >> is >> > the only solution I can think of for now. >> > Thanks once agin Jo?o, >> > >> > George >> >> Hi George, >> >> Hopefully I have understood your aim, and the following makes >> sense... >> >> Have you thought about (in your head - no code needed) trying to >> simultaneously try to superimpose the first residues AND the last >> residues with a rigid body motion (rotation and translation)? >> >> Assuming you are using just the C-alpha atoms for this. Lets >> call these atoms S1, E1 and S2, E2 (start and end). Also >> suppose that the distance S1 to E1 is bigger than S2 to E2. >> The superposition will give you S1, S2, E2, E1 on a line in space. >> However, the relative rotation of the two loops is free. >> >> So, if using C-alpha atoms, I think you are going to have to include >> at least one pair of atoms from the loop, e.g. M1 and M2 for the >> C-alpha of the mid point residue. >> >> However, if you are using more than just the C-alpha atoms, that >> should give enough constraints for the superposition to be unique. >> It may still not be what you want, in which case try adding more >> constraints as above. >> >> Peter >> > > From anaryin at gmail.com Tue Dec 7 17:40:51 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 7 Dec 2010 23:40:51 +0100 Subject: [Biopython] superimposing problem In-Reply-To: References:

Message-ID: Hello George, I've tried restricting the alignment to a few atoms only. It works perfectly for one residue only, but when you add a second residue it probably tries to fit both (which it pretty much impossible unless both starting residues have exactly the same rotamer) and it doesn't produce exactly what you want. I'm attaching a pymol session picture of the result of thisscript that superimposes CA, CB, and HA (more or less a straight line) atoms of the first residue of both loops. Both loops have a completely different structure, starting already with a twist on the first ALA residue, so I think it's pretty hard to do what you want (exactly) without deforming your loops. Usually, to produce those figures, you should have anchor residues whose coordinates remain the same and then generate the loop residues. I hope someone else can prove me wrong :) Best, Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Tue, Dec 7, 2010 at 10:39 PM, George Devaniranjan wrote: > sorry message got sent before I finished. > > I can do this (connected.jpg) > > I want to do this (superimpose.jpg) > > > Thanks > George > > On Tue, Dec 7, 2010 at 4:14 PM, Jo?o Rodrigues wrote: > >> Hello George, >> >> From what I understood, what you want to be compare the diversity of two >> loops based on anchor points. >> >> Peter already pointed you to a great resource here >> . >> >> To make it clearer, you should first define the residues you want to align >> (or the particular atoms) and then perform the alignment. >> >> For example, I wanted to align a loop in my protein so I defined the * >> resi_range* variable to restrict the alignment to those residues. >> Furthermore, I then made sure that I only used CA atoms with two for loops. >> >> Check the code here (http://pastebin.com/V2UsGYGL), might help you >> understand how it works. >> >> Best! >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: loop.png Type: image/png Size: 11694 bytes Desc: not available URL: From devaniranjan at gmail.com Tue Dec 7 18:18:28 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Dec 2010 18:18:28 -0500 Subject: [Biopython] superimposing problem In-Reply-To: References:

Message-ID: Hello Peter, Thank you for your suggestion. I am using CA/C/N/O for superimposition of S1/S2 and E1/E2. Not sure I got exactly what you meant but I am going to think about it tonight and see if that makes sense. The idea that I am trying to see is ---if 2 coils from diff protein have approximately the same length (S1-E1 = S2-E2 distance) and have the same number of residues what is the "floppiness" or (no of conformations) that you can see. So as in my figure I sent (reattached) it does not matter where the intermediate residues are they can flop about. Thanks I will think about this more in the next few days. Regards, George On Tue, Dec 7, 2010 at 6:00 PM, Peter wrote: > On Tue, Dec 7, 2010 at 10:49 PM, George Devaniranjan > wrote: > > Thanks Jo?o for trying. > > I was also thinking that you would have to deform the 2nd coil to get > > exactly what I want. > > > > What I tried and not completly sussefully but I will try that agian is > as > > follows. > > > > 1) Superimpose 1st residue of coil1 with 1st residue of coil2, rot/tran > > coil2 and get the coordinates---lets call the rot/tran coil2 as > mod_coil2 > > > > 2) Superimpose last residue of coil1 with last residue of mod_coil2 and > then > > rot/tran whole molecule---the result is not exactly what I want but it is > > the only solution I can think of for now. > > Thanks once agin Jo?o, > > > > George > > Hi George, > > Hopefully I have understood your aim, and the following makes > sense... > > Have you thought about (in your head - no code needed) trying to > simultaneously try to superimpose the first residues AND the last > residues with a rigid body motion (rotation and translation)? > > Assuming you are using just the C-alpha atoms for this. Lets > call these atoms S1, E1 and S2, E2 (start and end). Also > suppose that the distance S1 to E1 is bigger than S2 to E2. > The superposition will give you S1, S2, E2, E1 on a line in space. > However, the relative rotation of the two loops is free. > > So, if using C-alpha atoms, I think you are going to have to include > at least one pair of atoms from the loop, e.g. M1 and M2 for the > C-alpha of the mid point residue. > > However, if you are using more than just the C-alpha atoms, that > should give enough constraints for the superposition to be unique. > It may still not be what you want, in which case try adding more > constraints as above. > > Peter > -------------- next part -------------- A non-text attachment was scrubbed... Name: superimpose.jpg Type: image/jpeg Size: 8723 bytes Desc: not available URL: From devaniranjan at gmail.com Tue Dec 7 19:11:43 2010 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 8 Dec 2010 00:11:43 +0000 Subject: [Biopython] superimposing problem In-Reply-To: References:

Message-ID: Oh now I see what you mean--ok what I did was actually convert all residues EXCEPT (if present that is) GLY and PRO to ALA So all residues in both structures I think end up being the same in the given example so then I tried what I suggested. I will try again tomorrow what Peter suggested. Thank you once again to you both for the help. George On Tue, Dec 7, 2010 at 11:33 PM, Jo?o Rodrigues wrote: > The problem here is that for that picture to be possible you have to have > exactly similar flanking residues. And in those structure of yours, it seems > you don't. > > Peter was trying to find a way of making sure that the superimposition was > unambiguous. In other words, that the loops superimposed in the only way > possible (rotation- and translation-wise). By having more than one point per > residue, as you are doing with those CA, C, N and O atoms, you're getting > the best superimposition possible. However, since the ends of the loop do > not match, you can never fit them as in the picture. > > I'm assuming you're generating these loops out of loop modelling program of > sampling program. Try adding dummy residues before and after the loops, say > two ALAs, and fix them during the procedure. If, on the other hand, you want > a good measure of displaying loop flexibility/variability, try superimposing > only the first residue. That should be unique enough to display the > different conformations. > > Sorry I can't be of much help... > > > Jo?o [...] Rodrigues > http://doeidoei.wordpress.com > > > > On Wed, Dec 8, 2010 at 12:18 AM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hello Peter, >> Thank you for your suggestion. >> I am using CA/C/N/O for superimposition of S1/S2 and E1/E2. >> >> Not sure I got exactly what you meant but I am going to think about it >> tonight and see if that makes sense. >> >> The idea that I am trying to see is ---if 2 coils from diff protein have >> approximately the same length (S1-E1 = S2-E2 distance) and have the same >> number of residues what is the "floppiness" or (no of conformations) that >> you can see. >> >> So as in my figure I sent (reattached) it does not matter where the >> intermediate residues are they can flop about. >> Thanks I will think about this more in the next few days. >> Regards, >> George >> >> >> >> >> On Tue, Dec 7, 2010 at 6:00 PM, Peter wrote: >> >>> On Tue, Dec 7, 2010 at 10:49 PM, George Devaniranjan >>> wrote: >>> > Thanks Jo?o for trying. >>> > I was also thinking that you would have to deform the 2nd coil to get >>> > exactly what I want. >>> > >>> > What I tried and not completly sussefully but I will try that agian is >>> as >>> > follows. >>> > >>> > 1) Superimpose 1st residue of coil1 with 1st residue of coil2, rot/tran >>> > coil2 and get the coordinates---lets call the rot/tran coil2 as >>> mod_coil2 >>> > >>> > 2) Superimpose last residue of coil1 with last residue of mod_coil2 and >>> then >>> > rot/tran whole molecule---the result is not exactly what I want but it >>> is >>> > the only solution I can think of for now. >>> > Thanks once agin Jo?o, >>> > >>> > George >>> >>> Hi George, >>> >>> Hopefully I have understood your aim, and the following makes >>> sense... >>> >>> Have you thought about (in your head - no code needed) trying to >>> simultaneously try to superimpose the first residues AND the last >>> residues with a rigid body motion (rotation and translation)? >>> >>> Assuming you are using just the C-alpha atoms for this. Lets >>> call these atoms S1, E1 and S2, E2 (start and end). Also >>> suppose that the distance S1 to E1 is bigger than S2 to E2. >>> The superposition will give you S1, S2, E2, E1 on a line in space. >>> However, the relative rotation of the two loops is free. >>> >>> So, if using C-alpha atoms, I think you are going to have to include >>> at least one pair of atoms from the loop, e.g. M1 and M2 for the >>> C-alpha of the mid point residue. >>> >>> However, if you are using more than just the C-alpha atoms, that >>> should give enough constraints for the superposition to be unique. >>> It may still not be what you want, in which case try adding more >>> constraints as above. >>> >>> Peter >>> >> >> > From chapmanb at 50mail.com Wed Dec 8 08:12:12 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Dec 2010 08:12:12 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <1291755149.22122.7.camel@W01209978> References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

<1291755149.22122.7.camel@W01209978> Message-ID: <20101208131212.GO4621@sobchak.mgh.harvard.edu> David; > Thanks, Brad and Peter, for your solutions. I'm going to use a modified > version of Brad's Internet-based script to work with my genes of > interest. I'll keep your solution in mind, Peter, in case I need to use > the script offline in the future. Glad that those worked out for you. > One question that this has prompted (for me) is: Why not extend > BioPython to support queries that NCBI does not, itself, support? I am a > newcomer, but it seems like the Entrez model is simply a wrapper for the > NCBI interfaces and protocols. But there is potential for more. > > Is there any interest in boosting the Entrez model to support common > tasks like bacterial gene lookups? That sounds great. Biopython is completely open source and we welcome contributions of general purpose, tested code. Support for different tasks exists because generous folks had a need for that functionality and coded it in a reusable way. Entrez gene XML support would be a fairly big task, since there is quite a bit in there that might be of interest to different users, but definitely welcome. Another useful contribution would be a Cookbook entry describing the problem and solution. Since these are examples, they can be more specific than a library but also provide useful direction and help for others with similar needs: http://biopython.org/wiki/Category:Cookbook Thanks for the interest, Brad From biopython at maubp.freeserve.co.uk Wed Dec 8 08:57:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Dec 2010 13:57:44 +0000 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <20101208131212.GO4621@sobchak.mgh.harvard.edu> References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

<1291755149.22122.7.camel@W01209978> <20101208131212.GO4621@sobchak.mgh.harvard.edu> Message-ID: On Wed, Dec 8, 2010 at 1:12 PM, Brad Chapman wrote: > David; > >> Thanks, Brad and Peter, for your solutions. I'm going to use a modified >> version of Brad's Internet-based script to work with my genes of >> interest. I'll keep your solution in mind, Peter, in case I need to use >> the script offline in the future. > > Glad that those worked out for you. > >> One question that this has prompted (for me) is: Why not extend >> BioPython to support queries that NCBI does not, itself, support? I am a >> newcomer, but it seems like the Entrez model is simply a wrapper for the >> NCBI interfaces and protocols. But there is potential for more. >> >> Is there any interest in boosting the Entrez model to support common >> tasks like bacterial gene lookups? > > That sounds great. Biopython is completely open source and we > welcome contributions of general purpose, tested code. Support for > different tasks exists because generous folks had a need for that > functionality and coded it in a reusable way. > > Entrez gene XML support would be a fairly big task, since there is > quite a bit in there that might be of interest to different users, > but definitely welcome. What's wrong with the current Entrez XML parser for the gene database? Is this one of the corner cases where the NCBI are not currently returning "proper" XML with a DTD? > Another useful contribution would be a Cookbook entry describing the > problem and solution. Since these are examples, they can be more > specific than a library but also provide useful direction and help > for others with similar needs: > > http://biopython.org/wiki/Category:Cookbook Absolutely - if you can work out some standard useful combinations of the Entrez tools as cookbook form, they may form the basis of a potential higher level interface on top of the core Entrez API (I think Michael has commented on this before). Peter From chapmanb at 50mail.com Thu Dec 9 07:59:59 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 9 Dec 2010 07:59:59 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References: <20101203134415.GG23468@sobchak.mgh.harvard.edu>

<1291755149.22122.7.camel@W01209978> <20101208131212.GO4621@sobchak.mgh.harvard.edu> Message-ID: <20101209125959.GV4621@sobchak.mgh.harvard.edu> Peter; > What's wrong with the current Entrez XML parser for the gene > database? Is this one of the corner cases where the NCBI are > not currently returning "proper" XML with a DTD? That's right. The gene database doesn't give back the "native" XML format which Entrez.read deals with. Instead it's got custom XML output: http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html#gene Brad From mjldehoon at yahoo.com Thu Dec 9 08:38:34 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 9 Dec 2010 05:38:34 -0800 (PST) Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <20101209125959.GV4621@sobchak.mgh.harvard.edu> Message-ID: <509327.39860.qm@web62404.mail.re1.yahoo.com> --- On Thu, 12/9/10, Brad Chapman wrote: > Peter wrote: > > What's wrong with the current Entrez XML parser for > the gene > > database? Is this one of the corner cases where the > NCBI are > > not currently returning "proper" XML with a DTD? > > That's right. The gene database doesn't give back the > "native" XML > format which Entrez.read deals with. Instead it's got > custom XML output: > > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html#gene Really? What's the difference between "native" and "custom" XML? The Entrez Gene XML example from this linked gets parsed perfectly fine by Entrez.read. --Michiel. From cjfields at illinois.edu Thu Dec 9 08:49:04 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 9 Dec 2010 07:49:04 -0600 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <20101209125959.GV4621@sobchak.mgh.harvard.edu> References: <20101203134415.GG23468@sobchak.mgh.harvard.edu>

<1291755149.22122.7.camel@W01209978> <20101208131212.GO4621@sobchak.mgh.harvard.edu> <20101209125959.GV4621@sobchak.mgh.harvard.edu> Message-ID: <5AB7B39D-1B1E-4CD1-9FDA-26F0CCBC66C9@illinois.edu> On Dec 9, 2010, at 6:59 AM, Brad Chapman wrote: > Peter; > >> What's wrong with the current Entrez XML parser for the gene >> database? Is this one of the corner cases where the NCBI are >> not currently returning "proper" XML with a DTD? > > That's right. The gene database doesn't give back the "native" XML > format which Entrez.read deals with. Instead it's got custom XML > output: > > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html#gene > > Brad Looks like it's just a straight conversion of their ASN.1 data. Ick. chris From matsen at fhcrc.org Thu Dec 9 20:00:19 2010 From: matsen at fhcrc.org (Erick Matsen) Date: Thu, 9 Dec 2010 17:00:19 -0800 Subject: [Biopython] slashes in Stockholm format names are not properly parsed Message-ID: Hello there--- In StockholmIO.py, class StockholmIterator(AlignmentIterator), there is a subroutine like so: def _identifier_split(self, identifier): """Returns (name,start,end) string tuple from an identier.""" if identifier.find("/")!=-1: start_end = identifier.split("/",1)[1] if start_end.count("-")==1: start, end = map(int, start_end.split("-")) name = identifier.split("/",1)[0] return (name, start, end) return (identifier, None, None) which splits off the start and end tag which gets attached onto the end of the Stockholm sequence identifier. These identifiers look like: myseq/4-9 By using split like the above, the above code has a problem when the seq has a slash in the name. Given my/seq/4-9 it will get split into "my" and "seq/4-9", which is not right. An easy start to fixing the issue is to simply replace the above calls to split with rsplit. A more complete solution may require regex? The definition at http://sonnhammer.sbc.su.se/Stockholm.html doesn't state that slashes are illegal in names. I'm using 1.55, and I didn't see it mentioned in bugzilla. Thank you for the great project. Erick From developer at allthingsprogress.com Thu Dec 9 23:36:29 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Thu, 9 Dec 2010 23:36:29 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <20101208131212.GO4621@sobchak.mgh.harvard.edu> References:

<20101203134415.GG23468@sobchak.mgh.harvard.edu>

<1291755149.22122.7.camel@W01209978> <20101208131212.GO4621@sobchak.mgh.harvard.edu> Message-ID: > That sounds great. Biopython is completely open source and we > welcome contributions of general purpose, tested code. Support for > different tasks exists because generous folks had a need for that > functionality and coded it in a reusable way. Excellent, I'm generalizing my protocol and extracting functionality that I think could fit comfortably into a library. I'm also going to look at the cookbook option. I know that your example script was invaluable for me going forward with my current code, so I would be happy to add to the cookbook. Thanks again, everyone, for your help. > Entrez gene XML support would be a fairly big task, since there is > quite a bit in there that might be of interest to different users, > but definitely welcome. Hmm... I have a couple of ideas in mind. I'm going to see if I can't come up with a reasonable OO solution here. (I'm more into FP, but OO seems appropriate here.) David From biopython at maubp.freeserve.co.uk Fri Dec 10 05:50:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Dec 2010 10:50:31 +0000 Subject: [Biopython] slashes in Stockholm format names are not properly parsed In-Reply-To: References: Message-ID: On Fri, Dec 10, 2010 at 1:00 AM, Erick Matsen wrote: > Hello there--- > > > In StockholmIO.py, class StockholmIterator(AlignmentIterator), there is > a subroutine like so: > > ? ?def _identifier_split(self, identifier): > ? ? ? ?"""Returns (name,start,end) string tuple from an identier.""" > ? ? ? ?if identifier.find("/")!=-1: > ? ? ? ? ? ?start_end = identifier.split("/",1)[1] > ? ? ? ? ? ?if start_end.count("-")==1: > ? ? ? ? ? ? ? ?start, end = map(int, start_end.split("-")) > ? ? ? ? ? ? ? ?name = identifier.split("/",1)[0] > ? ? ? ? ? ? ? ?return (name, start, end) > ? ? ? ?return (identifier, None, None) > > which splits off the start and end tag which gets attached onto the end > of the Stockholm sequence identifier. These identifiers look like: > > myseq/4-9 > > By using split like the above, the above code has a problem when the seq > has a slash in the name. Given > > my/seq/4-9 > > it will get split into "my" and "seq/4-9", which is not right. > > An easy start to fixing the issue is to simply replace the above calls > to split with rsplit. A more complete solution may require regex? > > The definition at > http://sonnhammer.sbc.su.se/Stockholm.html > doesn't state that slashes are illegal in names. > > > I'm using 1.55, and I didn't see it mentioned in bugzilla. > > Thank you for the great project. > > Erick Hi Erick, Your suggested change to use rsplit makes sense - I'm happy to commit that. Do you mind being thanked in the release notes and list of contributors? Also, do you have a small real example of a Stockholm file with sequence identifiers with embedded slashes (for our test suite) - or is this a hypothetical problem you've identified? Thank you, Peter From chapmanb at 50mail.com Fri Dec 10 08:10:26 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 10 Dec 2010 08:10:26 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: <509327.39860.qm@web62404.mail.re1.yahoo.com> References: <20101209125959.GV4621@sobchak.mgh.harvard.edu> <509327.39860.qm@web62404.mail.re1.yahoo.com> Message-ID: <20101210131026.GA4621@sobchak.mgh.harvard.edu> Michiel, David and Peter; > > That's right. The gene database doesn't give back the "native" XML > > format which Entrez.read deals with. Instead it's got custom XML output: > > > > http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html#gene > > Really? What's the difference between "native" and "custom" XML? The > Entrez Gene XML example from this linked gets parsed perfectly fine by > Entrez.read. Nice one. I seriously underestimated the power of your code to deal with that nastiness. Here's an updated version that uses Entrez.read instead of the custom XML parsing with ElementTree: https://gist.github.com/727625 from Bio import Entrez def fetch_gene_coordinates(search_term): handle = Entrez.esearch(db="gene", term=search_term) rec = Entrez.read(handle) gene_id = rec["IdList"][0] # assuming best match works handle = Entrez.efetch(db="gene", id=gene_id, retmode="xml") rec = Entrez.read(handle)[0] gene_locus = rec["Entrezgene_locus"][0] region = gene_locus["Gene-commentary_seqs"][0]["Seq-loc_int"]["Seq-interval"] start = int(region["Seq-interval_from"]) + 1 end = int(region["Seq-interval_to"]) + 1 gi_id = region["Seq-interval_id"]["Seq-id"]["Seq-id_gi"] strand = region["Seq-interval_strand"]["Na-strand"].attributes["value"] return gi_id, start, end, strand def get_fasta_seq(gi_id, start, end, strand): strand = 2 if strand.lower() == "minus" else 1 handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=gi_id, seq_start=start, seq_stop=end, strand=strand) return handle.read() Entrez.email = "yours at mail.com" search_term = "fliC ct18" gi_id, start, end, strand = fetch_gene_coordinates(search_term) print get_fasta_seq(gi_id, start, end, strand) Brad From jgrant at smith.edu Fri Dec 10 09:59:38 2010 From: jgrant at smith.edu (Jessica Grant) Date: Fri, 10 Dec 2010 09:59:38 -0500 Subject: [Biopython] translating 454 data with frameshifts Message-ID: We have some transcriptome 454 data and quite simply we are trying to build a protein database from the nucleotide sequences. The problem comes in that there are quite a lot of frameshifts in our contig assemblies--and in the original sequences as well. We have a list of the best blastx hit for each sequence, and I have tried 1 - blasting each sequence against its best hit 2 - taking the hsp_qseqs from the blast output 3 - sticking them together, in order, if there is more than one hsp. This has worked for many of the sequences but sometimes there are overlapping "best hsp_qseqs" and when I stick them together I get a long made-up protein. Also, for some sequences, the qseq goes past the point where the alignment should stop and then when I stick them together I get a few extra amino acids in my protein that ought not to be there. Frank Kauff told me that bioperl has a "tile_hsp" function, but before I try understanding how that works in a language I am not familiar with, I thought I would ask here to see if anyone knows of a way to do this in python. Is there a smart way to concatenate hsps in biopython? Does anyone have a better idea about how to build a protein database from 454 data? Thank you! Jessica From Tony.Heitkam at tu-dresden.de Fri Dec 10 14:30:44 2010 From: Tony.Heitkam at tu-dresden.de (Tony Heitkam) Date: Fri, 10 Dec 2010 20:30:44 +0100 Subject: [Biopython] translating 454 data with frameshifts In-Reply-To: References: Message-ID: <20101210203044.hi9bxhfm044gsw8k@mail.zih.tu-dresden.de> Hello Jessica, I am not a programmer and can't help you with a python equivalent to "tile_hsp", but as fas as I can tell, the GeneWise Tool might be helpful to you. http://www.ebi.ac.uk/Tools/Wise2/index.html - There is a standalone version available for download which can be used for a whole batch of sequences. It aligns DNA sequences to Protein queries (which you would get by blastp) and also accounts for frameshifts! Best luck, Tony > Message: 5 > Date: Fri, 10 Dec 2010 09:59:38 -0500 > From: Jessica Grant > Subject: [Biopython] translating 454 data with frameshifts > To: biopython at biopython.org > Message-ID: > Content-Type: text/plain; charset="us-ascii" ; format="flowed" > > We have some transcriptome 454 data and quite simply we are trying to > build a protein database from the nucleotide sequences. The problem > comes in that there are quite a lot of frameshifts in our contig > assemblies--and in the original sequences as well. > > We have a list of the best blastx hit for each sequence, and I have tried > > 1 - blasting each sequence against its best hit > 2 - taking the hsp_qseqs from the blast output > 3 - sticking them together, in order, if there is more than one hsp. > > > This has worked for many of the sequences but sometimes there are > overlapping "best hsp_qseqs" and when I stick them together I get a > long made-up protein. Also, for some sequences, the qseq goes past > the point where the alignment should stop and then when I stick them > together I get a few extra amino acids in my protein that ought not > to be there. > > Frank Kauff told me that bioperl has a "tile_hsp" function, but > before I try understanding how that works in a language I am not > familiar with, I thought I would ask here to see if anyone knows of a > way to do this in python. > > Is there a smart way to concatenate hsps in biopython? Does anyone > have a better idea about how to build a protein database from 454 > data? > > Thank you! > > Jessica > From biopython at maubp.freeserve.co.uk Sun Dec 12 14:03:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 12 Dec 2010 19:03:05 +0000 Subject: [Biopython] slashes in Stockholm format names are not properly parsed In-Reply-To: References: <5626_1291978234_4D0205FA_5626_1569715_1_729280825.8198.1291978233697.JavaMail.root@zimbra4.fhcrc.org> Message-ID: On Sun, Dec 12, 2010 at 6:12 PM, Erick Matsen wrote: > Hello Peter-- > > >> Your suggested change to use rsplit makes sense - I'm >> happy to commit that. Do you mind being thanked in the >> release notes and list of contributors? > > Sure! I'll do that, the code is already done: https://github.com/biopython/biopython/commit/e332571bc77e40bc99599648bca60e6c76e4b455 >> Also, do you have a small real example of a Stockholm file >> with sequence identifiers with embedded slashes (for our >> test suite) - or is this a hypothetical problem you've identified? > > Attached. Thanks, I'll take a look. > This is a very minor note, but the README links don't all work on GH, > because of colons. I tried to sort it out, but you'll have to change > something. See: > http://support.github.com/discussions/site/2281-leading-colons-included-in-links Huh - I'd never noticed that. The README file is using some kind of markup (not sure what off hand), and as far as I know it won't break anything if we change it. Peter From alan.mcculloch at agresearch.co.nz Sun Dec 12 16:57:15 2010 From: alan.mcculloch at agresearch.co.nz (McCulloch, Alan) Date: Mon, 13 Dec 2010 10:57:15 +1300 Subject: [Biopython] translating 454 data with frameshifts In-Reply-To: References: Message-ID: <18DF7D20DFEC044098A1062202F5FFF3313C6729C3@exchsth.agresearch.co.nz> Hi Jessica * There are some packages out there which combine blast and "de-novo" HMM (e.g. ESTScan) evidence to do translations of transcript contigs - prot4EST is a python based one (blast + ESTScan) I think there have been others published. * I have also combined blastx and ESTScan evidence, as follows : 1. blastx contigs against NR protein, recording top (say) 10 hits (*not* using -w option - see below) 2. For those sequences where all HSPs in the same frame, conclude that there are no frameshift errors, and translate by picking the longest ORF in the same frame as and overlapping the hsps, and translate. 3. For those seqs with hsps in > 1 frame, conclude that there are frameshift errors and use ESTScan, which includes these in its model 4. Confirm translations via annotation using blastp against NR (I have some python code for bits of this happy to share if useful) ( no use for unknowns obviously - only option for these is something like ESTScan) * Have you tried using the -w option of blastx ? (Frame shift penalty (OOF algorithm for blastx)) - blastx may be able to figure out the frameshift errors for you and generate a single merged alignment, using this option. We have had fairly good luck with -w 20. In order to reduce the chances of alignments with spurious frameshifts, you could try using blastx -w in step 3 above, as an alternative to ESTScan - i.e. where you then already know there are frameshifts, or you could use to check ESTScan predictions Cheers AMcC -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Jessica Grant Sent: Saturday, 11 December 2010 4:00 a.m. To: biopython at biopython.org Subject: [Biopython] translating 454 data with frameshifts We have some transcriptome 454 data and quite simply we are trying to build a protein database from the nucleotide sequences. The problem comes in that there are quite a lot of frameshifts in our contig assemblies--and in the original sequences as well. We have a list of the best blastx hit for each sequence, and I have tried 1 - blasting each sequence against its best hit 2 - taking the hsp_qseqs from the blast output 3 - sticking them together, in order, if there is more than one hsp. This has worked for many of the sequences but sometimes there are overlapping "best hsp_qseqs" and when I stick them together I get a long made-up protein. Also, for some sequences, the qseq goes past the point where the alignment should stop and then when I stick them together I get a few extra amino acids in my protein that ought not to be there. Frank Kauff told me that bioperl has a "tile_hsp" function, but before I try understanding how that works in a language I am not familiar with, I thought I would ask here to see if anyone knows of a way to do this in python. Is there a smart way to concatenate hsps in biopython? Does anyone have a better idea about how to build a protein database from 454 data? Thank you! Jessica _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From bjorn_johansson at bio.uminho.pt Wed Dec 15 05:22:37 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Wed, 15 Dec 2010 10:22:37 +0000 Subject: [Biopython] Sequence assembly in (bio)python In-Reply-To: References:

Message-ID: Hi, My purpose is really to try to adapt the software for a different purpose. I would like to see if an assembler program can be used to predict the outcome of homologous recombination between DNA fragments that share small stretches of homology. This is a method that I ofetn use to clone DNA in the lab, but there is afaik no software where the assembled molecule can be predicted. Thanks for the reply! /bjorn On Fri, Dec 3, 2010 at 09:21, Peter Cock wrote: > 2010/12/3 Bj?rn Johansson : > > Hi, > > I wonder if there is a sequence assembler (like phred, cap3) > > implemented in python? > > Possibly, people have used Perl for this (!). > > > I am working on a small utility to assemble a handful of sequences > > and in this case I think that a standalone assembler might be > > overkill, and I would like to tweak the parameters easily. > > > > > Alternatively, is there (bio)python bindings for any assembler > > program? I could not find any in biopython. > > Do you mean binding in the sense of a programming API? There > is pysam which is a Python wrapper for the samtools C-API. > > If you mean bindings in the sense of a command line tool wrapper, > there is one for NovoAlign in Bio.Sequencing.Applications, and > others could be added. I did wonder about writing one for MIRA > but concluded it would be a never ending task since it is under > such active development. Other assemblers should be > easier is they have a manageable number of options. > > Regards, > > Peter > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From david.koppstein at gmail.com Fri Dec 17 16:06:35 2010 From: david.koppstein at gmail.com (David Koppstein) Date: Fri, 17 Dec 2010 16:06:35 -0500 Subject: [Biopython] problem building with pip Message-ID: <4D0BD0DB.9040003@gmail.com> Hio, First time biopythoner here. I just wanted to mention that installing with pip fails with the following error. I was able to fix it by commenting out line 546 of req.py, which lives in $YOUR_LIBRARY_DIRECTORY/python/2.7.1/lib/python2.7/site-packages/pip-0.8.2-py2.7.egg/pip/ I think it would be nice to include the "--single-version-externally-managed" option in the install script so that biopython can be installed using pip. Looking forward to using biopython (and for python3 support whenever numpy and scipy get their act together...)! Best, David ----------------- # req.py ...... def install(self, install_options, global_options=()): if self.editable: self.install_editable(install_options, global_options) return temp_location = tempfile.mkdtemp('-record', 'pip-') record_filename = os.path.join(temp_location, 'install-record.txt') try: install_args = [ sys.executable, '-c', "import setuptools;__file__=%r;"\ "execfile(__file__)" % self.setup_py] +\ list(global_options) + [ 'install', #'--single-version-externally-managed', # commented line '--record', record_filename] ---------------- # error message Running setup.py install for biopython usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: -c --help [cmd1 cmd2 ...] or: -c --help-commands or: -c cmd --help error: option --single-version-externally-managed not recognized Complete output from command /usr/local/Cellar/python/2.7.1/bin/python -c "import setuptools;__file__='/usr/local/var/pip/build/biopython/setup.py';execfile(__file__)" install --single-version-externally-managed --record /var/folders/Oa/OajmBH+JHBaJJ5JEOfVFSE+++TI/-Tmp-/pip-jjEn53-record/install-record.txt: usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: -c --help [cmd1 cmd2 ...] or: -c --help-commands or: -c cmd --help error: option --single-version-externally-managed not recognized ---------------------------------------- Command /usr/local/Cellar/python/2.7.1/bin/python -c "import setuptools;__file__='/usr/local/var/pip/build/biopython/setup.py';execfile(__file__)" install --single-version-externally-managed --record /var/folders/Oa/OajmBH+JHBaJJ5JEOfVFSE+++TI/-Tmp-/pip-jjEn53-record/install-record.txt failed with error code 1 Storing complete log in /usr/local/var/pip/pip.log -- David Koppstein MIT Biology Graduate Student, Bartel Lab Whitehead Institute, Room 629a 9 Cambridge Center Cambridge, MA 02142 Cell: (609) 933-3952 From biopython at maubp.freeserve.co.uk Fri Dec 17 17:53:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Dec 2010 22:53:28 +0000 Subject: [Biopython] problem building with pip In-Reply-To: <4D0BD0DB.9040003@gmail.com> References: <4D0BD0DB.9040003@gmail.com> Message-ID: On Fri, Dec 17, 2010 at 9:06 PM, David Koppstein wrote: > Hio, > > First time biopythoner here. I just wanted to mention that installing with > pip fails with the following error. ... Hello and welcome, I'm not familiar with pip, and it isn't the recommended way to install Biopython, but hopefully one of our developers can take a look. Brad? > Looking forward to using biopython (and for python3 support whenever > numpy and scipy get their act together...)! Numpy is already officially supporting Python 3.1, and I understand work is underway for SciPy. For Biopython, we don't need SciPy, so we can't use that as an excuse. You can already try Biopython on Python 3.1 if you want to help, see our README file for details. Regards, Peter From mjldehoon at yahoo.com Fri Dec 17 21:35:44 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Dec 2010 18:35:44 -0800 (PST) Subject: [Biopython] code to be removed in the next release Message-ID: <410220.36661.qm@web62406.mail.re1.yahoo.com> Hi everybody, The following parts of Biopython were deprecated in release 1.53, and are scheduled to be removed in the next release of Biopython. Please let us know if you have any objections against removal. Bio.ExPASy the cgi argument in get_sprot_raw Bio.GFF (old Bio.GFF for access to a MySQL GFF database) Bio.Motif.Parsers.AlignAce CompareAceParser CompareAceScanner CompareAceConsumer Bio.SubsMat: mat_type keyword in SeqMat SeqMat.letter_sum SeqMat.all_letters_sum Best, --Michiel. From chapmanb at 50mail.com Sat Dec 18 14:40:04 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 18 Dec 2010 14:40:04 -0500 Subject: [Biopython] problem building with pip In-Reply-To: References: <4D0BD0DB.9040003@gmail.com> Message-ID: <20101218194004.GA4572@kunkel> David and Peter; > On Fri, Dec 17, 2010 at 9:06 PM, David Koppstein > wrote: > > Hio, > > > > First time biopythoner here. I just wanted to mention that installing with > > pip fails with the following error. ... > > error: option --single-version-externally-managed not recognized David, thanks for the report. That flag is added in the 'setuptools' and 'distribute' code bases, but not in 'distutils,' which Biopython uses. The world of python packaging and best practices is a bit murky to me, but distutils is included with Python itself, so it seems no harm in continuing to use this in Biopython. Looking at the distribute source code, this flag forces installation the distutils way which is what we do by default: https://bitbucket.org/tarek/distribute/src/4ab9b96dc540/setuptools/command/install.py So I added in the fix to Biopython's setup.py which handles the flag without failing and doesn't need to do any additional work: https://github.com/biopython/biopython/commit/4e4c3aa1df3d6848afed4783c4e54186a751accd Peter, I didn't re-roll 1.56 tarballs with the fix but we could do if lots of people are running into a wall with pip. Thanks again for reporting the issue, Brad From mjldehoon at yahoo.com Tue Dec 28 22:24:23 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Dec 2010 19:24:23 -0800 (PST) Subject: [Biopython] Bio.trie Message-ID: <524070.34823.qm@web62404.mail.re1.yahoo.com> Hi everybody, Over the past couple of days we have updated all C extensions in Biopython, except for Bio.trie, to be ready for Python 3. Bio.trie would need some significant work to make it ready for Python 3 due to changes in the buffer protocol. We would like to know though how many users Bio.trie has, so we can decide whether it is worthwhile to update this module. If you are using Bio.trie, please let us know (preferably via the mailing list). If there are no current users, I suggest that we deprecate and later remove this module from Biopython. Best wishes for 2011, --Michiel. From ruchira.datta at gmail.com Tue Dec 28 23:40:08 2010 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 28 Dec 2010 20:40:08 -0800 Subject: [Biopython] Bio.trie In-Reply-To: <524070.34823.qm@web62404.mail.re1.yahoo.com> References: <524070.34823.qm@web62404.mail.re1.yahoo.com> Message-ID: Hi Michiel, I would like to be able to continue using Bio.trie. Thanks, --Ruchira On Tue, Dec 28, 2010 at 7:24 PM, Michiel de Hoon wrote: > Hi everybody, > > Over the past couple of days we have updated all C extensions in Biopython, > except for Bio.trie, to be ready for Python 3. > > Bio.trie would need some significant work to make it ready for Python 3 due > to changes in the buffer protocol. We would like to know though how many > users Bio.trie has, so we can decide whether it is worthwhile to update this > module. If you are using Bio.trie, please let us know (preferably via the > mailing list). If there are no current users, I suggest that we deprecate > and later remove this module from Biopython. > > Best wishes for 2011, > > --Michiel. > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Wed Dec 29 22:34:39 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 29 Dec 2010 19:34:39 -0800 (PST) Subject: [Biopython] Bio.trie In-Reply-To: Message-ID: <334926.66463.qm@web62406.mail.re1.yahoo.com> Hi Ruchira, Thanks for your mail. Could you give us some example of how you're using Bio.trie? I am wondering if there are some alternatives in Python, NumPy or SciPy. If not, would you be willing to make the necessary changes to Bio.trie to make it ready for Python 3? Best, --Michiel. --- On Tue, 12/28/10, Ruchira Datta wrote: From: Ruchira Datta Subject: Re: [Biopython] Bio.trie To: "Michiel de Hoon" Cc: biopython at biopython.org Date: Tuesday, December 28, 2010, 11:40 PM Hi Michiel, I would like to be able to continue using Bio.trie. Thanks, --Ruchira On Tue, Dec 28, 2010 at 7:24 PM, Michiel de Hoon wrote: Hi everybody, Over the past couple of days we have updated all C extensions in Biopython, except for Bio.trie, to be ready for Python 3. Bio.trie would need some significant work to make it ready for Python 3 due to changes in the buffer protocol. We would like to know though how many users Bio.trie has, so we can decide whether it is worthwhile to update this module. If you are using Bio.trie, please let us know (preferably via the mailing list). If there are no current users, I suggest that we deprecate and later remove this module from Biopython. Best wishes for 2011, --Michiel. _______________________________________________ Biopython mailing list ?- ?Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From kpguy1975 at gmail.com Thu Dec 30 12:22:31 2010 From: kpguy1975 at gmail.com (Vikram K) Date: Thu, 30 Dec 2010 12:22:31 -0500 Subject: [Biopython] ncbi entrez eutils Message-ID: have been investigating the ncbi entrez eutils feature. Consider the esearch eutils. i typed this on my python shell: >>> from Bio import Entrez >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="Opuntia") >>> record = Entrez.read(handle) >>> print record["Count"] 427 >>> print record["IdList"] ['186461630', '71390237', '71390236', '297381040', '297381039', '297381038', '297381037', '297381036', '297381035', '284178681'] >>> print record {u'Count': '427', u'RetMax': '10', u'IdList': ['186461630', '71390237', '71390236', '297381040', '297381039', '297381038', '297381037', '297381036', '297381035', '284178681'], u'TranslationStack': [{u'Count': '204', u'Field': 'Organism', u'Term': '"Opuntia"[Organism]', u'Explode': 'Y'}, {u'Count': '427', u'Field': 'All Fields', u'Term': 'Opuntia[All Fields]', u'Explode': 'Y'}, 'OR', 'GROUP'], u'TranslationSet': [{u'To': '"Opuntia"[Organism] OR Opuntia[All Fields]', u'From': 'Opuntia'}], u'RetStart': '0', u'QueryTranslation': '"Opuntia"[Organism] OR Opuntia[All Fields]'} >>> print record ["TranslationStack"] [{u'Count': '204', u'Field': 'Organism', u'Term': '"Opuntia"[Organism]', u'Explode': 'Y'}, {u'Count': '427', u'Field': 'All Fields', u'Term': 'Opuntia[All Fields]', u'Explode': 'Y'}, 'OR', 'GROUP'] >>> When i go to NCBI Entrez Nucleotide and type Opuntia in the search space i get a list of 427 entries. The first 10 of these have gi numbers corresponding to the numbers given by record["IDList"]. What information is record["Count"]--the value 427-- giving? Also, what is the count value 204 which shows up when you output record["TranslationStack"]? Finally, is it correct to say that NCBI eutils are all APIs? Further, should Biopython also be considered as an API? Thanks. Vikram From biopython at maubp.freeserve.co.uk Thu Dec 30 13:29:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Dec 2010 18:29:33 +0000 Subject: [Biopython] ncbi entrez eutils In-Reply-To: References: Message-ID: On Thu, Dec 30, 2010 at 5:22 PM, Vikram K wrote: > have been investigating the ncbi entrez eutils feature. Consider the > esearch eutils. i typed this on my python shell: > >>>> from Bio import Entrez >>>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="Opuntia") > >>>> record = Entrez.read(handle) > >>>> print record["Count"] > 427 >>>> print record["IdList"] > ['186461630', '71390237', '71390236', '297381040', '297381039', '297381038', > '297381037', '297381036', '297381035', '284178681'] >>>> print record > {u'Count': '427', u'RetMax': '10', u'IdList': ['186461630',... } >>>> print record ["TranslationStack"] > [{u'Count': '204', u'Field': 'Organism', u'Term': '"Opuntia"[Organism]', > u'Explode': 'Y'}, {u'Count': '427', u'Field': 'All Fields', u'Term': > 'Opuntia[All Fields]', u'Explode': 'Y'}, 'OR', 'GROUP'] >>>> > > When i go to NCBI Entrez Nucleotide and type Opuntia in the search space i > get a list of 427 entries. The first 10 of these have gi numbers > corresponding to the numbers given by record["IDList"]. That sounds as expected. Using the webpage and using the API via Biopython should give the same results. > > What information ?is record["Count"]--the value 427-- giving? > The number of records matching your search was 427 (however you used retmax to only get the first 10). > > Also, what is the count value 204 which shows up when you output > record["TranslationStack"]? > I don't know - I've never looked at it. > > Finally, is it correct to say that NCBI eutils are all APIs? > Further, should Biopython also be considered as an API? > Yes, i think both could be regarded as APIs. Peter From david.koppstein at gmail.com Fri Dec 31 18:33:33 2010 From: david.koppstein at gmail.com (David Koppstein) Date: Fri, 31 Dec 2010 18:33:33 -0500 Subject: [Biopython] eprimer3 and primer3 incompatibility Message-ID: <4D1E684D.9070007@gmail.com> Hi, I noticed today, while trying to work with Bio.Emboss.Applications.Primer3CommandLine, that there is an incompatibility in the TAG format between the current version of Emboss's eprimer3 (6.3.1) and the current version of primer3_core (2.2.3). The EMBOSS developers apparently know about this: http://web.archiveorange.com/archive/v/yWYDQkVd25Rxx2EAVunh but haven't yet fixed it, unless you want to download a c program and recompile manually. If and when they do fix it, at some point the tag lists will probably have to be updated for the Biopython module. In the meantime, I am using the old primer3_core (1.1.4) which should interface with eprimer3 just fine. Would it make sense, however, to have a Biopython module that interfaces directly with primer3_core, rather than going through eprimer3? Happy New Year! David -- David Koppstein MIT Biology Graduate Student, Bartel Lab Whitehead Institute, Room 621 9 Cambridge Center Cambridge, MA 02142 Cell: (609) 933-3952 From biopython at maubp.freeserve.co.uk Fri Dec 31 19:05:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 1 Jan 2011 00:05:52 +0000 Subject: [Biopython] eprimer3 and primer3 incompatibility In-Reply-To: <4D1E684D.9070007@gmail.com> References: <4D1E684D.9070007@gmail.com> Message-ID: On Fri, Dec 31, 2010 at 11:33 PM, David Koppstein wrote: > Hi, > > I noticed today, while trying to work with > Bio.Emboss.Applications.Primer3CommandLine, that there is an incompatibility > in the TAG format between the current version of Emboss's eprimer3 (6.3.1) > and the current version of primer3_core (2.2.3). > > The EMBOSS developers apparently know about this: > > http://web.archiveorange.com/archive/v/yWYDQkVd25Rxx2EAVunh > > but haven't yet fixed it, unless you want to download a c program and > recompile manually. Oh yeah, it looks like it was me that reported this to them back in April 2010 (linked to in the thread you found): http://www.mail-archive.com/emboss at lists.open-bio.org/msg01405.html > If and when they do fix it, at some point the tag lists will probably have > to be updated for the Biopython module. Possibly - if EMBOSS change their command line arguments. > In the meantime, I am using the old > primer3_core (1.1.4) which should interface with eprimer3 just fine. Would > it make sense, however, to have a Biopython module that interfaces directly > with primer3_core, rather than going through eprimer3? Maybe. I've not looked at the native command line API of primer3_core to form an opinion. > Happy New Year! > David You too, Peter From dejmail at gmail.com Wed Dec 1 15:13:02 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 1 Dec 2010 17:13:02 +0200 Subject: [Biopython] fasta fail Message-ID: hi everyone I have a list of sequences that I want to write to file in fasta format. This is easy enough, however I keep getting an error which I can't fix. SeqIO.write(final_seq, out_handle, "fasta") File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 398, in write count = writer_class(handle).write_file(sequences) File "/usr/lib/pymodules/python2.6/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/lib/pymodules/python2.6/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/lib/pymodules/python2.6/Bio/SeqIO/FastaIO.py", line 136, in write_record data = self._get_seq_string(record) #Catches sequence being None File "/usr/lib/pymodules/python2.6/Bio/SeqIO/Interfaces.py", line 164, in _get_seq_string % record.id) TypeError: SeqRecord (id=FN545840.1) has an invalid sequence. There is nothing wrong with the record, as far as I can see as I have written to and extraced from it numerous times as a fasta entry There is definitely sequence and other basic information on the record, which I can see from a simple print(final_seq[x].seq) The list of sequences is a "list" as opposed to a "SeqRecord", so I thought this could be problem ? Is there a way to convert a list to a SeqRecord or is this not necessary ? thanks Liam ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 http://web.wits.ac.za/Academic/Health/Pathology/AGTRU/ Tel: 2711 717 2465/7 Fax: 2711 717 2395 Skype: liam_thompson Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From dejmail at gmail.com Wed Dec 1 15:25:37 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 1 Dec 2010 17:25:37 +0200 Subject: [Biopython] rephrase Message-ID: Apologies, it seems each of the records in the list is a SeqRecord. >>> type(final_seq[0]) >>> type(final_seq) Thanks Liam From biopython at maubp.freeserve.co.uk Wed Dec 1 15:44:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Dec 2010 15:44:42 +0000 Subject: [Biopython] fasta fail In-Reply-To: References: Message-ID: On Wed, Dec 1, 2010 at 3:13 PM, Liam Thompson wrote: > hi everyone > > I have a list of sequences that I want to write to file in fasta format. > This is easy enough, however I keep getting an error which I can't fix. > > SeqIO.write(final_seq, out_handle, "fasta") > ... > TypeError: SeqRecord (id=FN545840.1) has an invalid sequence. > > There is nothing wrong with the record, as far as I can see as I have > written to and extraced from it numerous times as a fasta entry There is > definitely sequence and other basic information on the record, which I can > see from a simple print(final_seq[x].seq) The list of sequences is a "list" > as opposed to a "SeqRecord", so I thought this could be problem ? Is there a > way to convert a list to a SeqRecord or is this not necessary ? What is your final_seq object? You should have a list of SeqRecord objects (or in recent versions of Biopython you can also give SeqIO.write a single SeqRecord). Each SeqRecord's seq property should be a Seq object (or similar). Peter From dejmail at gmail.com Wed Dec 1 15:57:39 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 1 Dec 2010 17:57:39 +0200 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: Hi Peter Apologies, it seems each of the records in the list is a SeqRecord. >>> type(final_seq[0]) >>> type(final_seq) SeqIO seems to process the sequence fine, it just can't seem to write it. thanks Liam On 1 December 2010 17:44, Peter wrote: > On Wed, Dec 1, 2010 at 3:13 PM, Liam Thompson wrote: > > hi everyone > > > > I have a list of sequences that I want to write to file in fasta format. > > This is easy enough, however I keep getting an error which I can't fix. > > > > SeqIO.write(final_seq, out_handle, "fasta") > > ... > > TypeError: SeqRecord (id=FN545840.1) has an invalid sequence. > > > > There is nothing wrong with the record, as far as I can see as I have > > written to and extraced from it numerous times as a fasta entry There is > > definitely sequence and other basic information on the record, which I > can > > see from a simple print(final_seq[x].seq) The list of sequences is a > "list" > > as opposed to a "SeqRecord", so I thought this could be problem ? Is > there a > > way to convert a list to a SeqRecord or is this not necessary ? > > What is your final_seq object? > > You should have a list of SeqRecord objects (or in recent versions > of Biopython you can also give SeqIO.write a single SeqRecord). > Each SeqRecord's seq property should be a Seq object (or similar). > > Peter > From biopython at maubp.freeserve.co.uk Wed Dec 1 16:06:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Dec 2010 16:06:28 +0000 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: On Wed, Dec 1, 2010 at 3:57 PM, Liam Thompson wrote: > Hi Peter > > Apologies, it seems each of the records in the list is a SeqRecord. > >>>> type(final_seq[0]) > > >>>> type(final_seq) > > > SeqIO seems to process the sequence fine, it just can't seem > to write it. > > thanks > Liam If final_seq is a list of SeqRecord objects, then Bio.SeqIO.write should be able to save it - assuming they all have sequences. What does this do?, for record in final_seq: print record.id, type(record.seq) Peter From dejmail at gmail.com Thu Dec 2 10:33:23 2010 From: dejmail at gmail.com (Liam Thompson) Date: Thu, 2 Dec 2010 12:33:23 +0200 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: Hi Peter Not sure what the problem was, but I just did it using regular expressions as opposed to SeqIO. Thanks Liam From biopython at maubp.freeserve.co.uk Thu Dec 2 10:39:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 2 Dec 2010 10:39:33 +0000 Subject: [Biopython] fasta fail In-Reply-To: References:

Message-ID: On Thu, Dec 2, 2010 at 10:33 AM, Liam Thompson wrote: > Hi Peter > > Not sure what the problem was, but I just did it using regular expressions > as opposed to SeqIO. > > Thanks > Liam If you did have a self contained example showing the failure I would still like to see it (email me directly with any attachments rather than the mailing list) to find out what went wrong, but I'm glad you've solved your immediate task. Peter From developer at allthingsprogress.com Thu Dec 2 20:42:09 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Thu, 2 Dec 2010 15:42:09 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' Message-ID: I want to do something obvious but can't find a good way to do it. Maybe I'm looking in the wrong places. Anyway, I figured I'd ask here. (Bear with me, I'm new to Python and Biopython.) My question is: What's the easiest way to find and parse DNA sequences from the gene database? I'd like to use something like: handle = Entrez.efetch(db='gene', id='2', rettype='gb') handle.read() But this doesn't work. After poking around, I've learned you can do this query on, the nucleotide database. But not on the gene database. Instead, I have to do this: handle = Entrez.efetch(db='gene', id='2', retmode='gb') I get back something like this: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=gb That isn't easily parseable, at least as far as I can tell. So what's the best way for me to find my sequence? And is there a parser for the string I get from retmode='gb'? Thanks, David From sdavis2 at mail.nih.gov Thu Dec 2 21:07:14 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 2 Dec 2010 16:07:14 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References: Message-ID: On Thu, Dec 2, 2010 at 3:42 PM, David Jacobs < developer at allthingsprogress.com> wrote: > I want to do something obvious but can't find a good way to do it. Maybe > I'm > looking in the wrong places. Anyway, I figured I'd ask here. (Bear with me, > I'm new to Python and Biopython.) > > My question is: What's the easiest way to find and parse DNA sequences from > the gene database? > > I'd like to use something like: > > handle = Entrez.efetch(db='gene', id='2', rettype='gb') > handle.read() > > But this doesn't work. After poking around, I've learned you can do this > query on, the nucleotide database. But not on the gene database. Instead, I > have to do this: > > handle = Entrez.efetch(db='gene', id='2', retmode='gb') > > I get back something like this: > > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=gb > > That isn't easily parseable, at least as far as I can tell. So what's the > best way for me to find my sequence? And is there a parser for the string I > get from retmode='gb'? > > Hi, David. Genes (in the sense used in Entrez Gene) do not have sequences. Their respective transcripts do, however, and there can be, in general, multiple transcripts per gene. Therefore, I think you would have to do a query for the gene of interest and then link to nucleotide to get the sequences for the associated transcripts. If you want to do this for many genes, it may be easier to download the entire refseq collection for your species of interest and simply load stuff into memory or index the fasta file. Sean From developer at allthingsprogress.com Thu Dec 2 21:42:19 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Thu, 2 Dec 2010 16:42:19 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

Message-ID: Hi Sean, Thanks for the info. I didn't realize the gene database wasn't concerned with sequences. (The distinction isn't so clear when you're using the web interface.) So now I'm trying to query nucleotide. My scripting approach has been: 1. Get list of gene names from a file 2. Query nucleotide for gene ID 3. Use that gene ID to download the proper nucleotide entry However, every time I get an ID from nucleotide, it's for an entire genome. How can I specify either a) a specific gene (as identified in the gene database) or b) a specific region of the genome? David On Thu, Dec 2, 2010 at 4:07 PM, Sean Davis wrote: > > > Hi, David. > > Genes (in the sense used in Entrez Gene) do not have sequences. Their > respective transcripts do, however, and there can be, in general, multiple > transcripts per gene. Therefore, I think you would have to do a query for > the gene of interest and then link to nucleotide to get the sequences for > the associated transcripts. If you want to do this for many genes, it may > be easier to download the entire refseq collection for your species of > interest and simply load stuff into memory or index the fasta file. > > Sean > > From kellrott at gmail.com Thu Dec 2 22:53:36 2010 From: kellrott at gmail.com (Kyle) Date: Thu, 2 Dec 2010 14:53:36 -0800 Subject: [Biopython] HMMER / Pfam support Message-ID: I would like to submit my hmmer branch for merge into the main BioPython tree, targeting inclusion in 1.57. This branch adds support for HMMER3 file parsing and some Pfam related file work. It's adapted from the PfamScan perl code found at ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/ The code can be found at https://github.com/kellrott/biopython/tree/hmmer Kyle From sdavis2 at mail.nih.gov Fri Dec 3 01:15:28 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 2 Dec 2010 20:15:28 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: References:

Message-ID: On Thu, Dec 2, 2010 at 4:42 PM, David Jacobs < developer at allthingsprogress.com> wrote: > Hi Sean, > > Thanks for the info. I didn't realize the gene database wasn't concerned > with sequences. (The distinction isn't so clear when you're using the web > interface.) So now I'm trying to query nucleotide. My scripting approach has > been: > > 1. Get list of gene names from a file > 2. Query nucleotide for gene ID > 3. Use that gene ID to download the proper nucleotide entry > > Hi, David. Perhaps you can give a concrete example. What is the starting value (gene name, HUGO gene symbol, Entrez Gene ID)? What is the expected output--you mention "proper nucleotide entry", but there will likely be more than one for a given gene? You also mention that you are interested in a specific region of the genome--do you want the gene locus or the transcripts or the CDS, or something else? Finally, how many genes are we talking about here? 5-10 or thousands? Sean > However, every time I get an ID from nucleotide, it's for an entire genome. > How can I specify either a) a specific gene (as identified in the gene > database) or b) a specific region of the genome? > > David > > On Thu, Dec 2, 2010 at 4:07 PM, Sean Davis wrote: >> >> >> Hi, David. >> >> Genes (in the sense used in Entrez Gene) do not have sequences. Their >> respective transcripts do, however, and there can be, in general, multiple >> transcripts per gene. Therefore, I think you would have to do a query for >> the gene of interest and then link to nucleotide to get the sequences for >> the associated transcripts. If you want to do this for many genes, it may >> be easier to download the entire refseq collection for your species of >> interest and simply load stuff into memory or index the fasta file. >> >> Sean >> >> > From developer at allthingsprogress.com Fri Dec 3 05:59:36 2010 From: developer at allthingsprogress.com (David Jacobs) Date: Fri, 3 Dec 2010 00:59:36 -0500 Subject: [Biopython] Access Entrez gene DB using rettype 'gb' In-Reply-To: