From winda002 at student.otago.ac.nz Tue Mar 3 17:03:36 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 04 Mar 2009 11:03:36 +1300 Subject: [BioPython] ACE contig to alignment Message-ID: <49ADA938.80408@student.otago.ac.nz> Hi all, I'd like to start by thanking everyone that's contributed to biopython and especially the cookbook/tutorial - its been a great help to this empiricist getting into some (decidedly amateur) bioinformatics.However, for the first time I've run into a problem the available docs can't help me with. I want to be able to represent all of the reads that contribute to a 454 sequencing contig as a generic biopython alignment. I've written some code that I thought would pad/cut the reads to size and add them to an alignment but when I run it a significant minority of the contigs in the files I'm working with have misalignments. I was wondering if someone more familiar with the ace parser or generic alignment class could tell me if I'm making some elementary mistake (it is possible that original alignment was bad, just seems more likely I did something dumb). I can send along an ACE file if you want to run the script (didn't want to spam the list with attachments). Thanks in advance for any pointers and I'm sorry to force people to read what I'm sure is inelegant code: from Bio.Sequencing import Ace from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC, Gapped ace_handle = open('eldoni.ace', 'r') contigs = Ace.parse(ace_handle) alignments = [] #start the list to which we'll add the contig data for contig in contigs: conname = contig.name + " numreads=" + str(contig.nreads) conlength = len(contig.sequence) align = Alignment(Gapped(IUPAC.ambiguous_dna, "*")) for readn in range(len(contig.reads)): start = contig.af[readn].padded_start # position rel to consensus if start < 1: # If 'start' is negative or zero we need to ignore bases readseq = contig.reads[readn].rd.sequence[-1 * start+1:] else: # If it's larger then the start needs to be padded with gaps readseq = (start-1) * '*' + contig.reads[readn].rd.sequence #Finally, pad the end then cut to size readseq = readseq + (conlength-len(readseq)) * '*' readseq = readseq[:conlength] align.add_sequence(readn+1, readseq) condata = conname, align alignments.append(condata) -- PhD Student Allan Wilson Centre Department of Zoology University of Otago, PO Box 56, Dunedin 9054 p: +64-3-4798459 m: +64-27-3326815 e: winda002 at student.otago.ac.nz From winda002 at student.otago.ac.nz Tue Mar 3 21:40:08 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 04 Mar 2009 15:40:08 +1300 Subject: [BioPython] ACE contig to alignment (found my error) In-Reply-To: <49ADA938.80408@student.otago.ac.nz> References: <49ADA938.80408@student.otago.ac.nz> Message-ID: <49ADEA08.8070400@student.otago.ac.nz> Hi again all, After digging around a little more I realised the dumb mistake I made. In case anyone was interested and to prevent future suffering by getting the answer on to google: The code as written is adding the entirety of each read to the alignment but when the assembly was made some reads where clipped on either side for quality. Including the low quality bases from each read makes some of the alignments nasty. In my case "contig.reads[readn].qa" contains the start and end clipping points needed to get just the 'good' bases of each read into the alignment. Cheers, David David Winter wrote: > Hi all, > > I'd like to start by thanking everyone that's contributed to biopython > and especially the cookbook/tutorial - its been a great help to this > empiricist getting into some (decidedly amateur) > bioinformatics.However, for the first time I've run into a problem the > available docs can't help me with. > > I want to be able to represent all of the reads that contribute to a > 454 sequencing contig as a generic biopython alignment. I've written > some code that I thought would pad/cut the reads to size and add them > to an alignment but when I run it a significant minority of the > contigs in the files I'm working with have misalignments. I was > wondering if someone more familiar with the ace parser or generic > alignment class could tell me if I'm making some elementary mistake > (it is possible that original alignment was bad, just seems more > likely I did something dumb). I can send along an ACE file if you want > to run the script (didn't want to spam the list with attachments). > > Thanks in advance for any pointers and I'm sorry to force people to > read what I'm sure is inelegant code: > > from Bio.Sequencing import Ace > from Bio.Align.Generic import Alignment > from Bio.Alphabet import IUPAC, Gapped > > ace_handle = open('eldoni.ace', 'r') > contigs = Ace.parse(ace_handle) > alignments = [] #start the list to which we'll add the contig data > > for contig in contigs: conname = contig.name + " numreads=" + > str(contig.nreads) > conlength = len(contig.sequence) > align = Alignment(Gapped(IUPAC.ambiguous_dna, "*")) > for readn in range(len(contig.reads)): > start = contig.af[readn].padded_start # position rel to consensus > if start < 1: > # If 'start' is negative or zero we need to ignore bases > readseq = contig.reads[readn].rd.sequence[-1 * start+1:] > else: > # If it's larger then the start needs to be padded with gaps > readseq = (start-1) * '*' + contig.reads[readn].rd.sequence > #Finally, pad the end then cut to size > readseq = readseq + (conlength-len(readseq)) * '*' > readseq = readseq[:conlength] > align.add_sequence(readn+1, readseq) > condata = conname, align > alignments.append(condata) From rodrigo_faccioli at uol.com.br Wed Mar 4 23:04:07 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Thu, 5 Mar 2009 01:04:07 -0300 Subject: [BioPython] Bio.Entez - Help Message-ID: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com> I want to know where I can find examples about Bio.Entez. Specifically, I'm developing a program which has a protein primary sequence and I need to search its conserved domain and read it to show for user. I'm reading this link http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However, I'm not understanding very well. I know that I will work with CDD database. I made a simple example which is below. from Bio import Entrez Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are handle = Entrez.esearch(db="cdd", term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") record = Entrez.read(handle) print record["IdList"] Thanks for any helps. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Thu Mar 5 05:42:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 10:42:02 +0000 Subject: [BioPython] Bio.Entez - Help In-Reply-To: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com> References: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com> Message-ID: <320fb6e00903050242v63a2f38cgc6eddfa3819814e4@mail.gmail.com> On Thu, Mar 5, 2009 at 4:04 AM, Rodrigo faccioli wrote: > I want to know where I can find examples about Bio.Entez. Specifically, I'm > developing a program which has a protein primary sequence and I need to > search its conserved domain and read it to show for user. > > I'm reading this link > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However, > I'm not understanding very well. I know that I will work with CDD database. The CDD database is one of several protein motif databases the NCBI make available for use with their tool RPS-BLAST. CDD is a composite database which includes domains from PFAM, SMART, KOG etc. Have a look at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml with your example and you'll get a hit to pfam00321. It sound like what you want is a script which runs RPS-BLAST using your query protein against the CDD motif database. You can run BLASTN, BLASTP etc online at the NCBI using a script, but as far as I know, the NCBI do not make RPS-BLAST (or PSI-BLAST) available in this way. I haven't checked this in recent months. However, I have done task myself using standalone BLAST installed on my computer, i.e. the tool rpsblast from the NCBI. You'll also need to install the databases (which are big - you'll need plenty of disk space and RAM). Once this is installed and working, you can rpsblast this from Biopython using the Bio.Blast.NCBIStandalone.rpsblast(...) function. > I made a simple example which is below. > > from Bio import Entrez > Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are > handle = Entrez.esearch(db="cdd", > term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") > record = Entrez.read(handle) > print record["IdList"] > > Thanks for any helps. I think if you use Entrez to access the CDD database, you can just access the domains themselves (using their names - not searching by sequence), e.g. >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.com" >>> handle = Entrez.esearch(db="cdd", term="pfam00321", retmode="XML") >>> record = Entrez.read(handle) >>> print record["IdList"] ['109381'] You can check this ID works via their website: http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=109381 I've tried a few variations but efetch doesn't seem to support the CDD database (yet). Peter From biopython at maubp.freeserve.co.uk Thu Mar 5 07:26:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 12:26:13 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object Message-ID: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Hi All, As the following examples show, and the python string method's docstring clearly states, the python string's count method uses a non-overlapping search: >>> "AAA".count("A") 3 >>> "AAA".count("AA") # you might expect 2 1 >>> "BBBB".count("BB") # you might expect 3 2 Up until Biopython 1.44, the Seq object's count method only worked for single characters. From Biopython 1.45 onwards it accepted longer strings and followed the built in python string count behaviour. However, as Noel pointed out on Bug 2779 our docstring does not make it clear that this does a non-overlapping search. In fact, as Leighton suggests, one might the Seq object to use an overlapping search in the Seq object's count method. http://bugzilla.open-bio.org/show_bug.cgi?id=2779 We should either: (a) stick with the python string compatible behaviour (which has been a general principle for the Seq class), but document this issue more clearly as a non-overlapping search does run counter to some potential biological uses. or, (b) Or change the behaviour as Leighton suggests to do an overlapping search. This could break any code relying on the old python string-like behaviour. What do people here think? Any preferences? [I don't want to get into details about the implementation here on the main list] Peter From baoilleach at gmail.com Thu Mar 5 08:11:31 2009 From: baoilleach at gmail.com (Noel O'Boyle) Date: Thu, 5 Mar 2009 13:11:31 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object Message-ID: +1 for (b) Seq.count() should behave like a biological sequence. Here's an example in the wild of this type of analysis: http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14 It's from a bioinformatics textbook with example code in Matlab. I was helping a colleague who was trying to reproduce the analysis with BioPython. Everything was fine until the dimer frequencies were found to disagree. After implementing the count ourselves, we were able to reproduce the results. It was then we realised that BioPython was behaving in an unexpected and non-useful way. - Noel From biopython at maubp.freeserve.co.uk Thu Mar 5 08:26:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 13:26:10 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: Message-ID: <320fb6e00903050526r688eadcfv440602c32d294ee8@mail.gmail.com> On Thu, Mar 5, 2009 at 1:11 PM, Noel O'Boyle wrote: > +1 for (b) > > Seq.count() should behave like a biological sequence. > > Here's an example in the wild of this type of analysis: > http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14 > > It's from a bioinformatics textbook with example code in Matlab. I was > helping a colleague who was trying to reproduce the analysis with > BioPython. Everything was fine until the dimer frequencies were found > to disagree. After implementing the count ourselves, we were able to > reproduce the results. It was then we realised that BioPython was > behaving in an unexpected and non-useful way. I agree that in this context it is not useful to have the Seq object count do an non-overlapping search. However, calling it "unexpected" is debatable, and could probably depend on the user's background background. If you already know Python before using Biopython, I would argue that the non-overlapping search is expected because that is what python strings do. On the other hand, I'm sure many Biopython users learn Python and Biopython together - and one might still argue having strings and Seq objects do different things is unexpected. Overall between options (a) and (b), I'd pick consistency with the python string (a), even if it isn't ideal. There is another idea, let's call this option (c). Give the Seq object's count method an optional boolean argument to enable an overlapping search (which I would want to default to matching the python string behaviour). This makes switching between string and Seq objects easier, and makes the more useful (but probably slower) overlap aware count option quite accessible and discoverable. Peter From bartek at rezolwenta.eu.org Thu Mar 5 08:28:14 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 5 Mar 2009 14:28:14 +0100 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Message-ID: <8b34ec180903050528m7a3815c8l3048046e42f0ce00@mail.gmail.com> On Thu, Mar 5, 2009 at 1:26 PM, Peter wrote: > (a) stick with the python string compatible behaviour (which has been > a general principle for the Seq class), but document this issue more > clearly as a non-overlapping search does run counter to some potential > biological uses. > > or, > > (b) Or change the behaviour as Leighton suggests to do an overlapping > search. ?This could break any code relying on the old python > string-like behaviour. > > What do people here think? ?Any preferences? > > [I don't want to get into details about the implementation here on the > main list] > I don't use the count method much, so I don't have a strong opinion on that. As Leighton pointed out, searching for sequences looks like a good job for Bio.Motif It's currently doable, but (since Bio.Motif mostly deals with more complex motifs than a single sequence) the interface is not polished and it's not optimized for performance. Currently the code to do this would look like this: m=Bio.Motif.Motif() m.add_instance(Seq("GG",m.alphabet)) for i in m.search_instances(your_long sequence): print "found GG at position",i If there is a need to keep backwards compatibility for .count(), I can make changes to Bio.Motif to make it easier for people to use it. -- Bartek From lpritc at scri.ac.uk Thu Mar 5 08:34:03 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 05 Mar 2009 13:34:03 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Message-ID: Hi, On 05/03/2009 12:26, "Peter" wrote: > We should either: > > (a) stick with the python string compatible behaviour (which has been > a general principle for the Seq class), but document this issue more > clearly as a non-overlapping search does run counter to some potential > biological uses. > > or, > > (b) Or change the behaviour as Leighton suggests to do an overlapping > search. This could break any code relying on the old python > string-like behaviour. > > What do people here think? Any preferences? Not surprisingly, I favour (b). The intended domain of use for Seq is as a proxy for a biological entity and I think that, just as we extend methods to reflect useful biologically-themed operations, we should also override methods as appropriate to reflect those same themes. I can think of a number of run-of-the-mill use cases where we would want to know about the count of (potentially) overlapping matches of a subsequence in a biological sequence, for short sequence repeats (SSRs), restriction sites, protein sequence motifs, and so on. Also, if we want simply to test the expected number of occurrences of the dimer 'AA' in a larger sequence with a given base composition, a non-overlapping count() method will give a misleading answer, as it will underreport occurrences of 'AA' in odd-length runs of consecutive 'A's. I think that the overlapping approach (b) should at least be a default setting, even if we choose to make overlap/non-overlap an argument to the method. For some searches that potentially could have overlaps we might want to know what biological question is being asked before choosing which approach to take. We may, for example, desire different behaviour from query sequences like 'AGCCAG' depending on circumstances. This query on 'AGCCAGCCAG' will return 1 if there is no overlap is allowed, and 2 if an overlap is allowed. The same query on 'AGCCAGAGCCAG' will return 2 in both cases. If we care about 'AGCCAG' as a restriction site, then we would want an overlapping search. If we care about 'AGCCAG' as a simple repeat unit, then we might want a non-overlapping search instead (assuming that the circumstances of the search are such that this is a sensible answer). Having the option might be useful. A non-overlapping search might also be useful in those cases where existing code already corrects for nonintuitive behaviour of count(). This is only going to apply to code that has been produced since release 1.45, so may only have limited impact, if any. I would argue that, since a correction was needed, by parsimony the original behaviour was probably what required the change. On the whole, I think that an overlapping count() is the most intuitive and most likely use case. I see that there's an argument for consistency with string.count(), in that dyed-in-the-wool programmers might find it hard to shift mental gears from one to the other, but I'm not sure that it's a good argument, for the following reason. The following statements are true: A String is a Python sequence type. Its count() method returns a non-overlapping count of the query substring. A List is a Python sequence type. Its count() method returns the number of elements that match the query. A Tuple is a Python sequence type. It doesn't have a count() method, although you might imagine that it could stand to have one. There isn't any cross-sequence object consistency regarding count(). Should we choose String-like or List-like behaviour when dealing with a MutableSeq? I don't think that we should seek consistency with String at the expense of utility or biological intuition, when: A Seq/MutableSeq is a (Bio)Python sequence type. Its count() method returns the overlapping count of the query substring. Fits nicely with the other three statements, in that none of them are consistent with any other ;) L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From mjldehoon at yahoo.com Thu Mar 5 09:49:10 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 5 Mar 2009 06:49:10 -0800 (PST) Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Message-ID: <418103.38901.qm@web62405.mail.re1.yahoo.com> I vote (b). Another option is to continue to use count() for a Python-style count, and to add a new method that does a overlapping-type count. For this new method we'd need a clear but short name, and I can't think of anything now. --Michiel. --- On Thu, 3/5/09, Peter wrote: > From: Peter > Subject: [BioPython] The count method of a Seq (or MutableSeq) object > To: "BioPython Mailing List" > Date: Thursday, March 5, 2009, 7:26 AM > Hi All, > > As the following examples show, and the python string > method's > docstring clearly states, the python string's count > method uses a > non-overlapping search: > > >>> "AAA".count("A") > 3 > >>> "AAA".count("AA") # you > might expect 2 > 1 > >>> "BBBB".count("BB") # you > might expect 3 > 2 > > Up until Biopython 1.44, the Seq object's count method > only worked for > single characters. From Biopython 1.45 onwards it accepted > longer > strings and followed the built in python string count > behaviour. > However, as Noel pointed out on Bug 2779 our docstring does > not make > it clear that this does a non-overlapping search. In fact, > as > Leighton suggests, one might the Seq object to use an > overlapping > search in the Seq object's count method. > http://bugzilla.open-bio.org/show_bug.cgi?id=2779 > > We should either: > > (a) stick with the python string compatible behaviour > (which has been > a general principle for the Seq class), but document this > issue more > clearly as a non-overlapping search does run counter to > some potential > biological uses. > > or, > > (b) Or change the behaviour as Leighton suggests to do an > overlapping > search. This could break any code relying on the old > python > string-like behaviour. > > What do people here think? Any preferences? > > [I don't want to get into details about the > implementation here on the > main list] > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Mar 5 10:05:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 15:05:39 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <418103.38901.qm@web62405.mail.re1.yahoo.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: > > > I vote (b). > Another option is to continue to use count() for a Python-style count, > and to add a new method that does a overlapping-type count. For this > new method we'd need a clear but short name, and I can't think of > anything now. > > --Michiel. Did you like plan (c), which preserves the Python string style count as the default but offers the non-overlapping count via an optional argument? i.e. >>> from Bio.Seq import Seq >>> nuc = Seq("AAAA") >>> nuc.count("AA") #default is non-overlapping 2 >>> nuc.count("AA", overlap=True) 3 >>> nuc.count("AA", overlap=False) 2 Peter From dalloliogm at gmail.com Thu Mar 5 10:10:59 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 5 Mar 2009 16:10:59 +0100 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <5aa3b3570903050710hb407258k6fca86cf1bf9520f@mail.gmail.com> On Thu, Mar 5, 2009 at 4:05 PM, Peter wrote: > On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: >> >> >> I vote (b). >> Another option is to continue to use count() for a Python-style count, >> and to add a new method that does a overlapping-type count. For this >> new method we'd need a clear but short name, and I can't think of >> anything now. >> >> --Michiel. > > Did you like plan (c), which preserves the Python string style count > as the default but offers the non-overlapping count via an optional > argument? > > i.e. >>>> from Bio.Seq import Seq >>>> nuc = Seq("AAAA") >>>> nuc.count("AA") #default is non-overlapping > 2 >>>> nuc.count("AA", overlap=True) > 3 >>>> nuc.count("AA", overlap=False) > 2 Imho this is the best solution. If I can say, I expect a .count() method to act like the homonymous method in python strings. A good doctest example (similar to the existing one) would be nice, too. > > Peter > _______________________________________________ > BioPython mailing list ?- ?BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From baoilleach at gmail.com Thu Mar 5 10:23:42 2009 From: baoilleach at gmail.com (Noel O'Boyle) Date: Thu, 5 Mar 2009 15:23:42 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: 2009/3/5 Peter : > On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: >> >> >> I vote (b). >> Another option is to continue to use count() for a Python-style count, >> and to add a new method that does a overlapping-type count. For this >> new method we'd need a clear but short name, and I can't think of >> anything now. >> >> --Michiel. > > Did you like plan (c), which preserves the Python string style count > as the default but offers the non-overlapping count via an optional > argument? > > i.e. >>>> from Bio.Seq import Seq >>>> nuc = Seq("AAAA") >>>> nuc.count("AA") #default is non-overlapping > 2 >>>> nuc.count("AA", overlap=True) > 3 >>>> nuc.count("AA", overlap=False) > 2 > > Peter I think we are arguing here over which should be the default value. Several people here believe that behaviour analagous to Python's string.count will reduce bug reports and user confusion. However, no-one except Leighton has been able to come up with a single use case where the current behaviour is useful (and even that example, with respect, was flimsy). So we end up with a method with adheres magnificently to the principle of least surprise, but which is of no use to users. Aren't you trying to provide methods which are useful for biological analysis? Isn't that the purpose of wrapping the string in the first place? Noel (getting far too excited over painting this bikeshed) From bsouthey at gmail.com Thu Mar 5 11:28:11 2009 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 5 Mar 2009 10:28:11 -0600 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: Hi, This is a little deja vu as I feel this type of thing has come up before. While I can not speak for anyone else, if I sound different to that, then I was obviously convinced by those arguments as that sounds better than I forgot :-) More seriously, ignoring the reading fame or the genetic code when counting is rather bad form! I can not think of a relevant case involving a protein sequence - although counting pairs of cysteines in insulin-like sequences could be a situation of importance (related to disulphide bonds). An example for nucleic sequences, counting 'TTT' in the madeup sequence 'TTTTTTTGG' can be two in frames 1 and 2 but only one in frame 3. Also, a weaker concern is that the sum of counts is greater than or equal to the length of the sequence is not desirable property unless the user is informed that duplicates were found. In the above case, seven sounds rather wrong when one says that a DNA sequence of nine DNA bases can produce seven Leucines! Yes, context is everything because 3 different results is not nice. Don't get me wrong, I know that finding duplicates is important just that it should not be here - there must different functions. Thus, I vote for (a) and I also prefer that default syntax is consistent with Python language. If this change is done, then all of Biopython must be revised to be consistent - like reading frames and similar discussion... Bruce On Thu, Mar 5, 2009 at 9:23 AM, Noel O'Boyle wrote: > 2009/3/5 Peter : >> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: >>> >>> >>> I vote (b). >>> Another option is to continue to use count() for a Python-style count, >>> and to add a new method that does a overlapping-type count. For this >>> new method we'd need a clear but short name, and I can't think of >>> anything now. >>> >>> --Michiel. >> >> Did you like plan (c), which preserves the Python string style count >> as the default but offers the non-overlapping count via an optional >> argument? >> >> i.e. >>>>> from Bio.Seq import Seq >>>>> nuc = Seq("AAAA") >>>>> nuc.count("AA") #default is non-overlapping >> 2 >>>>> nuc.count("AA", overlap=True) >> 3 >>>>> nuc.count("AA", overlap=False) >> 2 >> >> Peter > > I think we are arguing here over which should be the default value. > > Several people here believe that behaviour analagous to Python's > string.count will reduce bug reports and user confusion. However, > no-one except Leighton has been able to come up with a single use case > where the current behaviour is useful (and even that example, with > respect, was flimsy). So we end up with a method with adheres > magnificently to the principle of least surprise, but which is of no > use to users. Aren't you trying to provide methods which are useful > for biological analysis? Isn't that the purpose of wrapping the string > in the first place? > > Noel (getting far too excited over painting this bikeshed) > _______________________________________________ > BioPython mailing list ?- ?BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Mar 5 11:34:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 16:34:37 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <320fb6e00903050834i32bd8d64w672e53b6ef1dbf56@mail.gmail.com> On Thu, Mar 5, 2009 at 4:28 PM, Bruce Southey wrote: > Hi, > This is a little deja vu as I feel this type of thing has come up > before. While I can not speak for anyone else, if I sound different to > that, then I was obviously convinced by those arguments as ?that > sounds better than I forgot :-) > > More seriously, ignoring the reading fame or the genetic code when > counting is rather bad form! Why? In many situations they are irrelevant. Consider counting restriction enzyme digest sites for example, plus of counting in any protein sequences. > I can not think of a relevant case involving a protein sequence - > although counting pairs of cysteines in insulin-like sequences could > be a situation of importance (related to disulphide bonds). > > An example for nucleic sequences, counting 'TTT' in the madeup > sequence ?'TTTTTTTGG' can be two in frames 1 and 2 but only one in > frame 3. Giving an answer of 2 (using a non overlapping search like the python string method) or 5 (using an overlapping search) are valid expected outcomes for "TTT" in "TTTTTTTGG". Here you seem want to count codons - which is by its nature a frame dependent task. Peter From biopython at maubp.freeserve.co.uk Thu Mar 5 11:35:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 16:35:10 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <320fb6e00903050835h2c548083jda67b5f50fcfc842@mail.gmail.com> On Thu, Mar 5, 2009 at 3:23 PM, Noel O'Boyle wrote: > I think we are arguing here over which should be the default value. > > Several people here believe that behaviour analagous to Python's > string.count will reduce bug reports and user confusion. However, > no-one except Leighton has been able to come up with a single use case > where the current behaviour is useful (and even that example, with > respect, was flimsy). So we end up with a method with adheres > magnificently to the principle of least surprise, but which is of no > use to users. Aren't you trying to provide methods which are useful > for biological analysis? Isn't that the purpose of wrapping the string > in the first place? > > Noel (getting far too excited over painting this bikeshed) If we hadn't been shipping Biopython with the old non-overlapping python-string-like count method for the last year, I would have probably have been more willing to agree that the Seq count method could differ from the python-string and use an overlapping search. However, changing it now also breaks backwards compatibility which shouldn't be done lightly. We could still do this (implementation discussion on the dev list or the Bug 2779), but will have to make this change very clear in the release notes. Peter From mjldehoon at yahoo.com Fri Mar 6 06:52:58 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 6 Mar 2009 03:52:58 -0800 (PST) Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <791065.98994.qm@web62403.mail.re1.yahoo.com> > > Another option is to continue to use count() for a Python-style count, > > and to add a new method that does a overlapping-type count. For this > > new method we'd need a clear but short name, and I can't think of > > anything now. > > > Did you like plan (c), which preserves the Python string style count > as the default but offers the non-overlapping count via an optional > argument? > It's also OK, but if we use a different method name we can leave count() untouched altogether. --Michiel. From biopython at maubp.freeserve.co.uk Fri Mar 6 07:07:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 12:07:57 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> <791065.98994.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00903060407u7383545fp80fc8b81899a33a7@mail.gmail.com> On Fri, Mar 6, 2009 at 11:52 AM, Michiel de Hoon wrote: >> > Another option is to continue to use count() for a Python-style count, >> > and to add a new method that does a overlapping-type count. For this >> > new method we'd need a clear but short name, and I can't think of >> > anything now. >> >> Did you like plan (c), which preserves the Python string style count >> as the default but offers the non-overlapping count via an optional >> argument? > > It's also OK, but if we use a different method name we can leave count() untouched altogether. Looking back, Sebastian Bassi raised this issue back in 2003 on this mailing list, and his overlap-aware-count implementation is used internally by Bio.SeqUtils.MeltingTemp, see: http://lists.open-bio.org/pipermail/biopython/2003-November/001741.html http://lists.open-bio.org/pipermail/biopython/2003-November/001742.html etc Sebastian also posted an enhancement request for adding an overlap aware counting method to the python base string, with "overcount" as a possible name. I don't know what happened to his bug report, it seems to have been marked private: http://mail.python.org/pipermail/python-bugs-list/2003-November/021239.html I don't really like the name "overcount", but as another suggestion how about "count_ol" which is short for count-with-overlaps? Peter From lpritc at scri.ac.uk Fri Mar 6 07:15:59 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 06 Mar 2009 12:15:59 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com> Message-ID: In the spirit of being blindingly obvious, how about: Seq.overlapping_count() ;) L. On 06/03/2009 11:52, "Michiel de Hoon" wrote: > >>> Another option is to continue to use count() for a Python-style count, >>> and to add a new method that does a overlapping-type count. For this >>> new method we'd need a clear but short name, and I can't think of >>> anything now. >>> >> Did you like plan (c), which preserves the Python string style count >> as the default but offers the non-overlapping count via an optional >> argument? >> > It's also OK, but if we use a different method name we can leave count() > untouched altogether. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From chapmanb at 50mail.com Fri Mar 6 08:14:04 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 6 Mar 2009 08:14:04 -0500 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <791065.98994.qm@web62403.mail.re1.yahoo.com> Message-ID: <20090306131404.GJ69627@sobchak.mgh.harvard.edu> Hey all; Great discussion on this. My preference is for a new function, and I like Leighton's naming suggestion. Also, unless someone has a use case for the current count() function, we should deprecate and eventually remove it. Overriding the string API where it makes sense is good, but here it seems to be creating confusion and not solving a problem. If someone needs the real string count, they can always do str(your_seq).count("GG"). Brad > In the spirit of being blindingly obvious, how about: > > Seq.overlapping_count() > > ;) > > L. > > > On 06/03/2009 11:52, "Michiel de Hoon" wrote: > > > > >>> Another option is to continue to use count() for a Python-style count, > >>> and to add a new method that does a overlapping-type count. For this > >>> new method we'd need a clear but short name, and I can't think of > >>> anything now. > >>> > >> Did you like plan (c), which preserves the Python string style count > >> as the default but offers the non-overlapping count via an optional > >> argument? > >> > > It's also OK, but if we use a different method name we can leave count() > > untouched altogether. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by > guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views > expressed by the sender are not necessarily the views of SCRI and its > subsidiaries. This email and any files transmitted with it are > confidential > > to the intended recipient at the e-mail address to which it has been > addressed. It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this > > confidentiality and you must not use, disclose, copy, print or rely on > this > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the > name of the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are > present in this email, neither the Institute nor the sender accepts any > responsibility for any viruses, and it is your responsibility to scan > the email and the attachments (if any). > ______________________________________________________________________ > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Mar 6 09:13:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 14:13:42 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <20090306131404.GJ69627@sobchak.mgh.harvard.edu> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman wrote: > Hey all; > Great discussion on this. My preference is for a new function, > and I like Leighton's naming suggestion. Yes, "overlapping_count" is a reasonable choice. Its a bit long, but it is clear. > Also, unless someone has a use case for the current count() > function, we should deprecate and eventually remove it. Overriding > the string API where it makes sense is good, but here it seems to be > creating confusion and not solving a problem. If someone needs the > real string count, they can always do str(your_seq).count("GG"). There is the very common use case of my_seq.count("A"), or similar, with single character search strings, and lots of code does this (both in Biopython and I'm sure user's scripts). For single letters of course, a non-overlapping count and an overlapping count do the same thing - deprecating the count method would cause a lot of unnecessary upheaval. Ignoring that, given we want the Seq to generally behave like a python string, I think removing the count method would still be a bad idea. [As a compromise, assuming we add an overlapping_count method and do a Biopython 1.50 beta release, the beta release could include a warning in the count method when used with a multi-character search string, suggesting the user might in fact need a non-overlapping count. Or is this a bit too crazy?] Peter From bsouthey at gmail.com Fri Mar 6 10:06:07 2009 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 06 Mar 2009 09:06:07 -0600 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> Message-ID: <49B13BDF.9030908@gmail.com> Peter wrote: > On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman wrote: > >> Hey all; >> Great discussion on this. My preference is for a new function, >> and I like Leighton's naming suggestion. >> > > Yes, "overlapping_count" is a reasonable choice. Its a bit long, but > it is clear. > > >> Also, unless someone has a use case for the current count() >> function, we should deprecate and eventually remove it. Overriding >> the string API where it makes sense is good, but here it seems to be >> creating confusion and not solving a problem. If someone needs the >> real string count, they can always do str(your_seq).count("GG"). >> I have already given one user case where overlapping counts is totally inappropriate! Unique codon counting is extremely important in many areas including gene prediction (possible splicing sites) and molecular evolution (like codon usage). Another valid case given was DNA restriction sites were you may want both overlapping and unique counts. For example, if DNA is digested by one enzyme that has unique sites in the sequence then followed by a second enzyme that has unique sites in the digested product but possibly duplicates in the original sequence. I just do not understand you logic of requiring a conversion when the Seq object is designed to 'behave like a python string'. > > There is the very common use case of my_seq.count("A"), or similar, > with single character search strings, and lots of code does this (both > in Biopython and I'm sure user's scripts). For single letters of > course, a non-overlapping count and an overlapping count do the same > thing - deprecating the count method would cause a lot of unnecessary > upheaval. > > Ignoring that, given we want the Seq to generally behave like a python > string, I think removing the count method would still be a bad idea. > I agree. > [As a compromise, assuming we add an overlapping_count method and do a > Biopython 1.50 beta release, the beta release could include a warning > in the count method when used with a multi-character search string, > suggesting the user might in fact need a non-overlapping count. Or is > this a bit too crazy?] > Yes it is too crazy and does not fit into the current established behavior of Biopython. Bruce From biopython at maubp.freeserve.co.uk Fri Mar 6 10:15:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 15:15:24 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <49B13BDF.9030908@gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> <49B13BDF.9030908@gmail.com> Message-ID: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey wrote: > I have already given one user case where overlapping counts is totally > inappropriate! Unique codon counting is extremely important in many areas > including gene prediction (possible splicing sites) and molecular evolution > (like codon usage). For codon counting NEITHER the current non-overlapping count nor the suggested overlapping count would be suitable. So this doesn't really affect the overlapping versus non-overlapping debate. Peter From bsouthey at gmail.com Fri Mar 6 10:34:42 2009 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 06 Mar 2009 09:34:42 -0600 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> <49B13BDF.9030908@gmail.com> <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> Message-ID: <49B14292.6080806@gmail.com> Peter wrote: > On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey wrote: > >> I have already given one user case where overlapping counts is totally >> inappropriate! Unique codon counting is extremely important in many areas >> including gene prediction (possible splicing sites) and molecular evolution >> (like codon usage). >> > > For codon counting NEITHER the current non-overlapping count nor the > suggested overlapping count would be suitable. So this doesn't really > affect the overlapping versus non-overlapping debate. > > Peter > With due respect, this does not make any sense. If it is a cDNA then I can count say the different Lysine codons to find any usage bias using seq.count('AAA')/ (seq.count('AAA')+seq.count('AAG'). (Actually I am more interested in the occurrence of specific multiple codons than single codons.) If you want the forward frames then just seq[0:].count('AAA'), seq[1:].count('AAA') and seq[2:].count('AAA') for frames 1, 2, and 3, respectively. As you pointed out single characters are not relevant so what is relevant? Bruce From biopython at maubp.freeserve.co.uk Fri Mar 6 10:46:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 15:46:19 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <49B14292.6080806@gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> <49B13BDF.9030908@gmail.com> <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> <49B14292.6080806@gmail.com> Message-ID: <320fb6e00903060746r309216e7t36d00434993a8cfb@mail.gmail.com> On Fri, Mar 6, 2009 at 3:34 PM, Bruce Southey wrote: >> >> For codon counting NEITHER the current non-overlapping count nor the >> suggested overlapping count would be suitable. ?So this doesn't really >> affect the overlapping versus non-overlapping debate. >> >> Peter > > With due respect, this does not make any sense. > > If it is a cDNA then I can count say the different Lysine codons to find any > usage bias using seq.count('AAA')/ (seq.count('AAA')+seq.count('AAG'). > (Actually I am more interested in the occurrence of specific multiple codons > than single codons.) If you have the (short) CDS "TAAAAAAAAAAG" which codes for "LKKK", then the codon count for "AAA" is 2 and the codon count for "AAG" is 1. Using the (standard python) non overlapping count method, "TAAAAAAAAAAG".count("AAA") = 3 and "TAAAAAAAAAAG".count("AAG") = 1 which does not do what you want. Using a hypothetical overlapping count method, "TAAAAAAAAAAG".overlapping_count("AAA") = 8 and "TAAAAAAAAAAG".overlapping_count("AAG") = 1 which does not do what you want. i.e. As I said, for codon counting NEITHER the current non-overlapping count nor the suggested overlapping count would be suitable. You seem to be asking for something different - a codon counting method, which is a special case of a non-overlapping count. Peter From lpritc at scri.ac.uk Fri Mar 6 10:47:37 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 06 Mar 2009 15:47:37 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <49B13BDF.9030908@gmail.com> Message-ID: On 06/03/2009 15:06, "Bruce Southey" wrote: > Peter wrote: >> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman wrote: unless someone has a use case for the current count() >>> function, we should deprecate and eventually remove it. Overriding >>> the string API where it makes sense is good, but here it seems to be >>> creating confusion and not solving a problem. If someone needs the >>> real string count, they can always do str(your_seq).count("GG"). >>> > I have already given one user case where overlapping counts is totally > inappropriate! Unique codon counting is extremely important in many > areas including gene prediction (possible splicing sites) and molecular > evolution (like codon usage). We're not discussing codon counting though, we're discussing counting occurrences of an arbitrary substring in a sequence. They're not the same operation, even though they both involve counting. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From chapmanb at 50mail.com Fri Mar 6 17:46:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 6 Mar 2009 17:46:39 -0500 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> Message-ID: <20090306224639.GM69627@sobchak.mgh.harvard.edu> Me: > > Also, unless someone has a use case for the current count() > > function, we should deprecate and eventually remove it. Overriding > > the string API where it makes sense is good, but here it seems to be > > creating confusion and not solving a problem. If someone needs the > > real string count, they can always do str(your_seq).count("GG"). Bruce: > I have already given one user case where overlapping counts is totally > inappropriate! Unique codon counting Sorry, I was a bit terse in my previous e-mail. My thought on deprecation was actually based on your and Noel's emails; both of you presented cases where you had biological expectations for count which are not met by the standard string count behaviour. For Noel, this is handled by the proposed overlapping_count function. For your example, I think it would be better handled by functionality that returned a list of codons, like: Seq("ATGGAACAT").codon_list(phase=0) ["ATG", "GAA", "CAT"] Bruce: > I just do not understand you logic of requiring a conversion when the > Seq object is designed to 'behave like a python string'. This is representing a biological sequence, so I think where a biologist user's intuition opposes what a standard python string does we should evaluate for an option that is more in line with expectations. My point about the string was just that if you are thinking as a python programmer and really want python string behavior, it is pretty easy to get. Peter: > There is the very common use case of my_seq.count("A"), or similar, > with single character search strings, and lots of code does this (both > in Biopython and I'm sure user's scripts). For single letters of > course, a non-overlapping count and an overlapping count do the same > thing - deprecating the count method would cause a lot of unnecessary > upheaval. Good point; I totally overlooked that. Retract my suggestion. I do like your warning idea, but maybe we can get by here with documentation and by highlighting the alternative fuctions. It looked like you're already all over the documentation, so hopefully the new functionality will fix up any confusion, Thanks all, Brad From chapmanb at 50mail.com Sun Mar 8 12:29:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 8 Mar 2009 12:29:41 -0400 Subject: [BioPython] Initial work on a GFF parser Message-ID: <20090308162941.GA99653@kunkel> Hi all; Generic Feature Format (GFF) is a nice tab delimited file format that we don't have full support for in Biopython. Michael Hoffman contributed code to work with GFF MySQL databases (in Bio.GFF), but we don't have a GFF parser for the flatfiles. Looking back over the list archives, this has come up a couple of times without a finished solution being implemented. GFF suffers from the curse of being too easy to hack together a solution for parsing a very specific problem, while generating a good standard parser takes more work. Recently, Peter brought up GFF on the BioSQL mailing list, which made me interested in digging into GFF as an input and output flat file format for BioSQL databases. Towards this end I put together an initial implementation of a GFF (version 3) parser for Biopython. A write up and the code are here: http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/ As described in the post, the GFF interface will be a bit different from the standard SeqIO interface, since GFF stores features separately from the sequences and also doesn't require features for a record to be grouped together. As a result, the interface is up for discussion and the best path is to start with an implementation and see where it takes us. I'd be grateful for any feedback and code from those who are interested. We can discuss on the development mailing list or on the blog, and move towards getting stable full featured GFF parsing in Biopython. Brad From biopython at maubp.freeserve.co.uk Mon Mar 9 06:14:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 9 Mar 2009 10:14:55 +0000 Subject: [BioPython] Initial work on a GFF parser In-Reply-To: <20090308162941.GA99653@kunkel> References: <20090308162941.GA99653@kunkel> Message-ID: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com> On Sun, Mar 8, 2009 at 4:29 PM, Brad Chapman wrote: > Hi all; > Generic Feature Format (GFF) is a nice tab delimited file format > that we don't have full support for in Biopython. Michael Hoffman > contributed code to work with GFF MySQL databases (in Bio.GFF), but > we don't have a GFF parser for the flatfiles. Looking back over the > list archives, this has come up a couple of times without a finished > solution being implemented. GFF suffers from the curse of being too easy > to hack together a solution for parsing a very specific problem, while > generating a good standard parser takes more work. You're right about creating a good general parser taking more work ;) See also enhancement Bug 2762, GFF capability in SeqIO, which has some discussion. Also, it wasn't clear from your blog if you are thinking about just GFF version 3, or something more general, coping with the assorted comparatively ill defined GFF2 variants. > Recently, Peter brought up GFF on the BioSQL mailing list, which > made me interested in digging into GFF as an input and output flat > file format for BioSQL databases. Towards this end I put together an > initial implementation of a GFF (version 3) parser for Biopython. A > write up and the code are here: > > http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/ > > As described in the post, the GFF interface will be a bit different > from the standard SeqIO interface, since GFF stores features > separately from the sequences and also doesn't require features for > a record to be grouped together. Regarding where to put this code, if it isn't going to support the Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but maybe Bio.GFF or Bio.GFF3 instead. However, you could still fit gff(3) files into Bio.SeqIO, its just that the sequence may not be present. This would be similar GenBank files usually have a long list of features plus the full sequence, but the sequence itself may be missing - for example if there is a just a CONTIG line. Or QUAL files from sequencing where there is never a sequence. As with GenBank files for large genome/chromosome, for a typical GFF file for Bio.SeqIO we'd just return a single SeqRecord containing all the features - within the SeqIO API there is no way to offer memory efficient iteration over the features themselves. Maybe we need to invent Bio.FeatureIO for this? You could consider GenBank/EMBL feature tables, GFF files, NCBI protein tables, and probably a few other formats too. > As a result, the interface is up for discussion and the best path is to > start with an implementation and see where it takes us. I'd be grateful > for any feedback and code from those who are interested. We can discuss > on the development mailing list or on the blog, and move towards getting > stable full featured GFF parsing in Biopython. >From the blog post it sounds like you are using sub-features to store the parent/child relationship between say mRNAs and genes. This is elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to cope with the general parent (part-of) relationships allowed in GFF files - for example an exon may have multiple parents. There is also the complication that when parsing GenBank files, a gene or CDS feature with a join-location ends up represented using sub-features (which probably would be represented with an explicit intron/exon structure in GFF files) [This is something I don't really like with the current object structure]. We'd want things to be fairly uniform between the parsers - for one thing our BioSQL code currently records a feature with subfeatures as a single feature in the database. Peter From chapmanb at 50mail.com Mon Mar 9 18:42:24 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 9 Mar 2009 18:42:24 -0400 Subject: [BioPython] Initial work on a GFF parser In-Reply-To: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com> References: <20090308162941.GA99653@kunkel> <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com> Message-ID: <20090309224224.GA4481@sobchak.mgh.harvard.edu> Peter; Thanks much for the feedback. > See also enhancement Bug 2762, GFF capability in SeqIO, which has some > discussion. > > Also, it wasn't clear from your blog if you are thinking about just > GFF version 3, or something more general, coping with the assorted > comparatively ill defined GFF2 variants. Bug 2762 had a lot of good background and ideas which helped in getting started. I did take the sub_feature route instead of the flattened method Leighton suggested there. Right now this tackles GFF3. The hard part is going to be getting a framework in place, and then GFF2 or GFT (or GFF2.5 or whatever they call it) support could be added. > Regarding where to put this code, if it isn't going to support the > Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but > maybe Bio.GFF or Bio.GFF3 instead. > > However, you could still fit gff(3) files into Bio.SeqIO, its just > that the sequence may not be present. This would be similar GenBank > files usually have a long list of features plus the full sequence, but > the sequence itself may be missing - for example if there is a just a > CONTIG line. Or QUAL files from sequencing where there is never a > sequence. Yes, where it lives is a good topic for debate. For GFF files, you'd at least like the option to add new features to an existing sequence record, which is what I do here. It would be easy enough to create new blank records if one is not present initially. The difficult thing with adding this to the existing syntax is that the GFF files are not ordered for efficient iteration. You essentially have to parse the whole file, so something like this would handle the syntax: seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta")) final_seq_dict = SeqIO.add_features(gff_handle, "gff3", initial_dict=seq_dict) Along these lines, I liked the way you did a sequence/quality dual iterator for quality output and think that works well when ordering of the records in multiple files is stable. > As with GenBank files for large genome/chromosome, for a typical GFF > file for Bio.SeqIO we'd just return a single SeqRecord containing all > the features - within the SeqIO API there is no way to offer memory > efficient iteration over the features themselves. > > Maybe we need to invent Bio.FeatureIO for this? You could consider > GenBank/EMBL feature tables, GFF files, NCBI protein tables, and > probably a few other formats too. FeatureIO is something BioPerl has; this page describes the status of GFF in BioPerl but is over a year old so things may have changed: http://www.bioperl.org/wiki/GFF_code_audit The iteration model still falls apart because of the undefined ordering of the file. That is why I settled on the filter approach to limit what you get to a reasonable memory size but still guarantee you've pulled all relevant features before building the parent/child relationships and features. This could also apply to data that comes off cluster runs where the output order will not necessarily correlate with the inputs. The filtering approach could also be useful for large GenBank files, as you could skip adding features and parsing locations for elements you are not interested in. If others find this approach intuitive, it would be worth looking at there as well. > From the blog post it sounds like you are using sub-features to store > the parent/child relationship between say mRNAs and genes. This is > elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to > cope with the general parent (part-of) relationships allowed in GFF > files - for example an exon may have multiple parents. For these the exon is added as a sub_feature to all of its parents. The shared feature is the same one in memory. t_nested_multiparent_features in the test code demonstrates this. How we output it to BioSQL is up for debate but we should also be able to do some sharing there; duplication is also not too bad of an option if it makes it cleaner since these are not likely to be deeply nested. > There is also the complication that when parsing GenBank files, a gene > or CDS feature with a join-location ends up represented using > sub-features (which probably would be represented with an explicit > intron/exon structure in GFF files) [This is something I don't really > like with the current object structure]. We'd want things to be > fairly uniform between the parsers - for one thing our BioSQL code > currently records a feature with subfeatures as a single feature in > the database. BioSQL definitely needs work to handle sub_features more generally. The seqfeature_relationship table in BioSQL can handle these but it needs to be coded. I agree with you that the way we do it now is a little too GenBank specific. This is a bit of a larger project since we should coordinate with the other projects, but as long as we continue to support the same location mechanism they use currently it will be back-compatible with older code. Thanks again for the thoughts, Brad From hlapp at gmx.net Mon Mar 9 23:36:30 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 9 Mar 2009 23:36:30 -0400 Subject: [BioPython] Google Summer of Code: Call for Bio* Volunteers In-Reply-To: References: Message-ID: You may recall my message to the developer lists of several O|B|F projects in February about the idea of O|B|F applying to Google Summer of Code as a mentoring organization [1]. I felt that the response to this was very positive and encouraging. Although late (sorry, been swamped too much), I've now put up the skeleton of an ideas page at http://open-bio.org/wiki/Google_Summer_Code_2009 I basically modeled (in fact, largely copied) this page after the NESCent Phyloinformatics Summer of Code ideas pages, which I think worked pretty well. We can completely rework this, though - any feedback and suggestions are very much welcome. In the meantime, I need all developers to double check the information under 'Contact'. Would the open-bio-l mailing list indeed reach the prospective mentors and other devs? Will be you be fine with students asking for feedback to their applications on the developers (i.e., this) list? Is there a blessed IRC where at least some of the prospective mentors hang out for students to ask questions during the time they apply? I also need space for the reference information for all projects that will participate with at least one project idea (I would hope that that's all projects) to be added in the 'Open-Bio projects involved' section. ***** Most important of all, if you can volunteer to mentor a project, please post a project idea to the page in the respective section, using the idea template that's there already (copy, paste, and edit). ***** The deadline for organization applications is Friday this week, Mar 13, which is very soon. The ideas page is a major factor and component in how Google scores new mentoring organizations - the more we can show the resourcefulness and diversity of our member projects the more competitive I think we'll be. So all those who responded with ideas or willingness to help out as primary or secondary mentores earlier, I need you to think about and put up your idea(s) now. Cheers, -hilmar [1] http://tinyurl.com/ck7tqe -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From dalloliogm at gmail.com Tue Mar 10 13:06:27 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 10 Mar 2009 18:06:27 +0100 Subject: [BioPython] can biopython query KEGG directly? Message-ID: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> Hi, is it possible to query the KEGG database with biopython? Actually I can do it with the kegg's wsdl apis and the python suds library and it works very well, but I was wondering whether there is something more integrated with biopython. For example, if there is something similar to Entrez, that can automatically retrieve a sequence from ncbi and transform it to a SeqRecord object. -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Mar 10 14:08:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Mar 2009 18:08:01 +0000 Subject: [BioPython] can biopython query KEGG directly? In-Reply-To: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> Message-ID: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio wrote: > Hi, > is it possible to query the KEGG database with biopython? I don't think there is any wrapper for the KEGG online API (yet). See: http://www.genome.jp/kegg/soap/doc/keggapi_manual.html This does sound like a worthwhile addition (especially if the SOAP stuff can be done using only core python libraries included in Python 2.4+) > .. and transform it to a SeqRecord object. We still need a Bio.KEGG gene parser, see also: http://bioperl.org/wiki/KEGG_sequence_format http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html Once that is done, a KEGG wrapper in Bio.SeqIO would make sense. Peter From matzke at berkeley.edu Tue Mar 10 21:18:12 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 10 Mar 2009 18:18:12 -0700 Subject: [BioPython] GSoC project: Biogeographical and community phylogenetics for BioPython Message-ID: <49B71154.5060109@berkeley.edu> On the advice of Mauricio & Hilmar, I have posted a draft proposal for a Google Summer of Code project: Biogeographical and community phylogenetics for BioPython. http://open-bio.org/wiki/Google_Summer_Code_2009#Biogeographical_and_community_phylogenetics_for_BioPython Comments welcome on- or off-list. Cheers! PS: Also, additional suggestions for pertinent members would be appreciated. Nick -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From dalloliogm at gmail.com Thu Mar 12 08:33:04 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 12 Mar 2009 13:33:04 +0100 Subject: [BioPython] can biopython query KEGG directly? In-Reply-To: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> Message-ID: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com> On Tue, Mar 10, 2009 at 7:08 PM, Peter wrote: > On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio > wrote: > > Hi, > > is it possible to query the KEGG database with biopython? > > I don't think there is any wrapper for the KEGG online API (yet). See: > http://www.genome.jp/kegg/soap/doc/keggapi_manual.html well, if someone is in a hurry to query KEGG with soap, I have some scripts (but they use the suds library). > > > This does sound like a worthwhile addition (especially if the SOAP > stuff can be done using only core python libraries included in Python > 2.4+) I am not sure if the SOAPpy library is the one included in the core python libraries, and if it is since python 2.4. For what I know, SOAPpy has ceased developed since 2005 (see http://pywebsvcs.sourceforge.net/). I couldn't test this library, because I still didn't managed to get it working under an http proxy :-(. > > > > .. and transform it to a SeqRecord object. > > We still need a Bio.KEGG gene parser, see also: > http://bioperl.org/wiki/KEGG_sequence_format > http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html > Once that is done, a KEGG wrapper in Bio.SeqIO would make sense. > I am just curious, but in which object a Kegg gene file would be transposed? A SeqRecord? And how, exactly? I suppose all the features will go in SeqRecord.features... but is there any standard convention to do so? For example, the codon usage table, class, dblinks, and all the other fields.. how they would be stored? > > Peter > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Mar 12 10:15:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Mar 2009 14:15:06 +0000 Subject: [BioPython] can biopython query KEGG directly? In-Reply-To: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com> References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com> Message-ID: <320fb6e00903120715n7ad57282h529150e22da826e9@mail.gmail.com> On Thu, Mar 12, 2009 at 12:33 PM, Giovanni Marco Dall'Olio wrote: >> We still need a Bio.KEGG gene parser, see also: >> http://bioperl.org/wiki/KEGG_sequence_format >> http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html >> Once that is done, a KEGG wrapper in Bio.SeqIO would make sense. > > I am just curious, but in which object a Kegg gene file would be transposed? > A SeqRecord? And how, exactly? I suppose all the features will go in > SeqRecord.features... but is there any standard convention to do so? > For example, the codon usage table, class, dblinks, and all the other > fields.. how they would be stored? Bio.SeqIO only deals with SeqRecord objects. If we had a KEGG gene parser in Bio.KEGG (written in the same style as the rest of Bio.KEGG ideally), then it would make sense to add a KEGG gene format to Bio.SeqIO, where the KEGG gene records would be parsed using Bio.KEGG and then converted into SeqRecord objects. At a minimum this would mean their id/name/description and sequence - even just that would still be useful I feel. For any richer annotation, the convention is to mimic the GenBank parser as closely as possible. See http://biopython.org/wiki/SeqIO_dev Peter From matzke at berkeley.edu Sat Mar 14 00:59:37 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Fri, 13 Mar 2009 21:59:37 -0700 Subject: [BioPython] Getting protein structure names from primary IDs Message-ID: <49BB39B9.2080206@berkeley.edu> Hi all, This has got to be trivial, but I can't find a hint about the solution online. I want to: 1. Search NCBI's structure database for structures from a certain group from Bio import Entrez handle = Entrez.einfo() record = Entrez.read(handle) print "Search the structure database on Organism = Drosophila" Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are #handle = Entrez.esearch(db="structure", term="Drosophila") handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]") pdb_record = Entrez.read(handle) print pdb_record #["IdList"] pdblist = pdb_record["IdList"] OK, now I have a list of primary IDs for the protein structures from Drosophila. 2. Download those structures. Apparently I have to do this from RSCB and not NCBI? (NCBI efetch has no information on efetching from the structure database, and I tried a few obvious methods on analogy to other databases without result) This will download from RSCB, but apparently you need the structure name, not the NCBI primary ID. from Bio.PDB import * pdbl=PDBList() pdbl.retrieve_pdb_file('1FAT') So, how do I get from primary ID to structure name? I'm sure I'm missing something obvious. Cheers, Nick -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Sat Mar 14 01:05:47 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Fri, 13 Mar 2009 22:05:47 -0700 Subject: [BioPython] Getting protein structure names from primary IDs In-Reply-To: <49BB39B9.2080206@berkeley.edu> References: <49BB39B9.2080206@berkeley.edu> Message-ID: <49BB3B2B.9080900@berkeley.edu> Hi again -- Esummary was what I needed, so nevermind! Sorry for the trouble, Nick Nick Matzke wrote: > Hi all, > > This has got to be trivial, but I can't find a hint about the solution > online. > > I want to: > > 1. Search NCBI's structure database for structures from a certain group > > from Bio import Entrez > handle = Entrez.einfo() > record = Entrez.read(handle) > print "Search the structure database on Organism = Drosophila" > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are > #handle = Entrez.esearch(db="structure", term="Drosophila") > handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]") > > pdb_record = Entrez.read(handle) > print pdb_record #["IdList"] > > pdblist = pdb_record["IdList"] > > > > OK, now I have a list of primary IDs for the protein structures from > Drosophila. > > > > 2. Download those structures. Apparently I have to do this from RSCB > and not NCBI? (NCBI efetch has no information on efetching from the > structure database, and I tried a few obvious methods on analogy to > other databases without result) > > This will download from RSCB, but apparently you need the structure > name, not the NCBI primary ID. > > > from Bio.PDB import * > pdbl=PDBList() > pdbl.retrieve_pdb_file('1FAT') > > > So, how do I get from primary ID to structure name? I'm sure I'm > missing something obvious. > > Cheers, > Nick > > > > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From hlapp at gmx.net Sat Mar 14 18:59:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 14 Mar 2009 18:59:57 -0400 Subject: [BioPython] Google Summer of Code: application submitted, action needed In-Reply-To: References: Message-ID: <71A1E85A-2007-4FAE-A03B-475000C5CD38@gmx.net> Hi all, I have submitted the application yesterday for O|B|F participating in the 2009 Google Summer of Code as a mentoring organization. The application is at http://docs.google.com/Doc?id=dhs98hzv_7zn8bxqjm and is also linked to from the ideas page at http://open-bio.org/wiki/Google_Summer_of_Code_2009 Now keep your fingers crossed, Google is slated to announce acceptances on March 18. This is the last cross-project message re: Summer of Code that addresses mentors and our projects; future messages that I'll post across projects will be primarily for students such as announcing whether we are accepted or not and issuing calls for application. **What we need most and right now is action from our projects' developers and from possible mentors.** Google admins will start reviewing organization applications on Monday. The ideas page has 6 project ideas right now - though the ideas are good ones, the quantity won't be particularly impressive to Google. Therefore, if you have an idea for a summer project for a student please use the C& template (it is commented out now but you'll see it when you pull the Ideas section into the editor) and put it up there ASAP. If you're not sure yet who'll mentor, put tentative names there. We don't need a full commitment from mentors until the student application period starts (March 23). Next, for all projects, the leads and/or volunteers should check the reference information for their project: http://open-bio.org/wiki/Google_Summer_of_Code_2009#Open-Bio_projects_involved I just culled these links from the various project websites - it'd be much appreciated if going forward everyone can lend a hand in this. Please review what's there and add or fix as you see fit. *These links must be correct and complete - otherwise potential students may not find you.* Finally, all prospective mentors, primary or secondary, committed or not, and anyone else who would like to volunteer to help out, should subscribe themselves ASAP to the mailing list for communicating GSoC- related administrivia: http://lists.open-bio.org/mailman/listinfo/gsoc I will *not* cross-post all administrative announcements or requests for information, and so you *will* miss information if you don't subscribe yourself there. (Note: students will be subscribed there only *after* acceptance). Those who are considering to mentor, primary or helping out, please also add yourselves to the Mentors section on the Ideas page (and check your link if you're already there): http://open-bio.org/wiki/Google_Summer_of_Code_2009#Mentors Cheers everyone, and fingers crossed! -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mjldehoon at yahoo.com Sun Mar 15 06:25:43 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 15 Mar 2009 03:25:43 -0700 (PDT) Subject: [BioPython] Bio.SwissProt.SProt Dictionary, index_file Message-ID: <653996.59295.qm@web62408.mail.re1.yahoo.com> Hi everybody, Does anybody use the Dictionary class or index_file function in Bio.SwissProt.SProt? As far as I can tell these functions are broken. If there are no users, I suggest we deprecate the Dictionary class and the index_file function in Bio.SwissProt.SProt. --Michiel From biopython at maubp.freeserve.co.uk Mon Mar 16 09:40:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Mar 2009 13:40:13 +0000 Subject: [BioPython] List of publications citing or using Biopython Message-ID: <320fb6e00903160640j73289abbl51d9f8935184a760@mail.gmail.com> Hi all, I've been working on a listing of journal publications citing or using Biopython for the website: http://biopython.org/wiki/Publications If you've published anything that qualifies that isn't listed, this is a wiki page so you should be able to add it. If you are unsure if something is appropriate, please ask here on the mailing list. For publications from the 2008 onwards I have tried to add a short note saying which part(s) of Biopython were used - this should be easy to write for your own recent papers ;) If you try editing the page you should see how to add extra entries - for anything in PubMed this is really easy. See the discussion page for more details: http://biopython.org/wiki/Talk:Publications Peter From matzke at berkeley.edu Mon Mar 16 15:31:57 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 16 Mar 2009 12:31:57 -0700 Subject: [BioPython] Entrez.einfo error? Message-ID: <49BEA92D.7040905@berkeley.edu> Hi all, This exact code worked fine for me on Friday, I wonder if it could be a temporary problem at Entrez? A similar problem seems to occur with other Entrez queries. Running biopython 1.49 in IPython... ============ from Bio import Entrez Entrez.email = "matzke at berkeley.edu" handle = Entrez.einfo(db="structure") --------------------------------------------------------------------------- IOError Traceback (most recent call last) /bioinformatics/pyeg/ in () /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc in einfo(cgi, **keywds) 195 variables = {} 196 variables.update(keywds) --> 197 return _open(cgi, variables) 198 199 def esummary(cgi=None, **keywds): /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc in _open(cgi, params) 320 options = urllib.urlencode(params, doseq=True) 321 cgi += "?" + options --> 322 handle = urllib.urlopen(cgi) 323 324 # Wrap the handle inside an UndoHandle. /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc in urlopen(url, data, proxies) 80 opener = _urlopener 81 if data is None: ---> 82 return opener.open(url) 83 else: 84 return opener.open(url, data) /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc in open(self, fullurl, data) 188 try: 189 if data is None: --> 190 return getattr(self, name)(url) 191 else: 192 return getattr(self, name)(url, data) /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc in open_http(self, url, data) 323 if realhost: h.putheader('Host', realhost) 324 for args in self.addheaders: h.putheader(*args) --> 325 h.endheaders() 326 if data is not None: 327 h.send(data) /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in endheaders(self) 858 raise CannotSendHeader() 859 --> 860 self._send_output() 861 862 def request(self, method, url, body=None, headers={}): /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in _send_output(self) 730 msg = "\r\n".join(self._buffer) 731 del self._buffer[:] --> 732 self.send(msg) 733 734 def putrequest(self, method, url, skip_host=0, skip_accept_encoding=0): /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in send(self, str) 697 if self.sock is None: 698 if self.auto_open: --> 699 self.connect() 700 else: 701 raise NotConnected() /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in connect(self) 665 msg = "getaddrinfo returns an empty list" 666 for res in socket.getaddrinfo(self.host, self.port, 0, --> 667 socket.SOCK_STREAM): 668 af, socktype, proto, canonname, sa = res 669 try: IOError: [Errno socket error] (7, 'No address associated with nodename') > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect() 666 for res in socket.getaddrinfo(self.host, self.port, 0, --> 667 socket.SOCK_STREAM): 668 af, socktype, proto, canonname, sa = res ipdb> record = Entrez.read(handle) *** NameError: name 'Entrez' is not defined ============ -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Mon Mar 16 15:42:22 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 16 Mar 2009 12:42:22 -0700 Subject: [BioPython] Entrez.einfo error? In-Reply-To: <49BEA92D.7040905@berkeley.edu> References: <49BEA92D.7040905@berkeley.edu> Message-ID: <49BEAB9E.7070707@berkeley.edu> Looks like PubMed is down at the moment also, so it's all an NCBI problem. Cheers! Nick Nick Matzke wrote: > Hi all, > > This exact code worked fine for me on Friday, I wonder if it could be a > temporary problem at Entrez? A similar problem seems to occur with > other Entrez queries. > > Running biopython 1.49 in IPython... > > ============ > from Bio import Entrez > > Entrez.email = "matzke at berkeley.edu" > > handle = Entrez.einfo(db="structure") > > > --------------------------------------------------------------------------- > IOError Traceback (most recent call last) > > /bioinformatics/pyeg/ in () > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc > in einfo(cgi, **keywds) > 195 variables = {} > 196 variables.update(keywds) > --> 197 return _open(cgi, variables) > 198 > 199 def esummary(cgi=None, **keywds): > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc > in _open(cgi, params) > 320 options = urllib.urlencode(params, doseq=True) > 321 cgi += "?" + options > --> 322 handle = urllib.urlopen(cgi) > 323 > 324 # Wrap the handle inside an UndoHandle. > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc > in urlopen(url, data, proxies) > 80 opener = _urlopener > 81 if data is None: > ---> 82 return opener.open(url) > 83 else: > 84 return opener.open(url, data) > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc > in open(self, fullurl, data) > 188 try: > 189 if data is None: > --> 190 return getattr(self, name)(url) > 191 else: > 192 return getattr(self, name)(url, data) > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc > in open_http(self, url, data) > 323 if realhost: h.putheader('Host', realhost) > 324 for args in self.addheaders: h.putheader(*args) > --> 325 h.endheaders() > 326 if data is not None: > 327 h.send(data) > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in endheaders(self) > 858 raise CannotSendHeader() > 859 > --> 860 self._send_output() > 861 > 862 def request(self, method, url, body=None, headers={}): > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in _send_output(self) > 730 msg = "\r\n".join(self._buffer) > 731 del self._buffer[:] > --> 732 self.send(msg) > 733 > 734 def putrequest(self, method, url, skip_host=0, > skip_accept_encoding=0): > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in send(self, str) > 697 if self.sock is None: > 698 if self.auto_open: > --> 699 self.connect() > 700 else: > 701 raise NotConnected() > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in connect(self) > 665 msg = "getaddrinfo returns an empty list" > 666 for res in socket.getaddrinfo(self.host, self.port, 0, > --> 667 socket.SOCK_STREAM): > 668 af, socktype, proto, canonname, sa = res > 669 try: > > IOError: [Errno socket error] (7, 'No address associated with nodename') > > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect() > > 666 for res in socket.getaddrinfo(self.host, self.port, 0, > --> 667 socket.SOCK_STREAM): > 668 af, socktype, proto, canonname, sa = res > > > > > > ipdb> record = Entrez.read(handle) > *** NameError: name 'Entrez' is not defined > > ============ > > > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Mon Mar 16 15:52:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Mar 2009 19:52:30 +0000 Subject: [BioPython] Entrez.einfo error? In-Reply-To: <49BEA92D.7040905@berkeley.edu> References: <49BEA92D.7040905@berkeley.edu> Message-ID: <320fb6e00903161252s9f41eecx56853a0cc9a76882@mail.gmail.com> On Mon, Mar 16, 2009 at 7:31 PM, Nick Matzke wrote: > Hi all, > > This exact code worked fine for me on Friday, I wonder if it could be a > temporary problem at Entrez? A similar problem seems to occur with other > Entrez queries. > > Running biopython 1.49 in IPython... > > ============ > from Bio import Entrez > Entrez.email = "matzke at berkeley.edu" > handle = Entrez.einfo(db="structure") > --------------------------------------------------------------------------- > IOError Traceback (most recent call last) > ... Yes, I think you were experiencing a temporary problem, either at the NCBI or somewhere else on the network. Its working now on my machine right now. In general an IOError in Bio.Entrez is a good sign of a network issue, and for any complex task you may want to explicitly catch these exceptions. Peter From mgenome at gmail.com Tue Mar 17 08:02:42 2009 From: mgenome at gmail.com (mgenome) Date: Tue, 17 Mar 2009 21:02:42 +0900 Subject: [BioPython] How can I draw genome comparison figure to publish? Message-ID: I have the whole genome sequence of a phage to compare it's ORFs to those of other related phages. I want to draw a comparison figure of two or more genomes. Two genomes should be compared by their ORFs similarities calculated by BLASTP or stretcher etc. If there is a table like this ORF1, start, stop, strand, ORF2, start, stop, strand, similarity, genome1_ORF1, 1, 200, +, genome2_ORF1, 1, 300, -, 50 genome1_ORF2, 201, 400, +, genome2_ORF3, 320, 500, -, 90 .... the programs or library should draw as follows; ===> ===> .... | | | | | | <=== <=== .... Their different similarities should be represented by different colors of linker lines. I examined several programs, but I didn't find the program good enough to use for publication. ACT (Artemics) can draw comparison figure but it can not show ORFs well. inGeno is the program close to what I want. But It cannot compare multiple genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do not support comparison of ORFs in genomic level. Does anybody know a program and library to draw genome comparion figure showing ORF comparison. I known that it is stupid to want a perfect program to fulfill all my requirments, but I want to find program or library to fulfill a part of my requirements. Thank you in advance. Kyoung-Ho Kim, Korea. From lpritc at scri.ac.uk Tue Mar 17 08:42:36 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Mar 2009 12:42:36 +0000 Subject: [BioPython] How can I draw genome comparison figure to publish? In-Reply-To: Message-ID: Hi Kyoung-Ho, On 17/03/2009 12:02, "mgenome" wrote: > Two genomes should be compared by their ORFs similarities calculated by > BLASTP or stretcher etc. > > If there is a table like this > > ORF1, start, stop, strand, ORF2, start, stop, strand, similarity, > genome1_ORF1, 1, 200, +, genome2_ORF1, 1, 300, -, 50 > genome1_ORF2, 201, 400, +, genome2_ORF3, 320, 500, -, 90 > .... > > the programs or library should draw as follows; > ===> ===> .... > | | > | | > | | > <=== <=== .... > Their different similarities should be represented by different colors of > linker lines. > > I examined several programs, but I didn't find the program good enough to > use for publication. > ACT (Artemics) can draw comparison figure but it can not show ORFs well. > inGeno is the program close to what I want. But It cannot compare multiple > genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do > not support comparison of ORFs in genomic level. GenomeDiagram does not draw the linker lines you require, I'm afraid. The package I would use to do so is ACT, and I have published diagrams created using ACT (figure 3 in http://dx.doi.org/10.1073/pnas.0402424101). There is also M-GCAT (http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html), which is very similar to ACT, and perhaps so similar that it will have the same problems when generating publication-quality images to your liking. GCV (http://zamov.online.fr/projects/gct/) I've never tried. > Does anybody know a program and library to draw genome comparion figure > showing ORF comparison. I known that it is stupid to want a perfect program > to fulfill all my requirments, but I want to find program or library to > fulfill a part of my requirements. GenomeDiagram does not currently have a facility to indicate synteny in the way that you require using linker lines, so it may not be the tool you need just yet. However, it has been used to indicate the results of comparisons between ORFs on the whole-genome level, using the colours of the compared features to indicate the sequence identities of the matches (e.g. Figure 2 in http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 and http://apsjournals.apsnet.org/doi/abs/10.1094). Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Tue Mar 17 08:51:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Mar 2009 12:51:18 +0000 Subject: [BioPython] How can I draw genome comparison figure to publish? In-Reply-To: References: Message-ID: <320fb6e00903170551u284b1f20v4a77fedd7bdbfbed@mail.gmail.com> On Tue, Mar 17, 2009 at 12:02 PM, mgenome wrote: > ... I examined several programs, but I didn't find the program good enough > to use for publication. > ACT (Artemics) can draw comparison figure but it can not show ORFs well. > inGeno is the program close to what I want. But It cannot compare multiple > genomes and I want to draw ORF as arrows. I know GenomeDaigrams in > python do not support comparison of ORFs in genomic level. Based on your description, I was going to suggest ACT (Artemics), but you have already considered this. GenomeDiagram has been integrated into Biopython and will be part of Biopython 1.50, and as part of this work it does now support drawing features (e.g. ORFs) as simple arrows. GenomeDiagram is very good at comparative genomics plots - but not the kind you are interested in. It wouldn't be very elegant, but you might be able to use GenomeDiagram to draw two linear genome diagrams, and then combine this and add the comparison lines on yourself with extra code using ReportLab directly. This would probably be quite a lot of work... Peter From biopython at maubp.freeserve.co.uk Tue Mar 17 12:52:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Mar 2009 16:52:23 +0000 Subject: [BioPython] Biopython contributors and participants listings Message-ID: <320fb6e00903170952t329332aer310906da64f49cb6@mail.gmail.com> Hi all, We're starting to prepare for the release of Biopython 1.50, so its seems a good occasion to update the Biopython contributors and participants listing. I've just changed the formatting for the wiki page, and to me at least this looks much nicer now - you can look at the history and decide for yourselves: http://biopython.org/wiki/Participants I see some of you aren't on this participants wiki page and probably should be (e.g. Tiago), so could I encourage relevant people to add themselves. Likewise if you have contributed to the project and think you have been left out of the contributors file, please let us know: http://biopython.org/SRC/biopython/CONTRIB or: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/CONTRIB?cvsroot=biopython Peter From biopython at maubp.freeserve.co.uk Tue Mar 17 13:38:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Mar 2009 17:38:58 +0000 Subject: [BioPython] [Biopython-dev] PDB Parser error In-Reply-To: <3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com> References: <3715adb70903170830x61bb6e3bl4412a8cf1504d80c@mail.gmail.com> <320fb6e00903170901v6533910bl57ddd534dc05cf51@mail.gmail.com> <3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com> Message-ID: <320fb6e00903171038m72127569m279801556e5b9551@mail.gmail.com> On Tue, Mar 17, 2009 at 5:34 PM, Rodrigo faccioli wrote: > Peter, > > Your suspect was corrected. When I received a database value its was stored > in a Tuple data structure. The solution was converted them in string > objects. For this, I used str command. > > Now, I can proceed with my tests. > > Thanks for your help. OK, good luck. Peter From mitlox at op.pl Wed Mar 18 05:05:58 2009 From: mitlox at op.pl (mitlox) Date: Wed, 18 Mar 2009 19:05:58 +1000 Subject: [BioPython] protein-ligand interactions Message-ID: <49C0B976.1020005@op.pl> Hello, I have a solved structure (1E8W) with a ligand and I would like to know which residues are within 3A of the ligand. This 3A is a cut off and should be using just for the C-alpha in each residue, but it would be great if I know which C-alpha belongs to a residue. I am newbie in Biopython/Python, maybe anyone know an example how is it possible? Thank you in advance. Best regards From p.j.a.cock at googlemail.com Wed Mar 18 05:31:14 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 Mar 2009 09:31:14 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <49C0B976.1020005@op.pl> References: <49C0B976.1020005@op.pl> Message-ID: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> On Wed, Mar 18, 2009 at 9:05 AM, mitlox wrote: > > Hello, > I have a solved structure (1E8W) with a ligand and I would like to > know which residues are within 3A of the ligand. This 3A is a cut > off and should be using just for the C-alpha in each residue, but > it would be great if I know which C-alpha belongs to a residue. > > I am newbie in Biopython/Python, maybe anyone know an > example how is it possible? Hi, I've got a couple of PDB examples on my personal website, and although they need a little update to use NumPy instead of Numeric, I think the page on doing protein contact maps would be very informative: http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ In your case, for the protein in each residue you'll want to use just the C-alpha atom (in the residue's atom dictionary under the key "CA"), but I think you should loop over all the residues in the ligand in order to find the least distance. Peter From p.j.a.cock at googlemail.com Wed Mar 18 08:36:06 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 Mar 2009 12:36:06 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> Message-ID: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> > Hi, > > I've got a couple of PDB examples on my personal website, and although > they need a little update to use NumPy instead of Numeric, I think the > page on doing protein contact maps would be very informative: > http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ I've updated those pages to use NumPy instead of Numeric - all very straight forward (apart from some issue with rpy for the graphics which isn't relevant to Biopython): http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ Peter From dalke at dalkescientific.com Wed Mar 18 11:34:59 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 18 Mar 2009 16:34:59 +0100 Subject: [BioPython] Fwd: Available: 2 Bioinformatics positions in AstraZeneca References: <45988AB300A3B1468F7CF2F9EF207579C6A3D7@SEMLRDEMBX01.rd.astrazeneca.net> Message-ID: For those interested, there's a couple of temporary bioinformatics positions at AstraZeneca/M?lndal (near Gothenburg). Reading the announcements, which are in Swedish, I see it's more biomedical informatics than sequence analysis (text mining, workflows, a decision system for medical researchers). > Ads: > https://www.poolia.se/sok-jobb/webcv/JobAd.aspx?jobadid=19008 > http://annonsoversikt.monster.se/getjob.aspx? > JobID=79909293&cy=se&where=L%c3%a4n%3aV%c3%a4stra+G%c3% > b6taland&lid=1398&re=95&pg=1&dv=1&AVSDM=2009-03-13+11%3a17% > 3a00&seq=11&fseo=1&isjs=1&re=1000 > https://sjobs.brassring.com/1053/ASP/TG/cim_jobdetail.asp?SID=% > 5edUuKAW_slp_rhc_DOlGOwdxDn_slp_rhc_PthlP/WlgiP85aWAkz/ > xRYSIbMXcsvZrHO0fJu5/ > PZdH3vw1QoLQAr5X3A_C_R__L_F_lA_slp_rhc_0Q7alykZpdfns2LzK3W8x8tde_slp_r > hc_tU=&jobId=275215&type=search&partnerid=20054&siteid=5036 Also, if you are doing Python in the Gothenburg area, join us for GothPy, the Gothenburg Python user's group: http://groups.google.com/ group/gothpy Andrew dalke at dalkescientific.com From n.j.loman at bham.ac.uk Wed Mar 18 13:59:09 2009 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 18 Mar 2009 17:59:09 +0000 Subject: [BioPython] [Fwd: Bioinformatician wanted] Message-ID: <49C1366D.7070105@bham.ac.uk> Hi all, I hope biopython'ers will excuse me posting this job advert for a Research Fellow at University of Birmingham - the project referenced makes heavy use of Biopython. The position holder would interact with Biopython on a daily basis, and potentially be able to help the Biopython open source effort should they wish. Cheers, Nick. Please pass this advert on to anyone who might be interested and suitable. http://www.jobs.ac.uk/jobs/BO446/ Research Fellow School of Immunity and Infection *Fixed term for 33 months* We are looking for a talented bioinformatician to assist in the development, maintenance and exploitation of an internationally renowned web-based microbial genomics facility, xBASE. The post holder will build on our existing achievements with xBASE (http://xbase.ac.uk ; Chaudhuri RR, Loman NJ, Snyder LA, Bailey CM, Stekel DJ, Pallen MJ. Nucleic Acids Res. 2008 36:D543-6). The work will be carried out under the supervision of Professor Mark Pallen (Medical School) in collaboration with Dr Dov Stekel (Biosciences). The post holder will work within an attractive modern research environment in the University's newly established inter-disciplinary Centre for Systems Biology. All candidates must have proficiency in programming within the Unix/Linux environment, including web-linked database design, development and management and use of languages such as Perl, PHP, C++, Python, Ruby or JAVA. Familiarity with BioPerl, BioSQL and MySQL is highly desirable. Applicants must possess the critical thinking skills needed to devise and carry out research projects and should have experience of analysing macromolecular sequence data. A PhD in a relevant subject area is desirable and will be required for appointment to a research fellowship. A flair for design, particularly as applied to web-based resources, good team-working skills and an ability to work under their own initiative will provide an advantage, as will experience of research in molecular bacteriology, comparative genomics, molecular evolution and/or pathogenesis. Informal enquiries may be addressed to Professor Mark Pallen on 0121 414 7163 or m.pallen at bham.ac.uk Starting salary ?27,183 a year, in the range of ?27,183 to ?35,469 a year (potential progression on performance once in post to ?37,651). The post will be offered on a fixed-term contract for a period up to two years and nine months, starting on or shortly after May 1st 2009. Interviews will be held in the week beginning Monday 30 March 2009. Closing date: 23 March 2009 Reference: 39855 To download the details and submit an electronic application online visit: www.hr.bham.ac.uk/jobs alternatively information can be obtained from 0121 415 9000. A University of Fairness and Diversity. Mark Professor Mark Pallen Professor of Microbial Genomics Centre for Systems Biology Biosciences University of Birmingham, BIRMINGHAM, B15 2TT m.pallen at bham.ac.uk tel ++44(0)121 414 7163 Author: The Rough Guide to Evolution http://www.amazon.co.uk/Rough-Guide-Evolution-Science-Phenomena/dp/1858289467/ Blog http://roughguidetoevolution.blogspot.com feed://roughguidetoevolution.blogspot.com/feeds/posts/default "There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being evolved." Charles Darwin ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From hlapp at gmx.net Wed Mar 18 14:45:50 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Mar 2009 14:45:50 -0400 Subject: [BioPython] OBF application for Summer of Code has been rejected Message-ID: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> I hope to find out later why, but our Google Summer of Code application as an umbrella org has been rejected. However, NESCent has been accepted. If you can give your project idea a phylogenetics/phyloinformatics focus, go and put it up on the NESCent ideas page at http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009 Do so pretty much **now** - we will start broadcasting and reaching out to students tonight and tomorrow. If someone comes to the site and they don't see a Bio* project that they would have been interested in, they may not check back for updates. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From Yvan.Strahm at bccs.uib.no Wed Mar 18 14:47:58 2009 From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no) Date: Wed, 18 Mar 2009 19:47:58 +0100 Subject: [BioPython] How can I get a more explicite error Message-ID: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> Hello List, I try to get a grip on Biopython and followed the chapter 6 form the tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html) I run this script: from Bio.Blast import NCBIStandalone import re import sys my_blast_db = "/export/scratch/yvans/BEE/Apis_mellifera_ligustica_complete_mitochondrial_genome.fasta" my_blast_file = sys.argv[1] my_blast_exe = "/Home/lundalm/yvans/src/blast-2.2.19/bin/blastall" result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file, gap_open=5, gap_extend=2, filter ='F', expectation=1000) blast_results = result_handle.read() my_results=sys.argv[1]+".xml" save_file = open(my_results, "w") save_file.write(blast_results) save_file.close() I got this error [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta Traceback (most recent call last): File "bioblast.py", line 16, in blast_results = result_handle.read() SystemError: Objects/stringobject.c:4271: bad argument to internal function if the number of sequence blasted agianst the db is greater than 500000. The sequence are small reads from a solexa sequencing project. Is there a size limitation? And should I save(keep) only the sequence I am interested in into my_results instead of saving everything? And is there a way of running some tests before doinr the blast_result.read()? Now I try to use keep_hits=1 as a blast parameters in order to reduce the size of my_result, will see. Thanks for your time and help Cheers, yvan From cjfields at illinois.edu Wed Mar 18 15:08:48 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 18 Mar 2009 14:08:48 -0500 Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has been rejected In-Reply-To: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> Message-ID: Hilmar, The idea was floated on the google SOC list that language-specific organizations that have been accepted may potentially take bioinformatics-related applications. Specifically, Jonathan Leto (from The Perl Foundation) indicated that bioinformatics-related projects using BioPerl might be able to apply through them. Not sure about others (Python Software Foundation, etc) but might be worth checking into. Any idea on who's been accepted beyond NEScent? chris On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote: > I hope to find out later why, but our Google Summer of Code > application as an umbrella org has been rejected. > > However, NESCent has been accepted. If you can give your project > idea a phylogenetics/phyloinformatics focus, go and put it up on the > NESCent ideas page at > > http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009 > > Do so pretty much **now** - we will start broadcasting and reaching > out to students tonight and tomorrow. If someone comes to the site > and they don't see a Bio* project that they would have been > interested in, they may not check back for updates. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From chapmanb at 50mail.com Wed Mar 18 17:20:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Mar 2009 17:20:07 -0400 Subject: [BioPython] How can I get a more explicite error In-Reply-To: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> Message-ID: <20090318212007.GM57054@sobchak.mgh.harvard.edu> Hi Yvan; > I try to get a grip on Biopython and followed the chapter 6 form the > tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html) > > I run this script: [...] > blast_results = result_handle.read() [...] > [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta > Traceback (most recent call last): > File "bioblast.py", line 16, in > blast_results = result_handle.read() > SystemError: Objects/stringobject.c:4271: bad argument to internal function > > if the number of sequence blasted agianst the db is greater than 500000. > The sequence are small reads from a solexa sequencing project. The result_handle.read() line is pulling the entire large BLAST result file into memory as a string. You will run out of memory with huge files, leading to the errors you are seeing. To limit the problem, run BLAST initially at the command line, and then process the resulting XML file with the BLAST parser as described here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56 This iterates over 1 record at a time, avoiding the memory issue. However, you should be using a short read aligner to map these reads to the genome. BLAST is not the right tool for this particular application; massive BLAST report files are going to be one of many problems you will run into analyzing the data. Here are a couple of popular aligners designed for the exact problem you are tackling: Bowtie: http://bowtie-bio.sourceforge.net/index.shtml Maq: http://maq.sourceforge.net/ Hope this helps, Brad From hlapp at gmx.net Wed Mar 18 18:50:26 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Mar 2009 18:50:26 -0400 Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has been rejected In-Reply-To: References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> Message-ID: Yes, thanks for mentioning that, was going to do so too. The Perl Foundation and the Python foundation have been accepted. I guess there isn't a Java Foundation, and if there is a Ruby one it hasn't been accepted or hasn't applied. However, Ruby on Rails has been accepted. Don't know how open they would be a Bioruby project. -hilmar On Mar 18, 2009, at 3:08 PM, Chris Fields wrote: > Hilmar, > > The idea was floated on the google SOC list that language-specific > organizations that have been accepted may potentially take > bioinformatics-related applications. Specifically, Jonathan Leto > (from The Perl Foundation) indicated that bioinformatics-related > projects using BioPerl might be able to apply through them. Not > sure about others (Python Software Foundation, etc) but might be > worth checking into. > > Any idea on who's been accepted beyond NEScent? > > chris > > On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote: > >> I hope to find out later why, but our Google Summer of Code >> application as an umbrella org has been rejected. >> >> However, NESCent has been accepted. If you can give your project >> idea a phylogenetics/phyloinformatics focus, go and put it up on >> the NESCent ideas page at >> >> http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009 >> >> Do so pretty much **now** - we will start broadcasting and reaching >> out to students tonight and tomorrow. If someone comes to the site >> and they don't see a Bio* project that they would have been >> interested in, they may not check back for updates. >> >> -hilmar >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From yvan.strahm at bccs.uib.no Thu Mar 19 04:10:17 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Thu, 19 Mar 2009 09:10:17 +0100 Subject: [BioPython] How can I get a more explicite error In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu> References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> <20090318212007.GM57054@sobchak.mgh.harvard.edu> Message-ID: <49C1FDE9.20305@bccs.uib.no> Hello Brad, Thanks for the help, much appreciated. I will look at bowtie and Maq. In fact I am interested into reads which are not in the reference and how they differ from the reference, how many reads have 1,2,3,.... indels/mismatch. Cheers, yvan Brad Chapman wrote: > Hi Yvan; > >> I try to get a grip on Biopython and followed the chapter 6 form the >> tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html) >> >> I run this script: > [...] >> blast_results = result_handle.read() > [...] >> [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta >> Traceback (most recent call last): >> File "bioblast.py", line 16, in >> blast_results = result_handle.read() >> SystemError: Objects/stringobject.c:4271: bad argument to internal function >> >> if the number of sequence blasted agianst the db is greater than 500000. >> The sequence are small reads from a solexa sequencing project. > > The result_handle.read() line is pulling the entire large BLAST result > file into memory as a string. You will run out of memory with huge files, > leading to the errors you are seeing. > > To limit the problem, run BLAST initially at the command line, > and then process the resulting XML file with the BLAST parser > as described here: > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56 > > This iterates over 1 record at a time, avoiding the memory issue. > > However, you should be using a short read aligner to map these reads > to the genome. BLAST is not the right tool for this particular > application; massive BLAST report files are going to be one of many > problems you will run into analyzing the data. Here are a couple of > popular aligners designed for the exact problem you are tackling: > > Bowtie: http://bowtie-bio.sourceforge.net/index.shtml > Maq: http://maq.sourceforge.net/ > > Hope this helps, > Brad > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Mar 19 06:47:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Mar 2009 10:47:05 +0000 Subject: [BioPython] [Fwd: Bioinformatician wanted] In-Reply-To: <49C1366D.7070105@bham.ac.uk> References: <49C1366D.7070105@bham.ac.uk> Message-ID: <320fb6e00903190347v3a70b6b0w46033c5769b38aa5@mail.gmail.com> On Wed, Mar 18, 2009 at 5:59 PM, Nick Loman wrote: > Hi all, > > I hope biopython'ers will excuse me posting this job advert for a Research > Fellow at University of Birmingham - the project referenced makes heavy use > of Biopython. The position holder would interact with Biopython on a daily > basis, and potentially be able to help the Biopython open source effort > should they wish. > > Cheers, > > Nick. I have no objections to posting targeted and directly relevant academic jobs adverts here - in fact I rather like it. I would point out the job advert text itself doesn't actually mention Biopython - perhaps you can get HR to amend the copy linked to from the University job page updated to mention experience of Biopython, BioPerl or BioSQL being desirable? Peter P.S. Could you add links to Biopython, BioPerl and BioSQL to the xBase website, maybe on the about page? http://xbase.bham.ac.uk/about.pl P.P.S. Did you have a chance to try out the patch on Bug 2738 for speeding up loading GenBank files into BioSQL? http://bugzilla.open-bio.org/show_bug.cgi?id=2738 Cheers! From biopython at maubp.freeserve.co.uk Thu Mar 19 06:52:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Mar 2009 10:52:30 +0000 Subject: [BioPython] How can I get a more explicite error In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu> References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> <20090318212007.GM57054@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00903190352rcbca60bi4d703dbf65bcd3b0@mail.gmail.com> On Wed, Mar 18, 2009 at 9:20 PM, Brad Chapman wrote: > The result_handle.read() line is pulling the entire large BLAST result > file into memory as a string. You will run out of memory with huge files, > leading to the errors you are seeing. I think Brad is probably right about the memory issue - is certainly something to be careful of. Instead of this: blast_results = result_handle.read() my_results=sys.argv[1]+".xml" save_file = open(my_results, "w") save_file.write(blast_results) save_file.close() You could try keeping only one line in memory: my_results=sys.argv[1]+".xml" save_file = open(my_results, "w") for line in result_handle : save_file.write(line) save_file.close() Or, we should get round to fixing Bug 2654 which would let you tell the BLAST tool to save the file itself, which would be much more elegant. Do you want to add yourself as a CC to this bug, so you'll automatically be informed of any updates: http://bugzilla.open-bio.org/show_bug.cgi?id=2654 Peter From mitlox at op.pl Thu Mar 19 08:55:06 2009 From: mitlox at op.pl (mitlox) Date: Thu, 19 Mar 2009 22:55:06 +1000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> Message-ID: <49C240AA.908@op.pl> I wrote this code: ------------------------------------------------------------------------------------------------ import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" #not the full cage! structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) backBoneAtomNames = "N","CA","C","0", "CB" tempBackbone = [0,0,0,0,0] Backbone = [] backboneNo = 0 for atom in structure.get_atoms(): if (atom.get_name() == backBoneAtomNames[backboneNo]) and (backboneNo < len(backBoneAtomNames)): tempBackbone[backboneNo] = atom backboneNo+=1 elif atom.get_name() != backBoneAtomNames[backboneNo]: backboneNo = 0 elif len(backBoneAtomNames) == backboneNo: Backbone.extend(tempBackbone) for a in tempBackbone: print a ------------------------------------------------------------------------------------------------ to identified the backbone, but unfortunately it does not work. Maybe exist already to identified backbone in Biopython? Thank you in advance Peter Cock wrote: >> Hi, >> >> I've got a couple of PDB examples on my personal website, and although >> they need a little update to use NumPy instead of Numeric, I think the >> page on doing protein contact maps would be very informative: >> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ >> > > I've updated those pages to use NumPy instead of Numeric - all very > straight forward (apart from some issue with rpy for the graphics which > isn't relevant to Biopython): > > http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > Peter > > From p.j.a.cock at googlemail.com Thu Mar 19 09:31:30 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 Mar 2009 13:31:30 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <49C240AA.908@op.pl> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> Message-ID: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> On Thu, Mar 19, 2009 at 12:55 PM, mitlox wrote: > I wrote this code: > ------------------------------------------------------------------------------------------------ > import Bio.PDB > import numpy > > pdb_code = "1E8W" > pdb_filename = "1E8W.pdb" #not the full cage! That comment was about the fact that the PDB file 1XI4 only contains part of the full clathrin cage. > structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) > backBoneAtomNames = "N","CA","C","0", "CB" > ... > ------------------------------------------------------------------------------------------------ > to identified the backbone, but unfortunately it does not work. > > Maybe exist already to identified backbone in Biopython? I don't understand what you were trying to do. Have you read the Bio.PDB documentation about the hierarchy of structures, models, chains, residues and atoms? http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf This is how I would solve the original question, finding the distance between the C-alpha carbon to the closest atom is the ligand: import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) model = structure[0] chainA = model["A"] def residue_dist_to_ligand(protein_residue, ligand_residue) : """Returns distance from the protein C-alpha to the closest ligand atom.""" distances = [] for atom in ligand_residue : diff_vector = protein_residue["CA"].coord - atom.coord distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) return min(distances) #From looking at the PDB file, ligand is last residue in chain A, named QUE ligand_res = chainA.child_list[-1] assert ligand_res.resname == "QUE" for protein_res in chainA.child_list[:-1] : dist = residue_dist_to_ligand(protein_res, ligand_res) if dist < 5.0 : print protein_res.resname, protein_res.id[1], dist This gives the following output: ILE 881 3.64203 VAL 882 3.58559 ALA 885 4.62673 THR 886 4.95211 ILE 963 4.64252 ASP 964 3.08788 If you wanted to, it should be simple change this to find the closest distance between any part of each residue to any part of the ligand, which should I expect give some distances less than 3A. Peter From mitlox at op.pl Fri Mar 20 08:18:48 2009 From: mitlox at op.pl (mitlox) Date: Fri, 20 Mar 2009 22:18:48 +1000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> Message-ID: <49C389A8.5090703@op.pl> Thank you very much for your code, it works and the output is exactly for what I was looking for. I try to get a structureCA object to write out the results in a PDB file (outCA.pdb) like this: ATOM 5275 CA ILE A 881 17.242 57.141 22.062 1.00 38.49 C ATOM 5283 CA VAL A 882 16.292 57.880 25.678 1.00 38.90 C .... And the second reason for a structureCA object is that I do not want use: structureCA = Bio.PDB.PDBParser().get_structure(outCA.pdb, outCA.pdb) Unfortunately I get this error with the extension: ILE 881 3.64203 VAL 882 3.58559 ALA 885 4.62673 THR 886 4.95211 ILE 963 4.64252 ASP 964 3.08788 Traceback (most recent call last): File "interaction.py", line 31, in ? io.save('out.pdb') File "/usr/lib/python2.4/site-packages/biopython-1.49-py2.4-linux-i686.egg/Bio/PDB/PDBIO.py", line 121, in save for model in self.structure.get_list(): AttributeError: 'list' object has no attribute 'get_list' Here is the code: import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) model = structure[0] chainA = model["A"] structureCA = [] def residue_dist_to_ligand(protein_residue, ligand_residue) : """Returns distance from the protein C-alpha to the closest ligand atom.""" distances = [] for atom in ligand_residue : diff_vector = protein_residue["CA"].coord - atom.coord distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) return min(distances) #From looking at the PDB file, ligand is last residue in chain A, named QUE ligand_res = chainA.child_list[-1] assert ligand_res.resname == "QUE" for protein_res in chainA.child_list[:-1] : dist = residue_dist_to_ligand(protein_res, ligand_res) if dist < 5.0 : print protein_res.resname, protein_res.id[1], dist structureCA.append(protein_res) io=Bio.PDB.PDBIO() io.set_structure(structureCA) io.save('outCA.pdb') How can I get a structureCA object of the results? Thank you in advance. Best regards From p.j.a.cock at googlemail.com Fri Mar 20 09:36:47 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Mar 2009 13:36:47 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <49C389A8.5090703@op.pl> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> <49C389A8.5090703@op.pl> Message-ID: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com> On Fri, Mar 20, 2009 at 12:18 PM, mitlox wrote: > Thank you very much for your code, it works and the output is exactly for > what I was looking for. > > I try to get a structureCA object to write out the results in a PDB file > (outCA.pdb) like this: > ATOM ? 5275 ?CA ?ILE A 881 ? ? ?17.242 ?57.141 ?22.062 ?1.00 38.49 > C > ATOM ? 5283 ?CA ?VAL A 882 ? ? ?16.292 ?57.880 ?25.678 ?1.00 38.90 > C .... > > Unfortunately I get this error with ... Here is the code: > ... > structureCA = [] > ... > io=Bio.PDB.PDBIO() > io.set_structure(structureCA) > io.save('outCA.pdb') Your structureCA object is just a python list, containing Residue objects. Instead you need to create a new object with the partial chain - which can be done by creating structure, model and chain objects manually. However, I suggest you re-read pages 5 and 6 of the Bio.PDB documentation for the recommend approach: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf In your case, you'll want to write your own selection class using the residue distance to the ligand. I recognise this might seem rather complicated for a python novice as you have to create your own class - so here is my solution: import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) model = structure[0] chainA = model["A"] def residue_dist_to_ligand(protein_residue, ligand_residue) : """Returns distance from the protein C-alpha to the closest ligand atom.""" distances = [] for atom in ligand_residue : diff_vector = protein_residue["CA"].coord - atom.coord distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) return min(distances) class NearLigandSelect(Bio.PDB.Select): def __init__(self, distance_threshold, ligand_residue) : self.threshold = distance_threshold self.ligand_res = ligand_residue def accept_residue(self, residue): if residue == self.ligand_res : return True #change this to False if you don't want the ligand else : dist = residue_dist_to_ligand(residue, self.ligand_res) return dist < self.threshold io=Bio.PDB.PDBIO() io.set_structure(structure) #From looking at the PDB file, ligand is last residue in chain A ligand_res = chainA.child_list[-1] #Going to use a distance theshold of 4A io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res)) print "Done" Peter From mitlox at op.pl Fri Mar 20 19:45:56 2009 From: mitlox at op.pl (mitlox) Date: Sat, 21 Mar 2009 09:45:56 +1000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> <49C389A8.5090703@op.pl> <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com> Message-ID: <49C42AB4.7050404@op.pl> Thank you very much for your solution. Additionally It would be nice to have a structure object with the same information like in "near_ligand.pdb", that I do not need to read a new pdb file again: structureMOD = Bio.PDB.PDBParser().get_structure("near", "near_ligand.pdb"). It is possible to have both a "near_ligand.pdb" and the same structure object? Thank you in advance. Best regards Peter Cock wrote: > Your structureCA object is just a python list, containing Residue objects. > Instead you need to create a new object with the partial chain - which > can be done by creating structure, model and chain objects manually. > > However, I suggest you re-read pages 5 and 6 of the Bio.PDB > documentation for the recommend approach: > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > In your case, you'll want to write your own selection class using the > residue distance to the ligand. I recognise this might seem rather > complicated for a python novice as you have to create your own > class - so here is my solution: > > import Bio.PDB > import numpy > > pdb_code = "1E8W" > pdb_filename = "1E8W.pdb" > > structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) > model = structure[0] > chainA = model["A"] > > def residue_dist_to_ligand(protein_residue, ligand_residue) : > """Returns distance from the protein C-alpha to the closest ligand atom.""" > distances = [] > for atom in ligand_residue : > diff_vector = protein_residue["CA"].coord - atom.coord > distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) > return min(distances) > > class NearLigandSelect(Bio.PDB.Select): > def __init__(self, distance_threshold, ligand_residue) : > self.threshold = distance_threshold > self.ligand_res = ligand_residue > > def accept_residue(self, residue): > if residue == self.ligand_res : > return True #change this to False if you don't want the ligand > else : > dist = residue_dist_to_ligand(residue, self.ligand_res) > return dist < self.threshold > > io=Bio.PDB.PDBIO() > io.set_structure(structure) > #From looking at the PDB file, ligand is last residue in chain A > ligand_res = chainA.child_list[-1] > #Going to use a distance theshold of 4A > io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res)) > print "Done" > > Peter > > From mjldehoon at yahoo.com Sat Mar 21 00:54:08 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 20 Mar 2009 21:54:08 -0700 (PDT) Subject: [BioPython] Bio.Enzyme (was: Re: [Biopython-dev] Bio.ExPASy) In-Reply-To: <76595.11423.qm@web62404.mail.re1.yahoo.com> Message-ID: <517737.76119.qm@web62403.mail.re1.yahoo.com> I've created a simplified version of the parser in Bio.Enzyme in Bio.ExPASy.Enzyme. The idea behind it is to collect all parsers related to ExPASy databases in Bio.ExPASy so that they can be found more easily by users. Bio.ExPASy.Enzyme works essentially the same as Bio.Enzyme, but I've done a few things a bit differently. The biggest change is probably that Bio.Enzyme stores information as attributes to a record, whereas Bio.ExPASy.Enzyme has a Record derived from a dictionary, and stores information in the dictionary (same as Bio.Medline). Does anybody have any objection if Bio.ExPASy.Enzyme becomes the "official" parser for ExPASy's Enzyme database? If not, I'll modify the documentation and tests accordingly, and start the deprecation process for Bio.Enzyme. --Michiel --- On Sun, 3/15/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: [Biopython-dev] Bio.ExPASy > To: biopython-dev at biopython.org > Date: Sunday, March 15, 2009, 6:24 AM > Hi everybody, > > As discussed previously, I have moved the Bio.Prosite code > to Bio.ExPASy, and I've added a ScanProsite module to > Bio.ExPASy. I guess Bio.Enzyme should also move to > Bio.ExPASy. See > > http://biopython.org/DIST/docs/tutorial/Tutorial.proposal.html > > for the documentation of Biopython as currently in CVS. > > --Michiel. > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From lueck at ipk-gatersleben.de Tue Mar 24 05:34:19 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 24 Mar 2009 10:34:19 +0100 Subject: [BioPython] Emboss eprimer3 Message-ID: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> Hi! I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode: 1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C? 2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing. The primer 3 file looks like this: PRIMER_SEQUENCE_ID=HF15E08r SEQUENCE=GCATGTAATAATGCCAAAGCTCACAGCTGCAGTTGAATCTTGGGACCCGCGGAGCGAGAATGTACCAATCCATGTATGGGTACACCCATGGCTGCCAACTCTAGGGCAAAGGATAGATACACTGTGCCACTCTATCCGGTACAAGCTGAGTAGTGTCCTCCAATTATGGCAAGCTCACGATTCATCAGCTTATGCTGTGCTATCTCCATGGAAGGGTGTATTTGATCCAGCAAGTTGGGAAGACTTGATAGTGCGTTATATCATTCCTAAACTGAAAATGGCACTCCAGGAGTTCCAGATTAACCCAGCAAGCCAAAAGTTTGACCAGTTTAACTGGGTTATGATCTGGGCTTCTGCTGTCCCGGTACACCATATGGTCCATATGTTGGAAGTTGATTTCTTTAGCAAGTGGCAGCTGGTTTTGTACCATTGGCTGAGCTCACCAAATCCTGATTTCAATGAGATAATGAATTGGTAT PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200 PRIMER_OPT_TM=60.0 PRIMER_MIN_TM=58.0 PRIMER_MAX_TM=65.0 PRIMER_MAX_DIFF_TM=3.0 PRIMER_DNA_CONC=420 PRIMER_NUM_RETURN=1 PRIMER_PAIR_PENALTY=0.8691 PRIMER_LEFT_PENALTY=0.708329 PRIMER_RIGHT_PENALTY=0.160746 PRIMER_LEFT_SEQUENCE=GCATGTAATAATGCCAAAGC PRIMER_RIGHT_SEQUENCE=TTGAAATCAGGATTTGGTGA PRIMER_LEFT=0,20 PRIMER_RIGHT=458,20 PRIMER_LEFT_TM=59.292 PRIMER_RIGHT_TM=60.161 PRIMER_LEFT_GC_PERCENT=40.000 PRIMER_RIGHT_GC_PERCENT=35.000 PRIMER_LEFT_SELF_ANY=7.00 PRIMER_RIGHT_SELF_ANY=8.00 PRIMER_LEFT_SELF_END=2.00 PRIMER_RIGHT_SELF_END=2.00 PRIMER_LEFT_END_STABILITY=8.5000 PRIMER_RIGHT_END_STABILITY=7.9000 PRIMER_PAIR_COMPL_ANY=5.00 PRIMER_PAIR_COMPL_END=3.00 PRIMER_PRODUCT_SIZE=459 Thanks in advance! Stefanie From biopython at maubp.freeserve.co.uk Tue Mar 24 06:00:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Mar 2009 10:00:46 +0000 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> 2009/3/24 Stefanie L?ck : > Hi! > > I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode: > > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C? OK, you're using the gcclamp argument (i.e. GC clamp), which is supported by the Bio.Emboss.Applications wrapper. http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html I don't know if there is a primer3 argument for limiting the G or C's at the end - have you asked on the EMBOSS mailing list? > 2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing. >From reading the documentation there is a "fformat1" argument which *might* do what you want - you could try this out on the command line and see. Note that this argument is not currently supported in the Bio.Emboss.Applications wrapper, but that would be easy to add. If this argument doesn't do what you want, you'd have to ask the EMBOSS people about alternative output formats. Alternatively, you might investigate the original Whitehead version of primer3. Note that if you do succeed in changing the output format, you may need a new parser to read it. Peter From mitlox at op.pl Tue Mar 24 07:12:36 2009 From: mitlox at op.pl (mitlox) Date: Tue, 24 Mar 2009 21:12:36 +1000 Subject: [BioPython] Superimposer Message-ID: <49C8C024.60403@op.pl> Hello, I read that the Superimposer works only with the two lists of atoms which contain the same amount of atoms. So I decided to use "Combinatorial Extension (CE)". This program returns a rotation matrix and a translation vector. After the execution of CE I took the matrix and vector and tried to use it with Superimposer: ------------------------------------------------------------------------------ import sys import numpy from Bio.PDB import * pdb_fix = "../files/1z9g.pdb" pdb_mov = "../files/1z9g90.pdb" p=PDBParser() s1=p.get_structure("FIXED", pdb_fix) fixed=Selection.unfold_entities(s1, "A") s2=p.get_structure("MOVING", pdb_mov) moving=Selection.unfold_entities(s2, "A") rot=numpy.identity(3).astype('f') tran=numpy.array((1.0, 2.0, 3.0), 'f') tran[0] = -0.99996603; tran[1] = -2.00002559; tran[2] = -2.99998285 rot[0][0] = 0.19411441; rot[0][1] = -0.85385353; rot[0][2] = 0.48296351 rot[1][0] = 0.94858827; rot[1][1] = 0.28884874; rot[1][2] = 0.12940907 rot[2][0] = -0.24999979; rot[2][1] = 0.43301335; rot[2][2] = 0.86602514 for atom in moving: atom.transform(rot, tran) sup=Superimposer() sup.set_atoms(fixed, moving) print sup.rotran print sup.rms sup.apply(moving) print "Saving aligned structure as PDB file %s" % pdb_mov io=PDBIO() io.set_structure(s2) io.save(pdb_mov) print "Done" ------------------------------------------------------------------------------ Unfortunalaty "print sup.rotran" returns this: (array([[ 0.19411383, 0.94858824, -0.25000035], [-0.85385389, 0.28884841, 0.43301285], [ 0.4829631 , 0.12940999, 0.86602523]]), array([-0.06470776, 1.91446435, 3.21412203])) but this matrix and vector are no the same like above. What do I wrong? Thank you in advance. Best regards, From biopython at maubp.freeserve.co.uk Tue Mar 24 07:43:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Mar 2009 11:43:05 +0000 Subject: [BioPython] Superimposer In-Reply-To: <49C8C024.60403@op.pl> References: <49C8C024.60403@op.pl> Message-ID: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> On Tue, Mar 24, 2009 at 11:12 AM, mitlox wrote: > Hello, > I read that the Superimposer works only with the two lists of atoms which > contain the same amount of atoms. > > So I decided to use "Combinatorial Extension (CE)". This program returns a > rotation matrix and a translation vector. > > After the execution of CE I took the matrix and vector and tried to use it > with Superimposer: Why? Once you know the transformation, why do you need to try and recreate it with the superimposer? Are you just doing this as a check? > ------------------------------------------------------------------------------ > import sys > import numpy > from Bio.PDB import * > > > pdb_fix = "../files/1z9g.pdb" > pdb_mov = "../files/1z9g90.pdb" > p=PDBParser() > s1=p.get_structure("FIXED", pdb_fix) > fixed=Selection.unfold_entities(s1, "A") > > s2=p.get_structure("MOVING", pdb_mov) > moving=Selection.unfold_entities(s2, "A") You should be loading in the ORGINAL pdb file here, as the moved one won't exist yet, and if it did, you'd apply the transformation twice. Note you should expect slight differences due to floating point calculations. Your input was: array([[ 0.19411442, -0.85385352, 0.4829635 ], [ 0.94858825, 0.28884873, 0.12940907], [-0.24999979, 0.43301335, 0.86602515]], dtype=float32) array([-0.99996603, -2.00002551, -2.99998283], dtype=float32), The output was: array([[ 0.19411439, 0.94858827, -0.24999978], [-0.85385353, 0.28884871, 0.43301335], [ 0.4829635 , 0.12940907, 0.86602514]]), array([-0.06473777, 1.91448618, 3.21410633]) The rotation looks transposed (backwards). The translation does look different... however, if you switch this line: sup.set_atoms(fixed, moving) to: sup.set_atoms(moving, fixed) then things agree. I suspect something is flipped in the logic of your script regarding the frames of reference. Also, at the end you do sup.apply(moving), but you have already manually moved these atoms, so won't your PDB file have them moved twice? Peter From mitlox at op.pl Tue Mar 24 08:18:32 2009 From: mitlox at op.pl (mitlox) Date: Tue, 24 Mar 2009 22:18:32 +1000 Subject: [BioPython] Superimposer In-Reply-To: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> References: <49C8C024.60403@op.pl> <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> Message-ID: <49C8CF98.30809@op.pl> Thank you for you email. I would like only rotate and translate a pdb file that I can see the result in a pdb viewer. Maybe I do not need the Superimposer object to rotate and translate a pdb file with known rotation matrix and translation vector? Do you know how could I rotate and translate a pdb file? Thank you in advance. Peter wrote: > On Tue, Mar 24, 2009 at 11:12 AM, mitlox wrote: > >> Hello, >> I read that the Superimposer works only with the two lists of atoms which >> contain the same amount of atoms. >> >> So I decided to use "Combinatorial Extension (CE)". This program returns a >> rotation matrix and a translation vector. >> >> After the execution of CE I took the matrix and vector and tried to use it >> with Superimposer: >> > > Why? Once you know the transformation, why do you need to try and > recreate it with the superimposer? Are you just doing this as a check? > > >> ------------------------------------------------------------------------------ >> import sys >> import numpy >> from Bio.PDB import * >> >> >> pdb_fix = "../files/1z9g.pdb" >> pdb_mov = "../files/1z9g90.pdb" >> p=PDBParser() >> s1=p.get_structure("FIXED", pdb_fix) >> fixed=Selection.unfold_entities(s1, "A") >> >> s2=p.get_structure("MOVING", pdb_mov) >> moving=Selection.unfold_entities(s2, "A") >> > > You should be loading in the ORGINAL pdb file here, as the moved one > won't exist yet, and if it did, you'd apply the transformation twice. > > Note you should expect slight differences due to floating point > calculations. Your input was: > > array([[ 0.19411442, -0.85385352, 0.4829635 ], > [ 0.94858825, 0.28884873, 0.12940907], > [-0.24999979, 0.43301335, 0.86602515]], dtype=float32) > array([-0.99996603, -2.00002551, -2.99998283], dtype=float32), > > The output was: > > array([[ 0.19411439, 0.94858827, -0.24999978], > [-0.85385353, 0.28884871, 0.43301335], > [ 0.4829635 , 0.12940907, 0.86602514]]), > array([-0.06473777, 1.91448618, 3.21410633]) > > The rotation looks transposed (backwards). The translation does look > different... however, if you switch this line: > sup.set_atoms(fixed, moving) > to: > sup.set_atoms(moving, fixed) > then things agree. I suspect something is flipped in the logic of > your script regarding the frames of reference. > > Also, at the end you do sup.apply(moving), but you have already > manually moved these atoms, so won't your PDB file have them moved > twice? > > Peter > > From biopython at maubp.freeserve.co.uk Tue Mar 24 08:41:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Mar 2009 12:41:53 +0000 Subject: [BioPython] Superimposer In-Reply-To: <49C8CF98.30809@op.pl> References: <49C8C024.60403@op.pl> <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> <49C8CF98.30809@op.pl> Message-ID: <320fb6e00903240541p5fa8e043wc3363b18b34af37b@mail.gmail.com> On Tue, Mar 24, 2009 at 12:18 PM, mitlox wrote: > Thank you for you email. ?I would like only rotate and translate a pdb file > that I can see the result in a pdb viewer. I see. > Maybe I do not need the Superimposer object to rotate and translate a pdb > file with known rotation matrix and translation vector? Correct. > Do you know how could I rotate and translate a pdb file? You've got most of the steps already. This is my suggestion: import numpy from Bio import PDB pdb_fix = "1z9g.pdb" pdb_mov = "1z9g_moved.pdb" structure = PDB.PDBParser().get_structure("FIXED", pdb_fix) rot=numpy.identity(3).astype('f') tran=numpy.array((-0.99996603, -2.00002559, -2.99998285)) rot=numpy.array(((+0.19411441, -0.85385353, +0.48296351), (+0.94858827, +0.28884874, +0.12940907), (-0.24999979, +0.43301335, +0.86602514))) print "Applying transformation..." for atom in structure.get_atoms() : atom.transform(rot, tran) print "Saving transformed structure as PDB file %s" % pdb_mov io=PDB.PDBIO() io.set_structure(structure) io.save(pdb_mov) print "Done" NOTE - When giving a translation mapping as a translation vector and a rotation matrix there is some ambiguity about which order to apply them in. If the results using Bio.PDB don't match what you expect, you may want to double check this first. Peter From cjfields at illinois.edu Tue Mar 24 12:51:32 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 24 Mar 2009 11:51:32 -0500 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> Message-ID: <656D2F16-80DD-4976-90FE-2BCB8802093E@illinois.edu> On Mar 24, 2009, at 5:00 AM, Peter wrote: > ... >> From reading the documentation there is a "fformat1" argument which > *might* do what you want - you could try this out on the command line > and see. Note that this argument is not currently supported in the > Bio.Emboss.Applications wrapper, but that would be easy to add. If > this argument doesn't do what you want, you'd have to ask the EMBOSS > people about alternative output formats. Alternatively, you might > investigate the original Whitehead version of primer3. Peter, Not sure if this will be a problem for the BioPython wrapper for primer3, but the latest Primer3 version on Sourceforge (v2.0.0a) radically changes the various input parameters. I had to rewrite a bunch of code to handle those as well as older (v1) primer3 params. > Note that if you do succeed in changing the output format, you may > need a new parser to read it. > > Peter primer3 input and output is BoulderIO (which I think is an essentially obsolete format Lincoln Stein wrote up many years ago). It's very easy to parse, just simple key-value pairings. chris From nir at rosettadesigngroup.com Wed Mar 25 12:18:24 2009 From: nir at rosettadesigngroup.com (Nir London) Date: Wed, 25 Mar 2009 18:18:24 +0200 Subject: [BioPython] Rosetta Academic Training Webinar Message-ID: <88F0F36A-FC4D-4A9C-AC31-5B883C3F92CB@rosettadesigngroup.com> The Rosetta Design Group is proud to present the first webinar in the Rosetta Academic Workshop Series. For the first webinar, we have selected to focus on Protein-Protein Docking based on the answers to the interest poll. We hope this will be the first in a line of helpful and inspiring webinars to kick-off our Rosetta Academic Workshop Series. What: Protein-Protein Docking When: May 4th 2009, 0800-1000 AM EST Where: Your office! Click here for more details and registration (For non html emails: http://rosettadesigngroup.com/RDGLS/index.php?sid=54479&lang=en ) Pleas note: This is not a promotional webinar. Rosetta is open-source and freeware for academic and non-profit organizations and can be downloaded here from University of Washington's TechTransfer Digital Ventures. The majority of the webinar is concerned with Rosetta 2.3.0. Rosetta 3.0 is still a beta version. Hope to see you there, Nir London. Rosetta Design Group | http://rosettadesigngroup.com/ From biopython.chen at gmail.com Wed Mar 25 22:59:04 2009 From: biopython.chen at gmail.com (chen Ku) Date: Wed, 25 Mar 2009 19:59:04 -0700 Subject: [BioPython] how to retrieve data from PDB Message-ID: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> Dear all, I need your help in writing code to retrieve some of the pdb structures. Problem definition I just want to use some PDB file not all 50,000. > I want to apply one python code so that I can know transcription factor binding to DNA only out of all pdb data. So please guide me how to proceed for this.I raed some published article on this dataset and just want to do by python and not by manually.This is one of our course work in structural biology so trying by my own and taking some help of you all. I need a general code where I can check this kind of things by changing field name.Any help will be grateful for me as I am a beginner in python. Regards Chen From lueck at ipk-gatersleben.de Thu Mar 26 05:42:42 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 26 Mar 2009 10:42:42 +0100 Subject: [BioPython] Emboss eprimer3 References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> Message-ID: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> Hi! I got a patch to add a '-originalformat' argument. If someone is interested too, I could send it to him or the mailing list. >>>Note that if you do succeed in changing the output format, you may need a >>>new parser to read it. This is no problem. I just need the data ;-) >>> I don't know if there is a primer3 argument for limiting the G or C's at >>> the end - have you asked on the EMBOSS mailing list? Yes, no answer yet. Kind regards Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Tuesday, March 24, 2009 11:00 AM Subject: Re: [BioPython] Emboss eprimer3 2009/3/24 Stefanie L?ck : > Hi! > > I have some questions about eprimer3 from Emboss which I use over Python > to design primers in a batch mode: > > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G > or C's at the end to maximum of one G or C? OK, you're using the gcclamp argument (i.e. GC clamp), which is supported by the Bio.Emboss.Applications wrapper. http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html I don't know if there is a primer3 argument for limiting the G or C's at the end - have you asked on the EMBOSS mailing list? > 2) Is there a setting to get the original primer3 output? The emboss > output is for hundrets of primers not very usefull and many informations > are missing. >From reading the documentation there is a "fformat1" argument which *might* do what you want - you could try this out on the command line and see. Note that this argument is not currently supported in the Bio.Emboss.Applications wrapper, but that would be easy to add. If this argument doesn't do what you want, you'd have to ask the EMBOSS people about alternative output formats. Alternatively, you might investigate the original Whitehead version of primer3. Note that if you do succeed in changing the output format, you may need a new parser to read it. Peter From biopython at maubp.freeserve.co.uk Thu Mar 26 06:23:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Mar 2009 10:23:01 +0000 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00903260323p50f80c50w1ab07c8892518190@mail.gmail.com> On Thu, Mar 26, 2009 at 9:42 AM, Stefanie L?ck wrote: > Hi! > > I got a patch to add a '-originalformat' argument. If someone is interested > too, I could send it to him or the mailing list. Could you file an bug on bugzilla please, and the (after the bug is filed) you can attach the patch. I'll look at this (if Brad doesn't first) - if you can also include a short example that would be excellent. Thank you, Peter From biopython at maubp.freeserve.co.uk Thu Mar 26 07:04:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Mar 2009 11:04:29 +0000 Subject: [BioPython] how to retrieve data from PDB In-Reply-To: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> Message-ID: <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com> On Thu, Mar 26, 2009 at 2:59 AM, chen Ku wrote: > Dear all, > ? ? ? ? ? ? ? ?I need your help in writing code to retrieve some of the pdb > structures. > > Problem definition > ?I just want to use some PDB file not all 50,000. > >> I want to apply one python code so that I can know transcription factor > binding to DNA only out of all pdb data. So please guide me how to proceed > for this. According to the website, there are about 2250 protein structures in complex with nucleotides - and I assume some of these are for transcription factors with DNA: http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=molType-protein-nucleic-complex&seqid=100 I assume you'll want to search these PDB for entries which are transcription factors binding to DNA, but I don't know enough about the PDB search options to advise you. Peter From jblanca at btc.upv.es Thu Mar 26 07:48:02 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Thu, 26 Mar 2009 12:48:02 +0100 Subject: [BioPython] about the SeqRecord slicing Message-ID: <200903261248.02279.jblanca@btc.upv.es> Hi: I'm working with the SeqRecord slicing from cvs and I think that the behaviour could be sligthly changed. In fact that same opinion is written in the __getitem__ method: if isinstance(index, int) : #NOTE - The sequence level annotation like the id, name, etc #do not really apply to a single character. However, should #we try and expose any per-letter-annotation here? If so how? return self.seq[index] I don't like the fact that the SeqRecord returns different classes depending on the index type. I think is better to return always a SeqRecord because: - It simplifies the interface. It's easier to deal with the SeqRecord class if its behaviour is simple. Otherwise we have to check in the code that uses the SeqRecord if it's returning an str or a SeqRecord. - It looses the per-letter-annotation. I'm working with qualities and I'm interested in keeping them. - It's redundant because if we want to slice the seq property we can do it with: seqrec.seq[index] Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Thu Mar 26 08:05:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Mar 2009 12:05:25 +0000 Subject: [BioPython] about the SeqRecord slicing In-Reply-To: <200903261248.02279.jblanca@btc.upv.es> References: <200903261248.02279.jblanca@btc.upv.es> Message-ID: <320fb6e00903260505j387279b7kfa4c69c33efe5487@mail.gmail.com> On Thu, Mar 26, 2009 at 11:48 AM, Jose Blanca wrote: > Hi: > I'm working with the SeqRecord slicing from cvs and I think that the behaviour > could be sligthly changed. In fact that same opinion is written in the > __getitem__ method: > > ? ? ? ?if isinstance(index, int) : > ? ? ? ? ? ?#NOTE - The sequence level annotation like the id, name, etc > ? ? ? ? ? ?#do not really apply to a single character. ?However, should > ? ? ? ? ? ?#we try and expose any per-letter-annotation here? ?If so how? > ? ? ? ? ? ?return self.seq[index] > > I don't like the fact that the SeqRecord returns different classes depending > on the index type. I think is better to return always a SeqRecord because: > - It simplifies the interface. It's easier to deal with the SeqRecord class if > its behaviour is simple. Otherwise we have to check in the code that uses the > SeqRecord if it's returning an str or a SeqRecord. > - It looses the per-letter-annotation. I'm working with qualities and I'm > interested in keeping them. > - It's redundant because if we want to slice the seq property we can do it > with: seqrec.seq[index] > Best regards, Hi Jose, As we are talking about the CVS code, maybe this could have been on the dev mailing list, but as its of general interest let's carry on here for now. You note that (currently in CVS) the new SeqRecord slicing returns a SeqRecord for a slice, but a single letter string for a single integer index. This isn't so different from the Seq object - it returns a new Seq object for a slice, but a single letter string for a single integer index: >>> from Bio.Seq import Seq >>> s = Seq("ACGT") >>> s Seq('ACGT', Alphabet()) >>> s[0] 'A' >>> s[0:3] Seq('ACG', Alphabet()) More generally, consider lists in Python: >>> x = [1,2,3,4,5] >>> x[0] 1 >>> x[0:3] [1, 2, 3] So I don't agree with this expectation that slicing and indexing a SeqRecord should automatically both give a SeqRecord. You really want a SeqRecord for a single character string? Can you give me an example of where you want to pull out a single character from a SeqRecord, and its quality? I would consider things like this quite elegant: for letter, quality in zip(record.seq, record.letter_annotations("phred_quality") : #do stuff Peter From chapmanb at 50mail.com Thu Mar 26 08:40:45 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 26 Mar 2009 08:40:45 -0400 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> Message-ID: <20090326124045.GD21577@sobchak.mgh.harvard.edu> Hi all; Stefanie: > I got a patch to add a '-originalformat' argument. If someone is interested > too, I could send it to him or the mailing list. Is this a patch to EMBOSS itself? If so, did the developers indicate it would be in future versions of EMBOSS? If that's the case, we can easily add this option to the commandline interface. You need a: _Option(["-originalformat"], ["input"], None, 0), line in Bio.Emboss.Applications.Primer3Commandline. > >>>Note that if you do succeed in changing the output format, you may need a > >>>new parser to read it. > > This is no problem. I just need the data ;-) Out of curiosity, what parameter did you find useful from that output that is not in the eprimer3 format output? > >>> I don't know if there is a primer3 argument for limiting the G or C's at > >>> the end - have you asked on the EMBOSS mailing list? > > Yes, no answer yet. What I do in cases like this is ask for more primers (-numreturn) and then post-parse them to pull out the ones that satisfy my additional criteria. The output is ordered by primer3's ranking, so the first one that passes the criteria would move on. If none are satisfactory, then you can also build in a logic to decide if any are good enough for your use (for example, 2 G/Cs at the end) and pick one from this remaining group with less stringency. Brad > > Kind regards > Stefanie > > > > ----- Original Message ----- > From: "Peter" > To: "Stefanie L?ck" > Cc: > Sent: Tuesday, March 24, 2009 11:00 AM > Subject: Re: [BioPython] Emboss eprimer3 > > > 2009/3/24 Stefanie L?ck : > > Hi! > > > > I have some questions about eprimer3 from Emboss which I use over Python > > to design primers in a batch mode: > > > > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G > > or C's at the end to maximum of one G or C? > > OK, you're using the gcclamp argument (i.e. GC clamp), which is > supported by the Bio.Emboss.Applications wrapper. > http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html > > I don't know if there is a primer3 argument for limiting the G or C's > at the end - have you asked on the EMBOSS mailing list? > > > 2) Is there a setting to get the original primer3 output? The emboss > > output is for hundrets of primers not very usefull and many informations > > are missing. > > >From reading the documentation there is a "fformat1" argument which > *might* do what you want - you could try this out on the command line > and see. Note that this argument is not currently supported in the > Bio.Emboss.Applications wrapper, but that would be easy to add. If > this argument doesn't do what you want, you'd have to ask the EMBOSS > people about alternative output formats. Alternatively, you might > investigate the original Whitehead version of primer3. > > Note that if you do succeed in changing the output format, you may > need a new parser to read it. > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Mar 27 08:18:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Mar 2009 12:18:04 +0000 Subject: [BioPython] how to retrieve data from PDB In-Reply-To: <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com> References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com> <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com> Message-ID: <320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com> On Fri, Mar 27, 2009 at 2:53 AM, chen Ku wrote: > Thank you so much for the guidance but I need the coding part in python to > retrieve the data. > > Any help will be helpful for me. Have a look at the Bio.PDB.PDBList module in Biopython - this may do what you want. Peter From p.j.a.cock at googlemail.com Fri Mar 27 13:31:55 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 Mar 2009 17:31:55 +0000 Subject: [BioPython] Biopython application note published Message-ID: <320fb6e00903271031k2bd31464k8aaa075f8de39c82@mail.gmail.com> Dear all, An Application Note describing Biopython has recently been accepted for publication in the Oxford Journal Bioinformatics. An advance copy of the Open Access article is available online: P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski and M.J.L. de Hoon (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, doi:10.1093/bioinformatics/btp163 http://dx.doi.org/10.1093/bioinformatics/btp163 This was announced at the start of the week on our news page (to which you can subscribe using the RSS or Atom feeds), but was worth repeating for the mailing lists. See http://news.open-bio.org/news/2009/03/biopython-paper-published/ Peter From biopython at maubp.freeserve.co.uk Tue Mar 31 06:08:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 31 Mar 2009 11:08:08 +0100 Subject: [BioPython] how to retrieve data from PDB In-Reply-To: <4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com> References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com> <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com> <320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com> <4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com> Message-ID: <320fb6e00903310308q38168dbfx447c78c6da5454ee@mail.gmail.com> On Tue, Mar 31, 2009 at 10:45 AM, chen Ku wrote: > Dear peter, > ????????????????? thanks for the idea.I think I need to download all the pdb > files first and then can use command on python mode. Can you please write > one syntax to start with or give me the practical documentation so that I > can try out and play with this PDBList. Hi Chen, To learn about the PDBList functionality, see page 4 of "The Biopython Structural Bioinformatics FAQ" - this has some examples: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf You can also read about PDBList from the built in help, >>> from Bio import PDB >>> help(PDB.PDBList) Or online at http://biopython.org/DIST/docs/api/Bio.PDB.PDBList%27.PDBList-class.html If you really do want to download all 56,000+ PDB files (and I don't think this is a good idea), instead of using Python, you might also consider using the command line tool rsync, see: http://www.pdb.org/pdb/general_information/news_publications/newsletters/2003q3/focus_rsync.html However, as I said before, you only want transcription factors with DNA, so at most you'll need to download the 2250 protein structures in complex with nucleotides. I strongly urge you to find out more about searching the PDB in order to get a list of just the few PDB reference codes that you'll actually need - and download just those. Peter From hermifi at yahoo.com Tue Mar 31 23:56:22 2009 From: hermifi at yahoo.com (Hermella Woldemdihin) Date: Tue, 31 Mar 2009 20:56:22 -0700 (PDT) Subject: [BioPython] HELP! Message-ID: <513066.92437.qm@web111011.mail.gq1.yahoo.com> Hi everyone, I am trying to write a bio-python script that uses SwissProt accession numbers to download a sequence objects and then run remote blast with the sequences. Then download good hit sequences listed in Blast results and print their sequences.I am using a Windows based system with bio-python 2.5, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks. Hermi From winda002 at student.otago.ac.nz Tue Mar 3 22:03:36 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 04 Mar 2009 11:03:36 +1300 Subject: [BioPython] ACE contig to alignment Message-ID: <49ADA938.80408@student.otago.ac.nz> Hi all, I'd like to start by thanking everyone that's contributed to biopython and especially the cookbook/tutorial - its been a great help to this empiricist getting into some (decidedly amateur) bioinformatics.However, for the first time I've run into a problem the available docs can't help me with. I want to be able to represent all of the reads that contribute to a 454 sequencing contig as a generic biopython alignment. I've written some code that I thought would pad/cut the reads to size and add them to an alignment but when I run it a significant minority of the contigs in the files I'm working with have misalignments. I was wondering if someone more familiar with the ace parser or generic alignment class could tell me if I'm making some elementary mistake (it is possible that original alignment was bad, just seems more likely I did something dumb). I can send along an ACE file if you want to run the script (didn't want to spam the list with attachments). Thanks in advance for any pointers and I'm sorry to force people to read what I'm sure is inelegant code: from Bio.Sequencing import Ace from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC, Gapped ace_handle = open('eldoni.ace', 'r') contigs = Ace.parse(ace_handle) alignments = [] #start the list to which we'll add the contig data for contig in contigs: conname = contig.name + " numreads=" + str(contig.nreads) conlength = len(contig.sequence) align = Alignment(Gapped(IUPAC.ambiguous_dna, "*")) for readn in range(len(contig.reads)): start = contig.af[readn].padded_start # position rel to consensus if start < 1: # If 'start' is negative or zero we need to ignore bases readseq = contig.reads[readn].rd.sequence[-1 * start+1:] else: # If it's larger then the start needs to be padded with gaps readseq = (start-1) * '*' + contig.reads[readn].rd.sequence #Finally, pad the end then cut to size readseq = readseq + (conlength-len(readseq)) * '*' readseq = readseq[:conlength] align.add_sequence(readn+1, readseq) condata = conname, align alignments.append(condata) -- PhD Student Allan Wilson Centre Department of Zoology University of Otago, PO Box 56, Dunedin 9054 p: +64-3-4798459 m: +64-27-3326815 e: winda002 at student.otago.ac.nz From winda002 at student.otago.ac.nz Wed Mar 4 02:40:08 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 04 Mar 2009 15:40:08 +1300 Subject: [BioPython] ACE contig to alignment (found my error) In-Reply-To: <49ADA938.80408@student.otago.ac.nz> References: <49ADA938.80408@student.otago.ac.nz> Message-ID: <49ADEA08.8070400@student.otago.ac.nz> Hi again all, After digging around a little more I realised the dumb mistake I made. In case anyone was interested and to prevent future suffering by getting the answer on to google: The code as written is adding the entirety of each read to the alignment but when the assembly was made some reads where clipped on either side for quality. Including the low quality bases from each read makes some of the alignments nasty. In my case "contig.reads[readn].qa" contains the start and end clipping points needed to get just the 'good' bases of each read into the alignment. Cheers, David David Winter wrote: > Hi all, > > I'd like to start by thanking everyone that's contributed to biopython > and especially the cookbook/tutorial - its been a great help to this > empiricist getting into some (decidedly amateur) > bioinformatics.However, for the first time I've run into a problem the > available docs can't help me with. > > I want to be able to represent all of the reads that contribute to a > 454 sequencing contig as a generic biopython alignment. I've written > some code that I thought would pad/cut the reads to size and add them > to an alignment but when I run it a significant minority of the > contigs in the files I'm working with have misalignments. I was > wondering if someone more familiar with the ace parser or generic > alignment class could tell me if I'm making some elementary mistake > (it is possible that original alignment was bad, just seems more > likely I did something dumb). I can send along an ACE file if you want > to run the script (didn't want to spam the list with attachments). > > Thanks in advance for any pointers and I'm sorry to force people to > read what I'm sure is inelegant code: > > from Bio.Sequencing import Ace > from Bio.Align.Generic import Alignment > from Bio.Alphabet import IUPAC, Gapped > > ace_handle = open('eldoni.ace', 'r') > contigs = Ace.parse(ace_handle) > alignments = [] #start the list to which we'll add the contig data > > for contig in contigs: conname = contig.name + " numreads=" + > str(contig.nreads) > conlength = len(contig.sequence) > align = Alignment(Gapped(IUPAC.ambiguous_dna, "*")) > for readn in range(len(contig.reads)): > start = contig.af[readn].padded_start # position rel to consensus > if start < 1: > # If 'start' is negative or zero we need to ignore bases > readseq = contig.reads[readn].rd.sequence[-1 * start+1:] > else: > # If it's larger then the start needs to be padded with gaps > readseq = (start-1) * '*' + contig.reads[readn].rd.sequence > #Finally, pad the end then cut to size > readseq = readseq + (conlength-len(readseq)) * '*' > readseq = readseq[:conlength] > align.add_sequence(readn+1, readseq) > condata = conname, align > alignments.append(condata) From rodrigo_faccioli at uol.com.br Thu Mar 5 04:04:07 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Thu, 5 Mar 2009 01:04:07 -0300 Subject: [BioPython] Bio.Entez - Help Message-ID: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com> I want to know where I can find examples about Bio.Entez. Specifically, I'm developing a program which has a protein primary sequence and I need to search its conserved domain and read it to show for user. I'm reading this link http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However, I'm not understanding very well. I know that I will work with CDD database. I made a simple example which is below. from Bio import Entrez Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are handle = Entrez.esearch(db="cdd", term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") record = Entrez.read(handle) print record["IdList"] Thanks for any helps. -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 From biopython at maubp.freeserve.co.uk Thu Mar 5 10:42:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 10:42:02 +0000 Subject: [BioPython] Bio.Entez - Help In-Reply-To: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com> References: <3715adb70903042004h27ac6f03oeb384d3c89777226@mail.gmail.com> Message-ID: <320fb6e00903050242v63a2f38cgc6eddfa3819814e4@mail.gmail.com> On Thu, Mar 5, 2009 at 4:04 AM, Rodrigo faccioli wrote: > I want to know where I can find examples about Bio.Entez. Specifically, I'm > developing a program which has a protein primary sequence and I need to > search its conserved domain and read it to show for user. > > I'm reading this link > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc64 . However, > I'm not understanding very well. I know that I will work with CDD database. The CDD database is one of several protein motif databases the NCBI make available for use with their tool RPS-BLAST. CDD is a composite database which includes domains from PFAM, SMART, KOG etc. Have a look at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml with your example and you'll get a hit to pfam00321. It sound like what you want is a script which runs RPS-BLAST using your query protein against the CDD motif database. You can run BLASTN, BLASTP etc online at the NCBI using a script, but as far as I know, the NCBI do not make RPS-BLAST (or PSI-BLAST) available in this way. I haven't checked this in recent months. However, I have done task myself using standalone BLAST installed on my computer, i.e. the tool rpsblast from the NCBI. You'll also need to install the databases (which are big - you'll need plenty of disk space and RAM). Once this is installed and working, you can rpsblast this from Biopython using the Bio.Blast.NCBIStandalone.rpsblast(...) function. > I made a simple example which is below. > > from Bio import Entrez > Entrez.email = "rodrigo.faccioli at gmail.com" # Always tell NCBI who you are > handle = Entrez.esearch(db="cdd", > term="TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN") > record = Entrez.read(handle) > print record["IdList"] > > Thanks for any helps. I think if you use Entrez to access the CDD database, you can just access the domains themselves (using their names - not searching by sequence), e.g. >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.com" >>> handle = Entrez.esearch(db="cdd", term="pfam00321", retmode="XML") >>> record = Entrez.read(handle) >>> print record["IdList"] ['109381'] You can check this ID works via their website: http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=109381 I've tried a few variations but efetch doesn't seem to support the CDD database (yet). Peter From biopython at maubp.freeserve.co.uk Thu Mar 5 12:26:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 12:26:13 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object Message-ID: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Hi All, As the following examples show, and the python string method's docstring clearly states, the python string's count method uses a non-overlapping search: >>> "AAA".count("A") 3 >>> "AAA".count("AA") # you might expect 2 1 >>> "BBBB".count("BB") # you might expect 3 2 Up until Biopython 1.44, the Seq object's count method only worked for single characters. From Biopython 1.45 onwards it accepted longer strings and followed the built in python string count behaviour. However, as Noel pointed out on Bug 2779 our docstring does not make it clear that this does a non-overlapping search. In fact, as Leighton suggests, one might the Seq object to use an overlapping search in the Seq object's count method. http://bugzilla.open-bio.org/show_bug.cgi?id=2779 We should either: (a) stick with the python string compatible behaviour (which has been a general principle for the Seq class), but document this issue more clearly as a non-overlapping search does run counter to some potential biological uses. or, (b) Or change the behaviour as Leighton suggests to do an overlapping search. This could break any code relying on the old python string-like behaviour. What do people here think? Any preferences? [I don't want to get into details about the implementation here on the main list] Peter From baoilleach at gmail.com Thu Mar 5 13:11:31 2009 From: baoilleach at gmail.com (Noel O'Boyle) Date: Thu, 5 Mar 2009 13:11:31 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object Message-ID: +1 for (b) Seq.count() should behave like a biological sequence. Here's an example in the wild of this type of analysis: http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14 It's from a bioinformatics textbook with example code in Matlab. I was helping a colleague who was trying to reproduce the analysis with BioPython. Everything was fine until the dimer frequencies were found to disagree. After implementing the count ourselves, we were able to reproduce the results. It was then we realised that BioPython was behaving in an unexpected and non-useful way. - Noel From biopython at maubp.freeserve.co.uk Thu Mar 5 13:26:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 13:26:10 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: Message-ID: <320fb6e00903050526r688eadcfv440602c32d294ee8@mail.gmail.com> On Thu, Mar 5, 2009 at 1:11 PM, Noel O'Boyle wrote: > +1 for (b) > > Seq.count() should behave like a biological sequence. > > Here's an example in the wild of this type of analysis: > http://www.computational-genomics.net/case_studies/haemophilus_demo.html#14 > > It's from a bioinformatics textbook with example code in Matlab. I was > helping a colleague who was trying to reproduce the analysis with > BioPython. Everything was fine until the dimer frequencies were found > to disagree. After implementing the count ourselves, we were able to > reproduce the results. It was then we realised that BioPython was > behaving in an unexpected and non-useful way. I agree that in this context it is not useful to have the Seq object count do an non-overlapping search. However, calling it "unexpected" is debatable, and could probably depend on the user's background background. If you already know Python before using Biopython, I would argue that the non-overlapping search is expected because that is what python strings do. On the other hand, I'm sure many Biopython users learn Python and Biopython together - and one might still argue having strings and Seq objects do different things is unexpected. Overall between options (a) and (b), I'd pick consistency with the python string (a), even if it isn't ideal. There is another idea, let's call this option (c). Give the Seq object's count method an optional boolean argument to enable an overlapping search (which I would want to default to matching the python string behaviour). This makes switching between string and Seq objects easier, and makes the more useful (but probably slower) overlap aware count option quite accessible and discoverable. Peter From bartek at rezolwenta.eu.org Thu Mar 5 13:28:14 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 5 Mar 2009 14:28:14 +0100 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Message-ID: <8b34ec180903050528m7a3815c8l3048046e42f0ce00@mail.gmail.com> On Thu, Mar 5, 2009 at 1:26 PM, Peter wrote: > (a) stick with the python string compatible behaviour (which has been > a general principle for the Seq class), but document this issue more > clearly as a non-overlapping search does run counter to some potential > biological uses. > > or, > > (b) Or change the behaviour as Leighton suggests to do an overlapping > search. ?This could break any code relying on the old python > string-like behaviour. > > What do people here think? ?Any preferences? > > [I don't want to get into details about the implementation here on the > main list] > I don't use the count method much, so I don't have a strong opinion on that. As Leighton pointed out, searching for sequences looks like a good job for Bio.Motif It's currently doable, but (since Bio.Motif mostly deals with more complex motifs than a single sequence) the interface is not polished and it's not optimized for performance. Currently the code to do this would look like this: m=Bio.Motif.Motif() m.add_instance(Seq("GG",m.alphabet)) for i in m.search_instances(your_long sequence): print "found GG at position",i If there is a need to keep backwards compatibility for .count(), I can make changes to Bio.Motif to make it easier for people to use it. -- Bartek From lpritc at scri.ac.uk Thu Mar 5 13:34:03 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 05 Mar 2009 13:34:03 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Message-ID: Hi, On 05/03/2009 12:26, "Peter" wrote: > We should either: > > (a) stick with the python string compatible behaviour (which has been > a general principle for the Seq class), but document this issue more > clearly as a non-overlapping search does run counter to some potential > biological uses. > > or, > > (b) Or change the behaviour as Leighton suggests to do an overlapping > search. This could break any code relying on the old python > string-like behaviour. > > What do people here think? Any preferences? Not surprisingly, I favour (b). The intended domain of use for Seq is as a proxy for a biological entity and I think that, just as we extend methods to reflect useful biologically-themed operations, we should also override methods as appropriate to reflect those same themes. I can think of a number of run-of-the-mill use cases where we would want to know about the count of (potentially) overlapping matches of a subsequence in a biological sequence, for short sequence repeats (SSRs), restriction sites, protein sequence motifs, and so on. Also, if we want simply to test the expected number of occurrences of the dimer 'AA' in a larger sequence with a given base composition, a non-overlapping count() method will give a misleading answer, as it will underreport occurrences of 'AA' in odd-length runs of consecutive 'A's. I think that the overlapping approach (b) should at least be a default setting, even if we choose to make overlap/non-overlap an argument to the method. For some searches that potentially could have overlaps we might want to know what biological question is being asked before choosing which approach to take. We may, for example, desire different behaviour from query sequences like 'AGCCAG' depending on circumstances. This query on 'AGCCAGCCAG' will return 1 if there is no overlap is allowed, and 2 if an overlap is allowed. The same query on 'AGCCAGAGCCAG' will return 2 in both cases. If we care about 'AGCCAG' as a restriction site, then we would want an overlapping search. If we care about 'AGCCAG' as a simple repeat unit, then we might want a non-overlapping search instead (assuming that the circumstances of the search are such that this is a sensible answer). Having the option might be useful. A non-overlapping search might also be useful in those cases where existing code already corrects for nonintuitive behaviour of count(). This is only going to apply to code that has been produced since release 1.45, so may only have limited impact, if any. I would argue that, since a correction was needed, by parsimony the original behaviour was probably what required the change. On the whole, I think that an overlapping count() is the most intuitive and most likely use case. I see that there's an argument for consistency with string.count(), in that dyed-in-the-wool programmers might find it hard to shift mental gears from one to the other, but I'm not sure that it's a good argument, for the following reason. The following statements are true: A String is a Python sequence type. Its count() method returns a non-overlapping count of the query substring. A List is a Python sequence type. Its count() method returns the number of elements that match the query. A Tuple is a Python sequence type. It doesn't have a count() method, although you might imagine that it could stand to have one. There isn't any cross-sequence object consistency regarding count(). Should we choose String-like or List-like behaviour when dealing with a MutableSeq? I don't think that we should seek consistency with String at the expense of utility or biological intuition, when: A Seq/MutableSeq is a (Bio)Python sequence type. Its count() method returns the overlapping count of the query substring. Fits nicely with the other three statements, in that none of them are consistent with any other ;) L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From mjldehoon at yahoo.com Thu Mar 5 14:49:10 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 5 Mar 2009 06:49:10 -0800 (PST) Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> Message-ID: <418103.38901.qm@web62405.mail.re1.yahoo.com> I vote (b). Another option is to continue to use count() for a Python-style count, and to add a new method that does a overlapping-type count. For this new method we'd need a clear but short name, and I can't think of anything now. --Michiel. --- On Thu, 3/5/09, Peter wrote: > From: Peter > Subject: [BioPython] The count method of a Seq (or MutableSeq) object > To: "BioPython Mailing List" > Date: Thursday, March 5, 2009, 7:26 AM > Hi All, > > As the following examples show, and the python string > method's > docstring clearly states, the python string's count > method uses a > non-overlapping search: > > >>> "AAA".count("A") > 3 > >>> "AAA".count("AA") # you > might expect 2 > 1 > >>> "BBBB".count("BB") # you > might expect 3 > 2 > > Up until Biopython 1.44, the Seq object's count method > only worked for > single characters. From Biopython 1.45 onwards it accepted > longer > strings and followed the built in python string count > behaviour. > However, as Noel pointed out on Bug 2779 our docstring does > not make > it clear that this does a non-overlapping search. In fact, > as > Leighton suggests, one might the Seq object to use an > overlapping > search in the Seq object's count method. > http://bugzilla.open-bio.org/show_bug.cgi?id=2779 > > We should either: > > (a) stick with the python string compatible behaviour > (which has been > a general principle for the Seq class), but document this > issue more > clearly as a non-overlapping search does run counter to > some potential > biological uses. > > or, > > (b) Or change the behaviour as Leighton suggests to do an > overlapping > search. This could break any code relying on the old > python > string-like behaviour. > > What do people here think? Any preferences? > > [I don't want to get into details about the > implementation here on the > main list] > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Mar 5 15:05:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 15:05:39 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <418103.38901.qm@web62405.mail.re1.yahoo.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: > > > I vote (b). > Another option is to continue to use count() for a Python-style count, > and to add a new method that does a overlapping-type count. For this > new method we'd need a clear but short name, and I can't think of > anything now. > > --Michiel. Did you like plan (c), which preserves the Python string style count as the default but offers the non-overlapping count via an optional argument? i.e. >>> from Bio.Seq import Seq >>> nuc = Seq("AAAA") >>> nuc.count("AA") #default is non-overlapping 2 >>> nuc.count("AA", overlap=True) 3 >>> nuc.count("AA", overlap=False) 2 Peter From dalloliogm at gmail.com Thu Mar 5 15:10:59 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 5 Mar 2009 16:10:59 +0100 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <5aa3b3570903050710hb407258k6fca86cf1bf9520f@mail.gmail.com> On Thu, Mar 5, 2009 at 4:05 PM, Peter wrote: > On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: >> >> >> I vote (b). >> Another option is to continue to use count() for a Python-style count, >> and to add a new method that does a overlapping-type count. For this >> new method we'd need a clear but short name, and I can't think of >> anything now. >> >> --Michiel. > > Did you like plan (c), which preserves the Python string style count > as the default but offers the non-overlapping count via an optional > argument? > > i.e. >>>> from Bio.Seq import Seq >>>> nuc = Seq("AAAA") >>>> nuc.count("AA") #default is non-overlapping > 2 >>>> nuc.count("AA", overlap=True) > 3 >>>> nuc.count("AA", overlap=False) > 2 Imho this is the best solution. If I can say, I expect a .count() method to act like the homonymous method in python strings. A good doctest example (similar to the existing one) would be nice, too. > > Peter > _______________________________________________ > BioPython mailing list ?- ?BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From baoilleach at gmail.com Thu Mar 5 15:23:42 2009 From: baoilleach at gmail.com (Noel O'Boyle) Date: Thu, 5 Mar 2009 15:23:42 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: 2009/3/5 Peter : > On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: >> >> >> I vote (b). >> Another option is to continue to use count() for a Python-style count, >> and to add a new method that does a overlapping-type count. For this >> new method we'd need a clear but short name, and I can't think of >> anything now. >> >> --Michiel. > > Did you like plan (c), which preserves the Python string style count > as the default but offers the non-overlapping count via an optional > argument? > > i.e. >>>> from Bio.Seq import Seq >>>> nuc = Seq("AAAA") >>>> nuc.count("AA") #default is non-overlapping > 2 >>>> nuc.count("AA", overlap=True) > 3 >>>> nuc.count("AA", overlap=False) > 2 > > Peter I think we are arguing here over which should be the default value. Several people here believe that behaviour analagous to Python's string.count will reduce bug reports and user confusion. However, no-one except Leighton has been able to come up with a single use case where the current behaviour is useful (and even that example, with respect, was flimsy). So we end up with a method with adheres magnificently to the principle of least surprise, but which is of no use to users. Aren't you trying to provide methods which are useful for biological analysis? Isn't that the purpose of wrapping the string in the first place? Noel (getting far too excited over painting this bikeshed) From bsouthey at gmail.com Thu Mar 5 16:28:11 2009 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 5 Mar 2009 10:28:11 -0600 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: Hi, This is a little deja vu as I feel this type of thing has come up before. While I can not speak for anyone else, if I sound different to that, then I was obviously convinced by those arguments as that sounds better than I forgot :-) More seriously, ignoring the reading fame or the genetic code when counting is rather bad form! I can not think of a relevant case involving a protein sequence - although counting pairs of cysteines in insulin-like sequences could be a situation of importance (related to disulphide bonds). An example for nucleic sequences, counting 'TTT' in the madeup sequence 'TTTTTTTGG' can be two in frames 1 and 2 but only one in frame 3. Also, a weaker concern is that the sum of counts is greater than or equal to the length of the sequence is not desirable property unless the user is informed that duplicates were found. In the above case, seven sounds rather wrong when one says that a DNA sequence of nine DNA bases can produce seven Leucines! Yes, context is everything because 3 different results is not nice. Don't get me wrong, I know that finding duplicates is important just that it should not be here - there must different functions. Thus, I vote for (a) and I also prefer that default syntax is consistent with Python language. If this change is done, then all of Biopython must be revised to be consistent - like reading frames and similar discussion... Bruce On Thu, Mar 5, 2009 at 9:23 AM, Noel O'Boyle wrote: > 2009/3/5 Peter : >> On Thu, Mar 5, 2009 at 2:49 PM, Michiel de Hoon wrote: >>> >>> >>> I vote (b). >>> Another option is to continue to use count() for a Python-style count, >>> and to add a new method that does a overlapping-type count. For this >>> new method we'd need a clear but short name, and I can't think of >>> anything now. >>> >>> --Michiel. >> >> Did you like plan (c), which preserves the Python string style count >> as the default but offers the non-overlapping count via an optional >> argument? >> >> i.e. >>>>> from Bio.Seq import Seq >>>>> nuc = Seq("AAAA") >>>>> nuc.count("AA") #default is non-overlapping >> 2 >>>>> nuc.count("AA", overlap=True) >> 3 >>>>> nuc.count("AA", overlap=False) >> 2 >> >> Peter > > I think we are arguing here over which should be the default value. > > Several people here believe that behaviour analagous to Python's > string.count will reduce bug reports and user confusion. However, > no-one except Leighton has been able to come up with a single use case > where the current behaviour is useful (and even that example, with > respect, was flimsy). So we end up with a method with adheres > magnificently to the principle of least surprise, but which is of no > use to users. Aren't you trying to provide methods which are useful > for biological analysis? Isn't that the purpose of wrapping the string > in the first place? > > Noel (getting far too excited over painting this bikeshed) > _______________________________________________ > BioPython mailing list ?- ?BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Mar 5 16:34:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 16:34:37 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <320fb6e00903050834i32bd8d64w672e53b6ef1dbf56@mail.gmail.com> On Thu, Mar 5, 2009 at 4:28 PM, Bruce Southey wrote: > Hi, > This is a little deja vu as I feel this type of thing has come up > before. While I can not speak for anyone else, if I sound different to > that, then I was obviously convinced by those arguments as ?that > sounds better than I forgot :-) > > More seriously, ignoring the reading fame or the genetic code when > counting is rather bad form! Why? In many situations they are irrelevant. Consider counting restriction enzyme digest sites for example, plus of counting in any protein sequences. > I can not think of a relevant case involving a protein sequence - > although counting pairs of cysteines in insulin-like sequences could > be a situation of importance (related to disulphide bonds). > > An example for nucleic sequences, counting 'TTT' in the madeup > sequence ?'TTTTTTTGG' can be two in frames 1 and 2 but only one in > frame 3. Giving an answer of 2 (using a non overlapping search like the python string method) or 5 (using an overlapping search) are valid expected outcomes for "TTT" in "TTTTTTTGG". Here you seem want to count codons - which is by its nature a frame dependent task. Peter From biopython at maubp.freeserve.co.uk Thu Mar 5 16:35:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Mar 2009 16:35:10 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <320fb6e00903050426h4cb79170x3298b206ecbe294a@mail.gmail.com> <418103.38901.qm@web62405.mail.re1.yahoo.com> <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <320fb6e00903050835h2c548083jda67b5f50fcfc842@mail.gmail.com> On Thu, Mar 5, 2009 at 3:23 PM, Noel O'Boyle wrote: > I think we are arguing here over which should be the default value. > > Several people here believe that behaviour analagous to Python's > string.count will reduce bug reports and user confusion. However, > no-one except Leighton has been able to come up with a single use case > where the current behaviour is useful (and even that example, with > respect, was flimsy). So we end up with a method with adheres > magnificently to the principle of least surprise, but which is of no > use to users. Aren't you trying to provide methods which are useful > for biological analysis? Isn't that the purpose of wrapping the string > in the first place? > > Noel (getting far too excited over painting this bikeshed) If we hadn't been shipping Biopython with the old non-overlapping python-string-like count method for the last year, I would have probably have been more willing to agree that the Seq count method could differ from the python-string and use an overlapping search. However, changing it now also breaks backwards compatibility which shouldn't be done lightly. We could still do this (implementation discussion on the dev list or the Bug 2779), but will have to make this change very clear in the release notes. Peter From mjldehoon at yahoo.com Fri Mar 6 11:52:58 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 6 Mar 2009 03:52:58 -0800 (PST) Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> Message-ID: <791065.98994.qm@web62403.mail.re1.yahoo.com> > > Another option is to continue to use count() for a Python-style count, > > and to add a new method that does a overlapping-type count. For this > > new method we'd need a clear but short name, and I can't think of > > anything now. > > > Did you like plan (c), which preserves the Python string style count > as the default but offers the non-overlapping count via an optional > argument? > It's also OK, but if we use a different method name we can leave count() untouched altogether. --Michiel. From biopython at maubp.freeserve.co.uk Fri Mar 6 12:07:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 12:07:57 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00903050705mc940e8cy80fb545f6cb9ea5e@mail.gmail.com> <791065.98994.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00903060407u7383545fp80fc8b81899a33a7@mail.gmail.com> On Fri, Mar 6, 2009 at 11:52 AM, Michiel de Hoon wrote: >> > Another option is to continue to use count() for a Python-style count, >> > and to add a new method that does a overlapping-type count. For this >> > new method we'd need a clear but short name, and I can't think of >> > anything now. >> >> Did you like plan (c), which preserves the Python string style count >> as the default but offers the non-overlapping count via an optional >> argument? > > It's also OK, but if we use a different method name we can leave count() untouched altogether. Looking back, Sebastian Bassi raised this issue back in 2003 on this mailing list, and his overlap-aware-count implementation is used internally by Bio.SeqUtils.MeltingTemp, see: http://lists.open-bio.org/pipermail/biopython/2003-November/001741.html http://lists.open-bio.org/pipermail/biopython/2003-November/001742.html etc Sebastian also posted an enhancement request for adding an overlap aware counting method to the python base string, with "overcount" as a possible name. I don't know what happened to his bug report, it seems to have been marked private: http://mail.python.org/pipermail/python-bugs-list/2003-November/021239.html I don't really like the name "overcount", but as another suggestion how about "count_ol" which is short for count-with-overlaps? Peter From lpritc at scri.ac.uk Fri Mar 6 12:15:59 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 06 Mar 2009 12:15:59 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <791065.98994.qm@web62403.mail.re1.yahoo.com> Message-ID: In the spirit of being blindingly obvious, how about: Seq.overlapping_count() ;) L. On 06/03/2009 11:52, "Michiel de Hoon" wrote: > >>> Another option is to continue to use count() for a Python-style count, >>> and to add a new method that does a overlapping-type count. For this >>> new method we'd need a clear but short name, and I can't think of >>> anything now. >>> >> Did you like plan (c), which preserves the Python string style count >> as the default but offers the non-overlapping count via an optional >> argument? >> > It's also OK, but if we use a different method name we can leave count() > untouched altogether. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From chapmanb at 50mail.com Fri Mar 6 13:14:04 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 6 Mar 2009 08:14:04 -0500 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: References: <791065.98994.qm@web62403.mail.re1.yahoo.com> Message-ID: <20090306131404.GJ69627@sobchak.mgh.harvard.edu> Hey all; Great discussion on this. My preference is for a new function, and I like Leighton's naming suggestion. Also, unless someone has a use case for the current count() function, we should deprecate and eventually remove it. Overriding the string API where it makes sense is good, but here it seems to be creating confusion and not solving a problem. If someone needs the real string count, they can always do str(your_seq).count("GG"). Brad > In the spirit of being blindingly obvious, how about: > > Seq.overlapping_count() > > ;) > > L. > > > On 06/03/2009 11:52, "Michiel de Hoon" wrote: > > > > >>> Another option is to continue to use count() for a Python-style count, > >>> and to add a new method that does a overlapping-type count. For this > >>> new method we'd need a clear but short name, and I can't think of > >>> anything now. > >>> > >> Did you like plan (c), which preserves the Python string style count > >> as the default but offers the non-overlapping count via an optional > >> argument? > >> > > It's also OK, but if we use a different method name we can leave count() > > untouched altogether. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by > guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views > expressed by the sender are not necessarily the views of SCRI and its > subsidiaries. This email and any files transmitted with it are > confidential > > to the intended recipient at the e-mail address to which it has been > addressed. It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this > > confidentiality and you must not use, disclose, copy, print or rely on > this > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the > name of the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are > present in this email, neither the Institute nor the sender accepts any > responsibility for any viruses, and it is your responsibility to scan > the email and the attachments (if any). > ______________________________________________________________________ > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Mar 6 14:13:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 14:13:42 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <20090306131404.GJ69627@sobchak.mgh.harvard.edu> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman wrote: > Hey all; > Great discussion on this. My preference is for a new function, > and I like Leighton's naming suggestion. Yes, "overlapping_count" is a reasonable choice. Its a bit long, but it is clear. > Also, unless someone has a use case for the current count() > function, we should deprecate and eventually remove it. Overriding > the string API where it makes sense is good, but here it seems to be > creating confusion and not solving a problem. If someone needs the > real string count, they can always do str(your_seq).count("GG"). There is the very common use case of my_seq.count("A"), or similar, with single character search strings, and lots of code does this (both in Biopython and I'm sure user's scripts). For single letters of course, a non-overlapping count and an overlapping count do the same thing - deprecating the count method would cause a lot of unnecessary upheaval. Ignoring that, given we want the Seq to generally behave like a python string, I think removing the count method would still be a bad idea. [As a compromise, assuming we add an overlapping_count method and do a Biopython 1.50 beta release, the beta release could include a warning in the count method when used with a multi-character search string, suggesting the user might in fact need a non-overlapping count. Or is this a bit too crazy?] Peter From bsouthey at gmail.com Fri Mar 6 15:06:07 2009 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 06 Mar 2009 09:06:07 -0600 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> Message-ID: <49B13BDF.9030908@gmail.com> Peter wrote: > On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman wrote: > >> Hey all; >> Great discussion on this. My preference is for a new function, >> and I like Leighton's naming suggestion. >> > > Yes, "overlapping_count" is a reasonable choice. Its a bit long, but > it is clear. > > >> Also, unless someone has a use case for the current count() >> function, we should deprecate and eventually remove it. Overriding >> the string API where it makes sense is good, but here it seems to be >> creating confusion and not solving a problem. If someone needs the >> real string count, they can always do str(your_seq).count("GG"). >> I have already given one user case where overlapping counts is totally inappropriate! Unique codon counting is extremely important in many areas including gene prediction (possible splicing sites) and molecular evolution (like codon usage). Another valid case given was DNA restriction sites were you may want both overlapping and unique counts. For example, if DNA is digested by one enzyme that has unique sites in the sequence then followed by a second enzyme that has unique sites in the digested product but possibly duplicates in the original sequence. I just do not understand you logic of requiring a conversion when the Seq object is designed to 'behave like a python string'. > > There is the very common use case of my_seq.count("A"), or similar, > with single character search strings, and lots of code does this (both > in Biopython and I'm sure user's scripts). For single letters of > course, a non-overlapping count and an overlapping count do the same > thing - deprecating the count method would cause a lot of unnecessary > upheaval. > > Ignoring that, given we want the Seq to generally behave like a python > string, I think removing the count method would still be a bad idea. > I agree. > [As a compromise, assuming we add an overlapping_count method and do a > Biopython 1.50 beta release, the beta release could include a warning > in the count method when used with a multi-character search string, > suggesting the user might in fact need a non-overlapping count. Or is > this a bit too crazy?] > Yes it is too crazy and does not fit into the current established behavior of Biopython. Bruce From biopython at maubp.freeserve.co.uk Fri Mar 6 15:15:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 15:15:24 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <49B13BDF.9030908@gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> <49B13BDF.9030908@gmail.com> Message-ID: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey wrote: > I have already given one user case where overlapping counts is totally > inappropriate! Unique codon counting is extremely important in many areas > including gene prediction (possible splicing sites) and molecular evolution > (like codon usage). For codon counting NEITHER the current non-overlapping count nor the suggested overlapping count would be suitable. So this doesn't really affect the overlapping versus non-overlapping debate. Peter From bsouthey at gmail.com Fri Mar 6 15:34:42 2009 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 06 Mar 2009 09:34:42 -0600 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> <49B13BDF.9030908@gmail.com> <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> Message-ID: <49B14292.6080806@gmail.com> Peter wrote: > On Fri, Mar 6, 2009 at 3:06 PM, Bruce Southey wrote: > >> I have already given one user case where overlapping counts is totally >> inappropriate! Unique codon counting is extremely important in many areas >> including gene prediction (possible splicing sites) and molecular evolution >> (like codon usage). >> > > For codon counting NEITHER the current non-overlapping count nor the > suggested overlapping count would be suitable. So this doesn't really > affect the overlapping versus non-overlapping debate. > > Peter > With due respect, this does not make any sense. If it is a cDNA then I can count say the different Lysine codons to find any usage bias using seq.count('AAA')/ (seq.count('AAA')+seq.count('AAG'). (Actually I am more interested in the occurrence of specific multiple codons than single codons.) If you want the forward frames then just seq[0:].count('AAA'), seq[1:].count('AAA') and seq[2:].count('AAA') for frames 1, 2, and 3, respectively. As you pointed out single characters are not relevant so what is relevant? Bruce From biopython at maubp.freeserve.co.uk Fri Mar 6 15:46:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Mar 2009 15:46:19 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <49B14292.6080806@gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> <49B13BDF.9030908@gmail.com> <320fb6e00903060715w73d6efccqa5ffc9813ce3ec6a@mail.gmail.com> <49B14292.6080806@gmail.com> Message-ID: <320fb6e00903060746r309216e7t36d00434993a8cfb@mail.gmail.com> On Fri, Mar 6, 2009 at 3:34 PM, Bruce Southey wrote: >> >> For codon counting NEITHER the current non-overlapping count nor the >> suggested overlapping count would be suitable. ?So this doesn't really >> affect the overlapping versus non-overlapping debate. >> >> Peter > > With due respect, this does not make any sense. > > If it is a cDNA then I can count say the different Lysine codons to find any > usage bias using seq.count('AAA')/ (seq.count('AAA')+seq.count('AAG'). > (Actually I am more interested in the occurrence of specific multiple codons > than single codons.) If you have the (short) CDS "TAAAAAAAAAAG" which codes for "LKKK", then the codon count for "AAA" is 2 and the codon count for "AAG" is 1. Using the (standard python) non overlapping count method, "TAAAAAAAAAAG".count("AAA") = 3 and "TAAAAAAAAAAG".count("AAG") = 1 which does not do what you want. Using a hypothetical overlapping count method, "TAAAAAAAAAAG".overlapping_count("AAA") = 8 and "TAAAAAAAAAAG".overlapping_count("AAG") = 1 which does not do what you want. i.e. As I said, for codon counting NEITHER the current non-overlapping count nor the suggested overlapping count would be suitable. You seem to be asking for something different - a codon counting method, which is a special case of a non-overlapping count. Peter From lpritc at scri.ac.uk Fri Mar 6 15:47:37 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 06 Mar 2009 15:47:37 +0000 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <49B13BDF.9030908@gmail.com> Message-ID: On 06/03/2009 15:06, "Bruce Southey" wrote: > Peter wrote: >> On Fri, Mar 6, 2009 at 1:14 PM, Brad Chapman wrote: unless someone has a use case for the current count() >>> function, we should deprecate and eventually remove it. Overriding >>> the string API where it makes sense is good, but here it seems to be >>> creating confusion and not solving a problem. If someone needs the >>> real string count, they can always do str(your_seq).count("GG"). >>> > I have already given one user case where overlapping counts is totally > inappropriate! Unique codon counting is extremely important in many > areas including gene prediction (possible splicing sites) and molecular > evolution (like codon usage). We're not discussing codon counting though, we're discussing counting occurrences of an arbitrary substring in a sequence. They're not the same operation, even though they both involve counting. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From chapmanb at 50mail.com Fri Mar 6 22:46:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 6 Mar 2009 17:46:39 -0500 Subject: [BioPython] The count method of a Seq (or MutableSeq) object In-Reply-To: <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> References: <791065.98994.qm@web62403.mail.re1.yahoo.com> <20090306131404.GJ69627@sobchak.mgh.harvard.edu> <320fb6e00903060613j51b376dcpbb1c0029f3450106@mail.gmail.com> Message-ID: <20090306224639.GM69627@sobchak.mgh.harvard.edu> Me: > > Also, unless someone has a use case for the current count() > > function, we should deprecate and eventually remove it. Overriding > > the string API where it makes sense is good, but here it seems to be > > creating confusion and not solving a problem. If someone needs the > > real string count, they can always do str(your_seq).count("GG"). Bruce: > I have already given one user case where overlapping counts is totally > inappropriate! Unique codon counting Sorry, I was a bit terse in my previous e-mail. My thought on deprecation was actually based on your and Noel's emails; both of you presented cases where you had biological expectations for count which are not met by the standard string count behaviour. For Noel, this is handled by the proposed overlapping_count function. For your example, I think it would be better handled by functionality that returned a list of codons, like: Seq("ATGGAACAT").codon_list(phase=0) ["ATG", "GAA", "CAT"] Bruce: > I just do not understand you logic of requiring a conversion when the > Seq object is designed to 'behave like a python string'. This is representing a biological sequence, so I think where a biologist user's intuition opposes what a standard python string does we should evaluate for an option that is more in line with expectations. My point about the string was just that if you are thinking as a python programmer and really want python string behavior, it is pretty easy to get. Peter: > There is the very common use case of my_seq.count("A"), or similar, > with single character search strings, and lots of code does this (both > in Biopython and I'm sure user's scripts). For single letters of > course, a non-overlapping count and an overlapping count do the same > thing - deprecating the count method would cause a lot of unnecessary > upheaval. Good point; I totally overlooked that. Retract my suggestion. I do like your warning idea, but maybe we can get by here with documentation and by highlighting the alternative fuctions. It looked like you're already all over the documentation, so hopefully the new functionality will fix up any confusion, Thanks all, Brad From chapmanb at 50mail.com Sun Mar 8 16:29:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 8 Mar 2009 12:29:41 -0400 Subject: [BioPython] Initial work on a GFF parser Message-ID: <20090308162941.GA99653@kunkel> Hi all; Generic Feature Format (GFF) is a nice tab delimited file format that we don't have full support for in Biopython. Michael Hoffman contributed code to work with GFF MySQL databases (in Bio.GFF), but we don't have a GFF parser for the flatfiles. Looking back over the list archives, this has come up a couple of times without a finished solution being implemented. GFF suffers from the curse of being too easy to hack together a solution for parsing a very specific problem, while generating a good standard parser takes more work. Recently, Peter brought up GFF on the BioSQL mailing list, which made me interested in digging into GFF as an input and output flat file format for BioSQL databases. Towards this end I put together an initial implementation of a GFF (version 3) parser for Biopython. A write up and the code are here: http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/ As described in the post, the GFF interface will be a bit different from the standard SeqIO interface, since GFF stores features separately from the sequences and also doesn't require features for a record to be grouped together. As a result, the interface is up for discussion and the best path is to start with an implementation and see where it takes us. I'd be grateful for any feedback and code from those who are interested. We can discuss on the development mailing list or on the blog, and move towards getting stable full featured GFF parsing in Biopython. Brad From biopython at maubp.freeserve.co.uk Mon Mar 9 10:14:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 9 Mar 2009 10:14:55 +0000 Subject: [BioPython] Initial work on a GFF parser In-Reply-To: <20090308162941.GA99653@kunkel> References: <20090308162941.GA99653@kunkel> Message-ID: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com> On Sun, Mar 8, 2009 at 4:29 PM, Brad Chapman wrote: > Hi all; > Generic Feature Format (GFF) is a nice tab delimited file format > that we don't have full support for in Biopython. Michael Hoffman > contributed code to work with GFF MySQL databases (in Bio.GFF), but > we don't have a GFF parser for the flatfiles. Looking back over the > list archives, this has come up a couple of times without a finished > solution being implemented. GFF suffers from the curse of being too easy > to hack together a solution for parsing a very specific problem, while > generating a good standard parser takes more work. You're right about creating a good general parser taking more work ;) See also enhancement Bug 2762, GFF capability in SeqIO, which has some discussion. Also, it wasn't clear from your blog if you are thinking about just GFF version 3, or something more general, coping with the assorted comparatively ill defined GFF2 variants. > Recently, Peter brought up GFF on the BioSQL mailing list, which > made me interested in digging into GFF as an input and output flat > file format for BioSQL databases. Towards this end I put together an > initial implementation of a GFF (version 3) parser for Biopython. A > write up and the code are here: > > http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/ > > As described in the post, the GFF interface will be a bit different > from the standard SeqIO interface, since GFF stores features > separately from the sequences and also doesn't require features for > a record to be grouped together. Regarding where to put this code, if it isn't going to support the Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but maybe Bio.GFF or Bio.GFF3 instead. However, you could still fit gff(3) files into Bio.SeqIO, its just that the sequence may not be present. This would be similar GenBank files usually have a long list of features plus the full sequence, but the sequence itself may be missing - for example if there is a just a CONTIG line. Or QUAL files from sequencing where there is never a sequence. As with GenBank files for large genome/chromosome, for a typical GFF file for Bio.SeqIO we'd just return a single SeqRecord containing all the features - within the SeqIO API there is no way to offer memory efficient iteration over the features themselves. Maybe we need to invent Bio.FeatureIO for this? You could consider GenBank/EMBL feature tables, GFF files, NCBI protein tables, and probably a few other formats too. > As a result, the interface is up for discussion and the best path is to > start with an implementation and see where it takes us. I'd be grateful > for any feedback and code from those who are interested. We can discuss > on the development mailing list or on the blog, and move towards getting > stable full featured GFF parsing in Biopython. >From the blog post it sounds like you are using sub-features to store the parent/child relationship between say mRNAs and genes. This is elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to cope with the general parent (part-of) relationships allowed in GFF files - for example an exon may have multiple parents. There is also the complication that when parsing GenBank files, a gene or CDS feature with a join-location ends up represented using sub-features (which probably would be represented with an explicit intron/exon structure in GFF files) [This is something I don't really like with the current object structure]. We'd want things to be fairly uniform between the parsers - for one thing our BioSQL code currently records a feature with subfeatures as a single feature in the database. Peter From chapmanb at 50mail.com Mon Mar 9 22:42:24 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 9 Mar 2009 18:42:24 -0400 Subject: [BioPython] Initial work on a GFF parser In-Reply-To: <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com> References: <20090308162941.GA99653@kunkel> <320fb6e00903090314q19d64af2m4e37918fc3f5f164@mail.gmail.com> Message-ID: <20090309224224.GA4481@sobchak.mgh.harvard.edu> Peter; Thanks much for the feedback. > See also enhancement Bug 2762, GFF capability in SeqIO, which has some > discussion. > > Also, it wasn't clear from your blog if you are thinking about just > GFF version 3, or something more general, coping with the assorted > comparatively ill defined GFF2 variants. Bug 2762 had a lot of good background and ideas which helped in getting started. I did take the sub_feature route instead of the flattened method Leighton suggested there. Right now this tackles GFF3. The hard part is going to be getting a framework in place, and then GFF2 or GFT (or GFF2.5 or whatever they call it) support could be added. > Regarding where to put this code, if it isn't going to support the > Bio.SeqIO interface then it shouldn't really go in Bio.SeqIO, but > maybe Bio.GFF or Bio.GFF3 instead. > > However, you could still fit gff(3) files into Bio.SeqIO, its just > that the sequence may not be present. This would be similar GenBank > files usually have a long list of features plus the full sequence, but > the sequence itself may be missing - for example if there is a just a > CONTIG line. Or QUAL files from sequencing where there is never a > sequence. Yes, where it lives is a good topic for debate. For GFF files, you'd at least like the option to add new features to an existing sequence record, which is what I do here. It would be easy enough to create new blank records if one is not present initially. The difficult thing with adding this to the existing syntax is that the GFF files are not ordered for efficient iteration. You essentially have to parse the whole file, so something like this would handle the syntax: seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta")) final_seq_dict = SeqIO.add_features(gff_handle, "gff3", initial_dict=seq_dict) Along these lines, I liked the way you did a sequence/quality dual iterator for quality output and think that works well when ordering of the records in multiple files is stable. > As with GenBank files for large genome/chromosome, for a typical GFF > file for Bio.SeqIO we'd just return a single SeqRecord containing all > the features - within the SeqIO API there is no way to offer memory > efficient iteration over the features themselves. > > Maybe we need to invent Bio.FeatureIO for this? You could consider > GenBank/EMBL feature tables, GFF files, NCBI protein tables, and > probably a few other formats too. FeatureIO is something BioPerl has; this page describes the status of GFF in BioPerl but is over a year old so things may have changed: http://www.bioperl.org/wiki/GFF_code_audit The iteration model still falls apart because of the undefined ordering of the file. That is why I settled on the filter approach to limit what you get to a reasonable memory size but still guarantee you've pulled all relevant features before building the parent/child relationships and features. This could also apply to data that comes off cluster runs where the output order will not necessarily correlate with the inputs. The filtering approach could also be useful for large GenBank files, as you could skip adding features and parsing locations for elements you are not interested in. If others find this approach intuitive, it would be worth looking at there as well. > From the blog post it sounds like you are using sub-features to store > the parent/child relationship between say mRNAs and genes. This is > elegant, but as I wrote on Bug 2762 comment 1, this isn't enough to > cope with the general parent (part-of) relationships allowed in GFF > files - for example an exon may have multiple parents. For these the exon is added as a sub_feature to all of its parents. The shared feature is the same one in memory. t_nested_multiparent_features in the test code demonstrates this. How we output it to BioSQL is up for debate but we should also be able to do some sharing there; duplication is also not too bad of an option if it makes it cleaner since these are not likely to be deeply nested. > There is also the complication that when parsing GenBank files, a gene > or CDS feature with a join-location ends up represented using > sub-features (which probably would be represented with an explicit > intron/exon structure in GFF files) [This is something I don't really > like with the current object structure]. We'd want things to be > fairly uniform between the parsers - for one thing our BioSQL code > currently records a feature with subfeatures as a single feature in > the database. BioSQL definitely needs work to handle sub_features more generally. The seqfeature_relationship table in BioSQL can handle these but it needs to be coded. I agree with you that the way we do it now is a little too GenBank specific. This is a bit of a larger project since we should coordinate with the other projects, but as long as we continue to support the same location mechanism they use currently it will be back-compatible with older code. Thanks again for the thoughts, Brad From hlapp at gmx.net Tue Mar 10 03:36:30 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 9 Mar 2009 23:36:30 -0400 Subject: [BioPython] Google Summer of Code: Call for Bio* Volunteers In-Reply-To: References: Message-ID: You may recall my message to the developer lists of several O|B|F projects in February about the idea of O|B|F applying to Google Summer of Code as a mentoring organization [1]. I felt that the response to this was very positive and encouraging. Although late (sorry, been swamped too much), I've now put up the skeleton of an ideas page at http://open-bio.org/wiki/Google_Summer_Code_2009 I basically modeled (in fact, largely copied) this page after the NESCent Phyloinformatics Summer of Code ideas pages, which I think worked pretty well. We can completely rework this, though - any feedback and suggestions are very much welcome. In the meantime, I need all developers to double check the information under 'Contact'. Would the open-bio-l mailing list indeed reach the prospective mentors and other devs? Will be you be fine with students asking for feedback to their applications on the developers (i.e., this) list? Is there a blessed IRC where at least some of the prospective mentors hang out for students to ask questions during the time they apply? I also need space for the reference information for all projects that will participate with at least one project idea (I would hope that that's all projects) to be added in the 'Open-Bio projects involved' section. ***** Most important of all, if you can volunteer to mentor a project, please post a project idea to the page in the respective section, using the idea template that's there already (copy, paste, and edit). ***** The deadline for organization applications is Friday this week, Mar 13, which is very soon. The ideas page is a major factor and component in how Google scores new mentoring organizations - the more we can show the resourcefulness and diversity of our member projects the more competitive I think we'll be. So all those who responded with ideas or willingness to help out as primary or secondary mentores earlier, I need you to think about and put up your idea(s) now. Cheers, -hilmar [1] http://tinyurl.com/ck7tqe -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From dalloliogm at gmail.com Tue Mar 10 17:06:27 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 10 Mar 2009 18:06:27 +0100 Subject: [BioPython] can biopython query KEGG directly? Message-ID: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> Hi, is it possible to query the KEGG database with biopython? Actually I can do it with the kegg's wsdl apis and the python suds library and it works very well, but I was wondering whether there is something more integrated with biopython. For example, if there is something similar to Entrez, that can automatically retrieve a sequence from ncbi and transform it to a SeqRecord object. -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Mar 10 18:08:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Mar 2009 18:08:01 +0000 Subject: [BioPython] can biopython query KEGG directly? In-Reply-To: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> Message-ID: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio wrote: > Hi, > is it possible to query the KEGG database with biopython? I don't think there is any wrapper for the KEGG online API (yet). See: http://www.genome.jp/kegg/soap/doc/keggapi_manual.html This does sound like a worthwhile addition (especially if the SOAP stuff can be done using only core python libraries included in Python 2.4+) > .. and transform it to a SeqRecord object. We still need a Bio.KEGG gene parser, see also: http://bioperl.org/wiki/KEGG_sequence_format http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html Once that is done, a KEGG wrapper in Bio.SeqIO would make sense. Peter From matzke at berkeley.edu Wed Mar 11 01:18:12 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 10 Mar 2009 18:18:12 -0700 Subject: [BioPython] GSoC project: Biogeographical and community phylogenetics for BioPython Message-ID: <49B71154.5060109@berkeley.edu> On the advice of Mauricio & Hilmar, I have posted a draft proposal for a Google Summer of Code project: Biogeographical and community phylogenetics for BioPython. http://open-bio.org/wiki/Google_Summer_Code_2009#Biogeographical_and_community_phylogenetics_for_BioPython Comments welcome on- or off-list. Cheers! PS: Also, additional suggestions for pertinent members would be appreciated. Nick -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From dalloliogm at gmail.com Thu Mar 12 12:33:04 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 12 Mar 2009 13:33:04 +0100 Subject: [BioPython] can biopython query KEGG directly? In-Reply-To: <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> Message-ID: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com> On Tue, Mar 10, 2009 at 7:08 PM, Peter wrote: > On Tue, Mar 10, 2009 at 5:06 PM, Giovanni Marco Dall'Olio > wrote: > > Hi, > > is it possible to query the KEGG database with biopython? > > I don't think there is any wrapper for the KEGG online API (yet). See: > http://www.genome.jp/kegg/soap/doc/keggapi_manual.html well, if someone is in a hurry to query KEGG with soap, I have some scripts (but they use the suds library). > > > This does sound like a worthwhile addition (especially if the SOAP > stuff can be done using only core python libraries included in Python > 2.4+) I am not sure if the SOAPpy library is the one included in the core python libraries, and if it is since python 2.4. For what I know, SOAPpy has ceased developed since 2005 (see http://pywebsvcs.sourceforge.net/). I couldn't test this library, because I still didn't managed to get it working under an http proxy :-(. > > > > .. and transform it to a SeqRecord object. > > We still need a Bio.KEGG gene parser, see also: > http://bioperl.org/wiki/KEGG_sequence_format > http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html > Once that is done, a KEGG wrapper in Bio.SeqIO would make sense. > I am just curious, but in which object a Kegg gene file would be transposed? A SeqRecord? And how, exactly? I suppose all the features will go in SeqRecord.features... but is there any standard convention to do so? For example, the codon usage table, class, dblinks, and all the other fields.. how they would be stored? > > Peter > -- My blog on bioinformatics (now in English): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Mar 12 14:15:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Mar 2009 14:15:06 +0000 Subject: [BioPython] can biopython query KEGG directly? In-Reply-To: <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com> References: <5aa3b3570903101006q7db19fdq4845f77e45ebc1f7@mail.gmail.com> <320fb6e00903101108w3bb9b4bfl7ee6bc14bb2fd336@mail.gmail.com> <5aa3b3570903120533w1d6bad6fy12b70ebf769deef2@mail.gmail.com> Message-ID: <320fb6e00903120715n7ad57282h529150e22da826e9@mail.gmail.com> On Thu, Mar 12, 2009 at 12:33 PM, Giovanni Marco Dall'Olio wrote: >> We still need a Bio.KEGG gene parser, see also: >> http://bioperl.org/wiki/KEGG_sequence_format >> http://lists.open-bio.org/pipermail/biopython/2008-January/004000.html >> Once that is done, a KEGG wrapper in Bio.SeqIO would make sense. > > I am just curious, but in which object a Kegg gene file would be transposed? > A SeqRecord? And how, exactly? I suppose all the features will go in > SeqRecord.features... but is there any standard convention to do so? > For example, the codon usage table, class, dblinks, and all the other > fields.. how they would be stored? Bio.SeqIO only deals with SeqRecord objects. If we had a KEGG gene parser in Bio.KEGG (written in the same style as the rest of Bio.KEGG ideally), then it would make sense to add a KEGG gene format to Bio.SeqIO, where the KEGG gene records would be parsed using Bio.KEGG and then converted into SeqRecord objects. At a minimum this would mean their id/name/description and sequence - even just that would still be useful I feel. For any richer annotation, the convention is to mimic the GenBank parser as closely as possible. See http://biopython.org/wiki/SeqIO_dev Peter From matzke at berkeley.edu Sat Mar 14 04:59:37 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Fri, 13 Mar 2009 21:59:37 -0700 Subject: [BioPython] Getting protein structure names from primary IDs Message-ID: <49BB39B9.2080206@berkeley.edu> Hi all, This has got to be trivial, but I can't find a hint about the solution online. I want to: 1. Search NCBI's structure database for structures from a certain group from Bio import Entrez handle = Entrez.einfo() record = Entrez.read(handle) print "Search the structure database on Organism = Drosophila" Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are #handle = Entrez.esearch(db="structure", term="Drosophila") handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]") pdb_record = Entrez.read(handle) print pdb_record #["IdList"] pdblist = pdb_record["IdList"] OK, now I have a list of primary IDs for the protein structures from Drosophila. 2. Download those structures. Apparently I have to do this from RSCB and not NCBI? (NCBI efetch has no information on efetching from the structure database, and I tried a few obvious methods on analogy to other databases without result) This will download from RSCB, but apparently you need the structure name, not the NCBI primary ID. from Bio.PDB import * pdbl=PDBList() pdbl.retrieve_pdb_file('1FAT') So, how do I get from primary ID to structure name? I'm sure I'm missing something obvious. Cheers, Nick -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Sat Mar 14 05:05:47 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Fri, 13 Mar 2009 22:05:47 -0700 Subject: [BioPython] Getting protein structure names from primary IDs In-Reply-To: <49BB39B9.2080206@berkeley.edu> References: <49BB39B9.2080206@berkeley.edu> Message-ID: <49BB3B2B.9080900@berkeley.edu> Hi again -- Esummary was what I needed, so nevermind! Sorry for the trouble, Nick Nick Matzke wrote: > Hi all, > > This has got to be trivial, but I can't find a hint about the solution > online. > > I want to: > > 1. Search NCBI's structure database for structures from a certain group > > from Bio import Entrez > handle = Entrez.einfo() > record = Entrez.read(handle) > print "Search the structure database on Organism = Drosophila" > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are > #handle = Entrez.esearch(db="structure", term="Drosophila") > handle = Entrez.esearch(db="structure", term="Drosophila[Orgn]") > > pdb_record = Entrez.read(handle) > print pdb_record #["IdList"] > > pdblist = pdb_record["IdList"] > > > > OK, now I have a list of primary IDs for the protein structures from > Drosophila. > > > > 2. Download those structures. Apparently I have to do this from RSCB > and not NCBI? (NCBI efetch has no information on efetching from the > structure database, and I tried a few obvious methods on analogy to > other databases without result) > > This will download from RSCB, but apparently you need the structure > name, not the NCBI primary ID. > > > from Bio.PDB import * > pdbl=PDBList() > pdbl.retrieve_pdb_file('1FAT') > > > So, how do I get from primary ID to structure name? I'm sure I'm > missing something obvious. > > Cheers, > Nick > > > > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From hlapp at gmx.net Sat Mar 14 22:59:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 14 Mar 2009 18:59:57 -0400 Subject: [BioPython] Google Summer of Code: application submitted, action needed In-Reply-To: References: Message-ID: <71A1E85A-2007-4FAE-A03B-475000C5CD38@gmx.net> Hi all, I have submitted the application yesterday for O|B|F participating in the 2009 Google Summer of Code as a mentoring organization. The application is at http://docs.google.com/Doc?id=dhs98hzv_7zn8bxqjm and is also linked to from the ideas page at http://open-bio.org/wiki/Google_Summer_of_Code_2009 Now keep your fingers crossed, Google is slated to announce acceptances on March 18. This is the last cross-project message re: Summer of Code that addresses mentors and our projects; future messages that I'll post across projects will be primarily for students such as announcing whether we are accepted or not and issuing calls for application. **What we need most and right now is action from our projects' developers and from possible mentors.** Google admins will start reviewing organization applications on Monday. The ideas page has 6 project ideas right now - though the ideas are good ones, the quantity won't be particularly impressive to Google. Therefore, if you have an idea for a summer project for a student please use the C& template (it is commented out now but you'll see it when you pull the Ideas section into the editor) and put it up there ASAP. If you're not sure yet who'll mentor, put tentative names there. We don't need a full commitment from mentors until the student application period starts (March 23). Next, for all projects, the leads and/or volunteers should check the reference information for their project: http://open-bio.org/wiki/Google_Summer_of_Code_2009#Open-Bio_projects_involved I just culled these links from the various project websites - it'd be much appreciated if going forward everyone can lend a hand in this. Please review what's there and add or fix as you see fit. *These links must be correct and complete - otherwise potential students may not find you.* Finally, all prospective mentors, primary or secondary, committed or not, and anyone else who would like to volunteer to help out, should subscribe themselves ASAP to the mailing list for communicating GSoC- related administrivia: http://lists.open-bio.org/mailman/listinfo/gsoc I will *not* cross-post all administrative announcements or requests for information, and so you *will* miss information if you don't subscribe yourself there. (Note: students will be subscribed there only *after* acceptance). Those who are considering to mentor, primary or helping out, please also add yourselves to the Mentors section on the Ideas page (and check your link if you're already there): http://open-bio.org/wiki/Google_Summer_of_Code_2009#Mentors Cheers everyone, and fingers crossed! -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mjldehoon at yahoo.com Sun Mar 15 10:25:43 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 15 Mar 2009 03:25:43 -0700 (PDT) Subject: [BioPython] Bio.SwissProt.SProt Dictionary, index_file Message-ID: <653996.59295.qm@web62408.mail.re1.yahoo.com> Hi everybody, Does anybody use the Dictionary class or index_file function in Bio.SwissProt.SProt? As far as I can tell these functions are broken. If there are no users, I suggest we deprecate the Dictionary class and the index_file function in Bio.SwissProt.SProt. --Michiel From biopython at maubp.freeserve.co.uk Mon Mar 16 13:40:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Mar 2009 13:40:13 +0000 Subject: [BioPython] List of publications citing or using Biopython Message-ID: <320fb6e00903160640j73289abbl51d9f8935184a760@mail.gmail.com> Hi all, I've been working on a listing of journal publications citing or using Biopython for the website: http://biopython.org/wiki/Publications If you've published anything that qualifies that isn't listed, this is a wiki page so you should be able to add it. If you are unsure if something is appropriate, please ask here on the mailing list. For publications from the 2008 onwards I have tried to add a short note saying which part(s) of Biopython were used - this should be easy to write for your own recent papers ;) If you try editing the page you should see how to add extra entries - for anything in PubMed this is really easy. See the discussion page for more details: http://biopython.org/wiki/Talk:Publications Peter From matzke at berkeley.edu Mon Mar 16 19:31:57 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 16 Mar 2009 12:31:57 -0700 Subject: [BioPython] Entrez.einfo error? Message-ID: <49BEA92D.7040905@berkeley.edu> Hi all, This exact code worked fine for me on Friday, I wonder if it could be a temporary problem at Entrez? A similar problem seems to occur with other Entrez queries. Running biopython 1.49 in IPython... ============ from Bio import Entrez Entrez.email = "matzke at berkeley.edu" handle = Entrez.einfo(db="structure") --------------------------------------------------------------------------- IOError Traceback (most recent call last) /bioinformatics/pyeg/ in () /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc in einfo(cgi, **keywds) 195 variables = {} 196 variables.update(keywds) --> 197 return _open(cgi, variables) 198 199 def esummary(cgi=None, **keywds): /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc in _open(cgi, params) 320 options = urllib.urlencode(params, doseq=True) 321 cgi += "?" + options --> 322 handle = urllib.urlopen(cgi) 323 324 # Wrap the handle inside an UndoHandle. /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc in urlopen(url, data, proxies) 80 opener = _urlopener 81 if data is None: ---> 82 return opener.open(url) 83 else: 84 return opener.open(url, data) /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc in open(self, fullurl, data) 188 try: 189 if data is None: --> 190 return getattr(self, name)(url) 191 else: 192 return getattr(self, name)(url, data) /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc in open_http(self, url, data) 323 if realhost: h.putheader('Host', realhost) 324 for args in self.addheaders: h.putheader(*args) --> 325 h.endheaders() 326 if data is not None: 327 h.send(data) /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in endheaders(self) 858 raise CannotSendHeader() 859 --> 860 self._send_output() 861 862 def request(self, method, url, body=None, headers={}): /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in _send_output(self) 730 msg = "\r\n".join(self._buffer) 731 del self._buffer[:] --> 732 self.send(msg) 733 734 def putrequest(self, method, url, skip_host=0, skip_accept_encoding=0): /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in send(self, str) 697 if self.sock is None: 698 if self.auto_open: --> 699 self.connect() 700 else: 701 raise NotConnected() /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc in connect(self) 665 msg = "getaddrinfo returns an empty list" 666 for res in socket.getaddrinfo(self.host, self.port, 0, --> 667 socket.SOCK_STREAM): 668 af, socktype, proto, canonname, sa = res 669 try: IOError: [Errno socket error] (7, 'No address associated with nodename') > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect() 666 for res in socket.getaddrinfo(self.host, self.port, 0, --> 667 socket.SOCK_STREAM): 668 af, socktype, proto, canonname, sa = res ipdb> record = Entrez.read(handle) *** NameError: name 'Entrez' is not defined ============ -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From matzke at berkeley.edu Mon Mar 16 19:42:22 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Mon, 16 Mar 2009 12:42:22 -0700 Subject: [BioPython] Entrez.einfo error? In-Reply-To: <49BEA92D.7040905@berkeley.edu> References: <49BEA92D.7040905@berkeley.edu> Message-ID: <49BEAB9E.7070707@berkeley.edu> Looks like PubMed is down at the moment also, so it's all an NCBI problem. Cheers! Nick Nick Matzke wrote: > Hi all, > > This exact code worked fine for me on Friday, I wonder if it could be a > temporary problem at Entrez? A similar problem seems to occur with > other Entrez queries. > > Running biopython 1.49 in IPython... > > ============ > from Bio import Entrez > > Entrez.email = "matzke at berkeley.edu" > > handle = Entrez.einfo(db="structure") > > > --------------------------------------------------------------------------- > IOError Traceback (most recent call last) > > /bioinformatics/pyeg/ in () > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc > in einfo(cgi, **keywds) > 195 variables = {} > 196 variables.update(keywds) > --> 197 return _open(cgi, variables) > 198 > 199 def esummary(cgi=None, **keywds): > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/site-packages/Bio/Entrez/__init__.pyc > in _open(cgi, params) > 320 options = urllib.urlencode(params, doseq=True) > 321 cgi += "?" + options > --> 322 handle = urllib.urlopen(cgi) > 323 > 324 # Wrap the handle inside an UndoHandle. > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc > in urlopen(url, data, proxies) > 80 opener = _urlopener > 81 if data is None: > ---> 82 return opener.open(url) > 83 else: > 84 return opener.open(url, data) > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc > in open(self, fullurl, data) > 188 try: > 189 if data is None: > --> 190 return getattr(self, name)(url) > 191 else: > 192 return getattr(self, name)(url, data) > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/urllib.pyc > in open_http(self, url, data) > 323 if realhost: h.putheader('Host', realhost) > 324 for args in self.addheaders: h.putheader(*args) > --> 325 h.endheaders() > 326 if data is not None: > 327 h.send(data) > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in endheaders(self) > 858 raise CannotSendHeader() > 859 > --> 860 self._send_output() > 861 > 862 def request(self, method, url, body=None, headers={}): > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in _send_output(self) > 730 msg = "\r\n".join(self._buffer) > 731 del self._buffer[:] > --> 732 self.send(msg) > 733 > 734 def putrequest(self, method, url, skip_host=0, > skip_accept_encoding=0): > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in send(self, str) > 697 if self.sock is None: > 698 if self.auto_open: > --> 699 self.connect() > 700 else: > 701 raise NotConnected() > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.pyc > in connect(self) > 665 msg = "getaddrinfo returns an empty list" > 666 for res in socket.getaddrinfo(self.host, self.port, 0, > --> 667 socket.SOCK_STREAM): > 668 af, socktype, proto, canonname, sa = res > 669 try: > > IOError: [Errno socket error] (7, 'No address associated with nodename') > > > /Library/Frameworks/Python.framework/Versions/4.1.30101/lib/python2.5/httplib.py(667)connect() > > 666 for res in socket.getaddrinfo(self.host, self.port, 0, > --> 667 socket.SOCK_STREAM): > 668 af, socktype, proto, canonname, sa = res > > > > > > ipdb> record = Entrez.read(handle) > *** NameError: name 'Entrez' is not defined > > ============ > > > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From biopython at maubp.freeserve.co.uk Mon Mar 16 19:52:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Mar 2009 19:52:30 +0000 Subject: [BioPython] Entrez.einfo error? In-Reply-To: <49BEA92D.7040905@berkeley.edu> References: <49BEA92D.7040905@berkeley.edu> Message-ID: <320fb6e00903161252s9f41eecx56853a0cc9a76882@mail.gmail.com> On Mon, Mar 16, 2009 at 7:31 PM, Nick Matzke wrote: > Hi all, > > This exact code worked fine for me on Friday, I wonder if it could be a > temporary problem at Entrez? A similar problem seems to occur with other > Entrez queries. > > Running biopython 1.49 in IPython... > > ============ > from Bio import Entrez > Entrez.email = "matzke at berkeley.edu" > handle = Entrez.einfo(db="structure") > --------------------------------------------------------------------------- > IOError Traceback (most recent call last) > ... Yes, I think you were experiencing a temporary problem, either at the NCBI or somewhere else on the network. Its working now on my machine right now. In general an IOError in Bio.Entrez is a good sign of a network issue, and for any complex task you may want to explicitly catch these exceptions. Peter From mgenome at gmail.com Tue Mar 17 12:02:42 2009 From: mgenome at gmail.com (mgenome) Date: Tue, 17 Mar 2009 21:02:42 +0900 Subject: [BioPython] How can I draw genome comparison figure to publish? Message-ID: I have the whole genome sequence of a phage to compare it's ORFs to those of other related phages. I want to draw a comparison figure of two or more genomes. Two genomes should be compared by their ORFs similarities calculated by BLASTP or stretcher etc. If there is a table like this ORF1, start, stop, strand, ORF2, start, stop, strand, similarity, genome1_ORF1, 1, 200, +, genome2_ORF1, 1, 300, -, 50 genome1_ORF2, 201, 400, +, genome2_ORF3, 320, 500, -, 90 .... the programs or library should draw as follows; ===> ===> .... | | | | | | <=== <=== .... Their different similarities should be represented by different colors of linker lines. I examined several programs, but I didn't find the program good enough to use for publication. ACT (Artemics) can draw comparison figure but it can not show ORFs well. inGeno is the program close to what I want. But It cannot compare multiple genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do not support comparison of ORFs in genomic level. Does anybody know a program and library to draw genome comparion figure showing ORF comparison. I known that it is stupid to want a perfect program to fulfill all my requirments, but I want to find program or library to fulfill a part of my requirements. Thank you in advance. Kyoung-Ho Kim, Korea. From lpritc at scri.ac.uk Tue Mar 17 12:42:36 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Mar 2009 12:42:36 +0000 Subject: [BioPython] How can I draw genome comparison figure to publish? In-Reply-To: Message-ID: Hi Kyoung-Ho, On 17/03/2009 12:02, "mgenome" wrote: > Two genomes should be compared by their ORFs similarities calculated by > BLASTP or stretcher etc. > > If there is a table like this > > ORF1, start, stop, strand, ORF2, start, stop, strand, similarity, > genome1_ORF1, 1, 200, +, genome2_ORF1, 1, 300, -, 50 > genome1_ORF2, 201, 400, +, genome2_ORF3, 320, 500, -, 90 > .... > > the programs or library should draw as follows; > ===> ===> .... > | | > | | > | | > <=== <=== .... > Their different similarities should be represented by different colors of > linker lines. > > I examined several programs, but I didn't find the program good enough to > use for publication. > ACT (Artemics) can draw comparison figure but it can not show ORFs well. > inGeno is the program close to what I want. But It cannot compare multiple > genomes and I want to draw ORF as arrows. I know GenomeDaigrams in python do > not support comparison of ORFs in genomic level. GenomeDiagram does not draw the linker lines you require, I'm afraid. The package I would use to do so is ACT, and I have published diagrams created using ACT (figure 3 in http://dx.doi.org/10.1073/pnas.0402424101). There is also M-GCAT (http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html), which is very similar to ACT, and perhaps so similar that it will have the same problems when generating publication-quality images to your liking. GCV (http://zamov.online.fr/projects/gct/) I've never tried. > Does anybody know a program and library to draw genome comparion figure > showing ORF comparison. I known that it is stupid to want a perfect program > to fulfill all my requirments, but I want to find program or library to > fulfill a part of my requirements. GenomeDiagram does not currently have a facility to indicate synteny in the way that you require using linker lines, so it may not be the tool you need just yet. However, it has been used to indicate the results of comparisons between ORFs on the whole-genome level, using the colours of the compared features to indicate the sequence identities of the matches (e.g. Figure 2 in http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 and http://apsjournals.apsnet.org/doi/abs/10.1094). Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Tue Mar 17 12:51:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Mar 2009 12:51:18 +0000 Subject: [BioPython] How can I draw genome comparison figure to publish? In-Reply-To: References: Message-ID: <320fb6e00903170551u284b1f20v4a77fedd7bdbfbed@mail.gmail.com> On Tue, Mar 17, 2009 at 12:02 PM, mgenome wrote: > ... I examined several programs, but I didn't find the program good enough > to use for publication. > ACT (Artemics) can draw comparison figure but it can not show ORFs well. > inGeno is the program close to what I want. But It cannot compare multiple > genomes and I want to draw ORF as arrows. I know GenomeDaigrams in > python do not support comparison of ORFs in genomic level. Based on your description, I was going to suggest ACT (Artemics), but you have already considered this. GenomeDiagram has been integrated into Biopython and will be part of Biopython 1.50, and as part of this work it does now support drawing features (e.g. ORFs) as simple arrows. GenomeDiagram is very good at comparative genomics plots - but not the kind you are interested in. It wouldn't be very elegant, but you might be able to use GenomeDiagram to draw two linear genome diagrams, and then combine this and add the comparison lines on yourself with extra code using ReportLab directly. This would probably be quite a lot of work... Peter From biopython at maubp.freeserve.co.uk Tue Mar 17 16:52:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Mar 2009 16:52:23 +0000 Subject: [BioPython] Biopython contributors and participants listings Message-ID: <320fb6e00903170952t329332aer310906da64f49cb6@mail.gmail.com> Hi all, We're starting to prepare for the release of Biopython 1.50, so its seems a good occasion to update the Biopython contributors and participants listing. I've just changed the formatting for the wiki page, and to me at least this looks much nicer now - you can look at the history and decide for yourselves: http://biopython.org/wiki/Participants I see some of you aren't on this participants wiki page and probably should be (e.g. Tiago), so could I encourage relevant people to add themselves. Likewise if you have contributed to the project and think you have been left out of the contributors file, please let us know: http://biopython.org/SRC/biopython/CONTRIB or: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/CONTRIB?cvsroot=biopython Peter From biopython at maubp.freeserve.co.uk Tue Mar 17 17:38:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Mar 2009 17:38:58 +0000 Subject: [BioPython] [Biopython-dev] PDB Parser error In-Reply-To: <3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com> References: <3715adb70903170830x61bb6e3bl4412a8cf1504d80c@mail.gmail.com> <320fb6e00903170901v6533910bl57ddd534dc05cf51@mail.gmail.com> <3715adb70903171034s2124de04k7e4ee719c188902a@mail.gmail.com> Message-ID: <320fb6e00903171038m72127569m279801556e5b9551@mail.gmail.com> On Tue, Mar 17, 2009 at 5:34 PM, Rodrigo faccioli wrote: > Peter, > > Your suspect was corrected. When I received a database value its was stored > in a Tuple data structure. The solution was converted them in string > objects. For this, I used str command. > > Now, I can proceed with my tests. > > Thanks for your help. OK, good luck. Peter From mitlox at op.pl Wed Mar 18 09:05:58 2009 From: mitlox at op.pl (mitlox) Date: Wed, 18 Mar 2009 19:05:58 +1000 Subject: [BioPython] protein-ligand interactions Message-ID: <49C0B976.1020005@op.pl> Hello, I have a solved structure (1E8W) with a ligand and I would like to know which residues are within 3A of the ligand. This 3A is a cut off and should be using just for the C-alpha in each residue, but it would be great if I know which C-alpha belongs to a residue. I am newbie in Biopython/Python, maybe anyone know an example how is it possible? Thank you in advance. Best regards From p.j.a.cock at googlemail.com Wed Mar 18 09:31:14 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 Mar 2009 09:31:14 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <49C0B976.1020005@op.pl> References: <49C0B976.1020005@op.pl> Message-ID: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> On Wed, Mar 18, 2009 at 9:05 AM, mitlox wrote: > > Hello, > I have a solved structure (1E8W) with a ligand and I would like to > know which residues are within 3A of the ligand. This 3A is a cut > off and should be using just for the C-alpha in each residue, but > it would be great if I know which C-alpha belongs to a residue. > > I am newbie in Biopython/Python, maybe anyone know an > example how is it possible? Hi, I've got a couple of PDB examples on my personal website, and although they need a little update to use NumPy instead of Numeric, I think the page on doing protein contact maps would be very informative: http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ In your case, for the protein in each residue you'll want to use just the C-alpha atom (in the residue's atom dictionary under the key "CA"), but I think you should loop over all the residues in the ligand in order to find the least distance. Peter From p.j.a.cock at googlemail.com Wed Mar 18 12:36:06 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 Mar 2009 12:36:06 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> Message-ID: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> > Hi, > > I've got a couple of PDB examples on my personal website, and although > they need a little update to use NumPy instead of Numeric, I think the > page on doing protein contact maps would be very informative: > http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ I've updated those pages to use NumPy instead of Numeric - all very straight forward (apart from some issue with rpy for the graphics which isn't relevant to Biopython): http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ Peter From dalke at dalkescientific.com Wed Mar 18 15:34:59 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 18 Mar 2009 16:34:59 +0100 Subject: [BioPython] Fwd: Available: 2 Bioinformatics positions in AstraZeneca References: <45988AB300A3B1468F7CF2F9EF207579C6A3D7@SEMLRDEMBX01.rd.astrazeneca.net> Message-ID: For those interested, there's a couple of temporary bioinformatics positions at AstraZeneca/M?lndal (near Gothenburg). Reading the announcements, which are in Swedish, I see it's more biomedical informatics than sequence analysis (text mining, workflows, a decision system for medical researchers). > Ads: > https://www.poolia.se/sok-jobb/webcv/JobAd.aspx?jobadid=19008 > http://annonsoversikt.monster.se/getjob.aspx? > JobID=79909293&cy=se&where=L%c3%a4n%3aV%c3%a4stra+G%c3% > b6taland&lid=1398&re=95&pg=1&dv=1&AVSDM=2009-03-13+11%3a17% > 3a00&seq=11&fseo=1&isjs=1&re=1000 > https://sjobs.brassring.com/1053/ASP/TG/cim_jobdetail.asp?SID=% > 5edUuKAW_slp_rhc_DOlGOwdxDn_slp_rhc_PthlP/WlgiP85aWAkz/ > xRYSIbMXcsvZrHO0fJu5/ > PZdH3vw1QoLQAr5X3A_C_R__L_F_lA_slp_rhc_0Q7alykZpdfns2LzK3W8x8tde_slp_r > hc_tU=&jobId=275215&type=search&partnerid=20054&siteid=5036 Also, if you are doing Python in the Gothenburg area, join us for GothPy, the Gothenburg Python user's group: http://groups.google.com/ group/gothpy Andrew dalke at dalkescientific.com From n.j.loman at bham.ac.uk Wed Mar 18 17:59:09 2009 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 18 Mar 2009 17:59:09 +0000 Subject: [BioPython] [Fwd: Bioinformatician wanted] Message-ID: <49C1366D.7070105@bham.ac.uk> Hi all, I hope biopython'ers will excuse me posting this job advert for a Research Fellow at University of Birmingham - the project referenced makes heavy use of Biopython. The position holder would interact with Biopython on a daily basis, and potentially be able to help the Biopython open source effort should they wish. Cheers, Nick. Please pass this advert on to anyone who might be interested and suitable. http://www.jobs.ac.uk/jobs/BO446/ Research Fellow School of Immunity and Infection *Fixed term for 33 months* We are looking for a talented bioinformatician to assist in the development, maintenance and exploitation of an internationally renowned web-based microbial genomics facility, xBASE. The post holder will build on our existing achievements with xBASE (http://xbase.ac.uk ; Chaudhuri RR, Loman NJ, Snyder LA, Bailey CM, Stekel DJ, Pallen MJ. Nucleic Acids Res. 2008 36:D543-6). The work will be carried out under the supervision of Professor Mark Pallen (Medical School) in collaboration with Dr Dov Stekel (Biosciences). The post holder will work within an attractive modern research environment in the University's newly established inter-disciplinary Centre for Systems Biology. All candidates must have proficiency in programming within the Unix/Linux environment, including web-linked database design, development and management and use of languages such as Perl, PHP, C++, Python, Ruby or JAVA. Familiarity with BioPerl, BioSQL and MySQL is highly desirable. Applicants must possess the critical thinking skills needed to devise and carry out research projects and should have experience of analysing macromolecular sequence data. A PhD in a relevant subject area is desirable and will be required for appointment to a research fellowship. A flair for design, particularly as applied to web-based resources, good team-working skills and an ability to work under their own initiative will provide an advantage, as will experience of research in molecular bacteriology, comparative genomics, molecular evolution and/or pathogenesis. Informal enquiries may be addressed to Professor Mark Pallen on 0121 414 7163 or m.pallen at bham.ac.uk Starting salary ?27,183 a year, in the range of ?27,183 to ?35,469 a year (potential progression on performance once in post to ?37,651). The post will be offered on a fixed-term contract for a period up to two years and nine months, starting on or shortly after May 1st 2009. Interviews will be held in the week beginning Monday 30 March 2009. Closing date: 23 March 2009 Reference: 39855 To download the details and submit an electronic application online visit: www.hr.bham.ac.uk/jobs alternatively information can be obtained from 0121 415 9000. A University of Fairness and Diversity. Mark Professor Mark Pallen Professor of Microbial Genomics Centre for Systems Biology Biosciences University of Birmingham, BIRMINGHAM, B15 2TT m.pallen at bham.ac.uk tel ++44(0)121 414 7163 Author: The Rough Guide to Evolution http://www.amazon.co.uk/Rough-Guide-Evolution-Science-Phenomena/dp/1858289467/ Blog http://roughguidetoevolution.blogspot.com feed://roughguidetoevolution.blogspot.com/feeds/posts/default "There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being evolved." Charles Darwin ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ From hlapp at gmx.net Wed Mar 18 18:45:50 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Mar 2009 14:45:50 -0400 Subject: [BioPython] OBF application for Summer of Code has been rejected Message-ID: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> I hope to find out later why, but our Google Summer of Code application as an umbrella org has been rejected. However, NESCent has been accepted. If you can give your project idea a phylogenetics/phyloinformatics focus, go and put it up on the NESCent ideas page at http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009 Do so pretty much **now** - we will start broadcasting and reaching out to students tonight and tomorrow. If someone comes to the site and they don't see a Bio* project that they would have been interested in, they may not check back for updates. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From Yvan.Strahm at bccs.uib.no Wed Mar 18 18:47:58 2009 From: Yvan.Strahm at bccs.uib.no (Yvan.Strahm at bccs.uib.no) Date: Wed, 18 Mar 2009 19:47:58 +0100 Subject: [BioPython] How can I get a more explicite error Message-ID: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> Hello List, I try to get a grip on Biopython and followed the chapter 6 form the tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html) I run this script: from Bio.Blast import NCBIStandalone import re import sys my_blast_db = "/export/scratch/yvans/BEE/Apis_mellifera_ligustica_complete_mitochondrial_genome.fasta" my_blast_file = sys.argv[1] my_blast_exe = "/Home/lundalm/yvans/src/blast-2.2.19/bin/blastall" result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file, gap_open=5, gap_extend=2, filter ='F', expectation=1000) blast_results = result_handle.read() my_results=sys.argv[1]+".xml" save_file = open(my_results, "w") save_file.write(blast_results) save_file.close() I got this error [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta Traceback (most recent call last): File "bioblast.py", line 16, in blast_results = result_handle.read() SystemError: Objects/stringobject.c:4271: bad argument to internal function if the number of sequence blasted agianst the db is greater than 500000. The sequence are small reads from a solexa sequencing project. Is there a size limitation? And should I save(keep) only the sequence I am interested in into my_results instead of saving everything? And is there a way of running some tests before doinr the blast_result.read()? Now I try to use keep_hits=1 as a blast parameters in order to reduce the size of my_result, will see. Thanks for your time and help Cheers, yvan From cjfields at illinois.edu Wed Mar 18 19:08:48 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 18 Mar 2009 14:08:48 -0500 Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has been rejected In-Reply-To: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> Message-ID: Hilmar, The idea was floated on the google SOC list that language-specific organizations that have been accepted may potentially take bioinformatics-related applications. Specifically, Jonathan Leto (from The Perl Foundation) indicated that bioinformatics-related projects using BioPerl might be able to apply through them. Not sure about others (Python Software Foundation, etc) but might be worth checking into. Any idea on who's been accepted beyond NEScent? chris On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote: > I hope to find out later why, but our Google Summer of Code > application as an umbrella org has been rejected. > > However, NESCent has been accepted. If you can give your project > idea a phylogenetics/phyloinformatics focus, go and put it up on the > NESCent ideas page at > > http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009 > > Do so pretty much **now** - we will start broadcasting and reaching > out to students tonight and tomorrow. If someone comes to the site > and they don't see a Bio* project that they would have been > interested in, they may not check back for updates. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From chapmanb at 50mail.com Wed Mar 18 21:20:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 18 Mar 2009 17:20:07 -0400 Subject: [BioPython] How can I get a more explicite error In-Reply-To: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> Message-ID: <20090318212007.GM57054@sobchak.mgh.harvard.edu> Hi Yvan; > I try to get a grip on Biopython and followed the chapter 6 form the > tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html) > > I run this script: [...] > blast_results = result_handle.read() [...] > [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta > Traceback (most recent call last): > File "bioblast.py", line 16, in > blast_results = result_handle.read() > SystemError: Objects/stringobject.c:4271: bad argument to internal function > > if the number of sequence blasted agianst the db is greater than 500000. > The sequence are small reads from a solexa sequencing project. The result_handle.read() line is pulling the entire large BLAST result file into memory as a string. You will run out of memory with huge files, leading to the errors you are seeing. To limit the problem, run BLAST initially at the command line, and then process the resulting XML file with the BLAST parser as described here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56 This iterates over 1 record at a time, avoiding the memory issue. However, you should be using a short read aligner to map these reads to the genome. BLAST is not the right tool for this particular application; massive BLAST report files are going to be one of many problems you will run into analyzing the data. Here are a couple of popular aligners designed for the exact problem you are tackling: Bowtie: http://bowtie-bio.sourceforge.net/index.shtml Maq: http://maq.sourceforge.net/ Hope this helps, Brad From hlapp at gmx.net Wed Mar 18 22:50:26 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Mar 2009 18:50:26 -0400 Subject: [BioPython] [BioSQL-l] OBF application for Summer of Code has been rejected In-Reply-To: References: <44D1FAFD-B5D7-418B-9FDA-6945219A5481@gmx.net> Message-ID: Yes, thanks for mentioning that, was going to do so too. The Perl Foundation and the Python foundation have been accepted. I guess there isn't a Java Foundation, and if there is a Ruby one it hasn't been accepted or hasn't applied. However, Ruby on Rails has been accepted. Don't know how open they would be a Bioruby project. -hilmar On Mar 18, 2009, at 3:08 PM, Chris Fields wrote: > Hilmar, > > The idea was floated on the google SOC list that language-specific > organizations that have been accepted may potentially take > bioinformatics-related applications. Specifically, Jonathan Leto > (from The Perl Foundation) indicated that bioinformatics-related > projects using BioPerl might be able to apply through them. Not > sure about others (Python Software Foundation, etc) but might be > worth checking into. > > Any idea on who's been accepted beyond NEScent? > > chris > > On Mar 18, 2009, at 1:45 PM, Hilmar Lapp wrote: > >> I hope to find out later why, but our Google Summer of Code >> application as an umbrella org has been rejected. >> >> However, NESCent has been accepted. If you can give your project >> idea a phylogenetics/phyloinformatics focus, go and put it up on >> the NESCent ideas page at >> >> http://hackathon.nescent.org/Phyloinformatics_Summer_of_Code_2009 >> >> Do so pretty much **now** - we will start broadcasting and reaching >> out to students tonight and tomorrow. If someone comes to the site >> and they don't see a Bio* project that they would have been >> interested in, they may not check back for updates. >> >> -hilmar >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From yvan.strahm at bccs.uib.no Thu Mar 19 08:10:17 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Thu, 19 Mar 2009 09:10:17 +0100 Subject: [BioPython] How can I get a more explicite error In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu> References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> <20090318212007.GM57054@sobchak.mgh.harvard.edu> Message-ID: <49C1FDE9.20305@bccs.uib.no> Hello Brad, Thanks for the help, much appreciated. I will look at bowtie and Maq. In fact I am interested into reads which are not in the reference and how they differ from the reference, how many reads have 1,2,3,.... indels/mismatch. Cheers, yvan Brad Chapman wrote: > Hi Yvan; > >> I try to get a grip on Biopython and followed the chapter 6 form the >> tutorial (http://www.biopython.org/DIST/docs/tutorial/Tutorial.html) >> >> I run this script: > [...] >> blast_results = result_handle.read() > [...] >> [yvans at lundalm BEE]$ python bioblast.py s_1_2_eland_extended.8000000.fta >> Traceback (most recent call last): >> File "bioblast.py", line 16, in >> blast_results = result_handle.read() >> SystemError: Objects/stringobject.c:4271: bad argument to internal function >> >> if the number of sequence blasted agianst the db is greater than 500000. >> The sequence are small reads from a solexa sequencing project. > > The result_handle.read() line is pulling the entire large BLAST result > file into memory as a string. You will run out of memory with huge files, > leading to the errors you are seeing. > > To limit the problem, run BLAST initially at the command line, > and then process the resulting XML file with the BLAST parser > as described here: > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc56 > > This iterates over 1 record at a time, avoiding the memory issue. > > However, you should be using a short read aligner to map these reads > to the genome. BLAST is not the right tool for this particular > application; massive BLAST report files are going to be one of many > problems you will run into analyzing the data. Here are a couple of > popular aligners designed for the exact problem you are tackling: > > Bowtie: http://bowtie-bio.sourceforge.net/index.shtml > Maq: http://maq.sourceforge.net/ > > Hope this helps, > Brad > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Mar 19 10:47:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Mar 2009 10:47:05 +0000 Subject: [BioPython] [Fwd: Bioinformatician wanted] In-Reply-To: <49C1366D.7070105@bham.ac.uk> References: <49C1366D.7070105@bham.ac.uk> Message-ID: <320fb6e00903190347v3a70b6b0w46033c5769b38aa5@mail.gmail.com> On Wed, Mar 18, 2009 at 5:59 PM, Nick Loman wrote: > Hi all, > > I hope biopython'ers will excuse me posting this job advert for a Research > Fellow at University of Birmingham - the project referenced makes heavy use > of Biopython. The position holder would interact with Biopython on a daily > basis, and potentially be able to help the Biopython open source effort > should they wish. > > Cheers, > > Nick. I have no objections to posting targeted and directly relevant academic jobs adverts here - in fact I rather like it. I would point out the job advert text itself doesn't actually mention Biopython - perhaps you can get HR to amend the copy linked to from the University job page updated to mention experience of Biopython, BioPerl or BioSQL being desirable? Peter P.S. Could you add links to Biopython, BioPerl and BioSQL to the xBase website, maybe on the about page? http://xbase.bham.ac.uk/about.pl P.P.S. Did you have a chance to try out the patch on Bug 2738 for speeding up loading GenBank files into BioSQL? http://bugzilla.open-bio.org/show_bug.cgi?id=2738 Cheers! From biopython at maubp.freeserve.co.uk Thu Mar 19 10:52:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Mar 2009 10:52:30 +0000 Subject: [BioPython] How can I get a more explicite error In-Reply-To: <20090318212007.GM57054@sobchak.mgh.harvard.edu> References: <20090318194758.pgs14nxoowww4gck@webmail.uib.no> <20090318212007.GM57054@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00903190352rcbca60bi4d703dbf65bcd3b0@mail.gmail.com> On Wed, Mar 18, 2009 at 9:20 PM, Brad Chapman wrote: > The result_handle.read() line is pulling the entire large BLAST result > file into memory as a string. You will run out of memory with huge files, > leading to the errors you are seeing. I think Brad is probably right about the memory issue - is certainly something to be careful of. Instead of this: blast_results = result_handle.read() my_results=sys.argv[1]+".xml" save_file = open(my_results, "w") save_file.write(blast_results) save_file.close() You could try keeping only one line in memory: my_results=sys.argv[1]+".xml" save_file = open(my_results, "w") for line in result_handle : save_file.write(line) save_file.close() Or, we should get round to fixing Bug 2654 which would let you tell the BLAST tool to save the file itself, which would be much more elegant. Do you want to add yourself as a CC to this bug, so you'll automatically be informed of any updates: http://bugzilla.open-bio.org/show_bug.cgi?id=2654 Peter From mitlox at op.pl Thu Mar 19 12:55:06 2009 From: mitlox at op.pl (mitlox) Date: Thu, 19 Mar 2009 22:55:06 +1000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> Message-ID: <49C240AA.908@op.pl> I wrote this code: ------------------------------------------------------------------------------------------------ import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" #not the full cage! structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) backBoneAtomNames = "N","CA","C","0", "CB" tempBackbone = [0,0,0,0,0] Backbone = [] backboneNo = 0 for atom in structure.get_atoms(): if (atom.get_name() == backBoneAtomNames[backboneNo]) and (backboneNo < len(backBoneAtomNames)): tempBackbone[backboneNo] = atom backboneNo+=1 elif atom.get_name() != backBoneAtomNames[backboneNo]: backboneNo = 0 elif len(backBoneAtomNames) == backboneNo: Backbone.extend(tempBackbone) for a in tempBackbone: print a ------------------------------------------------------------------------------------------------ to identified the backbone, but unfortunately it does not work. Maybe exist already to identified backbone in Biopython? Thank you in advance Peter Cock wrote: >> Hi, >> >> I've got a couple of PDB examples on my personal website, and although >> they need a little update to use NumPy instead of Numeric, I think the >> page on doing protein contact maps would be very informative: >> http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ >> > > I've updated those pages to use NumPy instead of Numeric - all very > straight forward (apart from some issue with rpy for the graphics which > isn't relevant to Biopython): > > http://www.warwick.ac.uk/go/peter_cock/python/protein_contact_map/ > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > Peter > > From p.j.a.cock at googlemail.com Thu Mar 19 13:31:30 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 Mar 2009 13:31:30 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <49C240AA.908@op.pl> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> Message-ID: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> On Thu, Mar 19, 2009 at 12:55 PM, mitlox wrote: > I wrote this code: > ------------------------------------------------------------------------------------------------ > import Bio.PDB > import numpy > > pdb_code = "1E8W" > pdb_filename = "1E8W.pdb" #not the full cage! That comment was about the fact that the PDB file 1XI4 only contains part of the full clathrin cage. > structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) > backBoneAtomNames = "N","CA","C","0", "CB" > ... > ------------------------------------------------------------------------------------------------ > to identified the backbone, but unfortunately it does not work. > > Maybe exist already to identified backbone in Biopython? I don't understand what you were trying to do. Have you read the Bio.PDB documentation about the hierarchy of structures, models, chains, residues and atoms? http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf This is how I would solve the original question, finding the distance between the C-alpha carbon to the closest atom is the ligand: import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) model = structure[0] chainA = model["A"] def residue_dist_to_ligand(protein_residue, ligand_residue) : """Returns distance from the protein C-alpha to the closest ligand atom.""" distances = [] for atom in ligand_residue : diff_vector = protein_residue["CA"].coord - atom.coord distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) return min(distances) #From looking at the PDB file, ligand is last residue in chain A, named QUE ligand_res = chainA.child_list[-1] assert ligand_res.resname == "QUE" for protein_res in chainA.child_list[:-1] : dist = residue_dist_to_ligand(protein_res, ligand_res) if dist < 5.0 : print protein_res.resname, protein_res.id[1], dist This gives the following output: ILE 881 3.64203 VAL 882 3.58559 ALA 885 4.62673 THR 886 4.95211 ILE 963 4.64252 ASP 964 3.08788 If you wanted to, it should be simple change this to find the closest distance between any part of each residue to any part of the ligand, which should I expect give some distances less than 3A. Peter From mitlox at op.pl Fri Mar 20 12:18:48 2009 From: mitlox at op.pl (mitlox) Date: Fri, 20 Mar 2009 22:18:48 +1000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> Message-ID: <49C389A8.5090703@op.pl> Thank you very much for your code, it works and the output is exactly for what I was looking for. I try to get a structureCA object to write out the results in a PDB file (outCA.pdb) like this: ATOM 5275 CA ILE A 881 17.242 57.141 22.062 1.00 38.49 C ATOM 5283 CA VAL A 882 16.292 57.880 25.678 1.00 38.90 C .... And the second reason for a structureCA object is that I do not want use: structureCA = Bio.PDB.PDBParser().get_structure(outCA.pdb, outCA.pdb) Unfortunately I get this error with the extension: ILE 881 3.64203 VAL 882 3.58559 ALA 885 4.62673 THR 886 4.95211 ILE 963 4.64252 ASP 964 3.08788 Traceback (most recent call last): File "interaction.py", line 31, in ? io.save('out.pdb') File "/usr/lib/python2.4/site-packages/biopython-1.49-py2.4-linux-i686.egg/Bio/PDB/PDBIO.py", line 121, in save for model in self.structure.get_list(): AttributeError: 'list' object has no attribute 'get_list' Here is the code: import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) model = structure[0] chainA = model["A"] structureCA = [] def residue_dist_to_ligand(protein_residue, ligand_residue) : """Returns distance from the protein C-alpha to the closest ligand atom.""" distances = [] for atom in ligand_residue : diff_vector = protein_residue["CA"].coord - atom.coord distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) return min(distances) #From looking at the PDB file, ligand is last residue in chain A, named QUE ligand_res = chainA.child_list[-1] assert ligand_res.resname == "QUE" for protein_res in chainA.child_list[:-1] : dist = residue_dist_to_ligand(protein_res, ligand_res) if dist < 5.0 : print protein_res.resname, protein_res.id[1], dist structureCA.append(protein_res) io=Bio.PDB.PDBIO() io.set_structure(structureCA) io.save('outCA.pdb') How can I get a structureCA object of the results? Thank you in advance. Best regards From p.j.a.cock at googlemail.com Fri Mar 20 13:36:47 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 20 Mar 2009 13:36:47 +0000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <49C389A8.5090703@op.pl> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> <49C389A8.5090703@op.pl> Message-ID: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com> On Fri, Mar 20, 2009 at 12:18 PM, mitlox wrote: > Thank you very much for your code, it works and the output is exactly for > what I was looking for. > > I try to get a structureCA object to write out the results in a PDB file > (outCA.pdb) like this: > ATOM ? 5275 ?CA ?ILE A 881 ? ? ?17.242 ?57.141 ?22.062 ?1.00 38.49 > C > ATOM ? 5283 ?CA ?VAL A 882 ? ? ?16.292 ?57.880 ?25.678 ?1.00 38.90 > C .... > > Unfortunately I get this error with ... Here is the code: > ... > structureCA = [] > ... > io=Bio.PDB.PDBIO() > io.set_structure(structureCA) > io.save('outCA.pdb') Your structureCA object is just a python list, containing Residue objects. Instead you need to create a new object with the partial chain - which can be done by creating structure, model and chain objects manually. However, I suggest you re-read pages 5 and 6 of the Bio.PDB documentation for the recommend approach: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf In your case, you'll want to write your own selection class using the residue distance to the ligand. I recognise this might seem rather complicated for a python novice as you have to create your own class - so here is my solution: import Bio.PDB import numpy pdb_code = "1E8W" pdb_filename = "1E8W.pdb" structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) model = structure[0] chainA = model["A"] def residue_dist_to_ligand(protein_residue, ligand_residue) : """Returns distance from the protein C-alpha to the closest ligand atom.""" distances = [] for atom in ligand_residue : diff_vector = protein_residue["CA"].coord - atom.coord distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) return min(distances) class NearLigandSelect(Bio.PDB.Select): def __init__(self, distance_threshold, ligand_residue) : self.threshold = distance_threshold self.ligand_res = ligand_residue def accept_residue(self, residue): if residue == self.ligand_res : return True #change this to False if you don't want the ligand else : dist = residue_dist_to_ligand(residue, self.ligand_res) return dist < self.threshold io=Bio.PDB.PDBIO() io.set_structure(structure) #From looking at the PDB file, ligand is last residue in chain A ligand_res = chainA.child_list[-1] #Going to use a distance theshold of 4A io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res)) print "Done" Peter From mitlox at op.pl Fri Mar 20 23:45:56 2009 From: mitlox at op.pl (mitlox) Date: Sat, 21 Mar 2009 09:45:56 +1000 Subject: [BioPython] protein-ligand interactions In-Reply-To: <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com> References: <49C0B976.1020005@op.pl> <320fb6e00903180231v749b29dao3d674cb1c569c2da@mail.gmail.com> <320fb6e00903180536i5129c6ccu844bc193858321a4@mail.gmail.com> <49C240AA.908@op.pl> <320fb6e00903190631u756d9452qa6ca213ca3d8a9e0@mail.gmail.com> <49C389A8.5090703@op.pl> <320fb6e00903200636oe48bb71u4cc72bf385ac8e9b@mail.gmail.com> Message-ID: <49C42AB4.7050404@op.pl> Thank you very much for your solution. Additionally It would be nice to have a structure object with the same information like in "near_ligand.pdb", that I do not need to read a new pdb file again: structureMOD = Bio.PDB.PDBParser().get_structure("near", "near_ligand.pdb"). It is possible to have both a "near_ligand.pdb" and the same structure object? Thank you in advance. Best regards Peter Cock wrote: > Your structureCA object is just a python list, containing Residue objects. > Instead you need to create a new object with the partial chain - which > can be done by creating structure, model and chain objects manually. > > However, I suggest you re-read pages 5 and 6 of the Bio.PDB > documentation for the recommend approach: > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > In your case, you'll want to write your own selection class using the > residue distance to the ligand. I recognise this might seem rather > complicated for a python novice as you have to create your own > class - so here is my solution: > > import Bio.PDB > import numpy > > pdb_code = "1E8W" > pdb_filename = "1E8W.pdb" > > structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) > model = structure[0] > chainA = model["A"] > > def residue_dist_to_ligand(protein_residue, ligand_residue) : > """Returns distance from the protein C-alpha to the closest ligand atom.""" > distances = [] > for atom in ligand_residue : > diff_vector = protein_residue["CA"].coord - atom.coord > distances.append(numpy.sqrt(numpy.sum(diff_vector * diff_vector))) > return min(distances) > > class NearLigandSelect(Bio.PDB.Select): > def __init__(self, distance_threshold, ligand_residue) : > self.threshold = distance_threshold > self.ligand_res = ligand_residue > > def accept_residue(self, residue): > if residue == self.ligand_res : > return True #change this to False if you don't want the ligand > else : > dist = residue_dist_to_ligand(residue, self.ligand_res) > return dist < self.threshold > > io=Bio.PDB.PDBIO() > io.set_structure(structure) > #From looking at the PDB file, ligand is last residue in chain A > ligand_res = chainA.child_list[-1] > #Going to use a distance theshold of 4A > io.save("near_ligand.pdb", NearLigandSelect(4, ligand_res)) > print "Done" > > Peter > > From mjldehoon at yahoo.com Sat Mar 21 04:54:08 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 20 Mar 2009 21:54:08 -0700 (PDT) Subject: [BioPython] Bio.Enzyme (was: Re: [Biopython-dev] Bio.ExPASy) In-Reply-To: <76595.11423.qm@web62404.mail.re1.yahoo.com> Message-ID: <517737.76119.qm@web62403.mail.re1.yahoo.com> I've created a simplified version of the parser in Bio.Enzyme in Bio.ExPASy.Enzyme. The idea behind it is to collect all parsers related to ExPASy databases in Bio.ExPASy so that they can be found more easily by users. Bio.ExPASy.Enzyme works essentially the same as Bio.Enzyme, but I've done a few things a bit differently. The biggest change is probably that Bio.Enzyme stores information as attributes to a record, whereas Bio.ExPASy.Enzyme has a Record derived from a dictionary, and stores information in the dictionary (same as Bio.Medline). Does anybody have any objection if Bio.ExPASy.Enzyme becomes the "official" parser for ExPASy's Enzyme database? If not, I'll modify the documentation and tests accordingly, and start the deprecation process for Bio.Enzyme. --Michiel --- On Sun, 3/15/09, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: [Biopython-dev] Bio.ExPASy > To: biopython-dev at biopython.org > Date: Sunday, March 15, 2009, 6:24 AM > Hi everybody, > > As discussed previously, I have moved the Bio.Prosite code > to Bio.ExPASy, and I've added a ScanProsite module to > Bio.ExPASy. I guess Bio.Enzyme should also move to > Bio.ExPASy. See > > http://biopython.org/DIST/docs/tutorial/Tutorial.proposal.html > > for the documentation of Biopython as currently in CVS. > > --Michiel. > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From lueck at ipk-gatersleben.de Tue Mar 24 09:34:19 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 24 Mar 2009 10:34:19 +0100 Subject: [BioPython] Emboss eprimer3 Message-ID: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> Hi! I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode: 1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C? 2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing. The primer 3 file looks like this: PRIMER_SEQUENCE_ID=HF15E08r SEQUENCE=GCATGTAATAATGCCAAAGCTCACAGCTGCAGTTGAATCTTGGGACCCGCGGAGCGAGAATGTACCAATCCATGTATGGGTACACCCATGGCTGCCAACTCTAGGGCAAAGGATAGATACACTGTGCCACTCTATCCGGTACAAGCTGAGTAGTGTCCTCCAATTATGGCAAGCTCACGATTCATCAGCTTATGCTGTGCTATCTCCATGGAAGGGTGTATTTGATCCAGCAAGTTGGGAAGACTTGATAGTGCGTTATATCATTCCTAAACTGAAAATGGCACTCCAGGAGTTCCAGATTAACCCAGCAAGCCAAAAGTTTGACCAGTTTAACTGGGTTATGATCTGGGCTTCTGCTGTCCCGGTACACCATATGGTCCATATGTTGGAAGTTGATTTCTTTAGCAAGTGGCAGCTGGTTTTGTACCATTGGCTGAGCTCACCAAATCCTGATTTCAATGAGATAATGAATTGGTAT PRIMER_PRODUCT_SIZE_RANGE=500-1000 450-500 400-450 350-400 300-350 250-300 200-250 150-200 PRIMER_OPT_TM=60.0 PRIMER_MIN_TM=58.0 PRIMER_MAX_TM=65.0 PRIMER_MAX_DIFF_TM=3.0 PRIMER_DNA_CONC=420 PRIMER_NUM_RETURN=1 PRIMER_PAIR_PENALTY=0.8691 PRIMER_LEFT_PENALTY=0.708329 PRIMER_RIGHT_PENALTY=0.160746 PRIMER_LEFT_SEQUENCE=GCATGTAATAATGCCAAAGC PRIMER_RIGHT_SEQUENCE=TTGAAATCAGGATTTGGTGA PRIMER_LEFT=0,20 PRIMER_RIGHT=458,20 PRIMER_LEFT_TM=59.292 PRIMER_RIGHT_TM=60.161 PRIMER_LEFT_GC_PERCENT=40.000 PRIMER_RIGHT_GC_PERCENT=35.000 PRIMER_LEFT_SELF_ANY=7.00 PRIMER_RIGHT_SELF_ANY=8.00 PRIMER_LEFT_SELF_END=2.00 PRIMER_RIGHT_SELF_END=2.00 PRIMER_LEFT_END_STABILITY=8.5000 PRIMER_RIGHT_END_STABILITY=7.9000 PRIMER_PAIR_COMPL_ANY=5.00 PRIMER_PAIR_COMPL_END=3.00 PRIMER_PRODUCT_SIZE=459 Thanks in advance! Stefanie From biopython at maubp.freeserve.co.uk Tue Mar 24 10:00:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Mar 2009 10:00:46 +0000 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> 2009/3/24 Stefanie L?ck : > Hi! > > I have some questions about eprimer3 from Emboss which I use over Python to design primers in a batch mode: > > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G or C's at the end to maximum of one G or C? OK, you're using the gcclamp argument (i.e. GC clamp), which is supported by the Bio.Emboss.Applications wrapper. http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html I don't know if there is a primer3 argument for limiting the G or C's at the end - have you asked on the EMBOSS mailing list? > 2) Is there a setting to get the original primer3 output? The emboss output is for hundrets of primers not very usefull and many informations are missing. >From reading the documentation there is a "fformat1" argument which *might* do what you want - you could try this out on the command line and see. Note that this argument is not currently supported in the Bio.Emboss.Applications wrapper, but that would be easy to add. If this argument doesn't do what you want, you'd have to ask the EMBOSS people about alternative output formats. Alternatively, you might investigate the original Whitehead version of primer3. Note that if you do succeed in changing the output format, you may need a new parser to read it. Peter From mitlox at op.pl Tue Mar 24 11:12:36 2009 From: mitlox at op.pl (mitlox) Date: Tue, 24 Mar 2009 21:12:36 +1000 Subject: [BioPython] Superimposer Message-ID: <49C8C024.60403@op.pl> Hello, I read that the Superimposer works only with the two lists of atoms which contain the same amount of atoms. So I decided to use "Combinatorial Extension (CE)". This program returns a rotation matrix and a translation vector. After the execution of CE I took the matrix and vector and tried to use it with Superimposer: ------------------------------------------------------------------------------ import sys import numpy from Bio.PDB import * pdb_fix = "../files/1z9g.pdb" pdb_mov = "../files/1z9g90.pdb" p=PDBParser() s1=p.get_structure("FIXED", pdb_fix) fixed=Selection.unfold_entities(s1, "A") s2=p.get_structure("MOVING", pdb_mov) moving=Selection.unfold_entities(s2, "A") rot=numpy.identity(3).astype('f') tran=numpy.array((1.0, 2.0, 3.0), 'f') tran[0] = -0.99996603; tran[1] = -2.00002559; tran[2] = -2.99998285 rot[0][0] = 0.19411441; rot[0][1] = -0.85385353; rot[0][2] = 0.48296351 rot[1][0] = 0.94858827; rot[1][1] = 0.28884874; rot[1][2] = 0.12940907 rot[2][0] = -0.24999979; rot[2][1] = 0.43301335; rot[2][2] = 0.86602514 for atom in moving: atom.transform(rot, tran) sup=Superimposer() sup.set_atoms(fixed, moving) print sup.rotran print sup.rms sup.apply(moving) print "Saving aligned structure as PDB file %s" % pdb_mov io=PDBIO() io.set_structure(s2) io.save(pdb_mov) print "Done" ------------------------------------------------------------------------------ Unfortunalaty "print sup.rotran" returns this: (array([[ 0.19411383, 0.94858824, -0.25000035], [-0.85385389, 0.28884841, 0.43301285], [ 0.4829631 , 0.12940999, 0.86602523]]), array([-0.06470776, 1.91446435, 3.21412203])) but this matrix and vector are no the same like above. What do I wrong? Thank you in advance. Best regards, From biopython at maubp.freeserve.co.uk Tue Mar 24 11:43:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Mar 2009 11:43:05 +0000 Subject: [BioPython] Superimposer In-Reply-To: <49C8C024.60403@op.pl> References: <49C8C024.60403@op.pl> Message-ID: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> On Tue, Mar 24, 2009 at 11:12 AM, mitlox wrote: > Hello, > I read that the Superimposer works only with the two lists of atoms which > contain the same amount of atoms. > > So I decided to use "Combinatorial Extension (CE)". This program returns a > rotation matrix and a translation vector. > > After the execution of CE I took the matrix and vector and tried to use it > with Superimposer: Why? Once you know the transformation, why do you need to try and recreate it with the superimposer? Are you just doing this as a check? > ------------------------------------------------------------------------------ > import sys > import numpy > from Bio.PDB import * > > > pdb_fix = "../files/1z9g.pdb" > pdb_mov = "../files/1z9g90.pdb" > p=PDBParser() > s1=p.get_structure("FIXED", pdb_fix) > fixed=Selection.unfold_entities(s1, "A") > > s2=p.get_structure("MOVING", pdb_mov) > moving=Selection.unfold_entities(s2, "A") You should be loading in the ORGINAL pdb file here, as the moved one won't exist yet, and if it did, you'd apply the transformation twice. Note you should expect slight differences due to floating point calculations. Your input was: array([[ 0.19411442, -0.85385352, 0.4829635 ], [ 0.94858825, 0.28884873, 0.12940907], [-0.24999979, 0.43301335, 0.86602515]], dtype=float32) array([-0.99996603, -2.00002551, -2.99998283], dtype=float32), The output was: array([[ 0.19411439, 0.94858827, -0.24999978], [-0.85385353, 0.28884871, 0.43301335], [ 0.4829635 , 0.12940907, 0.86602514]]), array([-0.06473777, 1.91448618, 3.21410633]) The rotation looks transposed (backwards). The translation does look different... however, if you switch this line: sup.set_atoms(fixed, moving) to: sup.set_atoms(moving, fixed) then things agree. I suspect something is flipped in the logic of your script regarding the frames of reference. Also, at the end you do sup.apply(moving), but you have already manually moved these atoms, so won't your PDB file have them moved twice? Peter From mitlox at op.pl Tue Mar 24 12:18:32 2009 From: mitlox at op.pl (mitlox) Date: Tue, 24 Mar 2009 22:18:32 +1000 Subject: [BioPython] Superimposer In-Reply-To: <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> References: <49C8C024.60403@op.pl> <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> Message-ID: <49C8CF98.30809@op.pl> Thank you for you email. I would like only rotate and translate a pdb file that I can see the result in a pdb viewer. Maybe I do not need the Superimposer object to rotate and translate a pdb file with known rotation matrix and translation vector? Do you know how could I rotate and translate a pdb file? Thank you in advance. Peter wrote: > On Tue, Mar 24, 2009 at 11:12 AM, mitlox wrote: > >> Hello, >> I read that the Superimposer works only with the two lists of atoms which >> contain the same amount of atoms. >> >> So I decided to use "Combinatorial Extension (CE)". This program returns a >> rotation matrix and a translation vector. >> >> After the execution of CE I took the matrix and vector and tried to use it >> with Superimposer: >> > > Why? Once you know the transformation, why do you need to try and > recreate it with the superimposer? Are you just doing this as a check? > > >> ------------------------------------------------------------------------------ >> import sys >> import numpy >> from Bio.PDB import * >> >> >> pdb_fix = "../files/1z9g.pdb" >> pdb_mov = "../files/1z9g90.pdb" >> p=PDBParser() >> s1=p.get_structure("FIXED", pdb_fix) >> fixed=Selection.unfold_entities(s1, "A") >> >> s2=p.get_structure("MOVING", pdb_mov) >> moving=Selection.unfold_entities(s2, "A") >> > > You should be loading in the ORGINAL pdb file here, as the moved one > won't exist yet, and if it did, you'd apply the transformation twice. > > Note you should expect slight differences due to floating point > calculations. Your input was: > > array([[ 0.19411442, -0.85385352, 0.4829635 ], > [ 0.94858825, 0.28884873, 0.12940907], > [-0.24999979, 0.43301335, 0.86602515]], dtype=float32) > array([-0.99996603, -2.00002551, -2.99998283], dtype=float32), > > The output was: > > array([[ 0.19411439, 0.94858827, -0.24999978], > [-0.85385353, 0.28884871, 0.43301335], > [ 0.4829635 , 0.12940907, 0.86602514]]), > array([-0.06473777, 1.91448618, 3.21410633]) > > The rotation looks transposed (backwards). The translation does look > different... however, if you switch this line: > sup.set_atoms(fixed, moving) > to: > sup.set_atoms(moving, fixed) > then things agree. I suspect something is flipped in the logic of > your script regarding the frames of reference. > > Also, at the end you do sup.apply(moving), but you have already > manually moved these atoms, so won't your PDB file have them moved > twice? > > Peter > > From biopython at maubp.freeserve.co.uk Tue Mar 24 12:41:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Mar 2009 12:41:53 +0000 Subject: [BioPython] Superimposer In-Reply-To: <49C8CF98.30809@op.pl> References: <49C8C024.60403@op.pl> <320fb6e00903240443o42bfe160vaf2cd0547a91578d@mail.gmail.com> <49C8CF98.30809@op.pl> Message-ID: <320fb6e00903240541p5fa8e043wc3363b18b34af37b@mail.gmail.com> On Tue, Mar 24, 2009 at 12:18 PM, mitlox wrote: > Thank you for you email. ?I would like only rotate and translate a pdb file > that I can see the result in a pdb viewer. I see. > Maybe I do not need the Superimposer object to rotate and translate a pdb > file with known rotation matrix and translation vector? Correct. > Do you know how could I rotate and translate a pdb file? You've got most of the steps already. This is my suggestion: import numpy from Bio import PDB pdb_fix = "1z9g.pdb" pdb_mov = "1z9g_moved.pdb" structure = PDB.PDBParser().get_structure("FIXED", pdb_fix) rot=numpy.identity(3).astype('f') tran=numpy.array((-0.99996603, -2.00002559, -2.99998285)) rot=numpy.array(((+0.19411441, -0.85385353, +0.48296351), (+0.94858827, +0.28884874, +0.12940907), (-0.24999979, +0.43301335, +0.86602514))) print "Applying transformation..." for atom in structure.get_atoms() : atom.transform(rot, tran) print "Saving transformed structure as PDB file %s" % pdb_mov io=PDB.PDBIO() io.set_structure(structure) io.save(pdb_mov) print "Done" NOTE - When giving a translation mapping as a translation vector and a rotation matrix there is some ambiguity about which order to apply them in. If the results using Bio.PDB don't match what you expect, you may want to double check this first. Peter From cjfields at illinois.edu Tue Mar 24 16:51:32 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 24 Mar 2009 11:51:32 -0500 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> Message-ID: <656D2F16-80DD-4976-90FE-2BCB8802093E@illinois.edu> On Mar 24, 2009, at 5:00 AM, Peter wrote: > ... >> From reading the documentation there is a "fformat1" argument which > *might* do what you want - you could try this out on the command line > and see. Note that this argument is not currently supported in the > Bio.Emboss.Applications wrapper, but that would be easy to add. If > this argument doesn't do what you want, you'd have to ask the EMBOSS > people about alternative output formats. Alternatively, you might > investigate the original Whitehead version of primer3. Peter, Not sure if this will be a problem for the BioPython wrapper for primer3, but the latest Primer3 version on Sourceforge (v2.0.0a) radically changes the various input parameters. I had to rewrite a bunch of code to handle those as well as older (v1) primer3 params. > Note that if you do succeed in changing the output format, you may > need a new parser to read it. > > Peter primer3 input and output is BoulderIO (which I think is an essentially obsolete format Lincoln Stein wrote up many years ago). It's very easy to parse, just simple key-value pairings. chris From nir at rosettadesigngroup.com Wed Mar 25 16:18:24 2009 From: nir at rosettadesigngroup.com (Nir London) Date: Wed, 25 Mar 2009 18:18:24 +0200 Subject: [BioPython] Rosetta Academic Training Webinar Message-ID: <88F0F36A-FC4D-4A9C-AC31-5B883C3F92CB@rosettadesigngroup.com> The Rosetta Design Group is proud to present the first webinar in the Rosetta Academic Workshop Series. For the first webinar, we have selected to focus on Protein-Protein Docking based on the answers to the interest poll. We hope this will be the first in a line of helpful and inspiring webinars to kick-off our Rosetta Academic Workshop Series. What: Protein-Protein Docking When: May 4th 2009, 0800-1000 AM EST Where: Your office! Click here for more details and registration (For non html emails: http://rosettadesigngroup.com/RDGLS/index.php?sid=54479&lang=en ) Pleas note: This is not a promotional webinar. Rosetta is open-source and freeware for academic and non-profit organizations and can be downloaded here from University of Washington's TechTransfer Digital Ventures. The majority of the webinar is concerned with Rosetta 2.3.0. Rosetta 3.0 is still a beta version. Hope to see you there, Nir London. Rosetta Design Group | http://rosettadesigngroup.com/ From biopython.chen at gmail.com Thu Mar 26 02:59:04 2009 From: biopython.chen at gmail.com (chen Ku) Date: Wed, 25 Mar 2009 19:59:04 -0700 Subject: [BioPython] how to retrieve data from PDB Message-ID: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> Dear all, I need your help in writing code to retrieve some of the pdb structures. Problem definition I just want to use some PDB file not all 50,000. > I want to apply one python code so that I can know transcription factor binding to DNA only out of all pdb data. So please guide me how to proceed for this.I raed some published article on this dataset and just want to do by python and not by manually.This is one of our course work in structural biology so trying by my own and taking some help of you all. I need a general code where I can check this kind of things by changing field name.Any help will be grateful for me as I am a beginner in python. Regards Chen From lueck at ipk-gatersleben.de Thu Mar 26 09:42:42 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 26 Mar 2009 10:42:42 +0100 Subject: [BioPython] Emboss eprimer3 References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> Message-ID: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> Hi! I got a patch to add a '-originalformat' argument. If someone is interested too, I could send it to him or the mailing list. >>>Note that if you do succeed in changing the output format, you may need a >>>new parser to read it. This is no problem. I just need the data ;-) >>> I don't know if there is a primer3 argument for limiting the G or C's at >>> the end - have you asked on the EMBOSS mailing list? Yes, no answer yet. Kind regards Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Tuesday, March 24, 2009 11:00 AM Subject: Re: [BioPython] Emboss eprimer3 2009/3/24 Stefanie L?ck : > Hi! > > I have some questions about eprimer3 from Emboss which I use over Python > to design primers in a batch mode: > > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G > or C's at the end to maximum of one G or C? OK, you're using the gcclamp argument (i.e. GC clamp), which is supported by the Bio.Emboss.Applications wrapper. http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html I don't know if there is a primer3 argument for limiting the G or C's at the end - have you asked on the EMBOSS mailing list? > 2) Is there a setting to get the original primer3 output? The emboss > output is for hundrets of primers not very usefull and many informations > are missing. >From reading the documentation there is a "fformat1" argument which *might* do what you want - you could try this out on the command line and see. Note that this argument is not currently supported in the Bio.Emboss.Applications wrapper, but that would be easy to add. If this argument doesn't do what you want, you'd have to ask the EMBOSS people about alternative output formats. Alternatively, you might investigate the original Whitehead version of primer3. Note that if you do succeed in changing the output format, you may need a new parser to read it. Peter From biopython at maubp.freeserve.co.uk Thu Mar 26 10:23:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Mar 2009 10:23:01 +0000 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00903260323p50f80c50w1ab07c8892518190@mail.gmail.com> On Thu, Mar 26, 2009 at 9:42 AM, Stefanie L?ck wrote: > Hi! > > I got a patch to add a '-originalformat' argument. If someone is interested > too, I could send it to him or the mailing list. Could you file an bug on bugzilla please, and the (after the bug is filed) you can attach the patch. I'll look at this (if Brad doesn't first) - if you can also include a short example that would be excellent. Thank you, Peter From biopython at maubp.freeserve.co.uk Thu Mar 26 11:04:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Mar 2009 11:04:29 +0000 Subject: [BioPython] how to retrieve data from PDB In-Reply-To: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> Message-ID: <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com> On Thu, Mar 26, 2009 at 2:59 AM, chen Ku wrote: > Dear all, > ? ? ? ? ? ? ? ?I need your help in writing code to retrieve some of the pdb > structures. > > Problem definition > ?I just want to use some PDB file not all 50,000. > >> I want to apply one python code so that I can know transcription factor > binding to DNA only out of all pdb data. So please guide me how to proceed > for this. According to the website, there are about 2250 protein structures in complex with nucleotides - and I assume some of these are for transcription factors with DNA: http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=molType-protein-nucleic-complex&seqid=100 I assume you'll want to search these PDB for entries which are transcription factors binding to DNA, but I don't know enough about the PDB search options to advise you. Peter From jblanca at btc.upv.es Thu Mar 26 11:48:02 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Thu, 26 Mar 2009 12:48:02 +0100 Subject: [BioPython] about the SeqRecord slicing Message-ID: <200903261248.02279.jblanca@btc.upv.es> Hi: I'm working with the SeqRecord slicing from cvs and I think that the behaviour could be sligthly changed. In fact that same opinion is written in the __getitem__ method: if isinstance(index, int) : #NOTE - The sequence level annotation like the id, name, etc #do not really apply to a single character. However, should #we try and expose any per-letter-annotation here? If so how? return self.seq[index] I don't like the fact that the SeqRecord returns different classes depending on the index type. I think is better to return always a SeqRecord because: - It simplifies the interface. It's easier to deal with the SeqRecord class if its behaviour is simple. Otherwise we have to check in the code that uses the SeqRecord if it's returning an str or a SeqRecord. - It looses the per-letter-annotation. I'm working with qualities and I'm interested in keeping them. - It's redundant because if we want to slice the seq property we can do it with: seqrec.seq[index] Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Thu Mar 26 12:05:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Mar 2009 12:05:25 +0000 Subject: [BioPython] about the SeqRecord slicing In-Reply-To: <200903261248.02279.jblanca@btc.upv.es> References: <200903261248.02279.jblanca@btc.upv.es> Message-ID: <320fb6e00903260505j387279b7kfa4c69c33efe5487@mail.gmail.com> On Thu, Mar 26, 2009 at 11:48 AM, Jose Blanca wrote: > Hi: > I'm working with the SeqRecord slicing from cvs and I think that the behaviour > could be sligthly changed. In fact that same opinion is written in the > __getitem__ method: > > ? ? ? ?if isinstance(index, int) : > ? ? ? ? ? ?#NOTE - The sequence level annotation like the id, name, etc > ? ? ? ? ? ?#do not really apply to a single character. ?However, should > ? ? ? ? ? ?#we try and expose any per-letter-annotation here? ?If so how? > ? ? ? ? ? ?return self.seq[index] > > I don't like the fact that the SeqRecord returns different classes depending > on the index type. I think is better to return always a SeqRecord because: > - It simplifies the interface. It's easier to deal with the SeqRecord class if > its behaviour is simple. Otherwise we have to check in the code that uses the > SeqRecord if it's returning an str or a SeqRecord. > - It looses the per-letter-annotation. I'm working with qualities and I'm > interested in keeping them. > - It's redundant because if we want to slice the seq property we can do it > with: seqrec.seq[index] > Best regards, Hi Jose, As we are talking about the CVS code, maybe this could have been on the dev mailing list, but as its of general interest let's carry on here for now. You note that (currently in CVS) the new SeqRecord slicing returns a SeqRecord for a slice, but a single letter string for a single integer index. This isn't so different from the Seq object - it returns a new Seq object for a slice, but a single letter string for a single integer index: >>> from Bio.Seq import Seq >>> s = Seq("ACGT") >>> s Seq('ACGT', Alphabet()) >>> s[0] 'A' >>> s[0:3] Seq('ACG', Alphabet()) More generally, consider lists in Python: >>> x = [1,2,3,4,5] >>> x[0] 1 >>> x[0:3] [1, 2, 3] So I don't agree with this expectation that slicing and indexing a SeqRecord should automatically both give a SeqRecord. You really want a SeqRecord for a single character string? Can you give me an example of where you want to pull out a single character from a SeqRecord, and its quality? I would consider things like this quite elegant: for letter, quality in zip(record.seq, record.letter_annotations("phred_quality") : #do stuff Peter From chapmanb at 50mail.com Thu Mar 26 12:40:45 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 26 Mar 2009 08:40:45 -0400 Subject: [BioPython] Emboss eprimer3 In-Reply-To: <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> References: <007101c9ac63$b51ebc80$1022a8c0@ipkgatersleben.de> <320fb6e00903240300v171c9d71t7a3655938dd5e27f@mail.gmail.com> <005201c9adf7$357d47e0$1022a8c0@ipkgatersleben.de> Message-ID: <20090326124045.GD21577@sobchak.mgh.harvard.edu> Hi all; Stefanie: > I got a patch to add a '-originalformat' argument. If someone is interested > too, I could send it to him or the mailing list. Is this a patch to EMBOSS itself? If so, did the developers indicate it would be in future versions of EMBOSS? If that's the case, we can easily add this option to the commandline interface. You need a: _Option(["-originalformat"], ["input"], None, 0), line in Bio.Emboss.Applications.Primer3Commandline. > >>>Note that if you do succeed in changing the output format, you may need a > >>>new parser to read it. > > This is no problem. I just need the data ;-) Out of curiosity, what parameter did you find useful from that output that is not in the eprimer3 format output? > >>> I don't know if there is a primer3 argument for limiting the G or C's at > >>> the end - have you asked on the EMBOSS mailing list? > > Yes, no answer yet. What I do in cases like this is ask for more primers (-numreturn) and then post-parse them to pull out the ones that satisfy my additional criteria. The output is ordered by primer3's ranking, so the first one that passes the criteria would move on. If none are satisfactory, then you can also build in a logic to decide if any are good enough for your use (for example, 2 G/Cs at the end) and pick one from this remaining group with less stringency. Brad > > Kind regards > Stefanie > > > > ----- Original Message ----- > From: "Peter" > To: "Stefanie L?ck" > Cc: > Sent: Tuesday, March 24, 2009 11:00 AM > Subject: Re: [BioPython] Emboss eprimer3 > > > 2009/3/24 Stefanie L?ck : > > Hi! > > > > I have some questions about eprimer3 from Emboss which I use over Python > > to design primers in a batch mode: > > > > 1) I'm using the GCclamp function (value=1). Is it possible to limit the G > > or C's at the end to maximum of one G or C? > > OK, you're using the gcclamp argument (i.e. GC clamp), which is > supported by the Bio.Emboss.Applications wrapper. > http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/eprimer3.html > > I don't know if there is a primer3 argument for limiting the G or C's > at the end - have you asked on the EMBOSS mailing list? > > > 2) Is there a setting to get the original primer3 output? The emboss > > output is for hundrets of primers not very usefull and many informations > > are missing. > > >From reading the documentation there is a "fformat1" argument which > *might* do what you want - you could try this out on the command line > and see. Note that this argument is not currently supported in the > Bio.Emboss.Applications wrapper, but that would be easy to add. If > this argument doesn't do what you want, you'd have to ask the EMBOSS > people about alternative output formats. Alternatively, you might > investigate the original Whitehead version of primer3. > > Note that if you do succeed in changing the output format, you may > need a new parser to read it. > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Mar 27 12:18:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Mar 2009 12:18:04 +0000 Subject: [BioPython] how to retrieve data from PDB In-Reply-To: <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com> References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com> <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com> Message-ID: <320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com> On Fri, Mar 27, 2009 at 2:53 AM, chen Ku wrote: > Thank you so much for the guidance but I need the coding part in python to > retrieve the data. > > Any help will be helpful for me. Have a look at the Bio.PDB.PDBList module in Biopython - this may do what you want. Peter From p.j.a.cock at googlemail.com Fri Mar 27 17:31:55 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 Mar 2009 17:31:55 +0000 Subject: [BioPython] Biopython application note published Message-ID: <320fb6e00903271031k2bd31464k8aaa075f8de39c82@mail.gmail.com> Dear all, An Application Note describing Biopython has recently been accepted for publication in the Oxford Journal Bioinformatics. An advance copy of the Open Access article is available online: P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski and M.J.L. de Hoon (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, doi:10.1093/bioinformatics/btp163 http://dx.doi.org/10.1093/bioinformatics/btp163 This was announced at the start of the week on our news page (to which you can subscribe using the RSS or Atom feeds), but was worth repeating for the mailing lists. See http://news.open-bio.org/news/2009/03/biopython-paper-published/ Peter From biopython at maubp.freeserve.co.uk Tue Mar 31 10:08:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 31 Mar 2009 11:08:08 +0100 Subject: [BioPython] how to retrieve data from PDB In-Reply-To: <4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com> References: <4c2163890903251959u4bc7d6edjd6937de715747c33@mail.gmail.com> <320fb6e00903260404y7f5d5606jc46b4d3e87eeb9bb@mail.gmail.com> <4c2163890903261953k2f73613cvdc5d4bb497474f43@mail.gmail.com> <320fb6e00903270518g4eb5150pc1ae6de65da1a72c@mail.gmail.com> <4c2163890903310245oda7390bm829aee6f4f369478@mail.gmail.com> Message-ID: <320fb6e00903310308q38168dbfx447c78c6da5454ee@mail.gmail.com> On Tue, Mar 31, 2009 at 10:45 AM, chen Ku wrote: > Dear peter, > ????????????????? thanks for the idea.I think I need to download all the pdb > files first and then can use command on python mode. Can you please write > one syntax to start with or give me the practical documentation so that I > can try out and play with this PDBList. Hi Chen, To learn about the PDBList functionality, see page 4 of "The Biopython Structural Bioinformatics FAQ" - this has some examples: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf You can also read about PDBList from the built in help, >>> from Bio import PDB >>> help(PDB.PDBList) Or online at http://biopython.org/DIST/docs/api/Bio.PDB.PDBList%27.PDBList-class.html If you really do want to download all 56,000+ PDB files (and I don't think this is a good idea), instead of using Python, you might also consider using the command line tool rsync, see: http://www.pdb.org/pdb/general_information/news_publications/newsletters/2003q3/focus_rsync.html However, as I said before, you only want transcription factors with DNA, so at most you'll need to download the 2250 protein structures in complex with nucleotides. I strongly urge you to find out more about searching the PDB in order to get a list of just the few PDB reference codes that you'll actually need - and download just those. Peter