From akooser at unm.edu Thu Oct 7 23:06:00 2010 From: akooser at unm.edu (Ara Kooser) Date: Thu, 7 Oct 2010 21:06:00 -0600 Subject: [Biopython] Tutorial Question 7.4 alignment.title Message-ID: Hello all, I am a new user to Biopython. I've been working my way through the tutorial. I have a question about how the alignment.title works in the example given in section 7.4 of the tutorial. I wrote the following code: from Bio.Blast import NCBIXML E_VALUE_THRESH = 1e-30 result_handle = open("test.xml") blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'e value:', hsp.expect print 'length:', alignment.length print 'start:', hsp.query_start print 'end:',hsp.query_end To look at a .xml file that was produced by BLAST. I was wondering if there was a way to break up the string for information produced by the: print 'sequence:', alignment.title Basically I would like the organisms name first, followed by the locus number. I wasn't sure how to split up the print command. I looked at the docs over at http://biopython.org/DIST/docs/api/ to see if there was a tag specifically for the locus number and organism name. Thank you for your time and help. Regards, Ara From biopython at maubp.freeserve.co.uk Fri Oct 8 05:30:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Oct 2010 10:30:58 +0100 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References: Message-ID: On Fri, Oct 8, 2010 at 4:06 AM, Ara Kooser wrote: > Hello all, > > I am a new user to Biopython. I've been working my way through the > tutorial. I have a question about how the alignment.title works in the > example given in section 7.4 of the tutorial. I wrote the following code: > > from Bio.Blast import NCBIXML > > E_VALUE_THRESH = 1e-30 > > result_handle = open("test.xml") > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > if hsp.expect < E_VALUE_THRESH: > print '****Alignment****' > print 'sequence:', alignment.title > print 'e value:', hsp.expect > print 'length:', alignment.length > print 'start:', hsp.query_start > print 'end:',hsp.query_end > > To look at a .xml file that was produced by BLAST. I was wondering if there > was a way to break up the string for information produced by the: > > print 'sequence:', alignment.title > > Basically I would like the organisms name first, followed by the locus > number. I wasn't sure how to split up the print command. > > I looked at the docs over at http://biopython.org/DIST/docs/api/ to see if > there was a tag specifically for the locus number and organism name. > > Thank you for your time and help. > > Regards, > Ara Hi Ara, An example of the output you are getting and what you want would help, but I think this isn't possible in general. As I recall, the locus number and organism name information is just part of the original identifier and/or description in the FASTA file used to build the BLAST database. The NCBI tend to include the species in the description within square brackets - but this is just their convention, it is not a nicely tagged part of the BLAST output which the parser could spot. Basically I think you will have to parse the string yourself. Peter P.S. Alternatively if you want the organism name and have the GI number (or similar) this can be mapped to the organism via the NCBI taxonomy database (either online via Entrez or by parsing a downloaded copy of the mapping). From bratdaking at gmail.com Fri Oct 8 08:00:53 2010 From: bratdaking at gmail.com (Bart) Date: Fri, 8 Oct 2010 14:00:53 +0200 Subject: [Biopython] NCBIWWW and megablast Message-ID: Hey, I was wondering why the megablast option (the greedy extension) in the qblast is left out in the NCBIWWW.py? I want to map a sequence to the human genome, and to mimic the NCBI website I need a gapcost setting of "0 0", with the megablast option set to True. The fix was to add the following line ('LCASE_MASK',lcase_mask), ('MEGABLAST',megablast), ('MATRIX_NAME',matrix_name), to the parameters list of the qblast def and add: megablast=None, to the arguments. But is there a reason this setting has been left out (it is as far as I can see the only setting from the NCBI api missing)? Cheers, Bart From biopython at maubp.freeserve.co.uk Fri Oct 8 08:33:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Oct 2010 13:33:54 +0100 Subject: [Biopython] NCBIWWW and megablast In-Reply-To: References: Message-ID: On Fri, Oct 8, 2010 at 1:00 PM, Bart wrote: > Hey, > > I was wondering why the megablast option (the greedy extension) in the > qblast is left out in the NCBIWWW.py? > I want to map a sequence to the human genome, and to mimic the NCBI website > I need a gapcost setting of "0 0", with the megablast option set to True. > The fix was to add the following line > ? ? ? ('LCASE_MASK',lcase_mask), > ? ? ? ('MEGABLAST',megablast), > ? ? ? ('MATRIX_NAME',matrix_name), > to the parameters list of the qblast def and add: > megablast=None, > to the arguments. > But is there a reason this setting has been left out (it is as far as I can > see the only setting from the NCBI api missing)? > > Cheers, > Bart Hi Bart, Most likely this is a relatively recent addition to the NCBI API. Could you turn that into a patch we could apply? Don't forget to add the new option to the qblast function's docstring. Thanks, Peter From akooser at unm.edu Fri Oct 8 11:45:31 2010 From: akooser at unm.edu (Ara Kooser) Date: Fri, 8 Oct 2010 09:45:31 -0600 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References:

Message-ID: Peter, Thanks for your reply. I started to fiddle around with parsing the string last night but haven't made much progress. At the moment the output looks like this: ****Alignment**** sequence: gi|302529614|ref|ZP_07281956.1| predicted protein [Streptomyces sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein [Streptomyces sp. AA4] e value: 1.89229e-46 length: 1109 start: 7 end: 414 So what I want from the sequence string is the following: [Streptomyces sp. AA4] ZP_07281956.1 printed out as separated lines like the rest of the output. After that is figured out I want to put all the information in columns so it can be read into a spreadsheet in OO so that it looks like this: Name Locus # E_value Length Start End Regards, Ara On Oct 8, 2010, at 3:30 AM, Peter wrote: > On Fri, Oct 8, 2010 at 4:06 AM, Ara Kooser wrote: >> Hello all, >> >> I am a new user to Biopython. I've been working my way through the >> tutorial. I have a question about how the alignment.title works in >> the >> example given in section 7.4 of the tutorial. I wrote the following >> code: >> >> from Bio.Blast import NCBIXML >> >> E_VALUE_THRESH = 1e-30 >> >> result_handle = open("test.xml") >> blast_records = NCBIXML.parse(result_handle) >> blast_record = blast_records.next() >> >> for alignment in blast_record.alignments: >> for hsp in alignment.hsps: >> if hsp.expect < E_VALUE_THRESH: >> print '****Alignment****' >> print 'sequence:', alignment.title >> print 'e value:', hsp.expect >> print 'length:', alignment.length >> print 'start:', hsp.query_start >> print 'end:',hsp.query_end >> >> To look at a .xml file that was produced by BLAST. I was wondering >> if there >> was a way to break up the string for information produced by the: >> >> print 'sequence:', alignment.title >> >> Basically I would like the organisms name first, followed by the >> locus >> number. I wasn't sure how to split up the print command. >> >> I looked at the docs over at http://biopython.org/DIST/docs/api/ to >> see if >> there was a tag specifically for the locus number and organism name. >> >> Thank you for your time and help. >> >> Regards, >> Ara > > Hi Ara, > > An example of the output you are getting and what you want > would help, but I think this isn't possible in general. > > As I recall, the locus number and organism name information is > just part of the original identifier and/or description in the FASTA > file used to build the BLAST database. The NCBI tend to include > the species in the description within square brackets - but this is > just their convention, it is not a nicely tagged part of the BLAST > output which the parser could spot. > > Basically I think you will have to parse the string yourself. > > Peter > > P.S. Alternatively if you want the organism name and have the > GI number (or similar) this can be mapped to the organism via > the NCBI taxonomy database (either online via Entrez or > by parsing a downloaded copy of the mapping). From biopython at maubp.freeserve.co.uk Fri Oct 8 11:56:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Oct 2010 16:56:26 +0100 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References:

Message-ID: On Fri, Oct 8, 2010 at 4:45 PM, Ara Kooser wrote: > Peter, > > Thanks for your reply. I started to fiddle around with parsing the string > last night but haven't made much progress. > > At the moment the output looks like this: > > ****Alignment**** > sequence: gi|302529614|ref|ZP_07281956.1| predicted protein [Streptomyces > sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein [Streptomyces sp. > AA4] > e value: 1.89229e-46 > length: 1109 > start: 7 > end: 414 > > So what I want from the sequence string is the following: > [Streptomyces sp. AA4] > ZP_07281956.1 > > printed out as separated lines like the rest of the output. You could do this with regular expressions (import re), or some simple python searching for the square brackets etc. > After that is figured out I want to put all the information in columns so it > can be read into a spreadsheet in OO so that it looks like this: > Name ? ?Locus # E_value Length ?Start ? End It would be much simpler to ask BLAST to give you tabular ouput. If you are using BLAST+ you can even specify which columns you want (although this won't pull out the organism name for you). Peter From akooser at unm.edu Fri Oct 8 12:01:58 2010 From: akooser at unm.edu (Ara Kooser) Date: Fri, 8 Oct 2010 10:01:58 -0600 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References:

Message-ID: <0F95C585-007A-43EE-95F0-B28941FA6301@unm.edu> Peter, Thank you for those suggestions. I hadn't thought of using BLAST+. I will check that out this weekend. Regards, Ara On Oct 8, 2010, at 9:56 AM, Peter wrote: > On Fri, Oct 8, 2010 at 4:45 PM, Ara Kooser wrote: >> Peter, >> >> Thanks for your reply. I started to fiddle around with parsing the >> string >> last night but haven't made much progress. >> >> At the moment the output looks like this: >> >> ****Alignment**** >> sequence: gi|302529614|ref|ZP_07281956.1| predicted protein >> [Streptomyces >> sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein >> [Streptomyces sp. >> AA4] >> e value: 1.89229e-46 >> length: 1109 >> start: 7 >> end: 414 >> >> So what I want from the sequence string is the following: >> [Streptomyces sp. AA4] >> ZP_07281956.1 >> >> printed out as separated lines like the rest of the output. > > You could do this with regular expressions (import re), or some simple > python searching for the square brackets etc. > >> After that is figured out I want to put all the information in >> columns so it >> can be read into a spreadsheet in OO so that it looks like this: >> Name Locus # E_value Length Start End > > It would be much simpler to ask BLAST to give you tabular ouput. > If you are using BLAST+ you can even specify which columns you > want (although this won't pull out the organism name for you). > > Peter From mike.thon at gmail.com Sun Oct 10 02:09:42 2010 From: mike.thon at gmail.com (Michael Thon) Date: Sun, 10 Oct 2010 08:09:42 +0200 Subject: [Biopython] parsing newick trees in memory Message-ID: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> I have a String containing a tree in newick format and I want to turn it into biopython objects. The Bio.Phylo.read() function seems to only take file names or file handles as parameters. Is there any way to do this without actually saving the string to a file first? Thanks Mike From stran104 at chapman.edu Sun Oct 10 03:37:55 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sun, 10 Oct 2010 00:37:55 -0700 Subject: [Biopython] parsing newick trees in memory In-Reply-To: References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> Message-ID: You might find useful: Newick: A python module for parsing trees in the Newick file format. http://www.daimi.au.dk/~mailund/newick.html Cheers, Matt Strand - Hide quoted text - On Sat, Oct 9, 2010 at 11:09 PM, Michael Thon wrote: > I have a String containing a tree in newick format and I want to turn it > into biopython objects. The Bio.Phylo.read() function seems to only take > file names or file handles as parameters. Is there any way to do this > without actually saving the string to a file first? > Thanks > Mike > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fkauff at biologie.uni-kl.de Sun Oct 10 06:15:45 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Sun, 10 Oct 2010 12:15:45 +0200 Subject: [Biopython] parsing newick trees in memory In-Reply-To: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> Message-ID: If you don't want to use StringIO, then Nexus.Trees should be able to handle this: >>> from Bio.Nexus import Trees >>> tree='((a,b),c)' >>> tobj=Trees.Tree(tree) >>> tobj >>> dir(tobj) ['_Tree__values_are_support', '__doc__', '__init__', '__module__', '__str__', '_add_nodedata', '_add_subtree', '_get_id', '_get_values', '_parse', '_walk', 'add', 'all_ids', 'branchlength2support', 'chain', 'collapse', 'collapse_genera', 'common_ancestor', 'convert_absolute_support', 'count_terminals', 'dataclass', 'display', 'distance', 'get_taxa', 'get_terminals', 'has_support', 'id', 'is_bifurcating', 'is_compatible', 'is_identical', 'is_internal', 'is_monophyletic', 'is_parent_of', 'is_preterminal', 'is_terminal', 'kill', 'link', 'max_support', 'merge_with_support', 'name', 'node', 'prune', 'randomize', 'root', 'root_with_outgroup', 'rooted', 'search_taxon', 'set_subtree', 'split', 'sum_branchlength', 'to_string', 'trace', 'unlink', 'unroot', 'weight'] >>> On Sun, 10 Oct 2010 08:09:42 +0200 Michael Thon wrote: > I have a String containing a tree in newick format and I >want to turn it into biopython objects. The >Bio.Phylo.read() function seems to only take file names >or file handles as parameters. Is there any way to do >this without actually saving the string to a file first? > Thanks > Mike > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mike.thon at gmail.com Sun Oct 10 13:22:43 2010 From: mike.thon at gmail.com (Michael Thon) Date: Sun, 10 Oct 2010 19:22:43 +0200 Subject: [Biopython] parsing newick trees in memory In-Reply-To: References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> Message-ID: <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> On Oct 10, 2010, at 12:15 PM, Frank Kauff wrote: > If you don't want to use StringIO, then Nexus.Trees should be able to handle this: I could not get StringIO to work in this case... that is, until I learned that I have to ensure that I can read from the beginning of the buffer: out_h = StringIO.StringIO() out_h.write(tree_text) out_h.seek(0) tree = Phylo.read(out_h, 'newick') print tree From biopython at maubp.freeserve.co.uk Sun Oct 10 16:10:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 10 Oct 2010 21:10:30 +0100 Subject: [Biopython] parsing newick trees in memory In-Reply-To: <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> Message-ID: On Sun, Oct 10, 2010 at 6:22 PM, Michael Thon wrote: > > I could not get StringIO to work in this case... that > is, until I learned that I have to ensure that I can > read from the beginning of the buffer: > > ? ?out_h = StringIO.StringIO() > ? ?out_h.write(tree_text) > ? ?out_h.seek(0) > ? ?tree = Phylo.read(out_h, 'newick') > ? ?print tree > This way is shorter ;) from StringIO import StringIO from Bio import Phylo tree = Phylo.read(StringIO(tree_text), 'newick') print tree Eric - we should probably have an example of using StringIO in the Phlyo chapter as we do in the SeqIO chapter. Peter From eric.talevich at gmail.com Sun Oct 10 17:50:21 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 10 Oct 2010 17:50:21 -0400 Subject: [Biopython] parsing newick trees in memory In-Reply-To: References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> Message-ID: On Sun, Oct 10, 2010 at 4:10 PM, Peter wrote: > > This way is shorter ;) > > from StringIO import StringIO > from Bio import Phylo > tree = Phylo.read(StringIO(tree_text), 'newick') > print tree > > Eric - we should probably have an example of using > StringIO in the Phlyo chapter as we do in the SeqIO > chapter. > > Peter > Sure. I added an example to the wiki page just now: http://biopython.org/wiki/Phylo#read.28.29 -Eric From mike.thon at gmail.com Mon Oct 11 04:29:14 2010 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 11 Oct 2010 10:29:14 +0200 Subject: [Biopython] saving tree data in phyloXML format Message-ID: <7CA03745-3A1C-453F-8F7F-2E9372D4CBF3@gmail.com> I am now reading my trees and creating tree objects. For each clade in the tree I am adding a node_id attribute and one Property object. When I print the tree using: print tree I can see the new information that I've added to the tree. When I try to save the tree in phyloXML format, the node_id attribute and the Property object are not serialized. I'm saving the tree like this: PhyloXMLIO.write(tree, 'mytree.xml') Basically, what I'm trying to do is decorate the branches in the trees with some additional data (a node_id, branch labels, and a url) , and then render them in a web page, possibly using jsPhyloSVG (http://www.jsphylosvg.com). the examples on that website show tags containing and tags. I don't see an 'annotation' property in Bio.Phylo.PhyloXML.Clade but I'm hoping that there is some other property that maps to when I save the tree in phyloXML format. TIA Mike From biopython at maubp.freeserve.co.uk Mon Oct 11 05:03:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Oct 2010 10:03:37 +0100 Subject: [Biopython] NCBIWWW and megablast In-Reply-To: References:

Message-ID: On Fri, Oct 8, 2010 at 2:34 PM, Bart wrote: > Hi Peter, > > hereby the patch. Thank's Bart. I'll make some minor changes - in particular if you add a new optional argument to an existing function, it should be at the end. This is for backwards compatibility in case anyone was supplying their arguments by order. e.g. If you had this: def f(a, b=None, c=True): ... answer = f("test", None, False) If you added a fourth argument with 'def f(a, b=None, c=True, d=10)' then the above example is unaffected. However, if you insert the new argument earlier, e.g. 'def f(a, b=None, d=10, c=True)' the example breaks. > I found another argument missing: PSSM. It is to add a PSI > BLAST checkpoint. No clue how this should be done precisely, > but I added the lines anyway. Curious - last time we looked at this, the NCBI didn't seem to support PSI-BLAST via the online API, http://bugzilla.open-bio.org/show_bug.cgi?id=2496 > The third argument missing is the RESULTS_FILE parameter. I have > left that one out as I assume that that one needs somewhat more > alterations to be able to also download the file. > > Cheers, > Bart I think we should just put the PSSM and RESULTS_FILE parameters into the code as comments (until we know how and if we can use them). Peter From eric.talevich at gmail.com Mon Oct 11 09:39:20 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 11 Oct 2010 09:39:20 -0400 Subject: [Biopython] saving tree data in phyloXML format In-Reply-To: <7CA03745-3A1C-453F-8F7F-2E9372D4CBF3@gmail.com> References: <7CA03745-3A1C-453F-8F7F-2E9372D4CBF3@gmail.com> Message-ID: Hi Mike, Thanks for reporting this. I'll take a closer look at it tonight. A couple things come to mind: On Mon, Oct 11, 2010 at 4:29 AM, Michael Thon wrote: > I am now reading my trees and creating tree objects. For each clade in the > tree I am adding a node_id attribute and one Property object. When I print > the tree using: > > print tree > > I can see the new information that I've added to the tree. When I try to > save the tree in phyloXML format, the node_id attribute and the Property > object are not serialized. I'm saving the tree like this: > > PhyloXMLIO.write(tree, 'mytree.xml') > The serializer expects node_id and url to be instances of the PhyloXML.Id and PhyloXML.Uri classes, respectively. Were you assigning plain strings to these attributes? http://www.biopython.org/DIST/docs/api/Bio.Phylo.PhyloXML.Id-class.html http://www.biopython.org/DIST/docs/api/Bio.Phylo.PhyloXML.Uri-class.html Example: from Bio.Phylo import PhyloXML # Get a clade... myclade = mytree.find_any(...) myclade.node_id = PhyloXML.Id("foo") myclade.uri = PhyloXML.Uri("http://foo-db.org") Phylo.write(mytree, "mytree.xml", "phyloxml") (*untested*) > > Basically, what I'm trying to do is decorate the branches in the trees with > some additional data (a node_id, branch labels, and a url) , and then render > them in a web page, possibly using jsPhyloSVG (http://www.jsphylosvg.com). > the examples on that website show tags containing and > tags. I don't see an 'annotation' property in Bio.Phylo.PhyloXML.Clade > but I'm hoping that there is some other property that maps to > when I save the tree in phyloXML format. According to phyloxml.org, the "annotation" element goes under Sequence, not Clade. http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h917087604 I wonder if jsPhyloSVG is relying on each clade having an "annotation" element for its display. If so, that could get tricky. You might be able to fool jsPhyloSVG by using the "Other" class in Bio.Phylo.PhyloXML: http://www.biopython.org/DIST/docs/api/Bio.Phylo.PhyloXML.Other-class.html Something like: myclade.other = [PhyloXML.Other("annotation", namespace="", children=[ PhyloXML.Other("desc", value="Base of many coffees"), PhyloXML.Other("uri", value="http://en.wikipedia.org/wiki/Espresso")])] (*untested*) Note that the Clade.other attribute is expected to be a list. This is low-level stuff meant for adding elements outside the phyloXML spec, so if it seems a little ugly... yes, it is. Hope that helps, -Eric From mike.thon at gmail.com Mon Oct 11 11:13:08 2010 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 11 Oct 2010 17:13:08 +0200 Subject: [Biopython] saving tree data in phyloXML format In-Reply-To: References: <7CA03745-3A1C-453F-8F7F-2E9372D4CBF3@gmail.com> Message-ID: <02912A9C-E272-4948-96D5-042D3F1D72B7@gmail.com> > > Example: > > from Bio.Phylo import PhyloXML > # Get a clade... > myclade = mytree.find_any(...) > myclade.node_id = PhyloXML.Id("foo") > myclade.uri = PhyloXML.Uri("http://foo-db.org") > Phylo.write(mytree, "mytree.xml", "phyloxml") > > (*untested*) > I was assigning a string to node_id . I just switched to using a PhyloXML.Id. Note that the tree I'm adding data to is a Bio.Phylo.Newick.Tree . I found a method as_phyloxml() which seems to return a Phylogeny object. I tried this on my tree before I added the node_id and other stuff to the clades and now my phyloXML files look fine (i.e. the node_id appears in the file as I expect). > > > myclade.other = [PhyloXML.Other("annotation", namespace="", children=[ > PhyloXML.Other("desc", value="Base of many coffees"), > PhyloXML.Other("uri", > value="http://en.wikipedia.org/wiki/Espresso")])] > > (*untested*) > > Note that the Clade.other attribute is expected to be a list. This is low-level > stuff meant for adding elements outside the phyloXML spec, so if it seems a > little ugly... yes, it is. I will give this a try. Thanks for your help. Mike From mike.thon at gmail.com Mon Oct 11 11:25:42 2010 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 11 Oct 2010 17:25:42 +0200 Subject: [Biopython] saving tree data in phyloXML format In-Reply-To: References: <7CA03745-3A1C-453F-8F7F-2E9372D4CBF3@gmail.com> Message-ID: > > Example: > > from Bio.Phylo import PhyloXML > # Get a clade... > myclade = mytree.find_any(...) > myclade.node_id = PhyloXML.Id("foo") > myclade.uri = PhyloXML.Uri("http://foo-db.org") > Phylo.write(mytree, "mytree.xml", "phyloxml") > > (*untested*) > I was assigning a string to node_id . I just switched to using a PhyloXML.Id. Note that the tree I'm adding data to is a Bio.Phylo.Newick.Tree . I found a method as_phyloxml() which seems to return a Phylogeny object. I tried this on my tree before I added the node_id and other stuff to the clades and now my phyloXML files look fine (i.e. the node_id appears in the file as I expect). > > > myclade.other = [PhyloXML.Other("annotation", namespace="", children=[ > PhyloXML.Other("desc", value="Base of many coffees"), > PhyloXML.Other("uri", > value="http://en.wikipedia.org/wiki/Espresso")])] > > (*untested*) > > Note that the Clade.other attribute is expected to be a list. This is low-level > stuff meant for adding elements outside the phyloXML spec, so if it seems a > little ugly... yes, it is. I will give this a try. Thanks for your help. Mike From akooser at unm.edu Mon Oct 11 16:09:16 2010 From: akooser at unm.edu (Ara Kooser) Date: Mon, 11 Oct 2010 14:09:16 -0600 Subject: [Biopython] CDS location from xml in Biopython Message-ID: Hello all, Thank you again for your help. I have my program up and running. One thing that is throwing me is I am trying to extract the location of the gene from the BLAST .xml file. I've dug through the .xml and can't seem to find the information. Do I need to have the CDS files in order to parse the location start and stop values. So for instance, the record for modular polyketide synthase [Streptomyces sp. AA4] before the sequence data is CDS 1..5256 /locus_tag="StAA4_010100030484" / coded_by="complement(NZ_ACEV01000078.1:25146..40916)" /note="COG3321 Polyketide synthase modules and related proteins" /transl_table=11 /db_xref="CDD:33130" The start/stop values (25146:40916) aren't in the .xml is that correct? So I would need to add a separate code in Biopython to handle the CDS files? Thanks! Ara From akooser at unm.edu Mon Oct 11 18:30:04 2010 From: akooser at unm.edu (Ara Kooser) Date: Mon, 11 Oct 2010 16:30:04 -0600 Subject: [Biopython] Bio.GenBank .Scanner ? Message-ID: Hello all, I found a partial answer to my question. I've download all the GenBank files for Strep. sp. AA4. I am using SeqIO to look at the information in the files. The documentation recommends using SeqIO. I am searching for the tag that will only extract: CDS 1..5256 /locus_tag="StAA4_010100030484" / coded_by="complement(NZ_ACEV01000078.1:25146..40916)" /note="COG3321 Polyketide synthase modules and related proteins" /transl_table=11 /db_xref="CDD:33130" this /coded_by="complement(NZ_ACEV01000078.1:25146..40916)" line from the GenBank files. The api documentation on-line discusses the parse_feature which is what I think I need. I am not sure the best way to pull out that one line. My current code is: from Bio import SeqIO gb_file = "sequences.gp" for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"): gb_feature = gb_record.features[2] print gb_feature Thank you for your time and help. Ara From biopython at maubp.freeserve.co.uk Tue Oct 12 04:46:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Oct 2010 09:46:13 +0100 Subject: [Biopython] Bio.GenBank .Scanner ? In-Reply-To: References: Message-ID: On Mon, Oct 11, 2010 at 11:30 PM, Ara Kooser wrote: > Hello all, > > ?I found a partial answer to my question. I've download all the GenBank > files for Strep. sp. AA4. I am using SeqIO to look at the information in the > files. The documentation recommends using SeqIO. I am searching for the tag > that will only extract: > ? ? CDS ? ? ? ? ? ? 1..5256 > ? ? ? ? ? ? ? ? ? ? /locus_tag="StAA4_010100030484" > ? ? ? ? ? ? ? ? ? ? /coded_by="complement(NZ_ACEV01000078.1:25146..40916)" > ? ? ? ? ? ? ? ? ? ? /note="COG3321 Polyketide synthase modules and related > ? ? ? ? ? ? ? ? ? ? proteins" > ? ? ? ? ? ? ? ? ? ? /transl_table=11 > ? ? ? ? ? ? ? ? ? ? /db_xref="CDD:33130" > > this /coded_by="complement(NZ_ACEV01000078.1:25146..40916)" line from the > GenBank files. > The api documentation on-line discusses the parse_feature which is what I > think I need. I am not sure the best way to pull out that one line. I would not recommend usingBio.GenBank.Scanner directly for this task. If you did want to do this, you would create your own consumer class (probably as a subclass of BaseGenBankConsumer) and use this with the GenBankScanner object. Your consumer would ignore most of the parsing events, and focus on the CDS coded_by qualifier information. > My current code is: > from Bio import SeqIO > gb_file = "sequences.gp" > for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"): > ? ?gb_feature = gb_record.features[2] > ? ?print gb_feature > > > Thank you for your time and help. > Ara Try something along these lines: from Bio import SeqIO gb_file = "sequences.gp" for gb_record in SeqIO.parse(open(gb_file,"r"), "genbank"): ? ?for gb_feature in gb_record.features: ? ? ? ?if gb_feature.type != "CDS": continue ? ? ? ? ? ?print gb_feature.qualifiers Now you will need some way to identify *which* of the potentially many CDS features present in the GenBank file is the one you care about. I would guess you got StAA4_010100030484 from the BLAST hits, so you should filter on the locus_tag qualifier. There is a related example here, http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features Peter From alexl at users.sourceforge.net Mon Oct 18 19:56:59 2010 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Mon, 18 Oct 2010 19:56:59 -0400 Subject: [Biopython] API support for finding polymorphisms? Message-ID: Hi there, I have a number of already-aligned sequences from the same species and I can get them into Biopython as a multiple alignment using the following alignments=AlignIO.parse("foobar.fasta", "fasta") Before I go off and implement something by hand, I was wondering if there is any support for finding polymorphisms across the alignments, and, in particular, whether there is a way to "filter" the polymorphisms (as there are likely to be too many to look at manually) by type (e.g. stop codon gains/losses, frameshifts, or large deletions/ insertions). I know that other packages such as samtools have ways to find polymorphisms with respect to a specified reference sequence (and I'm not sure, but I don't think samtools will allow you to filter by type in any case), but I'd like to find a biopythonish solution. I did a quick look through the API, Cookbook etc., but didn't find anything that quite matches what I'm trying to do. Cheers, Alex From sdavis2 at mail.nih.gov Mon Oct 18 20:39:26 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 18 Oct 2010 20:39:26 -0400 Subject: [Biopython] API support for finding polymorphisms? In-Reply-To: References: Message-ID: On Mon, Oct 18, 2010 at 7:56 PM, Alex Lancaster wrote: > Hi there, > > I have a number of already-aligned sequences from the same species and I > can get them into Biopython as a multiple alignment using the following > > alignments=AlignIO.parse("foobar.fasta", "fasta") > > Before I go off and implement something by hand, I was wondering if > there is any support for finding polymorphisms across the alignments, > and, in particular, whether there is a way to "filter" the polymorphisms > (as there are likely to be too many to look at manually) by type > (e.g. stop codon gains/losses, frameshifts, or large deletions/ > insertions). > > I know that other packages such as samtools have ways to find > polymorphisms with respect to a specified reference sequence (and I'm > not sure, but I don't think samtools will allow you to filter by type in > any case), but I'd like to find a biopythonish solution. > > I did a quick look through the API, Cookbook etc., but didn't find > anything that quite matches what I'm trying to do. > Hi, Alex. Are you working from short read data? If so, what platform? In what format are the aligned data? Sean From alexl at users.sourceforge.net Mon Oct 18 23:04:00 2010 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Mon, 18 Oct 2010 23:04:00 -0400 (EDT) Subject: [Biopython] API support for finding polymorphisms? In-Reply-To: Message-ID: <86265363.4209.1287457440142.JavaMail.root@io.wi.mit.edu> ----- Original Message ----- > Hi, Alex. Are you working from short read data? If so, what platform? > In what format are the aligned data? Hi Sean, I'm actually working from yeast literature data released by Sanger: http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html The raw data is available via ftp in several formats including FASTQ and others, the PDF for more info: http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp_manual.pdf The original data is a mixture of Solexa/Illumina and ABI, different platforms for different yeast strains. They include a Perl script (alicat.pl) that can parse some of the alignments that they had performed already (including both sequence alignments with errors as well as imputed sequences with errors and missing data corrected). I have been working with the imputed alignments as I didn't want to go all the way back and re-align from scratch all the raw data. I could probably hack the Perl script to do some of what I need (it already has a facility to print out only polymorphic positions from the imputed alignments), but I would like a more robust Python-based solution. My first thought was to use the alicat.pl script to output the alignments and the imputed sequences, convert them into full sequences and then use Python-based solution from there to identify and classify the individual polymorphisms. At the moment, I'm only interested in looking at a couple of specific genes, so it's not a genome-wide survey (i.e. I only need to keep one or two genes and alignments in memory at once), but I'd like the solution to generalizable, so I could specify any yeast gene in the SGD and include polymorphisms in both promoters as well as coding regions. Alex From dejmail at gmail.com Wed Oct 20 09:19:14 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 20 Oct 2010 15:19:14 +0200 Subject: [Biopython] error in records Message-ID: hi everyone I am having problems seeing what is wrong with two genbank records of Hepatitis B Virus. When I cycle through a genbank file with multiple records, and these two are in it, it comes back with. Traceback (most recent call last): File "/media/0844588592/phd/lab_book/bioinformatics/typeseq_cds_split.py", line 13, in for records in SeqIO.parse("cts.gb", "gb"): File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 432, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 415, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 387, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 339, in _feed_feature_table consumer.location(location_string) File "/usr/lib/pymodules/python2.6/Bio/GenBank/__init__.py", line 673, in location LocationParser.parse(LocationParser.scan(location_line)) File "/usr/lib/pymodules/python2.6/Bio/GenBank/LocationParser.py", line 325, in parse return _cached_parser.parse(tokens) File "/usr/lib/pymodules/python2.6/Bio/Parsers/spark.py", line 203, in parse self.error(tokens[i-1]) IndexError: list index out of range I'm not sure what to make of this, especially as I've looked at the records for quite a while now and can't seem to figure out what peculiarity of the formatting upsets the parser. accessions X65259 and X85254. I would appreciate any tips or explanation of the above. Thanks Liam From biopython at maubp.freeserve.co.uk Wed Oct 20 09:49:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Oct 2010 14:49:20 +0100 Subject: [Biopython] error in records In-Reply-To: References: Message-ID: On Wed, Oct 20, 2010 at 2:19 PM, Liam Thompson wrote: > hi everyone > > I am having problems seeing what is wrong with two genbank records of > Hepatitis B Virus. When I cycle through a genbank file with multiple > records, and these two are in it, it comes back with. > > Traceback (most recent call last): > ?File "/media/0844588592/phd/lab_book/bioinformatics/typeseq_cds_split.py", > line 13, in > ? ?for records in SeqIO.parse("cts.gb", "gb"): > ?... > ?File "/usr/lib/pymodules/python2.6/Bio/Parsers/spark.py", line 203, in > parse > ? ?self.error(tokens[i-1]) > IndexError: list index out of range > > I'm not sure what to make of this, especially as I've looked at the records > for quite a while now and can't seem to figure out what peculiarity of the > formatting upsets the parser. Something about a feature location is causing the problem. >From the traceback I infer that you are using an old version of Biopython since this was rewritten in Biopython 1.55 (it doesn't use Spark by default). > accessions X65259 and X85254. I would appreciate any tips or explanation > of the above. Looking at those two nucleotide GenBank records on the NCBI Entrez website, I see nothing wrong or suspicious, and the current version of Biopython reads them fine. Could you try updating to Biopython 1.55? Alternatively if it isn't too big you can email me your cgt.gb example file *off the mailing list* and I'll try it here (in case there is something else wrong). Regards, Peter From biopython at maubp.freeserve.co.uk Wed Oct 20 10:44:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Oct 2010 15:44:47 +0100 Subject: [Biopython] error in records In-Reply-To: References: Message-ID: On Wed, Oct 20, 2010 at 3:08 PM, Liam Thompson wrote: > Hi Peter > > Thanks for looking at it. I upgraded to biopy 1.55, from 1.54 and it made no > difference. There is still something funky going on. I have attached the > records in the zip textfile, they are the last 2 listed in the file. > > Thanks > Liam Hi Liam, I got the zipped GenBank file, thanks. The two problem records have been changed - at least, they don't match what I download from the NCBI today. Running the example here the error message from X85254, Bio.GenBank.LocationParserError: /join(1816..1899,1903..2454) Hopefully you will agree that this is a much more helpful error message than you had before. Looking at the file, ... gene /join(1816..1899,1903..2454) /gene="precore-core" CDS /join(1816..1899,1903..2454) /gene="precore-core" /codon_start=1 ... You shouldn't have the leading slash on the join location (two cases, gene and CDS entry too). After fixing that by hand there is an error in X65259.1, ValueError: Sequence line mal-formed, ' 1 AACTCCACAA CCTTCCACCA AACTCTGCAA GATCCCAGAG TGAGAGGCCT GTATTTCCCT' You need another space at the start of that line. With those three fixes (removing two slashes, adding one space) then it seems to parse fine. Peter From dejmail at gmail.com Wed Oct 20 10:47:57 2010 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 20 Oct 2010 16:47:57 +0200 Subject: [Biopython] error in records In-Reply-To: References: Message-ID: Aaaargh. Thanks Peter. Regards Liam On Wed, Oct 20, 2010 at 4:44 PM, Peter wrote: > On Wed, Oct 20, 2010 at 3:08 PM, Liam Thompson wrote: > > Hi Peter > > > > Thanks for looking at it. I upgraded to biopy 1.55, from 1.54 and it made > no > > difference. There is still something funky going on. I have attached the > > records in the zip textfile, they are the last 2 listed in the file. > > > > Thanks > > Liam > > Hi Liam, > > I got the zipped GenBank file, thanks. The two problem records > have been changed - at least, they don't match what I download > from the NCBI today. > > Running the example here the error message from X85254, > > Bio.GenBank.LocationParserError: /join(1816..1899,1903..2454) > > Hopefully you will agree that this is a much more helpful error > message than you had before. Looking at the file, > > ... > gene /join(1816..1899,1903..2454) > /gene="precore-core" > CDS /join(1816..1899,1903..2454) > /gene="precore-core" > /codon_start=1 > ... > > You shouldn't have the leading slash on the join location > (two cases, gene and CDS entry too). > > After fixing that by hand there is an error in X65259.1, > > ValueError: Sequence line mal-formed, ' 1 AACTCCACAA CCTTCCACCA > AACTCTGCAA GATCCCAGAG TGAGAGGCCT GTATTTCCCT' > > You need another space at the start of that line. > > With those three fixes (removing two slashes, adding one space) > then it seems to parse fine. > > Peter > From oriolebaltimore at gmail.com Wed Oct 20 12:16:43 2010 From: oriolebaltimore at gmail.com (Adrian Johnson) Date: Wed, 20 Oct 2010 12:16:43 -0400 Subject: [Biopython] Samtools Pileup format - NGS data Message-ID: Dear group, I am wondering about any functionality in BioPython that deals with annotation of SNPs identified through NGS pipelines. For instance if given a Pileup format : chr1 799195 * */+G 115 115 33 37 * +G chr1 811750 a G 36 36 60 3 Ggg AB? chr1 815761 C A 2 33 46 3 A.a CCC chr1 815777 C T 2 33 46 3 T.t CCC Now it would be very interesting to have a module that connects to NCBI or UCSC servers and compute the following questions: 1. Identify what mutation type at a given position on a chromosome ( 815777@ chr1). The mutation could be a synonymous, frame-shift etc. 2. Get gene name, accession and protein accession. 3. Get the type of amino-acid change such as Gly -> Ser 4. If this SNP is observed in dbSNP, 1000 genomes data and other mutation databases. 5. Get the allele frequencies from dbSNP for this SNP if found in dbSNP 6. Location of the SNP - viz. intron, 5'UTR, 3'UTR or splice site. A web service from Shedure lab is available for this type of questions. Given MAQ or Pileup format, this website reports answers to all the questions above. However, the website is slow and cannot be used in a pipeline. Any BioPython user or developer working on this kind of functionality? thanks Adrian From sdavis2 at mail.nih.gov Wed Oct 20 12:44:38 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 20 Oct 2010 12:44:38 -0400 Subject: [Biopython] Samtools Pileup format - NGS data In-Reply-To: References: Message-ID: On Wed, Oct 20, 2010 at 12:16 PM, Adrian Johnson wrote: > Dear group, > > I am wondering about any functionality in BioPython that deals with > annotation of SNPs identified through NGS pipelines. > > For instance if given a Pileup format : > > chr1 799195 * */+G 115 115 33 37 * +G > chr1 811750 a G 36 36 60 3 Ggg AB? > chr1 815761 C A 2 33 46 3 A.a CCC > chr1 815777 C T 2 33 46 3 T.t CCC > > > Now it would be very interesting to have a module that connects to > NCBI or UCSC servers and compute the following questions: > > 1. Identify what mutation type at a given position on a chromosome ( > 815777@ chr1). The mutation could be a synonymous, frame-shift etc. > > 2. Get gene name, accession and protein accession. > > 3. Get the type of amino-acid change such as Gly -> Ser > > 4. If this SNP is observed in dbSNP, 1000 genomes data and other > mutation databases. > > 5. Get the allele frequencies from dbSNP for this SNP if found in dbSNP > > 6. Location of the SNP - viz. intron, 5'UTR, 3'UTR or splice site. > > > A web service from Shedure lab is available for this type of > questions. Given MAQ or Pileup format, this website reports answers to > all the questions above. However, the website is slow and cannot be > used in a pipeline. > > Any BioPython user or developer working on this kind of functionality? > > Hi, Adrian. You might look at the SIFT application. It can be downloaded and includes precomputed results for 1,2,3, and dbSNP part of 4 as several sqlite database files. We dump those databases out and use the text files directly. With BEDtools (and there python libraries like bxPython with similar functionality), number 6 is also quite straightforward (single command line, basically), also. If you have other tab-delimited text files with genomic things of interest, consider using tabix (from the samtools site) to index the compressed, sorted files. tabix includes a python wrapper that allows nearly instantaneous overlap queries and returns rows from the text file. Sean From akooser at unm.edu Fri Oct 22 14:22:59 2010 From: akooser at unm.edu (Ara Kooser) Date: Fri, 22 Oct 2010 12:22:59 -0600 Subject: [Biopython] Lineage from GenBank files Question Message-ID: <17B3DBDB-9F36-4C8F-85AC-42A61ED23DEA@unm.edu> Hello all, I've been working on a code to parse information from BLAST .xml files and GenBank files. I am interested in adding the taxonomy lineage information to the code. I was looking for the tag in the on-line documentation here: http://biopython.org/DIST/docs/api/Bio.GenBank.Scanner-module.html So in the code I am working on I have this line name_by_source_index = index_genbank_features(gb_record,"source","organism") which grabs the species Yersinia enterocolitica but not the whole lineage. I was wondering how to I grab the rest of the information like this: 1385 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; 1386 Enterobacteriaceae; Yersinia. Thank you! I do have a second question. Once I have a chunk of code running and made pretty what is the best way to submit it so it can be posted up in the Cookbook section. Ara From biopython at maubp.freeserve.co.uk Sat Oct 23 10:43:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 23 Oct 2010 15:43:31 +0100 Subject: [Biopython] Lineage from GenBank files Question In-Reply-To: <17B3DBDB-9F36-4C8F-85AC-42A61ED23DEA@unm.edu> References: <17B3DBDB-9F36-4C8F-85AC-42A61ED23DEA@unm.edu> Message-ID: On Fri, Oct 22, 2010 at 7:22 PM, Ara Kooser wrote: > Hello all, > > ?I've been working on a code to parse information from BLAST .xml files and > GenBank files. I am interested in adding the taxonomy lineage information to > the code. > There are two approaches here, firstly the (limited) lineage in the GenBank flat files themselves, and secondly using the taxon ID or accession online with the NCBI Entrez API to get the full lineage. Taking an example, LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009 DEFINITION Arabidopsis thaliana chloroplast, complete genome. ACCESSION NC_000932 VERSION NC_000932.1 GI:7525012 DBLINK Project:116 KEYWORDS . SOURCE chloroplast Arabidopsis thaliana (thale cress) ORGANISM Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids II; Brassicales; Brassicaceae; Arabidopsis. REFERENCE 1 (bases 1 to 154478) ... The lineage is in the header, the lines following the SOURCE and ORGANISM lines. This all gets recorded in the SeqRecord annotations dictionary: >>> from Bio import SeqIO >>> record = SeqIO.read("", "genbank") >>> record.annotations["source"] 'chloroplast Arabidopsis thaliana (thale cress)' >>> record.annotations["organism"] 'Arabidopsis thaliana' >>> record.annotations["taxonomy"] ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'core eudicotyledons', 'rosids', 'eurosids II', 'Brassicales', 'Brassicaceae', 'Arabidopsis'] There is also some relevant information in any source feature (usually there is one and only one, and this will be the first feature), such as the taxon ID. > > I do have a second question. Once I have a chunk of code running > and made pretty what is the best way to submit it so it can be posted > up in the Cookbook section. > It is a wiki, just make sure you include [[Category:Cookbook]] and it will appear here: http://biopython.org/wiki/Category:Cookbook Peter From zaricdragoslav at gmail.com Mon Oct 25 10:46:51 2010 From: zaricdragoslav at gmail.com (Dragoslav Zaric) Date: Mon, 25 Oct 2010 18:46:51 +0400 Subject: [Biopython] Getting involved Message-ID: Dear friends, I will first introduce myself. My name is Dragoslav Zaric, I am from country Serbia, capital city Belgrade. I am currently located in Abu Dhabi, UAE, working as professional programmer. I have Master degree in Astrophysics and I work as professional programmer for more than 3 years. I am not in the Bioinformatics field, but I have enough interest, energy and will to contribute. I have read biopython manual, so I have good idea and overview of biopython framework. Also I have good knowledge of python and django framework. I wanna start contribute to biopython project so can you tell me is there some planned work to be done ? At first I can get involved in programming tasks that are not totally related to genetics, like parsing of files, accessing databases ... Also I have started to read AMAZING book on NCBI site and after finishing this book I am sure I can start to work on genetic problems in biopython. This is the book: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=genomes This also gave me idea how to start to contribute to biopython. Does biopython have module just for searching NCBI bookshelf and books, like library module ?? Maybe I can start to work on this module ?? Kind regards -- Dragoslav Zaric Professional Programmer MSc Astrophysics From martin.djokovic at gmail.com Mon Oct 25 11:06:52 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Mon, 25 Oct 2010 11:06:52 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP Message-ID: OK I have looked at both the tutorial and the link below http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/python/protein_superposition/ I am pretty new so I am not sure how to do this. I have 2 PDB files --lets call them A and B . A = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) B = Bio.PDB.PDBParser().get_structure(pdb_code2, pdb_filename2) I pick the [CA] atoms of both these PDB's (each has 5 residues) then store them in two lists called ref_atoms and alt_atoms respectively super_imposer = Bio.PDB.Superimposer() super_imposer.set_atoms(ref_atoms, alt_atoms) super_imposer.apply(alt_atoms) (I move the alt_atoms above) Then write it out as follows io=Bio.PDB.PDBIO() io.set_structure(A) io.save('ans.pdb') My problem is I want a 3rd PDB called 'ans.pdb' that has two structures, namely A and B superimposed on each other but what I get is just a copy of A in the 'ans.pdb' What happened to the coordinates of B that I translated and rotated ????? What am I doing wrong?? Thanks Martin From fglaser at technion.ac.il Mon Oct 25 11:10:30 2010 From: fglaser at technion.ac.il (Fabian Glaser) Date: Mon, 25 Oct 2010 17:10:30 +0200 Subject: [Biopython] PDB bfactor Message-ID: Dear all, I am trying to modify the bfactor column of PDB files. That's not a problem with the atom.bfactor = XXX and PDB parser. But what I am interested in is to save 3 digits after the zero, that is 0.344, but the atom.bfactor object doesn't allow more than two (0.34). Is it possible to force atom.bfactor to write 3 digits? Thanks a lot, fabian -- ------------------------------------------------------ Fabian Glaser, PhD Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology Haifa 32000, ISRAEL E-mail: fglaser at technion.ac.il Tel:? ? +972-4-8293701 Fax:?? +972-4-8225153 From anaryin at gmail.com Mon Oct 25 11:14:39 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 25 Oct 2010 17:14:39 +0200 Subject: [Biopython] PDB bfactor In-Reply-To: References: Message-ID: Hello Fabian, Please have a look here: http://www.wwpdb.org/documentation/format32/sect9.html#ATOM The PDB format is quite annoying because it's basically space-based. So, for the B factor, you have 5 characters alloted meaning it is usually YY.XX Therefore, although some programs might accept longer B factors, the Parser will reject it. Best! Jo?o Rodrigues From p.j.a.cock at googlemail.com Mon Oct 25 11:19:20 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Oct 2010 16:19:20 +0100 Subject: [Biopython] PDB bfactor In-Reply-To: References: Message-ID: On Mon, Oct 25, 2010 at 4:10 PM, Fabian Glaser wrote: > Dear all, > > I am trying to modify the bfactor column of PDB files. That's not a > problem with the atom.bfactor = XXX and PDB parser. But what I am > interested in is to save 3 digits after the zero, that is 0.344, but > the atom.bfactor object doesn't allow more than two (0.34). > > Is it possible to force atom.bfactor to write 3 digits? > > Thanks a lot, > > fabian Hi, Do you mean the temperature factor in an ATOM line, columns 61 to 66? http://www.wwpdb.org/documentation/format32/sect9.html According to the spec that should be formatted as "Real (6.2)", which I think means at most six characters and fixed two decimal places, i.e. something like XXX.XX, which means you can't store 0.344 here, according to the spec you can only store 0.34. Peter From p.j.a.cock at googlemail.com Mon Oct 25 11:21:30 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Oct 2010 16:21:30 +0100 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References: Message-ID: On Mon, Oct 25, 2010 at 4:06 PM, martin djokovic wrote: > OK I have looked at both the tutorial and the link below > http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/python/protein_superposition/ > > I am pretty new so I am not sure how to do this. > > I have 2 PDB files --lets call them A and B . > > A = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) > > B = Bio.PDB.PDBParser().get_structure(pdb_code2, pdb_filename2) > > I pick the [CA] atoms of both these PDB's (each has 5 residues) then store > them in two lists called ref_atoms and alt_atoms respectively > > super_imposer = Bio.PDB.Superimposer() > super_imposer.set_atoms(ref_atoms, alt_atoms) > super_imposer.apply(alt_atoms) > (I move the alt_atoms above) > > Then write it out as follows > > io=Bio.PDB.PDBIO() > io.set_structure(A) > io.save('ans.pdb') > > My problem is I want a 3rd PDB ?called 'ans.pdb' that has two structures, > namely A and B superimposed on each other but what I get is just a copy of A > in the 'ans.pdb' > What happened to the coordinates of B that I translated and rotated ????? > > What am I doing wrong?? > Thanks > Martin Hi, Could you clarify what you want out? Maybe you want a single PDB file containing two models (structure A and structure B)? Otherwise you could have problems with the two structures using clashing chain identifiers etc. Peter From martin.djokovic at gmail.com Mon Oct 25 11:26:45 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Mon, 25 Oct 2010 11:26:45 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References: Message-ID: Hi Peter, I want a new PDB with structures A and B superimposed so that I can see them both at the same in the same file So at the end of the simulation/run I would have A and B (original) and the 'ans.pdb' with A as it was but B rotated and translated to be superimposed on A in the same PDB I want to try this first simple superimposition but actually I want to connect A and B together to make a longer strand The last residue of A and first residue of B are the same so I can use those coordinates to rotate/translate B then connect to A I can do that manually using SWISS PDB but I want to do it for many structures and its time consuing. On Mon, Oct 25, 2010 at 11:21 AM, Peter Cock wrote: > On Mon, Oct 25, 2010 at 4:06 PM, martin djokovic > wrote: > > OK I have looked at both the tutorial and the link below > > > http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/python/protein_superposition/ > > > > I am pretty new so I am not sure how to do this. > > > > I have 2 PDB files --lets call them A and B . > > > > A = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename) > > > > B = Bio.PDB.PDBParser().get_structure(pdb_code2, pdb_filename2) > > > > I pick the [CA] atoms of both these PDB's (each has 5 residues) then > store > > them in two lists called ref_atoms and alt_atoms respectively > > > > super_imposer = Bio.PDB.Superimposer() > > super_imposer.set_atoms(ref_atoms, alt_atoms) > > super_imposer.apply(alt_atoms) > > (I move the alt_atoms above) > > > > Then write it out as follows > > > > io=Bio.PDB.PDBIO() > > io.set_structure(A) > > io.save('ans.pdb') > > > > My problem is I want a 3rd PDB called 'ans.pdb' that has two structures, > > namely A and B superimposed on each other but what I get is just a copy > of A > > in the 'ans.pdb' > > What happened to the coordinates of B that I translated and rotated ????? > > > > What am I doing wrong?? > > Thanks > > Martin > > Hi, > > Could you clarify what you want out? Maybe you want a single > PDB file containing two models (structure A and structure B)? > Otherwise you could have problems with the two structures > using clashing chain identifiers etc. > > Peter > From p.j.a.cock at googlemail.com Mon Oct 25 11:32:37 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Oct 2010 16:32:37 +0100 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic wrote: > Hi Peter, > I want a new PDB with structures A and B superimposed so that I can see them > both at the same in the same file > > So at the end of the simulation/run? I would have A and B (original) and the > 'ans.pdb' with A as it was but B rotated and translated to be superimposed > on A in the same PDB > I want to try this first simple superimposition but actually I want to > connect A and B together to make a longer strand > The last residue of A and first residue of B are the same so I can use those > coordinates to rotate/translate B then connect to A > I can do that manually using SWISS PDB but I want to do it for many > structures and its time consuing. Won't that mean there would be a duplicate residue? i.e. The last residue of A and first residue of B are the same thing, but would be in the file twice. Anyway - that basic idea is you must create a Bio.PDB structure object with both A and B in it (perhaps as two chains in the same model), then write that to the PDB file. The details depend on how you want to do the combination - there is more than one way to represent A and B in the same PDB file (quite separate from how to do it in Biopython). Peter P.S. You could trying writing out two separate PDB files and try simply concatenating them... it might do what you want. From martin.djokovic at gmail.com Mon Oct 25 11:36:18 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Mon, 25 Oct 2010 11:36:18 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: So if we just focus on superimposing A and B for now- Are you saying its impossible to do it while A and B are seperate PDB's? They should be in the same PDB? Ok like the example in the warwick university site for 1JOY?? Oh I see-I thought that I could do the same thing but do it for 2 seperate PDB's--please confirm this. I was really getting confused as I tried to follow that as much as possible but do it using 2 files On Mon, Oct 25, 2010 at 11:32 AM, Peter Cock wrote: > On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic > wrote: > > Hi Peter, > > I want a new PDB with structures A and B superimposed so that I can see > them > > both at the same in the same file > > > > So at the end of the simulation/run I would have A and B (original) and > the > > 'ans.pdb' with A as it was but B rotated and translated to be > superimposed > > on A in the same PDB > > I want to try this first simple superimposition but actually I want to > > connect A and B together to make a longer strand > > The last residue of A and first residue of B are the same so I can use > those > > coordinates to rotate/translate B then connect to A > > I can do that manually using SWISS PDB but I want to do it for many > > structures and its time consuing. > > Won't that mean there would be a duplicate residue? i.e. The last residue > of A and first residue of B are the same thing, but would be in the file > twice. > > Anyway - that basic idea is you must create a Bio.PDB structure object > with both A and B in it (perhaps as two chains in the same model), then > write that to the PDB file. The details depend on how you want to do > the combination - there is more than one way to represent A and B > in the same PDB file (quite separate from how to do it in Biopython). > > Peter > > P.S. You could trying writing out two separate PDB files and try simply > concatenating them... it might do what you want. > From p.j.a.cock at googlemail.com Mon Oct 25 11:59:48 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Oct 2010 16:59:48 +0100 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: On Mon, Oct 25, 2010 at 4:36 PM, martin djokovic wrote: > So if we just focus on superimposing A and B for now- > Are you saying its impossible to do it while A and B are seperate PDB's? > They should be in the same PDB? > Ok like the example in the warwick university site for 1JOY?? Oh I see-I > thought that I could do the same thing but do it for 2 seperate > PDB's--please confirm this. > I was really getting confused as I tried to follow that as much as possible > but do it using 2 files The example on my Warwick page uses a single PDB file containing multiple models, but you can superimpose two separate PDB files with Biopython. You (apparently) are stuck at a different task - combining two PDB files into one. You need to be careful about not creating a bad PDB file where residues or chains are multiply defined (e.g. if both your structures are called Chain A in Model 1). Peter From martin.djokovic at gmail.com Mon Oct 25 12:06:13 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Mon, 25 Oct 2010 12:06:13 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Thanks Peter but if I just want to superimpose A and B and get a 3rd PDB file that shows A and B superimposed how do I do it? Following the method I did (first post) just makes me a copy of A as the 3rd file-I cant still figure out what happened to the result of the "apply" command that should have rotated and translated the coordinates of B to superimpose them on A. I want to write the result (now haveing A and superimposed B) into a 3rd file. Not sure if you are understanding my question due to my poor explantion of the problem-I am sorry about that. Imagine if 1JOY only has 2 strucutres instead of 21 and they are 1JOY-A and 1JOY-B how do I superimpose them and get 1JOY-C that shows them nicely overlapping as in the final output in your warwick example? Thanks once again, Martin On Mon, Oct 25, 2010 at 11:59 AM, Peter Cock wrote: > On Mon, Oct 25, 2010 at 4:36 PM, martin djokovic > wrote: > > So if we just focus on superimposing A and B for now- > > Are you saying its impossible to do it while A and B are seperate PDB's? > > They should be in the same PDB? > > Ok like the example in the warwick university site for 1JOY?? Oh I see-I > > thought that I could do the same thing but do it for 2 seperate > > PDB's--please confirm this. > > I was really getting confused as I tried to follow that as much as > possible > > but do it using 2 files > > The example on my Warwick page uses a single PDB file containing > multiple models, but you can superimpose two separate PDB files with > Biopython. > > You (apparently) are stuck at a different task - combining two PDB > files into one. You need to be careful about not creating a bad PDB > file where residues or chains are multiply defined (e.g. if both your > structures are called Chain A in Model 1). > > Peter > From anaryin at gmail.com Mon Oct 25 12:09:12 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 25 Oct 2010 18:09:12 +0200 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Dear Martin, To write a new PDB file you first must ensure that there are no naming clashes. This means that you can't have two chains named A, nor two residues numbered 12 in the same chain for example. For your particular problem, I'd suggest the following: Renaming the first (reference) structure chains to start from A. Then rename the following structures chains accordingly. Try for example using a combination of a for loop with the Python chr function (A character is ord 65). Very crudely: i_chain = 65 for structure in list_of_structures: for chain in structure: chain.id = chr(i_chain) i_chain += 1 This should fix your chain problems. Afterwards, all you need is to combine all your chains in ONE structure object. I believe it to be easier. Just add your structure B information to your structure A information. Check the documentationon how to manipulate SMCRA objects. This will yield you a final structure object with both structure A and structure B which you can then output and save. Hope it helps! Best! Jo?o [...] Rodrigues http://doeidoei.wordpress.org On Mon, Oct 25, 2010 at 5:59 PM, Peter Cock wrote: > On Mon, Oct 25, 2010 at 4:36 PM, martin djokovic > wrote: > > So if we just focus on superimposing A and B for now- > > Are you saying its impossible to do it while A and B are seperate PDB's? > > They should be in the same PDB? > > Ok like the example in the warwick university site for 1JOY?? Oh I see-I > > thought that I could do the same thing but do it for 2 seperate > > PDB's--please confirm this. > > I was really getting confused as I tried to follow that as much as > possible > > but do it using 2 files > > The example on my Warwick page uses a single PDB file containing > multiple models, but you can superimpose two separate PDB files with > Biopython. > > You (apparently) are stuck at a different task - combining two PDB > files into one. You need to be careful about not creating a bad PDB > file where residues or chains are multiply defined (e.g. if both your > structures are called Chain A in Model 1). > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From martin.djokovic at gmail.com Mon Oct 25 12:16:16 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Mon, 25 Oct 2010 12:16:16 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Jo?o and Peter, I think I have confused everyone by bringing in my second problem. This is what I really want to do now I have 2 PDB files protein1.pdb and protein2.pdb (I have stopped using A and B as people assume its chain A and B NO they are not I just called them A.pdb and B.pdb for convinience ) Both ptotein1.pdb and protein2.pdb have the same residues and I want to superimpose protein2.pdb on protein1.pdb Once I have rotated and translated the coordinates of protein2.pdb I want to write a 3rd PDB file containing the coordinates of protein1.pdb and the tras/rotated coordinates of protein2.pdb thats all. Thank you Martin On Mon, Oct 25, 2010 at 12:09 PM, Jo?o Rodrigues wrote: > Dear Martin, > > To write a new PDB file you first must ensure that there are no naming > clashes. This means that you can't have two chains named A, nor two residues > numbered 12 in the same chain for example. > > For your particular problem, I'd suggest the following: > > Renaming the first (reference) structure chains to start from A. Then > rename the following structures chains accordingly. > > Try for example using a combination of a for loop with the Python chr > function (A character is ord 65). Very crudely: > > i_chain = 65 > for structure in list_of_structures: > for chain in structure: > chain.id = chr(i_chain) > i_chain += 1 > > This should fix your chain problems. Afterwards, all you need is to combine > all your chains in ONE structure object. I believe it to be easier. > > Just add your structure B information to your structure A information. Check > the documentationon how to manipulate SMCRA objects. > > This will yield you a final structure object with both structure A and > structure B which you can then output and save. > > Hope it helps! Best! > > Jo?o [...] Rodrigues > http://doeidoei.wordpress.org > > > > On Mon, Oct 25, 2010 at 5:59 PM, Peter Cock wrote: > >> On Mon, Oct 25, 2010 at 4:36 PM, martin djokovic >> wrote: >> > So if we just focus on superimposing A and B for now- >> > Are you saying its impossible to do it while A and B are seperate PDB's? >> > They should be in the same PDB? >> > Ok like the example in the warwick university site for 1JOY?? Oh I see-I >> > thought that I could do the same thing but do it for 2 seperate >> > PDB's--please confirm this. >> > I was really getting confused as I tried to follow that as much as >> possible >> > but do it using 2 files >> >> The example on my Warwick page uses a single PDB file containing >> multiple models, but you can superimpose two separate PDB files with >> Biopython. >> >> You (apparently) are stuck at a different task - combining two PDB >> files into one. You need to be careful about not creating a bad PDB >> file where residues or chains are multiply defined (e.g. if both your >> structures are called Chain A in Model 1). >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From anaryin at gmail.com Mon Oct 25 12:19:06 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 25 Oct 2010 18:19:06 +0200 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Just make sure the chains are different, and then do pretty much what we told you :) 1. Load protA and protB 2. Rotate protB onto protA 3. Merge both into one single structure object (refer to documentation) 4. Save new structure object with PDBIO. Should work! J From martin.djokovic at gmail.com Mon Oct 25 12:23:56 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Mon, 25 Oct 2010 12:23:56 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Thanks Jo?o and Peter for your help No3 makes perfect sense now-sorry for coming back over and over again-I think I can go from here :-) (fingers crossed) 3. Merge both into one single structure object (refer to documentation) Martin On Mon, Oct 25, 2010 at 12:19 PM, Jo?o Rodrigues wrote: > Just make sure the chains are different, and then do pretty much what we > told you :) > > 1. Load protA and protB > 2. Rotate protB onto protA > 3. Merge both into one single structure object (refer to documentation) > 4. Save new structure object with PDBIO. > > Should work! > > J > From anaryin at gmail.com Mon Oct 25 12:31:35 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 25 Oct 2010 18:31:35 +0200 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: No problem at all. Just make sure you understand section 8.3 of the Bio.PDB FAQ. If you do, then it's pretty easy to do what you want! Best! J From biopython at maubp.freeserve.co.uk Mon Oct 25 12:39:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 25 Oct 2010 17:39:14 +0100 Subject: [Biopython] Getting involved In-Reply-To: References: Message-ID: On Mon, Oct 25, 2010 at 3:46 PM, Dragoslav Zaric wrote: > Dear friends, > > I will first introduce myself. My name is Dragoslav Zaric, I am from > country Serbia, capital city Belgrade. I am currently located in Abu > Dhabi, UAE, working as professional programmer. I have Master > degree in Astrophysics and I work as professional programmer > for more than 3 years. Hello Dragoslav, > I am not in the Bioinformatics field, but I have enough interest, > energy and will to contribute. > > I have read biopython manual, so I have good idea and overview of > biopython framework. > Also I have good knowledge of python and django framework. > > I wanna start contribute to biopython project so can you tell me is > there some planned work to be done ? > > At first I can get involved in programming tasks that are not totally > related to genetics, like parsing of files, accessing databases ... There are lots of things people are working on part time, e.g. http://biopython.org/wiki/Active_projects There may be some easy to solve open bugs/enhancements on bugzilla, http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED You could also look at the (unfunded) Google Summer of Code project ideas - although those tend to be larger pieces of work. I would normally recommend you work on something directly useful to the biology you are working on. Clearly that does not apply here. I guess you want to do Python coding as a hobby? Can you program in C and are you familiar with the C/Python API? We will need to look at porting our C code from Python 2 to Python 3, and this is quite complicated. > Also I have started to read AMAZING book on NCBI site and after > finishing this book I am sure I can start to work on genetic problems > in biopython. This is the book: > > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=genomes > > This also gave me idea how to start to contribute to biopython. Does > biopython have module just for searching NCBI bookshelf and books, > like library module ?? Maybe I can start to work on this module ?? I think you can use the NCBI Entrez API here with "Books" as the database (see Biopython module Bio.Entrez). You could try this, and maybe write a cookbook example for our wiki. Kind regards, Peter From matsen at fhcrc.org Mon Oct 25 12:54:12 2010 From: matsen at fhcrc.org (Erick Matsen) Date: Mon, 25 Oct 2010 09:54:12 -0700 Subject: [Biopython] programmer position available in Seattle Message-ID: Dear Biopython community-- We are a new group at the Fred Hutchinson Cancer Research Center in Seattle, looking for an experienced programmer to write python libraries and scripts. We would like to find someone who can engage with problems and work independently, but who is also happy to cooperatively hammer out a spec and discuss implementation details. No biological experience necessary, but scientific curiosity strictly required. Long-time linux hacker a big plus. We develop methods for the evolutionary analysis of next-generation DNA sequence data for HIV research and the study of the human microbiome (i.e. bacteria that live on and inside of us). Current projects you would be working on include developing tools for reproducible research with SCons (+ cluster integration), putting together simulations to validate methodology, and code for assembling annotated packages of biological sequence data. We love good tools-- our favorites include OCaml, python, and git. All finalized code will be open source, and you will be required to feed as much code as possible into biopython or other relevant projects. The work environment will be very interactive, and we want to keep on finding better ways to do what we do. If you think it would be best to reimplement our pipeline in NetLogo, we'll listen! On the other hand we're serious about helping biologists with their data and sometimes that just means turning the crank. Working at the "Hutch" is fantastic, with great benefits and a lovely campus next to Lake Union, walking distance from downtown. Powerful computing resources and a helpful IT staff await you. You can find out a bit more at: http://matsen.fhcrc.org/ http://github.com/matsen/ http://github.com/nhoffman/ Starting salary $75K, with more possible for a perfect fit with heaps of experience. Point us to some good code you've written! A CV would be helpful as well. For people not in the Seattle area: we will consider remote work for a fixed period with the idea of moving to Seattle after that period is over. We may also be interested in contract-based work for small projects, and let us know if you are interested in that. Please respond directly. Thank you, Erick From David.Lapointe at umassmed.edu Mon Oct 25 11:49:08 2010 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Mon, 25 Oct 2010 11:49:08 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Hi Martin, If you look at the PDB structure files based on NMR determinations, you'll see that they contain different models. Perhaps you can format the file that way. For example look at 2GDT.pdb David -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock Sent: Monday, October 25, 2010 11:33 AM To: martin djokovic Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] writing a PDB----PLEASEEEE HELP On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic wrote: > Hi Peter, > I want a new PDB with structures A and B superimposed so that I can see them > both at the same in the same file > > So at the end of the simulation/run? I would have A and B (original) and the > 'ans.pdb' with A as it was but B rotated and translated to be superimposed > on A in the same PDB > I want to try this first simple superimposition but actually I want to > connect A and B together to make a longer strand > The last residue of A and first residue of B are the same so I can use those > coordinates to rotate/translate B then connect to A > I can do that manually using SWISS PDB but I want to do it for many > structures and its time consuing. Won't that mean there would be a duplicate residue? i.e. The last residue of A and first residue of B are the same thing, but would be in the file twice. Anyway - that basic idea is you must create a Bio.PDB structure object with both A and B in it (perhaps as two chains in the same model), then write that to the PDB file. The details depend on how you want to do the combination - there is more than one way to represent A and B in the same PDB file (quite separate from how to do it in Biopython). Peter P.S. You could trying writing out two separate PDB files and try simply concatenating them... it might do what you want. _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From zaricdragoslav at gmail.com Mon Oct 25 16:32:46 2010 From: zaricdragoslav at gmail.com (Dragoslav Zaric) Date: Tue, 26 Oct 2010 00:32:46 +0400 Subject: [Biopython] Getting involved In-Reply-To: References: Message-ID: Dear Peter, I think that this: "Can you program in C and are you familiar with the C/Python API? We will need to look at porting our C code from Python 2 to Python 3, and this is quite complicated." is best idea for start. I can code in C, and have experience both with python 2.7 and 3. Will read tomorrow about C/Python API. Kind regards On Mon, Oct 25, 2010 at 8:39 PM, Peter wrote: > On Mon, Oct 25, 2010 at 3:46 PM, Dragoslav Zaric > wrote: >> Dear friends, >> >> I will first introduce myself. My name is Dragoslav Zaric, I am from >> country Serbia, capital city Belgrade. I am currently located in Abu >> Dhabi, UAE, working as professional programmer. I have Master >> degree in Astrophysics and I work as professional programmer >> for more than 3 years. > > Hello Dragoslav, > >> I am not in the Bioinformatics field, but I have enough interest, >> energy and will to contribute. >> >> I have read biopython manual, so I have good idea and overview of >> biopython framework. >> Also I have good knowledge of python and django framework. >> >> I wanna start contribute to biopython project so can you tell me is >> there some planned work to be done ? >> >> At first I can get involved in programming tasks that are not totally >> related to genetics, like parsing of files, accessing databases ... > > There are lots of things people are working on part time, e.g. > http://biopython.org/wiki/Active_projects > > There may be some easy to solve open bugs/enhancements > on bugzilla, > http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED > > You could also look at the (unfunded) Google Summer of Code > project ideas - although those tend to be larger pieces of work. > > I would normally recommend you work on something directly > useful to the biology you are working on. Clearly that does not > apply here. I guess you want to do Python coding as a hobby? > > Can you program in C and are you familiar with the C/Python > API? We will need to look at porting our C code from Python 2 > to Python 3, and this is quite complicated. > >> Also I have started to read AMAZING book on NCBI site and after >> finishing this book I am sure I can start to work on genetic problems >> in biopython. This is the book: >> >> http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=genomes >> >> This also gave me idea how to start to contribute to biopython. Does >> biopython have module just for searching NCBI bookshelf and books, >> like library module ?? Maybe I can start to work on this module ?? > > I think you can use the NCBI Entrez API here with "Books" as the > database (see Biopython module Bio.Entrez). You could try this, > and maybe write a cookbook example for our wiki. > > Kind regards, > > Peter > -- Dragoslav Zaric Professional Programmer MSc Astrophysics From biopython at maubp.freeserve.co.uk Mon Oct 25 17:28:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 25 Oct 2010 22:28:24 +0100 Subject: [Biopython] Getting involved In-Reply-To: References:

Message-ID: On Mon, Oct 25, 2010 at 9:32 PM, Dragoslav Zaric wrote: > Dear Peter, > > I think that this: > > "Can you program in C and are you familiar with the C/Python > API? We will need to look at porting our C code from Python 2 > to Python 3, and this is quite complicated." > > is best idea for start. I can code in C, and have experience > both with python 2.7 and 3. Will read tomorrow about C/Python > API. > > Kind regards Hi Dragoslav, I'm glad you sound enthusiastic, and I hope you can make some progress... Our plan (following what the NumPy project are doing) is to have a single code base targeting Python 2.x. All the Python code is automatically converted using the 2to3 script into Python 3. There are a few special cases, but that work is mostly done now. All the C code will need to use #ifdef statements to make the same C file work on both Python 2 and Python 3. The bad news is that the basic API for writing C extension modules for Python has changed. What I suggest you do first, is make sure you can get the latest Biopython source code from git, compile it under Python 2, and run the unit tests. Then try 2to3 and running the tests under Python 3 (see the README file). Next I would trying updating one of the smaller C modules in Biopython to work on Python 3. You'll need to edit our setup.py to compile what you are working on (currently we compile none of the C code on Python 3). I don't yet have a feel for how much work this will be. Please sign up to the biopython-dev mailing list where we can discuss things in more detail. The main list is more for user support and general discussion. Thanks, and good luck! Peter From zaricdragoslav at gmail.com Mon Oct 25 19:34:29 2010 From: zaricdragoslav at gmail.com (Dragoslav Zaric) Date: Tue, 26 Oct 2010 03:34:29 +0400 Subject: [Biopython] Getting involved In-Reply-To: References:

Message-ID: Dear Peter, I have subscribed to biopython-dev mailing list and I have downloaded source code with git. kind regards On Tue, Oct 26, 2010 at 1:28 AM, Peter wrote: > On Mon, Oct 25, 2010 at 9:32 PM, Dragoslav Zaric wrote: >> Dear Peter, >> >> I think that this: >> >> "Can you program in C and are you familiar with the C/Python >> API? We will need to look at porting our C code from Python 2 >> to Python 3, and this is quite complicated." >> >> is best idea for start. I can code in C, and have experience >> both with python 2.7 and 3. Will read tomorrow about C/Python >> API. >> >> Kind regards > > Hi Dragoslav, > > I'm glad you sound enthusiastic, and I hope you can make > some progress... > > Our plan (following what the NumPy project are doing) is > to have a single code base targeting Python 2.x. > > All the Python code is automatically converted using the > 2to3 script into Python 3. There are a few special cases, > but that work is mostly done now. > > All the C code will need to use #ifdef statements to make > the same C file work on both Python 2 and Python 3. The > bad news is that the basic API for writing C extension > modules for Python has changed. > > What I suggest you do first, is make sure you can get > the latest Biopython source code from git, compile it > under Python 2, and run the unit tests. Then try 2to3 > and running the tests under Python 3 (see the README > file). > > Next I would trying updating one of the smaller C > modules in Biopython to work on Python 3. You'll > need to edit our setup.py to compile what you are > working on (currently we compile none of the C > code on Python 3). I don't yet have a feel for how > much work this will be. > > Please sign up to the biopython-dev mailing list where > we can discuss things in more detail. The main list is > more for user support and general discussion. > > Thanks, and good luck! > > Peter > -- Dragoslav Zaric Professional Programmer MSc Astrophysics From fglaser at technion.ac.il Tue Oct 26 01:43:09 2010 From: fglaser at technion.ac.il (Fabian Glaser) Date: Tue, 26 Oct 2010 07:43:09 +0200 Subject: [Biopython] PDB bfactor In-Reply-To: References:

Message-ID: Hi Joao, Yes I know that there are only 5 characters alloted, I was hoping to find a way to bypass that... I need more accuracy. Thanks a lot, Fabian On Mon, Oct 25, 2010 at 5:14 PM, Jo?o Rodrigues wrote: > Hello Fabian, > > Please have a look here: > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM > > The PDB format is quite annoying because it's basically space-based. So, for > the B factor, you have 5 characters alloted meaning it is usually YY.XX > > Therefore, although some programs might accept longer B factors, the Parser > will reject it. > > Best! > > Jo?o Rodrigues > -- ------------------------------------------------------ Fabian Glaser, PhD Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology Haifa 32000, ISRAEL E-mail: fglaser at technion.ac.il Tel:? ? +972-4-8293701 Fax:?? +972-4-8225153 From eric.talevich at gmail.com Tue Oct 26 12:24:19 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 26 Oct 2010 12:24:19 -0400 Subject: [Biopython] PDB bfactor In-Reply-To: References:

Message-ID: Hi Fabian, Since it's a limitation of the wwPDB format, you could try a different format, e.g. mmCIF or PDBML. http://pdbml.pdb.org/ Or, if the extra accuracy is only needed internally in your program, you could store the B factors, or even the whole structure's data, in a Python pickle or some other non-PDB serialization. Or a database. -Eric On Tue, Oct 26, 2010 at 1:43 AM, Fabian Glaser wrote: > Hi Joao, > > Yes I know that there are only 5 characters alloted, I was hoping to > find a way to bypass that... I need more accuracy. > > Thanks a lot, > > Fabian > > On Mon, Oct 25, 2010 at 5:14 PM, Jo?o Rodrigues wrote: > > Hello Fabian, > > > > Please have a look here: > > http://www.wwpdb.org/documentation/format32/sect9.html#ATOM > > > > The PDB format is quite annoying because it's basically space-based. So, > for > > the B factor, you have 5 characters alloted meaning it is usually YY.XX > > > > Therefore, although some programs might accept longer B factors, the > Parser > > will reject it. > > > > Best! > > > > Jo?o Rodrigues > > > > > > -- > ------------------------------------------------------ > Fabian Glaser, PhD > Bioinformatics Knowledge Unit, > The Lorry I. Lokey Interdisciplinary > Center for Life Sciences and Engineering > Technion - Israel Institute of Technology > Haifa 32000, ISRAEL > > E-mail: fglaser at technion.ac.il > Tel: +972-4-8293701 > Fax: +972-4-8225153 > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From martin.djokovic at gmail.com Tue Oct 26 13:05:59 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Tue, 26 Oct 2010 13:05:59 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Hello David and everyone else, Yes David at the moment I am actually doing that-formatting the PDB into 2 models then superimposing them. Merging the structures when they are in 2 differant PDB files is not working. I tried the following: from Bio.PDB.Model import Model model1=Model(2ndModel) structure1.add(model1) Its not working-I think I might have to better understand the object "structure" Is there a better way to merge two structures? Else I will write the 2 pdb files as models into a single PDB then superimpose. Another question-maybe some of you use VMD-if I open 1JOY in VMD I can only see one of the models, I need to scroll using the scroll bar to see the others-I cannot see them all at once. So for running the warwick code and seeing its eefect I use RASMOL, is there a way to look at all the models in the same time without scrolling in VMD? I prefer VMD to RASMOL so just wondering about that. Thanks everyone foe all the help. On Mon, Oct 25, 2010 at 11:49 AM, Lapointe, David < David.Lapointe at umassmed.edu> wrote: > Hi Martin, > > If you look at the PDB structure files based on NMR determinations, you'll > see that they contain different models. Perhaps you can format the file that > way. For example look at 2GDT.pdb > > David > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto: > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > Sent: Monday, October 25, 2010 11:33 AM > To: martin djokovic > Cc: biopython at lists.open-bio.org > Subject: Re: [Biopython] writing a PDB----PLEASEEEE HELP > > On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic > wrote: > > Hi Peter, > > I want a new PDB with structures A and B superimposed so that I can see > them > > both at the same in the same file > > > > So at the end of the simulation/run I would have A and B (original) and > the > > 'ans.pdb' with A as it was but B rotated and translated to be > superimposed > > on A in the same PDB > > I want to try this first simple superimposition but actually I want to > > connect A and B together to make a longer strand > > The last residue of A and first residue of B are the same so I can use > those > > coordinates to rotate/translate B then connect to A > > I can do that manually using SWISS PDB but I want to do it for many > > structures and its time consuing. > > Won't that mean there would be a duplicate residue? i.e. The last residue > of A and first residue of B are the same thing, but would be in the file > twice. > > Anyway - that basic idea is you must create a Bio.PDB structure object > with both A and B in it (perhaps as two chains in the same model), then > write that to the PDB file. The details depend on how you want to do > the combination - there is more than one way to represent A and B > in the same PDB file (quite separate from how to do it in Biopython). > > Peter > > P.S. You could trying writing out two separate PDB files and try simply > concatenating them... it might do what you want. > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From eric.talevich at gmail.com Tue Oct 26 13:25:44 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 26 Oct 2010 13:25:44 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: Hi Martin, It sounds like VMD displays just one model at a time, unless you make a movie out of it. You could get around that by writing the two structures as two different chains within the same model. But if the goal is just to view two structures superimposed, you'd be better off scripting your molecular viewer of choice to do it for you. In PyMOL, to view 1ABC and 2XYZ (made up), write this to a file called 1ABC_vs_2XYZ.pml: # This is a generated PyMOL script load /path/to/1ABC.pdb, 1ABC load /path/to/2XYZ.pdb, 2XYZ align 1ABC, 2XYZ # Optional viewing tweaks hide all show cartoon, 1ABC show cartoon, 2XYZ reset Then launch the viewer from the command line: pymol 1ABC_vs_2XYZ.pml If the two structures don't have the same peptide sequence, you'll have better results with the "fit" function instead of "align": http://www.pymolwiki.org/index.php/Fit I know VMD has something similar, but I don't have an example offhand. Best, Eric On Tue, Oct 26, 2010 at 1:05 PM, martin djokovic wrote: > Hello David and everyone else, > > Yes David at the moment I am actually doing that-formatting the PDB into 2 > models then superimposing them. > Merging the structures when they are in 2 differant PDB files is not > working. > I tried the following: > from Bio.PDB.Model import Model > model1=Model(2ndModel) > structure1.add(model1) > Its not working-I think I might have to better understand the object > "structure" > Is there a better way to merge two structures? Else I will write the 2 pdb > files as models into a single PDB then superimpose. > > Another question-maybe some of you use VMD-if I open 1JOY in VMD I can only > see one of the models, I need to scroll using the scroll bar to see the > others-I cannot see them all at once. > So for running the warwick code and seeing its eefect I use RASMOL, is > there > a way to look at all the models in the same time without scrolling in VMD? > I > prefer VMD to RASMOL so just wondering about that. > > Thanks everyone foe all the help. > > > On Mon, Oct 25, 2010 at 11:49 AM, Lapointe, David < > David.Lapointe at umassmed.edu> wrote: > > > Hi Martin, > > > > If you look at the PDB structure files based on NMR determinations, > you'll > > see that they contain different models. Perhaps you can format the file > that > > way. For example look at 2GDT.pdb > > > > David > > > > -----Original Message----- > > From: biopython-bounces at lists.open-bio.org [mailto: > > biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock > > Sent: Monday, October 25, 2010 11:33 AM > > To: martin djokovic > > Cc: biopython at lists.open-bio.org > > Subject: Re: [Biopython] writing a PDB----PLEASEEEE HELP > > > > On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic > > wrote: > > > Hi Peter, > > > I want a new PDB with structures A and B superimposed so that I can see > > them > > > both at the same in the same file > > > > > > So at the end of the simulation/run I would have A and B (original) > and > > the > > > 'ans.pdb' with A as it was but B rotated and translated to be > > superimposed > > > on A in the same PDB > > > I want to try this first simple superimposition but actually I want to > > > connect A and B together to make a longer strand > > > The last residue of A and first residue of B are the same so I can use > > those > > > coordinates to rotate/translate B then connect to A > > > I can do that manually using SWISS PDB but I want to do it for many > > > structures and its time consuing. > > > > Won't that mean there would be a duplicate residue? i.e. The last residue > > of A and first residue of B are the same thing, but would be in the file > > twice. > > > > Anyway - that basic idea is you must create a Bio.PDB structure object > > with both A and B in it (perhaps as two chains in the same model), then > > write that to the PDB file. The details depend on how you want to do > > the combination - there is more than one way to represent A and B > > in the same PDB file (quite separate from how to do it in Biopython). > > > > Peter > > > > P.S. You could trying writing out two separate PDB files and try simply > > concatenating them... it might do what you want. > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From akooser at unm.edu Tue Oct 26 16:37:10 2010 From: akooser at unm.edu (Ara Kooser) Date: Tue, 26 Oct 2010 14:37:10 -0600 Subject: [Biopython] Lineage from GenBank files Question In-Reply-To: References: <17B3DBDB-9F36-4C8F-85AC-42A61ED23DEA@unm.edu> Message-ID: <707BBD96-B4E6-41E9-8C22-968A3EA03A1F@unm.edu> Peter, Thank you for your reply. I was able to figure out the code with your help. I had another question since I've been looking through the documentation on the GenPept files. I want to get the accession number. > > > > Taking an example, > > LOCUS NC_000932 154478 bp DNA circular PLN > 15-APR-2009 > DEFINITION Arabidopsis thaliana chloroplast, complete genome. > ACCESSION NC_000932 > VERSION NC_000932.1 GI:7525012 > DBLINK Project:116 I am guessing that this is also read into the records file. Is this a header so something like header.annotations? Thank you! Ara From m.stantoncook at gmail.com Wed Oct 27 01:44:54 2010 From: m.stantoncook at gmail.com (Mitchell Stanton-Cook) Date: Wed, 27 Oct 2010 16:44:54 +1100 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References:

Message-ID: <4CC7BC56.5090508@gmail.com> Hi, I was playing around with writing additional residues to PDB files using biopython last week. I've included a simple snippet that gives you the basic idea of adding residues/atoms... Mitch for model in struct: for chain in model: for i in range(0, len(t_l)): # create new residue new_res = Residue(t_l[i], 'TSR', '') for j in range(0, 4): #populate the residue with atoms cur_c = dat[i+fill+j] new_at = Atom(n_l[j], cur_c, 0, 1, ' ', n_l[j], \ int(str(90)+str(i)+str(j)), element='X') new_res.add(new_at) fill = fill +3 chain.add(new_res) # Add the residue to the chain io=PDBIO() io.set_structure(s1) io.save(outname+'.pdb') On 27/10/10 04:25, Eric Talevich wrote: > Hi Martin, > > It sounds like VMD displays just one model at a time, unless you make a > movie out of it. You could get around that by writing the two structures as > two different chains within the same model. > > But if the goal is just to view two structures superimposed, you'd be better > off scripting your molecular viewer of choice to do it for you. > > In PyMOL, to view 1ABC and 2XYZ (made up), write this to a file called > 1ABC_vs_2XYZ.pml: > > # This is a generated PyMOL script > load /path/to/1ABC.pdb, 1ABC > load /path/to/2XYZ.pdb, 2XYZ > align 1ABC, 2XYZ > > # Optional viewing tweaks > hide all > show cartoon, 1ABC > show cartoon, 2XYZ > reset > > > Then launch the viewer from the command line: > pymol 1ABC_vs_2XYZ.pml > > If the two structures don't have the same peptide sequence, you'll have > better results with the "fit" function instead of "align": > http://www.pymolwiki.org/index.php/Fit > > I know VMD has something similar, but I don't have an example offhand. > > Best, > Eric > > > On Tue, Oct 26, 2010 at 1:05 PM, martin djokovic > wrote: > > >> Hello David and everyone else, >> >> Yes David at the moment I am actually doing that-formatting the PDB into 2 >> models then superimposing them. >> Merging the structures when they are in 2 differant PDB files is not >> working. >> I tried the following: >> from Bio.PDB.Model import Model >> model1=Model(2ndModel) >> structure1.add(model1) >> Its not working-I think I might have to better understand the object >> "structure" >> Is there a better way to merge two structures? Else I will write the 2 pdb >> files as models into a single PDB then superimpose. >> >> Another question-maybe some of you use VMD-if I open 1JOY in VMD I can only >> see one of the models, I need to scroll using the scroll bar to see the >> others-I cannot see them all at once. >> So for running the warwick code and seeing its eefect I use RASMOL, is >> there >> a way to look at all the models in the same time without scrolling in VMD? >> I >> prefer VMD to RASMOL so just wondering about that. >> >> Thanks everyone foe all the help. >> >> >> On Mon, Oct 25, 2010 at 11:49 AM, Lapointe, David< >> David.Lapointe at umassmed.edu> wrote: >> >> >>> Hi Martin, >>> >>> If you look at the PDB structure files based on NMR determinations, >>> >> you'll >> >>> see that they contain different models. Perhaps you can format the file >>> >> that >> >>> way. For example look at 2GDT.pdb >>> >>> David >>> >>> -----Original Message----- >>> From: biopython-bounces at lists.open-bio.org [mailto: >>> biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock >>> Sent: Monday, October 25, 2010 11:33 AM >>> To: martin djokovic >>> Cc: biopython at lists.open-bio.org >>> Subject: Re: [Biopython] writing a PDB----PLEASEEEE HELP >>> >>> On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic >>> wrote: >>> >>>> Hi Peter, >>>> I want a new PDB with structures A and B superimposed so that I can see >>>> >>> them >>> >>>> both at the same in the same file >>>> >>>> So at the end of the simulation/run I would have A and B (original) >>>> >> and >> >>> the >>> >>>> 'ans.pdb' with A as it was but B rotated and translated to be >>>> >>> superimposed >>> >>>> on A in the same PDB >>>> I want to try this first simple superimposition but actually I want to >>>> connect A and B together to make a longer strand >>>> The last residue of A and first residue of B are the same so I can use >>>> >>> those >>> >>>> coordinates to rotate/translate B then connect to A >>>> I can do that manually using SWISS PDB but I want to do it for many >>>> structures and its time consuing. >>>> >>> Won't that mean there would be a duplicate residue? i.e. The last residue >>> of A and first residue of B are the same thing, but would be in the file >>> twice. >>> >>> Anyway - that basic idea is you must create a Bio.PDB structure object >>> with both A and B in it (perhaps as two chains in the same model), then >>> write that to the PDB file. The details depend on how you want to do >>> the combination - there is more than one way to represent A and B >>> in the same PDB file (quite separate from how to do it in Biopython). >>> >>> Peter >>> >>> P.S. You could trying writing out two separate PDB files and try simply >>> concatenating them... it might do what you want. >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Oct 27 04:56:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Oct 2010 09:56:59 +0100 Subject: [Biopython] Lineage from GenBank files Question In-Reply-To: <707BBD96-B4E6-41E9-8C22-968A3EA03A1F@unm.edu> References: <17B3DBDB-9F36-4C8F-85AC-42A61ED23DEA@unm.edu> <707BBD96-B4E6-41E9-8C22-968A3EA03A1F@unm.edu> Message-ID: On Tue, Oct 26, 2010 at 9:37 PM, Ara Kooser wrote: > Peter, > > ? Thank you for your reply. I was able to figure out the code with your > help. I had another question since I've been looking through the > documentation on the GenPept files. I want to get the accession number. > > ... > > I am guessing that this is also read into the records file. Is this a header > so something like header.annotations? Well... something like that I suppose. Have you read the chapter in the tutorial on the SeqRecord object? Each sequence record in the GenBank file (i.e. LOCUS line to // line) becomes a SeqRecord object. Most of the header ends up in the SeqRecord's annotations dictionary - some special fields are used for the SeqRecord name, id, description and dbxrefs (database cross references). The feature table becomes a list of SeqFeature objects. Did you look at the annotations dictionary? >>> from Bio import SeqIO >>> record = SeqIO.read("NC_000932.gb", "genbank") >>> print record.annotations.keys() ['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references', 'accessions', 'data_file_division', 'date', 'organism', 'gi'] >>> print record.annotations {'comment': ..., 'gi': '7525012'} >>> print record.annotations['gi'] 7525012 >>> print record.annotations['accessions'] ['NC_000932'] Also, >>> record.name 'NC_000932' >>> record.id 'NC_000932.1' Peter From martin.djokovic at gmail.com Wed Oct 27 09:02:10 2010 From: martin.djokovic at gmail.com (martin djokovic) Date: Wed, 27 Oct 2010 14:02:10 +0100 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: <4CC7BC56.5090508@gmail.com> References:

<4CC7BC56.5090508@gmail.com> Message-ID: Mitch, and everyone else. Thanks sooooo much for all your help guys. I really appreciate this. Martin On Wed, Oct 27, 2010 at 6:44 AM, Mitchell Stanton-Cook < m.stantoncook at gmail.com> wrote: > Hi, > > I was playing around with writing additional residues to PDB files using > biopython last week. > > I've included a simple snippet that gives you the basic idea of adding > residues/atoms... > > Mitch > > > for model in struct: > for chain in model: > for i in range(0, len(t_l)): > # create new residue > new_res = Residue(t_l[i], 'TSR', '') > for j in range(0, 4): > #populate the residue with atoms > cur_c = dat[i+fill+j] > new_at = Atom(n_l[j], cur_c, 0, 1, ' ', n_l[j], > \ > int(str(90)+str(i)+str(j)), > element='X') > new_res.add(new_at) > fill = fill +3 > chain.add(new_res) > # Add the residue to the chain > io=PDBIO() > io.set_structure(s1) > io.save(outname+'.pdb') > > > > On 27/10/10 04:25, Eric Talevich wrote: > >> Hi Martin, >> >> It sounds like VMD displays just one model at a time, unless you make a >> movie out of it. You could get around that by writing the two structures >> as >> two different chains within the same model. >> >> But if the goal is just to view two structures superimposed, you'd be >> better >> off scripting your molecular viewer of choice to do it for you. >> >> In PyMOL, to view 1ABC and 2XYZ (made up), write this to a file called >> 1ABC_vs_2XYZ.pml: >> >> # This is a generated PyMOL script >> load /path/to/1ABC.pdb, 1ABC >> load /path/to/2XYZ.pdb, 2XYZ >> align 1ABC, 2XYZ >> >> # Optional viewing tweaks >> hide all >> show cartoon, 1ABC >> show cartoon, 2XYZ >> reset >> >> >> Then launch the viewer from the command line: >> pymol 1ABC_vs_2XYZ.pml >> >> If the two structures don't have the same peptide sequence, you'll have >> better results with the "fit" function instead of "align": >> http://www.pymolwiki.org/index.php/Fit >> >> I know VMD has something similar, but I don't have an example offhand. >> >> Best, >> Eric >> >> >> On Tue, Oct 26, 2010 at 1:05 PM, martin djokovic >> wrote: >> >> >> >>> Hello David and everyone else, >>> >>> Yes David at the moment I am actually doing that-formatting the PDB into >>> 2 >>> models then superimposing them. >>> Merging the structures when they are in 2 differant PDB files is not >>> working. >>> I tried the following: >>> from Bio.PDB.Model import Model >>> model1=Model(2ndModel) >>> structure1.add(model1) >>> Its not working-I think I might have to better understand the object >>> "structure" >>> Is there a better way to merge two structures? Else I will write the 2 >>> pdb >>> files as models into a single PDB then superimpose. >>> >>> Another question-maybe some of you use VMD-if I open 1JOY in VMD I can >>> only >>> see one of the models, I need to scroll using the scroll bar to see the >>> others-I cannot see them all at once. >>> So for running the warwick code and seeing its eefect I use RASMOL, is >>> there >>> a way to look at all the models in the same time without scrolling in >>> VMD? >>> I >>> prefer VMD to RASMOL so just wondering about that. >>> >>> Thanks everyone foe all the help. >>> >>> >>> On Mon, Oct 25, 2010 at 11:49 AM, Lapointe, David< >>> David.Lapointe at umassmed.edu> wrote: >>> >>> >>> >>>> Hi Martin, >>>> >>>> If you look at the PDB structure files based on NMR determinations, >>>> >>>> >>> you'll >>> >>> >>>> see that they contain different models. Perhaps you can format the file >>>> >>>> >>> that >>> >>> >>>> way. For example look at 2GDT.pdb >>>> >>>> David >>>> >>>> -----Original Message----- >>>> From: biopython-bounces at lists.open-bio.org [mailto: >>>> biopython-bounces at lists.open-bio.org] On Behalf Of Peter Cock >>>> Sent: Monday, October 25, 2010 11:33 AM >>>> To: martin djokovic >>>> Cc: biopython at lists.open-bio.org >>>> Subject: Re: [Biopython] writing a PDB----PLEASEEEE HELP >>>> >>>> On Mon, Oct 25, 2010 at 4:26 PM, martin djokovic >>>> wrote: >>>> >>>> >>>>> Hi Peter, >>>>> I want a new PDB with structures A and B superimposed so that I can see >>>>> >>>>> >>>> them >>>> >>>> >>>>> both at the same in the same file >>>>> >>>>> So at the end of the simulation/run I would have A and B (original) >>>>> >>>>> >>>> and >>> >>> >>>> the >>>> >>>> >>>>> 'ans.pdb' with A as it was but B rotated and translated to be >>>>> >>>>> >>>> superimposed >>>> >>>> >>>>> on A in the same PDB >>>>> I want to try this first simple superimposition but actually I want to >>>>> connect A and B together to make a longer strand >>>>> The last residue of A and first residue of B are the same so I can use >>>>> >>>>> >>>> those >>>> >>>> >>>>> coordinates to rotate/translate B then connect to A >>>>> I can do that manually using SWISS PDB but I want to do it for many >>>>> structures and its time consuing. >>>>> >>>>> >>>> Won't that mean there would be a duplicate residue? i.e. The last >>>> residue >>>> of A and first residue of B are the same thing, but would be in the file >>>> twice. >>>> >>>> Anyway - that basic idea is you must create a Bio.PDB structure object >>>> with both A and B in it (perhaps as two chains in the same model), then >>>> write that to the PDB file. The details depend on how you want to do >>>> the combination - there is more than one way to represent A and B >>>> in the same PDB file (quite separate from how to do it in Biopython). >>>> >>>> Peter >>>> >>>> P.S. You could trying writing out two separate PDB files and try simply >>>> concatenating them... it might do what you want. >>>> >>>> _______________________________________________ >>>> Biopython mailing list - Biopython at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> >>>> >>>> >>>> _______________________________________________ >>>> Biopython mailing list - Biopython at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> >>>> >>>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> >>> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > > From akooser at unm.edu Wed Oct 27 09:51:19 2010 From: akooser at unm.edu (Ara Kooser) Date: Wed, 27 Oct 2010 07:51:19 -0600 Subject: [Biopython] Lineage from GenBank files Question In-Reply-To: References: <17B3DBDB-9F36-4C8F-85AC-42A61ED23DEA@unm.edu> <707BBD96-B4E6-41E9-8C22-968A3EA03A1F@unm.edu> Message-ID: <0666BDD2-BC73-40ED-953C-4B6276DB3F29@unm.edu> Peter, I went through the docs on the wiki for SeqRecords. I realized I forgot to run the "print record.annotations" to get the correct words. Thank you for your help! Ara > > Did you look at the annotations dictionary? > >>>> from Bio import SeqIO >>>> record = SeqIO.read("NC_000932.gb", "genbank") >>>> print record.annotations.keys() > ['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', > 'references', 'accessions', 'data_file_division', 'date', 'organism', > 'gi'] >>>> print record.annotations > {'comment': ..., 'gi': '7525012'} >>>> print record.annotations['gi'] > 7525012 >>>> print record.annotations['accessions'] > ['NC_000932'] From nje5 at georgetown.edu Wed Oct 27 10:27:56 2010 From: nje5 at georgetown.edu (Nathan Edwards) Date: Wed, 27 Oct 2010 10:27:56 -0400 Subject: [Biopython] writing a PDB----PLEASEEEE HELP In-Reply-To: References: Message-ID: <4CC836EC.8030300@georgetown.edu> I needed (well decided that I wanted) to figure this out for the Bio.PDB lecture in my Python/Bioinformatics course a couple of weeks ago. import Bio.PDB parser = Bio.PDB.PDBParser() structure1 = parser.get_structure("2WFI","2WFI.pdb") structure2 = parser.get_structure("2GW2","2GW2.pdb") # Quickly/easily extract amino-acid residue chains ppb=Bio.PDB.PPBuilder() # Figure out how the query and subject peptides correspond... # Done manually based on sequence alignment here, but non-trivial in # general. # query has an extra residue at the front # subject has two extra residues at the back # Assume first peptide is the right one in each case, # (presumes) single chain models. query = ppb.build_peptides(structure1)[0][1:] subject = ppb.build_peptides(structure2)[0][:-2] # Get C alpha atoms for each qatoms = [ r['CA'] for r in query ] satoms = [ r['CA'] for r in subject ] # Superimpose... superimposer = Bio.PDB.Superimposer() superimposer.set_atoms(qatoms, satoms) print "Query and subject superimposed, RMS:", superimposer.rms # Apply transformation to structure2 superimposer.apply(structure2.get_atoms()) # Magic - write two structures to one file as two models # This PDBIO usage is not documented, but can be found # embedded in Bio.PDB.PDBIO.py # outfile=open("out.pdb", "w") io=Bio.PDB.PDBIO(1) io.set_structure(structure1) io.save(outfile) io.set_structure(structure2) io.save(outfile, write_end=1) outfile.close() I wanted to do this in order to visualize multiple proteins's alignments in PyMol, which I do not know well. Perhaps there is a PyMol specific way to show two PDB files (pre-aligned or aligned by PyMol) to visualize residues where two chains' structures do not line up. It is possible to use the above approach, but it is necessary to set a special PyMol attribute to see both models. Load in the out.pdb file in PyMol, select menu item: Setting -> "Edit All..." and set "all_states" to "on" (second from the top). Hope this helps, - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From saikari78 at gmail.com Fri Oct 29 07:26:30 2010 From: saikari78 at gmail.com (saikari keitele) Date: Fri, 29 Oct 2010 12:26:30 +0100 Subject: [Biopython] Entrez.efetch problem when querying pccompound database Message-ID: Hi, I'm using BioPython to query the NCBI pccompound database. I'm trying to retrieve the molecular weight of a compound given its InChIKey. Gettting the ID of the compound with esearch works fine. For instance: Entrez.esearch(db="pccompound", term='"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"[InChIKey]') However, when I try to retrieve the record's content with efetch from the ID returned by esearch, like this: Entrez.efetch(db="pcassay", id="2244") I get the following response: ++++++++++

Error occurred: Report 'ASN1' not found in 'pccompound' presentation

db=pccompound
query_key=
report=
dispstart=
dispmax=
mode=html
WebEnv=

pmfetch need params:

(id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where NNN - number in the history, 0 - clipboard content for current database)

db=db_name (mandatory)

report=[docsum, brief, abstract, citation, medline, asn.1, mlasn1, uilist, sgml, gen] (Optional; default is asn.1)

mode=[html, file, text, asn.1, xml] (Optional; default is html)

dispstart - first element to display, from 0 to count - 1, (Optional; default is 0)

dispmax - number of items to display (Optional; default is all elements, from dispstart)

See help. ++++++++++++++++++++++++++++++++++++++++++++ I've tried to use other return types and return modes, like for instance Entrez.efetch(db="pcassay", id="2244", rettype="abstract", retmode="text") but I have not succeeded in retrieveing this compound's record's content. Many thanks for any help on how to retrieve information on a compound from pccompound. Best wishes Saikari From biopython at maubp.freeserve.co.uk Fri Oct 29 08:13:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Oct 2010 13:13:36 +0100 Subject: [Biopython] Entrez.efetch problem when querying pccompound database In-Reply-To: References: Message-ID: On Fri, Oct 29, 2010 at 12:26 PM, saikari keitele wrote: > Hi, > > I'm using BioPython to query the NCBI pccompound database. > I'm trying to retrieve the molecular weight of a compound given its > InChIKey. > Gettting the ID of the compound with esearch works fine. For instance: > > Entrez.esearch(db="pccompound", > term='"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"[InChIKey]') > > However, when I try to retrieve the record's content with efetch from the ID > returned by esearch, like this: > > Entrez.efetch(db="pcassay", id="2244") > > I get the following response: > ... > Error occurred: Report 'ASN1' not found in 'pccompound' presentation > ... > > I've tried to use other return types and return modes, like for instance > > Entrez.efetch(db="pcassay", id="2244", rettype="abstract", retmode="text") > > but I have not succeeded in retrieveing this compound's record's content. > Many thanks for any help on how to retrieve information on a compound from > pccompound. > > Best wishes > > Saikari If you go to the webpage for this, http://www.ncbi.nlm.nih.gov/pcassay?term=2244 then you don't actually get any download links - rather it connects to the BioAssay server to retrieve data. My guess is the NCBI don't support efetch for the pcassay database - you'll have to email them and ask. Peter From mike.thon at gmail.com Sun Oct 31 12:03:31 2010 From: mike.thon at gmail.com (Michael Thon) Date: Sun, 31 Oct 2010 17:03:31 +0100 Subject: [Biopython] getting the parent of a Clade Message-ID: I have a Clade object and I need to access its parent clade. I thought that clade.root should do this but this seems to contain a reference to itself: (Pdb) main_clade == main_clade.root True Is there some other way? Thanks Mike From eric.talevich at gmail.com Sun Oct 31 13:57:07 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 31 Oct 2010 13:57:07 -0400 Subject: [Biopython] getting the parent of a Clade In-Reply-To: References: Message-ID: On Sun, Oct 31, 2010 at 12:03 PM, Michael Thon wrote: > I have a Clade object and I need to access its parent clade. I thought > that clade.root should do this but this seems to contain a reference to > itself: > > (Pdb) main_clade == main_clade.root > True > > Is there some other way? > Thanks > Mike > > Hi Mike, You can do this, assuming you have the original tree object (call it "tree"): parent = tree.get_path(main_clade)[-2] This is an O(n) operation on the tree, so if you need to do it repeatedly on a large tree, it's faster to call tree.get_path(clade) once outside the loop and then reuse the resulting list. Is the operation you're doing here part of something you'd like to see implemented as a tree method? -Eric From eric.talevich at gmail.com Sun Oct 31 15:23:25 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 31 Oct 2010 15:23:25 -0400 Subject: [Biopython] getting the parent of a Clade In-Reply-To: References:

Message-ID: On Sun, Oct 31, 2010 at 1:57 PM, Eric Talevich wrote: > On Sun, Oct 31, 2010 at 12:03 PM, Michael Thon wrote: > >> I have a Clade object and I need to access its parent clade. I thought >> that clade.root should do this but this seems to contain a reference to >> itself: >> >> (Pdb) main_clade == main_clade.root >> True >> >> Is there some other way? >> Thanks >> Mike >> >> > Hi Mike, > > You can do this, assuming you have the original tree object (call it > "tree"): > > parent = tree.get_path(main_clade)[-2] > > This is an O(n) operation on the tree, so if you need to do it repeatedly > on a large tree, it's faster to call tree.get_path(clade) once outside the > loop and then reuse the resulting list. > > Is the operation you're doing here part of something you'd like to see > implemented as a tree method? > > I added a cookbook entry on the Biopython wiki for this problem: http://biopython.org/wiki/Phylo_cookbook#Get_the_parent_of_a_clade Cheers, Eric From akooser at unm.edu Fri Oct 8 03:06:00 2010 From: akooser at unm.edu (Ara Kooser) Date: Thu, 7 Oct 2010 21:06:00 -0600 Subject: [Biopython] Tutorial Question 7.4 alignment.title Message-ID: Hello all, I am a new user to Biopython. I've been working my way through the tutorial. I have a question about how the alignment.title works in the example given in section 7.4 of the tutorial. I wrote the following code: from Bio.Blast import NCBIXML E_VALUE_THRESH = 1e-30 result_handle = open("test.xml") blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'e value:', hsp.expect print 'length:', alignment.length print 'start:', hsp.query_start print 'end:',hsp.query_end To look at a .xml file that was produced by BLAST. I was wondering if there was a way to break up the string for information produced by the: print 'sequence:', alignment.title Basically I would like the organisms name first, followed by the locus number. I wasn't sure how to split up the print command. I looked at the docs over at http://biopython.org/DIST/docs/api/ to see if there was a tag specifically for the locus number and organism name. Thank you for your time and help. Regards, Ara From biopython at maubp.freeserve.co.uk Fri Oct 8 09:30:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Oct 2010 10:30:58 +0100 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References: Message-ID: On Fri, Oct 8, 2010 at 4:06 AM, Ara Kooser wrote: > Hello all, > > I am a new user to Biopython. I've been working my way through the > tutorial. I have a question about how the alignment.title works in the > example given in section 7.4 of the tutorial. I wrote the following code: > > from Bio.Blast import NCBIXML > > E_VALUE_THRESH = 1e-30 > > result_handle = open("test.xml") > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > if hsp.expect < E_VALUE_THRESH: > print '****Alignment****' > print 'sequence:', alignment.title > print 'e value:', hsp.expect > print 'length:', alignment.length > print 'start:', hsp.query_start > print 'end:',hsp.query_end > > To look at a .xml file that was produced by BLAST. I was wondering if there > was a way to break up the string for information produced by the: > > print 'sequence:', alignment.title > > Basically I would like the organisms name first, followed by the locus > number. I wasn't sure how to split up the print command. > > I looked at the docs over at http://biopython.org/DIST/docs/api/ to see if > there was a tag specifically for the locus number and organism name. > > Thank you for your time and help. > > Regards, > Ara Hi Ara, An example of the output you are getting and what you want would help, but I think this isn't possible in general. As I recall, the locus number and organism name information is just part of the original identifier and/or description in the FASTA file used to build the BLAST database. The NCBI tend to include the species in the description within square brackets - but this is just their convention, it is not a nicely tagged part of the BLAST output which the parser could spot. Basically I think you will have to parse the string yourself. Peter P.S. Alternatively if you want the organism name and have the GI number (or similar) this can be mapped to the organism via the NCBI taxonomy database (either online via Entrez or by parsing a downloaded copy of the mapping). From bratdaking at gmail.com Fri Oct 8 12:00:53 2010 From: bratdaking at gmail.com (Bart) Date: Fri, 8 Oct 2010 14:00:53 +0200 Subject: [Biopython] NCBIWWW and megablast Message-ID: Hey, I was wondering why the megablast option (the greedy extension) in the qblast is left out in the NCBIWWW.py? I want to map a sequence to the human genome, and to mimic the NCBI website I need a gapcost setting of "0 0", with the megablast option set to True. The fix was to add the following line ('LCASE_MASK',lcase_mask), ('MEGABLAST',megablast), ('MATRIX_NAME',matrix_name), to the parameters list of the qblast def and add: megablast=None, to the arguments. But is there a reason this setting has been left out (it is as far as I can see the only setting from the NCBI api missing)? Cheers, Bart From biopython at maubp.freeserve.co.uk Fri Oct 8 12:33:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Oct 2010 13:33:54 +0100 Subject: [Biopython] NCBIWWW and megablast In-Reply-To: References: Message-ID: On Fri, Oct 8, 2010 at 1:00 PM, Bart wrote: > Hey, > > I was wondering why the megablast option (the greedy extension) in the > qblast is left out in the NCBIWWW.py? > I want to map a sequence to the human genome, and to mimic the NCBI website > I need a gapcost setting of "0 0", with the megablast option set to True. > The fix was to add the following line > ? ? ? ('LCASE_MASK',lcase_mask), > ? ? ? ('MEGABLAST',megablast), > ? ? ? ('MATRIX_NAME',matrix_name), > to the parameters list of the qblast def and add: > megablast=None, > to the arguments. > But is there a reason this setting has been left out (it is as far as I can > see the only setting from the NCBI api missing)? > > Cheers, > Bart Hi Bart, Most likely this is a relatively recent addition to the NCBI API. Could you turn that into a patch we could apply? Don't forget to add the new option to the qblast function's docstring. Thanks, Peter From akooser at unm.edu Fri Oct 8 15:45:31 2010 From: akooser at unm.edu (Ara Kooser) Date: Fri, 8 Oct 2010 09:45:31 -0600 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References:

Message-ID: Peter, Thanks for your reply. I started to fiddle around with parsing the string last night but haven't made much progress. At the moment the output looks like this: ****Alignment**** sequence: gi|302529614|ref|ZP_07281956.1| predicted protein [Streptomyces sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein [Streptomyces sp. AA4] e value: 1.89229e-46 length: 1109 start: 7 end: 414 So what I want from the sequence string is the following: [Streptomyces sp. AA4] ZP_07281956.1 printed out as separated lines like the rest of the output. After that is figured out I want to put all the information in columns so it can be read into a spreadsheet in OO so that it looks like this: Name Locus # E_value Length Start End Regards, Ara On Oct 8, 2010, at 3:30 AM, Peter wrote: > On Fri, Oct 8, 2010 at 4:06 AM, Ara Kooser wrote: >> Hello all, >> >> I am a new user to Biopython. I've been working my way through the >> tutorial. I have a question about how the alignment.title works in >> the >> example given in section 7.4 of the tutorial. I wrote the following >> code: >> >> from Bio.Blast import NCBIXML >> >> E_VALUE_THRESH = 1e-30 >> >> result_handle = open("test.xml") >> blast_records = NCBIXML.parse(result_handle) >> blast_record = blast_records.next() >> >> for alignment in blast_record.alignments: >> for hsp in alignment.hsps: >> if hsp.expect < E_VALUE_THRESH: >> print '****Alignment****' >> print 'sequence:', alignment.title >> print 'e value:', hsp.expect >> print 'length:', alignment.length >> print 'start:', hsp.query_start >> print 'end:',hsp.query_end >> >> To look at a .xml file that was produced by BLAST. I was wondering >> if there >> was a way to break up the string for information produced by the: >> >> print 'sequence:', alignment.title >> >> Basically I would like the organisms name first, followed by the >> locus >> number. I wasn't sure how to split up the print command. >> >> I looked at the docs over at http://biopython.org/DIST/docs/api/ to >> see if >> there was a tag specifically for the locus number and organism name. >> >> Thank you for your time and help. >> >> Regards, >> Ara > > Hi Ara, > > An example of the output you are getting and what you want > would help, but I think this isn't possible in general. > > As I recall, the locus number and organism name information is > just part of the original identifier and/or description in the FASTA > file used to build the BLAST database. The NCBI tend to include > the species in the description within square brackets - but this is > just their convention, it is not a nicely tagged part of the BLAST > output which the parser could spot. > > Basically I think you will have to parse the string yourself. > > Peter > > P.S. Alternatively if you want the organism name and have the > GI number (or similar) this can be mapped to the organism via > the NCBI taxonomy database (either online via Entrez or > by parsing a downloaded copy of the mapping). From biopython at maubp.freeserve.co.uk Fri Oct 8 15:56:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Oct 2010 16:56:26 +0100 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References:

Message-ID: On Fri, Oct 8, 2010 at 4:45 PM, Ara Kooser wrote: > Peter, > > Thanks for your reply. I started to fiddle around with parsing the string > last night but haven't made much progress. > > At the moment the output looks like this: > > ****Alignment**** > sequence: gi|302529614|ref|ZP_07281956.1| predicted protein [Streptomyces > sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein [Streptomyces sp. > AA4] > e value: 1.89229e-46 > length: 1109 > start: 7 > end: 414 > > So what I want from the sequence string is the following: > [Streptomyces sp. AA4] > ZP_07281956.1 > > printed out as separated lines like the rest of the output. You could do this with regular expressions (import re), or some simple python searching for the square brackets etc. > After that is figured out I want to put all the information in columns so it > can be read into a spreadsheet in OO so that it looks like this: > Name ? ?Locus # E_value Length ?Start ? End It would be much simpler to ask BLAST to give you tabular ouput. If you are using BLAST+ you can even specify which columns you want (although this won't pull out the organism name for you). Peter From akooser at unm.edu Fri Oct 8 16:01:58 2010 From: akooser at unm.edu (Ara Kooser) Date: Fri, 8 Oct 2010 10:01:58 -0600 Subject: [Biopython] Tutorial Question 7.4 alignment.title In-Reply-To: References:

Message-ID: <0F95C585-007A-43EE-95F0-B28941FA6301@unm.edu> Peter, Thank you for those suggestions. I hadn't thought of using BLAST+. I will check that out this weekend. Regards, Ara On Oct 8, 2010, at 9:56 AM, Peter wrote: > On Fri, Oct 8, 2010 at 4:45 PM, Ara Kooser wrote: >> Peter, >> >> Thanks for your reply. I started to fiddle around with parsing the >> string >> last night but haven't made much progress. >> >> At the moment the output looks like this: >> >> ****Alignment**** >> sequence: gi|302529614|ref|ZP_07281956.1| predicted protein >> [Streptomyces >> sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein >> [Streptomyces sp. >> AA4] >> e value: 1.89229e-46 >> length: 1109 >> start: 7 >> end: 414 >> >> So what I want from the sequence string is the following: >> [Streptomyces sp. AA4] >> ZP_07281956.1 >> >> printed out as separated lines like the rest of the output. > > You could do this with regular expressions (import re), or some simple > python searching for the square brackets etc. > >> After that is figured out I want to put all the information in >> columns so it >> can be read into a spreadsheet in OO so that it looks like this: >> Name Locus # E_value Length Start End > > It would be much simpler to ask BLAST to give you tabular ouput. > If you are using BLAST+ you can even specify which columns you > want (although this won't pull out the organism name for you). > > Peter From mike.thon at gmail.com Sun Oct 10 06:09:42 2010 From: mike.thon at gmail.com (Michael Thon) Date: Sun, 10 Oct 2010 08:09:42 +0200 Subject: [Biopython] parsing newick trees in memory Message-ID: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> I have a String containing a tree in newick format and I want to turn it into biopython objects. The Bio.Phylo.read() function seems to only take file names or file handles as parameters. Is there any way to do this without actually saving the string to a file first? Thanks Mike From stran104 at chapman.edu Sun Oct 10 07:37:55 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sun, 10 Oct 2010 00:37:55 -0700 Subject: [Biopython] parsing newick trees in memory In-Reply-To: References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> Message-ID: You might find useful: Newick: A python module for parsing trees in the Newick file format. http://www.daimi.au.dk/~mailund/newick.html Cheers, Matt Strand - Hide quoted text - On Sat, Oct 9, 2010 at 11:09 PM, Michael Thon wrote: > I have a String containing a tree in newick format and I want to turn it > into biopython objects. The Bio.Phylo.read() function seems to only take > file names or file handles as parameters. Is there any way to do this > without actually saving the string to a file first? > Thanks > Mike > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fkauff at biologie.uni-kl.de Sun Oct 10 10:15:45 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Sun, 10 Oct 2010 12:15:45 +0200 Subject: [Biopython] parsing newick trees in memory In-Reply-To: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> Message-ID: If you don't want to use StringIO, then Nexus.Trees should be able to handle this: >>> from Bio.Nexus import Trees >>> tree='((a,b),c)' >>> tobj=Trees.Tree(tree) >>> tobj >>> dir(tobj) ['_Tree__values_are_support', '__doc__', '__init__', '__module__', '__str__', '_add_nodedata', '_add_subtree', '_get_id', '_get_values', '_parse', '_walk', 'add', 'all_ids', 'branchlength2support', 'chain', 'collapse', 'collapse_genera', 'common_ancestor', 'convert_absolute_support', 'count_terminals', 'dataclass', 'display', 'distance', 'get_taxa', 'get_terminals', 'has_support', 'id', 'is_bifurcating', 'is_compatible', 'is_identical', 'is_internal', 'is_monophyletic', 'is_parent_of', 'is_preterminal', 'is_terminal', 'kill', 'link', 'max_support', 'merge_with_support', 'name', 'node', 'prune', 'randomize', 'root', 'root_with_outgroup', 'rooted', 'search_taxon', 'set_subtree', 'split', 'sum_branchlength', 'to_string', 'trace', 'unlink', 'unroot', 'weight'] >>> On Sun, 10 Oct 2010 08:09:42 +0200 Michael Thon wrote: > I have a String containing a tree in newick format and I >want to turn it into biopython objects. The >Bio.Phylo.read() function seems to only take file names >or file handles as parameters. Is there any way to do >this without actually saving the string to a file first? > Thanks > Mike > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mike.thon at gmail.com Sun Oct 10 17:22:43 2010 From: mike.thon at gmail.com (Michael Thon) Date: Sun, 10 Oct 2010 19:22:43 +0200 Subject: [Biopython] parsing newick trees in memory In-Reply-To: References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> Message-ID: <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> On Oct 10, 2010, at 12:15 PM, Frank Kauff wrote: > If you don't want to use StringIO, then Nexus.Trees should be able to handle this: I could not get StringIO to work in this case... that is, until I learned that I have to ensure that I can read from the beginning of the buffer: out_h = StringIO.StringIO() out_h.write(tree_text) out_h.seek(0) tree = Phylo.read(out_h, 'newick') print tree From biopython at maubp.freeserve.co.uk Sun Oct 10 20:10:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 10 Oct 2010 21:10:30 +0100 Subject: [Biopython] parsing newick trees in memory In-Reply-To: <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> Message-ID: On Sun, Oct 10, 2010 at 6:22 PM, Michael Thon wrote: > > I could not get StringIO to work in this case... that > is, until I learned that I have to ensure that I can > read from the beginning of the buffer: > > ? ?out_h = StringIO.StringIO() > ? ?out_h.write(tree_text) > ? ?out_h.seek(0) > ? ?tree = Phylo.read(out_h, 'newick') > ? ?print tree > This way is shorter ;) from StringIO import StringIO from Bio import Phylo tree = Phylo.read(StringIO(tree_text), 'newick') print tree Eric - we should probably have an example of using StringIO in the Phlyo chapter as we do in the SeqIO chapter. Peter From eric.talevich at gmail.com Sun Oct 10 21:50:21 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 10 Oct 2010 17:50:21 -0400 Subject: [Biopython] parsing newick trees in memory In-Reply-To: References: <23DD8F96-6944-4CD1-84D9-ABF0DE5DC3C3@gmail.com> <4FDB38FA-DC71-4E96-8C50-B98130D62589@gmail.com> Message-ID: On Sun, Oct 10, 2010 at 4:10 PM, Peter wrote: > > This way is shorter ;) > > from StringIO import StringIO > from Bio import Phylo > tree = Phylo.read(StringIO(tree_text), 'newick') > print tree > > Eric - we should probably have an example of using > StringIO in the Phlyo chapter as we do in the SeqIO > chapter. > > Peter > Sure. I added an example to the wiki page just now: http://biopython.org/wiki/Phylo#read.28.29 -Eric From mike.thon at gmail.com Mon Oct 11 08:29:14 2010 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 11 Oct 2010 10:29:14 +0200 Subject: [Biopython] saving tree data in phyloXML format Message-ID: <7CA03745-3A1C-453F-8F7F-2E9372D4CBF3@gmail.com> I am now reading my trees and creating tree objects. For each clade in the tree I am adding a node_id attribute and one Property object. When I print the tree using: print tree I can see the new information that I've added to the tree. When I try to save the tree in phyloXML format, the node_id attribute and the Property object are not serialized. I'm saving the tree like this: PhyloXMLIO.write(tree, 'mytree.xml') Basically, what I'm trying to do is decorate the branches in the trees with some additional data (a node_id, branch labels, and a url) , and then render them in a web page, possibly using jsPhyloSVG (http://www.jsphylosvg.com). the examples on that website show tags containing and tags. I don't see an 'annotation' property in Bio.Phylo.PhyloXML.Clade but I'm hoping that there is some other property that maps to when I save the tree in phyloXML format. TIA Mike From biopython at maubp.freeserve.co.uk Mon Oct 11 09:03:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Oct 2010 10:03:37 +0100 Subject: [Biopython] NCBIWWW and megablast In-Reply-To: References: