From jttkim at googlemail.com Tue Oct 1 08:32:22 2013 From: jttkim at googlemail.com (Jan Kim) Date: Tue, 1 Oct 2013 13:32:22 +0100 Subject: [Biopython] Job: Bioinformatician in Livestock Virology Message-ID: <20131001123221.GD8804@LIN-2F308X1.iah.ac.uk> Bioinformatician in Livestock Virology (POST REF: IRC112153) ============================================================ Full details: http://www.pirbright.ac.uk/jobs/Jobs.aspx This post offers a unique opportunity to develop your own interdisciplinary research profile at the interface of experimental and theoretical bioscience applied to livestock virology at The Pirbright Institute. This post requires an understanding of the principles of bioinformatics and molecular biology, together with practical computing skills. Background knowledge of the biology of livestock animals and the viruses affecting them would also be beneficial. We expect candidates to have a firm grasp of at least one of the following areas: virus evolution; genomics and transcriptomics of viruses, vectors and hosts; or immunology and immunogenetics. We therefore encourage applications from biologists who are developing computer science and bioinformatics knowledge and from candidates with a computer science background who are developing knowledge in the biosciences. The Pirbright Institute, an institute of the Biotechnology and Biological Sciences Research Council (BBSRC), is a unique national center that works through its highly innovative fundamental and applied bioscience to enhance the UK capability to contain, control, and eliminate viral diseases of animals and viruses that spread from animals to humans. We thereby support the competitiveness of UK livestock and poultry producers, and improve the health and quality of life of both animals and people. The closing date for applying is 29 October 2013. Full details, including instructions for applying, can be found at http://www.pirbright.ac.uk/jobs/Jobs.aspx . Informal enquiries are welcome and should be sent to Jan T Kim, Head of Bioinformatics, jan.kim at pirbright.ac.uk. From anaryin at gmail.com Tue Oct 8 17:22:42 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Oct 2013 23:22:42 +0200 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: <521FD389.1090207@eng.ucsd.edu> References: <521FD389.1090207@eng.ucsd.edu> Message-ID: Dear James, Regarding problem 1. What you describe runs fine on my machine, using Python 2.7.5 and an up-to-date Biopython git version. You logic seems fine, maybe it's the version of Python you are using? Regarding your second problem, that of the mismatched indexes. The Selection method returns a *list* of residues while when you iterate over the neighbors and ask for their id it gives back the id of the residue. This id will only correspond to the Selection list index if your residues are numbered from 1 to N without gaps. If your protein starts at residue 3, then the first item given back by Selection has index 0 while in fact the id is 3. Does this make sense? The warning occurs if you have chain breaks. There should be some gaps in your structure, starting at a number other than 1 does not raise this warning normally. Cheers and sorry for the late reply, Jo?o 2013/8/30 James Jensen > Hello! > > I am writing a function that, given two chains in a PDB file, should > return 1) the positions and identities of all residues that are in contact > with (distance < 5 angstroms) a residue on the other chain, and 2) the > amino acid sequences of the chains. I've been doing this with > NeighborSearch.search_all(**radius=5, level='A') and then for each atom > pair, seeing what its parent residue is and whether the parent residues of > the two atoms belong to different chains. This may seem like a roundabout > way of doing it, but if I call search_all(radius=5, level='R'), or indeed > with level=any level other than 'A', I get the error > > TypeError: unorderable types: Residue() < Residue() > > So my first question is why it might be that search_all isn't working at > higher levels. > > For the adjacent residue pairs I identify using NeighborSearch, I get each > residue's position in its respective chain by residue.get_id()[1]. > > I've noticed, however, that if I get the sequence of the chain using seq = > Selection.unfold_entities(**chain, 'R') and then reference (i.e. > seq[index]) the amino acids using the indices returned by the > NeighborSearch step, they are not the same residues that I get if during > the NeighborSearch step I report residue.get_resname() for each adjacent > residue. > > I've tried it with several proteins, and the problem is the same. Chains A > and C of 2h62 are an example. > > I then noticed that the lowest residue ID number of the residues yielded > from Selection.unfold_entities(**chain, 'R') is not 1. For chain A, it's > 11, and for chain C, it's 34. Not knowing why this was, I thought I'd try > subtracting the lowest ID number from the indices returned by the > NeighborSearch step (i.e. in chain A, 11 -> 0 so seq[0] would be the first > residue, the one with ID 11). This happened to seem to work for chain A. > However, it gives me negative indices for some of the contacts in chain C. > This means that NeighborSearch can return residues that are not returned by > unfold_entities(). The lowest residue ID returned by NeighborSearch for > chain C was 24, whereas for unfold_entities() it was 34. > > For both chains A and C, I was given the warning > > PDBConstructionWarning: WARNING: Chain [letter] is discontinuous > at line [line number]. > > In fact, I seem to get this warning for just about every chain of every > structure I load. Is this the reason that the first residues in the two > chains are at 11 and 34, rather than 1? If so, could it be that > NeighborSearch is able to work around the discontinuity while > unfold_entities is not? > > Any suggestions? > > Thanks for your time and help, > > James Jensen > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From idoerg at gmail.com Wed Oct 9 09:03:10 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 9 Oct 2013 09:03:10 -0400 Subject: [Biopython] PyTennessee Message-ID: Hi all If there are any Biopython people in the area, there's a PyTennessee conference Feb 22-23 2014 in Nashville. They are taking calls for proposals now. http://www.pytennessee.org/speaking/cfp/ Thanks to Jeff Chang for the info. Best, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From rlinder at austin.utexas.edu Wed Oct 9 16:54:34 2013 From: rlinder at austin.utexas.edu (Randy Linder) Date: Wed, 9 Oct 2013 15:54:34 -0500 Subject: [Biopython] installing Biopython on Pyzo Message-ID: <5255C28A.20500@austin.utexas.edu> Hi, I'm using the Pyzo distro of Python 3.3 on the Windows 8 desktop and am having trouble installing Biopython. When I attempt to use the Windows installer it says it cannot find Python 3.3. in the registry. Is there a work around for this problem. Thanks in advance for any help you can provide. Randy Linder -- ___________________________________________ Randy Linder Associate Professor Office (512) 471-7825 Lab (512) 471-7826 FAX (512) 232-9529 _U.S. Postal Service address_: Section of Integrative Biology University of Texas 1 University Station #C0930 Austin, TX 78712 _FedEx, UPS, etc. address_: Section of Integrative Biology The University of Texas at Austin Biological Laboratories 404 2401 Whitis Austin, TX 78712 USA From bitsink at gmail.com Wed Oct 9 19:04:37 2013 From: bitsink at gmail.com (Nam Nguyen) Date: Wed, 9 Oct 2013 16:04:37 -0700 Subject: [Biopython] installing Biopython on Pyzo In-Reply-To: <5255C28A.20500@austin.utexas.edu> References: <5255C28A.20500@austin.utexas.edu> Message-ID: Hi Randy, I'm new to Biopython so my reply may be a complete non-sense. But, Pyzo's web site (http://www.pyzo.org/distro.html) says in its __future__ feature: - Activating the Pyzo distro so Windows installers (e.g. from gohlke) just work. On Wed, Oct 9, 2013 at 1:54 PM, Randy Linder wrote: > Hi, > > I'm using the Pyzo distro of Python 3.3 on the Windows 8 desktop and am > having trouble installing Biopython. When I attempt to use the Windows > installer it says it cannot find Python 3.3. in the registry. Is there a > work around for this problem. > > Thanks in advance for any help you can provide. > > Randy Linder > > > -- > > ______________________________**_____________ > Randy Linder > Associate Professor > Office (512) 471-7825 > Lab (512) 471-7826 > FAX (512) 232-9529 > > _U.S. Postal Service address_: > Section of Integrative Biology > University of Texas > 1 University Station #C0930 > Austin, TX 78712 > > _FedEx, UPS, etc. address_: > Section of Integrative Biology > The University of Texas at Austin > Biological Laboratories 404 > 2401 Whitis > Austin, TX 78712 > USA > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Thu Oct 10 04:57:37 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Oct 2013 09:57:37 +0100 Subject: [Biopython] installing Biopython on Pyzo In-Reply-To: References: <5255C28A.20500@austin.utexas.edu> Message-ID: Hi Randy, Nam, Yes, it sounds like Pyzo isn't recording itself in the Windows registry in the same way that the Official Python installers do, and therefore our installer can't find it. I've not heard of Pyzo but there are other Python bundles which include Biopython, e.g. the Enthought Python Distribution which is now branded as Canopy (although their package for Biopython is currently a bit out of date): https://www.enthought.com/products/canopy/package-index/ Regards, Peter On Thu, Oct 10, 2013 at 12:04 AM, Nam Nguyen wrote: > Hi Randy, > > I'm new to Biopython so my reply may be a complete non-sense. But, Pyzo's > web site (http://www.pyzo.org/distro.html) says in its __future__ feature: > > > - Activating the Pyzo distro so Windows installers (e.g. from gohlke) > just work. > > > > On Wed, Oct 9, 2013 at 1:54 PM, Randy Linder wrote: > >> Hi, >> >> I'm using the Pyzo distro of Python 3.3 on the Windows 8 desktop and am >> having trouble installing Biopython. When I attempt to use the Windows >> installer it says it cannot find Python 3.3. in the registry. Is there a >> work around for this problem. >> >> Thanks in advance for any help you can provide. >> >> Randy Linder >> >> >> -- >> >> ______________________________**_____________ >> Randy Linder >> Associate Professor >> Office (512) 471-7825 >> Lab (512) 471-7826 >> FAX (512) 232-9529 >> >> _U.S. Postal Service address_: >> Section of Integrative Biology >> University of Texas >> 1 University Station #C0930 >> Austin, TX 78712 >> >> _FedEx, UPS, etc. address_: >> Section of Integrative Biology >> The University of Texas at Austin >> Biological Laboratories 404 >> 2401 Whitis >> Austin, TX 78712 >> USA >> >> >> ______________________________**_________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From ericmajinglong at gmail.com Thu Oct 10 15:33:43 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 10 Oct 2013 15:33:43 -0400 Subject: [Biopython] How to do RNA-RNA hybridization searches? Message-ID: Hi everybody, I'm looking to try my hand at doing the following problem: I have a sequence of RNA, say, "RNA SEQUENCE 1": "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." I have another sequence of RNA, say "RNA SEQUENCE 2". I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll take a long time, I'm sure, but nonetheless that's what I'd like to do), such that I search: "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. along RNA Sequence 2, to see if I can find that same region present. I was wondering if there was some package that could do that, that either BioPython interfaces with, or is separately implemented as a Python package. Does anybody know if there is such a thing? Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl From chris.mit7 at gmail.com Thu Oct 10 15:59:00 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 10 Oct 2013 15:59:00 -0400 Subject: [Biopython] How to do RNA-RNA hybridization searches? In-Reply-To: References: Message-ID: If you want to do it in python, it's fairly trivial. a = 'CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG' for i in xrange(0,len(a)-14): if a[i:i+15] in seq2: do this or is there some reason you aren't taking this approach? On Thu, Oct 10, 2013 at 3:33 PM, Eric Ma wrote: > Hi everybody, > > I'm looking to try my hand at doing the following problem: > > I have a sequence of RNA, say, "RNA SEQUENCE 1": > "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." > > I have another sequence of RNA, say "RNA SEQUENCE 2". > > I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll > take a long time, I'm sure, but nonetheless that's what I'd like to do), > such that I search: > > "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" > "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" > "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. > > along RNA Sequence 2, to see if I can find that same region present. > > I was wondering if there was some package that could do that, that either > BioPython interfaces with, or is separately implemented as a Python > package. Does anybody know if there is such a thing? > > Cheers, > Eric > ----------------------------------------------------------------------- > Please consider the environment before printing this e-mail. Do you really > need to print it? > > http://about.me/ericmjl > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ivangreg at gmail.com Thu Oct 10 16:00:54 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Thu, 10 Oct 2013 16:00:54 -0400 Subject: [Biopython] How to do RNA-RNA hybridization searches? In-Reply-To: References: Message-ID: I suggest that you try pairwise2. Ivan Ivan Gregoretti, PhD Bioinformatics On Thu, Oct 10, 2013 at 3:33 PM, Eric Ma wrote: > Hi everybody, > > I'm looking to try my hand at doing the following problem: > > I have a sequence of RNA, say, "RNA SEQUENCE 1": > "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." > > I have another sequence of RNA, say "RNA SEQUENCE 2". > > I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll > take a long time, I'm sure, but nonetheless that's what I'd like to do), > such that I search: > > "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" > "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" > "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. > > along RNA Sequence 2, to see if I can find that same region present. > > I was wondering if there was some package that could do that, that either > BioPython interfaces with, or is separately implemented as a Python > package. Does anybody know if there is such a thing? > > Cheers, > Eric > ----------------------------------------------------------------------- > Please consider the environment before printing this e-mail. Do you really > need to print it? > > http://about.me/ericmjl > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From davidsshin at lbl.gov Thu Oct 10 18:36:57 2013 From: davidsshin at lbl.gov (David Shin) Date: Thu, 10 Oct 2013 15:36:57 -0700 Subject: [Biopython] general scripting help Message-ID: Hi - I am trying to write a script to parse through 50 or so deltablast .xml files. File names are: xaa.xml xab.xml xac.xml ... I'm new (2 days) to python, biopython, and just trying to have something to show for a meeting tomorrow. I have my script working well enough for one file, I was wondering if there was a way to go thru each file separately and output according to file name. ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like x?? or even x* # This part gets the length of the query and stores to a variable from Bio import SeqIO record = SeqIO.read("xaa", "fasta") query_length = len(record) #print "query length:", query_length #This part gets the user's high and low percent identity cutoffs high_percent_cutoff = float(input("Enter high percent cutoff: ")) low_percent_cutoff = float(input("Enter low percent cutoff: ")) # This part does the comparison to all the hits if result_handle = open("xaa.xml") from Bio.Blast import NCBIXML blast_record = NCBIXML.read(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: alignment_length = alignment.length identical_residues = hsp.identities percent_identity = float(identical_residues) / float(query_length) if alignment_length > query_length * 0.9 and alignment_length < query_length * 1.1 and percent_identity > low_percent_cutoff and percent_identity <= high_percent_cutoff: print "****Alignment****" print "sequence:", alignment.title print "query length:", query_length print "alighment length:", alignment.length print "identical residues:", identical_residues print "percent identity:", percent_identity print print "12345678901234567890123456789012345678901234567890123456789012345678901234567890" print hsp.query[:80] print hsp.match[:80] print hsp.sbjct[:80] Thanks for any help. Dave From nathaniel.echols at gmail.com Thu Oct 10 18:53:16 2013 From: nathaniel.echols at gmail.com (Nat Echols) Date: Thu, 10 Oct 2013 15:53:16 -0700 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Put all that code into a function with the file name or prefix as an argument, then iterate over possible files: def extract_alignments (prefix) : # old code goes here - append ".xml" to prefix to get alignment file name for prefix in ["xaa","xab","xac"] : extract_alignments(prefix) Or you could do this: import os.path for file_name in sys.argv[1:] : prefix = os.path.splitext(file_name)[0] extract_alignments(prefix) And run as: python my_script.py x*.xml Assuming you have a real OS installed, of course - I'm not sure whether Windows supports wildcards too. -Nat On Thu, Oct 10, 2013 at 3:36 PM, David Shin wrote: > Hi - > > I am trying to write a script to parse through 50 or so deltablast .xml > files. > > File names are: > xaa.xml > xab.xml > xac.xml > ... > > > I'm new (2 days) to python, biopython, and just trying to have something to > show for a meeting tomorrow. I have my script working well enough for one > file, I was wondering if there was a way to go thru each file separately > and output according to file name. > > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like x?? > or even x* > > > # This part gets the length of the query and stores to a variable > from Bio import SeqIO > record = SeqIO.read("xaa", "fasta") > query_length = len(record) > #print "query length:", query_length > > #This part gets the user's high and low percent identity cutoffs > high_percent_cutoff = float(input("Enter high percent cutoff: ")) > low_percent_cutoff = float(input("Enter low percent cutoff: ")) > > # This part does the comparison to all the hits if > result_handle = open("xaa.xml") > from Bio.Blast import NCBIXML > blast_record = NCBIXML.read(result_handle) > > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > alignment_length = alignment.length > identical_residues = hsp.identities > percent_identity = float(identical_residues) / float(query_length) > if alignment_length > query_length * 0.9 and alignment_length < > query_length * 1.1 and percent_identity > low_percent_cutoff and > percent_identity <= high_percent_cutoff: > print "****Alignment****" > print "sequence:", alignment.title > print "query length:", query_length > print "alighment length:", alignment.length > print "identical residues:", identical_residues > print "percent identity:", percent_identity > print > print > > "12345678901234567890123456789012345678901234567890123456789012345678901234567890" > print hsp.query[:80] > print hsp.match[:80] > print hsp.sbjct[:80] > > > Thanks for any help. > Dave > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Thu Oct 10 18:54:26 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 11 Oct 2013 00:54:26 +0200 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Hey David, If you put all your input files in one directory, all you need is a for loop and the os module (import os) and its listdir method. Then you can save the output to a file instead of printing to screen. import os directory = '/home/dave/xmlfiles/' for each_file in os.listdir(directory): output_file = each_file + '_output' fhandle = open(output_file, 'w') record = SeqIO.read(each_file, "fasta") query_length = len(record) # Print to file, use string formatting to make life easier fhandle.write("query length: {0}\n".format(query_length)) etc etc You might want to have a look here: http://pythonforbiologists.com/ Cheers and good luck for tomorrow, Jo?o 2013/10/11 David Shin > Hi - > > I am trying to write a script to parse through 50 or so deltablast .xml > files. > > File names are: > xaa.xml > xab.xml > xac.xml > ... > > > I'm new (2 days) to python, biopython, and just trying to have something to > show for a meeting tomorrow. I have my script working well enough for one > file, I was wondering if there was a way to go thru each file separately > and output according to file name. > > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like x?? > or even x* > > > # This part gets the length of the query and stores to a variable > from Bio import SeqIO > record = SeqIO.read("xaa", "fasta") > query_length = len(record) > #print "query length:", query_length > > #This part gets the user's high and low percent identity cutoffs > high_percent_cutoff = float(input("Enter high percent cutoff: ")) > low_percent_cutoff = float(input("Enter low percent cutoff: ")) > > # This part does the comparison to all the hits if > result_handle = open("xaa.xml") > from Bio.Blast import NCBIXML > blast_record = NCBIXML.read(result_handle) > > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > alignment_length = alignment.length > identical_residues = hsp.identities > percent_identity = float(identical_residues) / float(query_length) > if alignment_length > query_length * 0.9 and alignment_length < > query_length * 1.1 and percent_identity > low_percent_cutoff and > percent_identity <= high_percent_cutoff: > print "****Alignment****" > print "sequence:", alignment.title > print "query length:", query_length > print "alighment length:", alignment.length > print "identical residues:", identical_residues > print "percent identity:", percent_identity > print > print > > "12345678901234567890123456789012345678901234567890123456789012345678901234567890" > print hsp.query[:80] > print hsp.match[:80] > print hsp.sbjct[:80] > > > Thanks for any help. > Dave > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bitsink at gmail.com Thu Oct 10 18:56:04 2013 From: bitsink at gmail.com (Nam Nguyen) Date: Thu, 10 Oct 2013 15:56:04 -0700 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Module glob can help here too. import glob for filename in glob.glob('*.xml'): extract_alignments(filename) You do not have to worry about "real OS". Cheers, Nam On Thu, Oct 10, 2013 at 3:53 PM, Nat Echols wrote: > Put all that code into a function with the file name or prefix as an > argument, then iterate over possible files: > > def extract_alignments (prefix) : > # old code goes here - append ".xml" to prefix to get alignment file name > > for prefix in ["xaa","xab","xac"] : > extract_alignments(prefix) > > Or you could do this: > > import os.path > for file_name in sys.argv[1:] : > prefix = os.path.splitext(file_name)[0] > extract_alignments(prefix) > > And run as: > > python my_script.py x*.xml > > Assuming you have a real OS installed, of course - I'm not sure whether > Windows supports wildcards too. > > -Nat > > > > On Thu, Oct 10, 2013 at 3:36 PM, David Shin wrote: > > > Hi - > > > > I am trying to write a script to parse through 50 or so deltablast .xml > > files. > > > > File names are: > > xaa.xml > > xab.xml > > xac.xml > > ... > > > > > > I'm new (2 days) to python, biopython, and just trying to have something > to > > show for a meeting tomorrow. I have my script working well enough for one > > file, I was wondering if there was a way to go thru each file separately > > and output according to file name. > > > > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like > x?? > > or even x* > > > > > > # This part gets the length of the query and stores to a variable > > from Bio import SeqIO > > record = SeqIO.read("xaa", "fasta") > > query_length = len(record) > > #print "query length:", query_length > > > > #This part gets the user's high and low percent identity cutoffs > > high_percent_cutoff = float(input("Enter high percent cutoff: ")) > > low_percent_cutoff = float(input("Enter low percent cutoff: ")) > > > > # This part does the comparison to all the hits if > > result_handle = open("xaa.xml") > > from Bio.Blast import NCBIXML > > blast_record = NCBIXML.read(result_handle) > > > > for alignment in blast_record.alignments: > > for hsp in alignment.hsps: > > alignment_length = alignment.length > > identical_residues = hsp.identities > > percent_identity = float(identical_residues) / > float(query_length) > > if alignment_length > query_length * 0.9 and alignment_length < > > query_length * 1.1 and percent_identity > low_percent_cutoff and > > percent_identity <= high_percent_cutoff: > > print "****Alignment****" > > print "sequence:", alignment.title > > print "query length:", query_length > > print "alighment length:", alignment.length > > print "identical residues:", identical_residues > > print "percent identity:", percent_identity > > print > > print > > > > > "12345678901234567890123456789012345678901234567890123456789012345678901234567890" > > print hsp.query[:80] > > print hsp.match[:80] > > print hsp.sbjct[:80] > > > > > > Thanks for any help. > > Dave > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ericmajinglong at gmail.com Thu Oct 10 20:02:19 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 10 Oct 2013 17:02:19 -0700 (PDT) Subject: [Biopython] How to do RNA-RNA hybridization searches? In-Reply-To: References: Message-ID: <1381449739213.990e2a93@Nodemailer> I might try Ivan's approach. I was also trying to accommodate non-fully complementary portions. My bad for not stating this.? Thanks everybody! Cheers, Eric Sent from a mobile device. Please pardon typo errors. On Thu, Oct 10, 2013 at 3:59 PM, Chris Mitchell wrote: > If you want to do it in python, it's fairly trivial. > a = 'CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG' > for i in xrange(0,len(a)-14): > if a[i:i+15] in seq2: > do this > or is there some reason you aren't taking this approach? > On Thu, Oct 10, 2013 at 3:33 PM, Eric Ma wrote: >> Hi everybody, >> >> I'm looking to try my hand at doing the following problem: >> >> I have a sequence of RNA, say, "RNA SEQUENCE 1": >> "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." >> >> I have another sequence of RNA, say "RNA SEQUENCE 2". >> >> I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll >> take a long time, I'm sure, but nonetheless that's what I'd like to do), >> such that I search: >> >> "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" >> "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" >> "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. >> >> along RNA Sequence 2, to see if I can find that same region present. >> >> I was wondering if there was some package that could do that, that either >> BioPython interfaces with, or is separately implemented as a Python >> package. Does anybody know if there is such a thing? >> >> Cheers, >> Eric >> ----------------------------------------------------------------------- >> Please consider the environment before printing this e-mail. Do you really >> need to print it? >> >> http://about.me/ericmjl >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From davidsshin at lbl.gov Thu Oct 10 20:24:14 2013 From: davidsshin at lbl.gov (David Shin) Date: Thu, 10 Oct 2013 17:24:14 -0700 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Thanks everyone, I'm going to try to get each to work. Also, I meant to say I've been on biopython 2 days... but have been trying to do general python tutorials for a couple of weeks now, and I would have never come up with those suggestions. Thanks again, and thanks for the link also. On Thu, Oct 10, 2013 at 3:56 PM, Nam Nguyen wrote: > Module glob can help here too. > > import glob > for filename in glob.glob('*.xml'): > extract_alignments(filename) > > You do not have to worry about "real OS". > > Cheers, > Nam > > > On Thu, Oct 10, 2013 at 3:53 PM, Nat Echols wrote: > >> Put all that code into a function with the file name or prefix as an >> argument, then iterate over possible files: >> >> def extract_alignments (prefix) : >> # old code goes here - append ".xml" to prefix to get alignment file >> name >> >> for prefix in ["xaa","xab","xac"] : >> extract_alignments(prefix) >> >> Or you could do this: >> >> import os.path >> for file_name in sys.argv[1:] : >> prefix = os.path.splitext(file_name)[0] >> extract_alignments(prefix) >> >> And run as: >> >> python my_script.py x*.xml >> >> Assuming you have a real OS installed, of course - I'm not sure whether >> Windows supports wildcards too. >> >> -Nat >> >> >> >> On Thu, Oct 10, 2013 at 3:36 PM, David Shin wrote: >> >> > Hi - >> > >> > I am trying to write a script to parse through 50 or so deltablast .xml >> > files. >> > >> > File names are: >> > xaa.xml >> > xab.xml >> > xac.xml >> > ... >> > >> > >> > I'm new (2 days) to python, biopython, and just trying to have >> something to >> > show for a meeting tomorrow. I have my script working well enough for >> one >> > file, I was wondering if there was a way to go thru each file separately >> > and output according to file name. >> > >> > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like >> x?? >> > or even x* >> > >> > >> > # This part gets the length of the query and stores to a variable >> > from Bio import SeqIO >> > record = SeqIO.read("xaa", "fasta") >> > query_length = len(record) >> > #print "query length:", query_length >> > >> > #This part gets the user's high and low percent identity cutoffs >> > high_percent_cutoff = float(input("Enter high percent cutoff: ")) >> > low_percent_cutoff = float(input("Enter low percent cutoff: ")) >> > >> > # This part does the comparison to all the hits if >> > result_handle = open("xaa.xml") >> > from Bio.Blast import NCBIXML >> > blast_record = NCBIXML.read(result_handle) >> > >> > for alignment in blast_record.alignments: >> > for hsp in alignment.hsps: >> > alignment_length = alignment.length >> > identical_residues = hsp.identities >> > percent_identity = float(identical_residues) / >> float(query_length) >> > if alignment_length > query_length * 0.9 and alignment_length < >> > query_length * 1.1 and percent_identity > low_percent_cutoff and >> > percent_identity <= high_percent_cutoff: >> > print "****Alignment****" >> > print "sequence:", alignment.title >> > print "query length:", query_length >> > print "alighment length:", alignment.length >> > print "identical residues:", identical_residues >> > print "percent identity:", percent_identity >> > print >> > print >> > >> > >> "12345678901234567890123456789012345678901234567890123456789012345678901234567890" >> > print hsp.query[:80] >> > print hsp.match[:80] >> > print hsp.sbjct[:80] >> > >> > >> > Thanks for any help. >> > Dave >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- David Shin, Ph.D Lawrence Berkeley National Labs 1 Cyclotron Road MS 83-R0101 Berkeley, CA 94720 USA From golubchi at stats.ox.ac.uk Tue Oct 15 10:40:29 2013 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Tue, 15 Oct 2013 15:40:29 +0100 Subject: [Biopython] Blast using Biopython Message-ID: <525D53DD.5040903@stats.ox.ac.uk> Hi guys, This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... Thanks Tanya From mmokrejs at fold.natur.cuni.cz Tue Oct 15 18:33:32 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Wed, 16 Oct 2013 00:33:32 +0200 Subject: [Biopython] Blast using Biopython In-Reply-To: <525D53DD.5040903@stats.ox.ac.uk> References: <525D53DD.5040903@stats.ox.ac.uk> Message-ID: <525DC2BC.7050300@fold.natur.cuni.cz> Hi Tanya, I suppose you use the newer ncbi-tools++ suite. Try the legacy blastn from the ncbi-tools suite. The version numbering is same ... I have better experience with "blastall -p blastn" form the old suite. You can also try to find some switch to force the really old blastn algorithm buried in blastall (nowadays the blastall uses the new algorithm which is in the new ncbi-tools++ suite). However, experience shows that "blastall -p blastn" gives different results compared to blastn although BOTH should be in theory using the new algorithm. With the possibility to force the real predecessor of the algorithm in blastall you have a third method to test. From blastall you get only limited results into CSV-formatted output, you cannot change the output columns. For me important results can be only parsed from XML/plaintext results of blastall. You can increase the reward for a match "-r 2" to overcome some gaps on sides but depends what queries you have and whether that does not give you elsewhere falsely widened alignments. You have to test that. Good luck, Martin Tanya Golubchik wrote: > Hi guys, > > This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? > > What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... From jordan.r.willis at Vanderbilt.Edu Tue Oct 15 19:56:56 2013 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Tue, 15 Oct 2013 23:56:56 +0000 Subject: [Biopython] Blast using Biopython In-Reply-To: <525D53DD.5040903@stats.ox.ac.uk> References: <525D53DD.5040903@stats.ox.ac.uk> Message-ID: Tanya, Does it have to be XML? Could you try -outfmt 7 and possibly request qseq and sseq which will return the aligned part of the sequence from the query and subject? J On Oct 15, 2013, at 9:40 AM, Tanya Golubchik wrote: > Hi guys, > > This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? > > What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... > > Thanks > Tanya > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ddufour at pcb.ub.cat Thu Oct 17 04:33:29 2013 From: ddufour at pcb.ub.cat (David Dufour Rausell) Date: Thu, 17 Oct 2013 10:33:29 +0200 Subject: [Biopython] is_na()? Message-ID: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Hello, I would like to know if there is a function like is_aa(residue) but to test if a residue is RNA? Basically, what I want to know if a chain is RNA or not. Thanks in advance. David Dufour Rausell Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona Tel +34 93 4020542 email ddufour at pcb.ub.cat From p.j.a.cock at googlemail.com Thu Oct 17 04:49:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Oct 2013 09:49:15 +0100 Subject: [Biopython] is_na()? In-Reply-To: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: On Thu, Oct 17, 2013 at 9:33 AM, David Dufour Rausell wrote: > Hello, > > I would like to know if there is a function like is_aa(residue) but to test if > a residue is RNA? Basically, what I want to know if a chain is RNA or not. > Thanks in advance. > > David Dufour Rausell Hello David, Are you asking about residues in 3D structures (e.g. from Bio.PDB), letters in sequences (e.g. strings or Seq objects), or another context? If sequence and using Seq objects, you may be able to look at the alphabet. However, most file formats do not define this explicitly, but you can tell the SeqIO parsers via the optional alphabet argument. Peter From ddufour at pcb.ub.cat Thu Oct 17 06:07:45 2013 From: ddufour at pcb.ub.cat (David Dufour Rausell) Date: Thu, 17 Oct 2013 12:07:45 +0200 Subject: [Biopython] is_na()? In-Reply-To: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: Hello again, I forgot to mention that I'm working with PDB files, so I'm using the Bio.PDB module. I'm thinking in extracting the sequence from each chain and check if it is made by RNA residues, but any other idea will be very welcome. Thanks! David Dufour Rausell Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona Tel +34 93 4020542 email ddufour at pcb.ub.cat On Oct 17, 2013, at 10:33 AM, David Dufour Rausell wrote: > Hello, > > I would like to know if there is a function like is_aa(residue) but to test if a residue is RNA? Basically, what I want to know if a chain is RNA or not. Thanks in advance. > > > David Dufour Rausell > > Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > > Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona > Tel +34 93 4020542 > email ddufour at pcb.ub.cat > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From harijay at gmail.com Thu Oct 17 07:44:08 2013 From: harijay at gmail.com (hari jayaram) Date: Thu, 17 Oct 2013 07:44:08 -0400 Subject: [Biopython] Downloadable html documentation? Message-ID: Hi, I recently started using a very fast offline OSX only documentation browser called Dash.app . It works great to very quickly search documentation . It has the ability to load in "docsets" and a set of instructions for how to build your own docset starting from html files. Is there an archive of the html api documentation for Biopython. Thanks Hari From p.j.a.cock at googlemail.com Thu Oct 17 07:52:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Oct 2013 12:52:05 +0100 Subject: [Biopython] Downloadable html documentation? In-Reply-To: References: Message-ID: On Thu, Oct 17, 2013 at 12:44 PM, hari jayaram wrote: > Hi, > I recently started using a very fast offline OSX only documentation browser > called Dash.app . It works great to very quickly search documentation . > > It has the ability to load in "docsets" and a set of instructions for how > to build your own docset starting from html files. > > Is there an archive of the html api documentation for Biopython. > > Thanks > Hari All the API documentation is pulled from the docstrings in the Python code - can you just point Dash.app at the source code instead? Peter From jordan.r.willis at Vanderbilt.Edu Thu Oct 17 07:34:13 2013 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Thu, 17 Oct 2013 11:34:13 +0000 Subject: [Biopython] is_na()? In-Reply-To: References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: I would just get the ID of the residue and ask if it's in the standard amino acid library. If you wanna use all Bio.PDB the Polypeptide class has a three_to_one function that only contains the naturally occurring 20 AA's by their 3 letter code. Do something like: from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import three_to_one structure = PDBParser().get_structure('XXXX', 'you_structure.pdb') residues = structure.get_residues() for resi in residues: try: three_to_one(resi.get_resname()) except KeyError: print "residue {0} {1} on chain {2} is not a standard amino acid".format(resi.get_id()[1],resi.get_resname(),resi.get_parent().get_id()) On Oct 17, 2013, at 5:07 AM, David Dufour Rausell wrote: > Hello again, > > I forgot to mention that I'm working with PDB files, so I'm using the Bio.PDB module. I'm thinking in extracting the sequence from each chain and check if it is made by RNA residues, but any other idea will be very welcome. > > Thanks! > > David Dufour Rausell > > Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > > Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona > Tel +34 93 4020542 > email ddufour at pcb.ub.cat > > On Oct 17, 2013, at 10:33 AM, David Dufour Rausell wrote: > >> Hello, >> >> I would like to know if there is a function like is_aa(residue) but to test if a residue is RNA? Basically, what I want to know if a chain is RNA or not. Thanks in advance. >> >> >> David Dufour Rausell >> >> Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) >> Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) >> >> Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona >> Tel +34 93 4020542 >> email ddufour at pcb.ub.cat >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Thu Oct 17 08:52:43 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 17 Oct 2013 14:52:43 +0200 Subject: [Biopython] is_na()? In-Reply-To: References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: That list also has all the nucleic acids so it's a matter of parsing it and using only a portion of it. Cheers, Jo?o 2013/10/17 Willis, Jordan R > I would just get the ID of the residue and ask if it's in the standard > amino acid library. If you wanna use all Bio.PDB the Polypeptide class has > a three_to_one function that only contains the naturally occurring 20 AA's > by their 3 letter code. Do something like: > > from Bio.PDB.PDBParser import PDBParser > from Bio.PDB.Polypeptide import three_to_one > structure = PDBParser().get_structure('XXXX', 'you_structure.pdb') > residues = structure.get_residues() > for resi in residues: > try: > three_to_one(resi.get_resname()) > except KeyError: > print "residue {0} {1} on chain {2} is not a standard amino > acid".format(resi.get_id()[1],resi.get_resname(),resi.get_parent().get_id()) > > > > On Oct 17, 2013, at 5:07 AM, David Dufour Rausell > wrote: > > > Hello again, > > > > I forgot to mention that I'm working with PDB files, so I'm using the > Bio.PDB module. I'm thinking in extracting the sequence from each chain and > check if it is made by RNA residues, but any other idea will be very > welcome. > > > > Thanks! > > > > David Dufour Rausell > > > > Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > > Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > > > > Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 > - Barcelona > > Tel +34 93 4020542 > > email ddufour at pcb.ub.cat > > > > On Oct 17, 2013, at 10:33 AM, David Dufour Rausell wrote: > > > >> Hello, > >> > >> I would like to know if there is a function like is_aa(residue) but to > test if a residue is RNA? Basically, what I want to know if a chain is RNA > or not. Thanks in advance. > >> > >> > >> David Dufour Rausell > >> > >> Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > >> Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > >> > >> Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - > 08028 - Barcelona > >> Tel +34 93 4020542 > >> email ddufour at pcb.ub.cat > >> > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From golubchi at stats.ox.ac.uk Thu Oct 17 10:07:50 2013 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Thu, 17 Oct 2013 15:07:50 +0100 Subject: [Biopython] Blast using Biopython In-Reply-To: <525DC2BC.7050300@fold.natur.cuni.cz> References: <525D53DD.5040903@stats.ox.ac.uk> <525DC2BC.7050300@fold.natur.cuni.cz> Message-ID: <525FEF36.7080008@stats.ox.ac.uk> Hi Martin, Using task=blastn seems to solve the problem! Thanks so much, I didn't realise that the default (megablast) behaviour is different even when the word size and other parameters are changed. Blastn seems to find the edges much more precisely than megablast. I haven't thoroughly tested it yet to make sure it doesn't break anything, but so far so good! Thanks Tanya On 15/10/13 23:33, Martin Mokrejs wrote: > Hi Tanya, > I suppose you use the newer ncbi-tools++ suite. Try the legacy blastn from the ncbi-tools suite. > The version numbering is same ... I have better experience with "blastall -p blastn" form the old > suite. You can also try to find some switch to force the really old blastn algorithm buried in > blastall (nowadays the blastall uses the new algorithm which is in the new ncbi-tools++ suite). > However, experience shows that "blastall -p blastn" gives different results compared to blastn > although BOTH should be in theory using the new algorithm. With the possibility to force the real > predecessor of the algorithm in blastall you have a third method to test. > > From blastall you get only limited results into CSV-formatted output, you cannot change the > output columns. For me important results can be only parsed from XML/plaintext results of blastall. > > You can increase the reward for a match "-r 2" to overcome some gaps on sides but depends what > queries you have and whether that does not give you elsewhere falsely widened alignments. You have > to test that. > > Good luck, > Martin > > > Tanya Golubchik wrote: >> Hi guys, >> >> This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? >> >> What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... From harijay at gmail.com Thu Oct 17 11:02:16 2013 From: harijay at gmail.com (hari jayaram) Date: Thu, 17 Oct 2013 11:02:16 -0400 Subject: [Biopython] Downloadable html documentation? In-Reply-To: References: Message-ID: Thanks Peter.. I am sure I can do that. I will do that and share the docset once it is generated on github. I find the app super speedy and easy to use. Strangely easier than a browser and online search. Hari On Thu, Oct 17, 2013 at 7:52 AM, Peter Cock wrote: > On Thu, Oct 17, 2013 at 12:44 PM, hari jayaram wrote: > > Hi, > > I recently started using a very fast offline OSX only documentation > browser > > called Dash.app . It works great to very quickly search documentation . > > > > It has the ability to load in "docsets" and a set of instructions for how > > to build your own docset starting from html files. > > > > Is there an archive of the html api documentation for Biopython. > > > > Thanks > > Hari > > All the API documentation is pulled from the docstrings in > the Python code - can you just point Dash.app at the source > code instead? > > Peter > From jdjensen at eng.ucsd.edu Thu Oct 17 18:05:48 2013 From: jdjensen at eng.ucsd.edu (James Jensen) Date: Thu, 17 Oct 2013 15:05:48 -0700 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: References: <521FD389.1090207@eng.ucsd.edu> Message-ID: <52605F3C.7080109@eng.ucsd.edu> Hi, Jo?o, A late reply is much better than no reply. I'm impressed you tracked this down and followed up, and I appreciate your help. And it took me a while to get around to revisiting this myself. I was using Python 3.2 when I got the "unorderable types" error. For unrelated reasons, I ended up switching to Python 2.7.3, and now doing search_all at the residue level works. The issue with the indexing is not that the residues' get_id() function returns a different number from the corresponding list index in the list returned by Selection.unfold_entities(). That is inconvenient, but I've been working around it. What puzzled me is that it appeared that the NeighborSearch was accessing residues that unfold_entities() wasn't accessing, although this wouldn't make sense because I used NeighborSearch on the results of a call to unfold_entities(). Let me check again; it could have been something I was doing wrong. What do the chain breaks mean? Are they missing data, and if so, what is missing? And what are their consequences for working with the data? How would they be problematic for iterating over residues, calculating distances, returning the amino acid sequence of the structure, etc? Thanks again, James On 10/08/2013 02:22 PM, Jo?o Rodrigues wrote: > Dear James, > > Regarding problem 1. What you describe runs fine on my machine, using > Python 2.7.5 and an up-to-date Biopython git version. You logic seems > fine, maybe it's the version of Python you are using? > > Regarding your second problem, that of the mismatched indexes. The > Selection method returns a *list* of residues while when you iterate > over the neighbors and ask for their id it gives back the id of the > residue. This id will only correspond to the Selection list index if > your residues are numbered from 1 to N without gaps. If your protein > starts at residue 3, then the first item given back by Selection has > index 0 while in fact the id is 3. Does this make sense? > > The warning occurs if you have chain breaks. There should be some gaps > in your structure, starting at a number other than 1 does not raise > this warning normally. > > Cheers and sorry for the late reply, > > Jo?o > > > > 2013/8/30 James Jensen > > > Hello! > > I am writing a function that, given two chains in a PDB file, > should return 1) the positions and identities of all residues that > are in contact with (distance < 5 angstroms) a residue on the > other chain, and 2) the amino acid sequences of the chains. I've > been doing this with NeighborSearch.search_all(radius=5, > level='A') and then for each atom pair, seeing what its parent > residue is and whether the parent residues of the two atoms belong > to different chains. This may seem like a roundabout way of doing > it, but if I call search_all(radius=5, level='R'), or indeed with > level=any level other than 'A', I get the error > > TypeError: unorderable types: Residue() < Residue() > > So my first question is why it might be that search_all isn't > working at higher levels. > > For the adjacent residue pairs I identify using NeighborSearch, I > get each residue's position in its respective chain by > residue.get_id()[1]. > > I've noticed, however, that if I get the sequence of the chain > using seq = Selection.unfold_entities(chain, 'R') and then > reference (i.e. seq[index]) the amino acids using the indices > returned by the NeighborSearch step, they are not the same > residues that I get if during the NeighborSearch step I report > residue.get_resname() for each adjacent residue. > > I've tried it with several proteins, and the problem is the same. > Chains A and C of 2h62 are an example. > > I then noticed that the lowest residue ID number of the residues > yielded from Selection.unfold_entities(chain, 'R') is not 1. For > chain A, it's 11, and for chain C, it's 34. Not knowing why this > was, I thought I'd try subtracting the lowest ID number from the > indices returned by the NeighborSearch step (i.e. in chain A, 11 > -> 0 so seq[0] would be the first residue, the one with ID 11). > This happened to seem to work for chain A. However, it gives me > negative indices for some of the contacts in chain C. This means > that NeighborSearch can return residues that are not returned by > unfold_entities(). The lowest residue ID returned by > NeighborSearch for chain C was 24, whereas for unfold_entities() > it was 34. > > For both chains A and C, I was given the warning > > PDBConstructionWarning: WARNING: Chain [letter] is > discontinuous at line [line number]. > > In fact, I seem to get this warning for just about every chain of > every structure I load. Is this the reason that the first residues > in the two chains are at 11 and 34, rather than 1? If so, could it > be that NeighborSearch is able to work around the discontinuity > while unfold_entities is not? > > Any suggestions? > > Thanks for your time and help, > > James Jensen > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From anaryin at gmail.com Thu Oct 17 19:25:42 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 18 Oct 2013 01:25:42 +0200 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: <52605F3C.7080109@eng.ucsd.edu> References: <521FD389.1090207@eng.ucsd.edu> <52605F3C.7080109@eng.ucsd.edu> Message-ID: Hey James, NeighborSearch. Unless you edit your structure, you should always get back the same atoms from whatever method you use to access that structure (unfold_entities, NeighborSearch, iteration, etc). If you want paste your code here and I can try to explain what is going on, maybe it makes things a bit more clear. Also, if you can reproduce the weird behavior, paste the code on pastebin.com or write it here in the thread so we can try it on our machines too. The behavior of the get_id() is not inconvenient at all, at least from a biological point of view. You usually want the residue position in the amino acid sequence, not the computer science data structure index. You should always work with the indices from the PDB file, otherwise biologists will get quite mad at you if you start using the other numbers :) Chain breaks. Chain breaks are literally a break (discontinuity) in the polypeptide chain. Sometimes you cannot get enough density from the x-ray experiment to accurately determine the position of particular atoms. You usually see this at lower resolution structures (3?) or very mobile regions (loops). What happens is that you therefore get a gap in the structure, say, from residue 14 to residue 20 there is nothing there. But in the sequence that went in the x-ray beam, these 5 residues (15-19) were there, so you get the numbering taking them into account as well. As for implications.. well, depends on what you are doing with the structure. Calculating distances, iterating over residues, etc, will not be problematic at all. You will just 'miss' some residues because they are just not there. You might want to pay particular attention if you are renumbering your structure to make sure it 'respects' these gaps for example. 2h62 has 4 different chains and they are indeed complete. I get the same warning but for all chains, and the lines I get notified about are the first solvent molecules of that particular chain. The way StructureBuilder works is a bit silly indeed: it iterates over the lines of the PDB file and when it finds a different chain identifier from the one it was reading in the line before it adds a new chain. If this chain already exists, it raises this warning. It's a bit silly in this case because HETATM should not be accounted for in this situation since they always come at the end of the file.. If you can, submit a bug report or feature report in our tracker and I'll go over it when I have some free time. Cheers, Jo?o 2013/10/18 James Jensen > Hi, Jo?o, > > A late reply is much better than no reply. I'm impressed you tracked this > down and followed up, and I appreciate your help. And it took me a while to > get around to revisiting this myself. > > I was using Python 3.2 when I got the "unorderable types" error. For > unrelated reasons, I ended up switching to Python 2.7.3, and now doing > search_all at the residue level works. > > The issue with the indexing is not that the residues' get_id() function > returns a different number from the corresponding list index in the list > returned by Selection.unfold_entities(). That is inconvenient, but I've > been working around it. What puzzled me is that it appeared that the > NeighborSearch was accessing residues that unfold_entities() wasn't > accessing, although this wouldn't make sense because I used NeighborSearch > on the results of a call to unfold_entities(). Let me check again; it could > have been something I was doing wrong. > > What do the chain breaks mean? Are they missing data, and if so, what is > missing? And what are their consequences for working with the data? How > would they be problematic for iterating over residues, calculating > distances, returning the amino acid sequence of the structure, etc? > > Thanks again, > > James > > > > On 10/08/2013 02:22 PM, Jo?o Rodrigues wrote: > > Dear James, > > Regarding problem 1. What you describe runs fine on my machine, using > Python 2.7.5 and an up-to-date Biopython git version. You logic seems fine, > maybe it's the version of Python you are using? > > Regarding your second problem, that of the mismatched indexes. The > Selection method returns a *list* of residues while when you iterate over > the neighbors and ask for their id it gives back the id of the residue. > This id will only correspond to the Selection list index if your residues > are numbered from 1 to N without gaps. If your protein starts at residue 3, > then the first item given back by Selection has index 0 while in fact the > id is 3. Does this make sense? > > The warning occurs if you have chain breaks. There should be some gaps > in your structure, starting at a number other than 1 does not raise this > warning normally. > > Cheers and sorry for the late reply, > > Jo?o > > > > 2013/8/30 James Jensen > >> Hello! >> >> I am writing a function that, given two chains in a PDB file, should >> return 1) the positions and identities of all residues that are in contact >> with (distance < 5 angstroms) a residue on the other chain, and 2) the >> amino acid sequences of the chains. I've been doing this with >> NeighborSearch.search_all(radius=5, level='A') and then for each atom pair, >> seeing what its parent residue is and whether the parent residues of the >> two atoms belong to different chains. This may seem like a roundabout way >> of doing it, but if I call search_all(radius=5, level='R'), or indeed with >> level=any level other than 'A', I get the error >> >> TypeError: unorderable types: Residue() < Residue() >> >> So my first question is why it might be that search_all isn't working at >> higher levels. >> >> For the adjacent residue pairs I identify using NeighborSearch, I get >> each residue's position in its respective chain by residue.get_id()[1]. >> >> I've noticed, however, that if I get the sequence of the chain using seq >> = Selection.unfold_entities(chain, 'R') and then reference (i.e. >> seq[index]) the amino acids using the indices returned by the >> NeighborSearch step, they are not the same residues that I get if during >> the NeighborSearch step I report residue.get_resname() for each adjacent >> residue. >> >> I've tried it with several proteins, and the problem is the same. Chains >> A and C of 2h62 are an example. >> >> I then noticed that the lowest residue ID number of the residues yielded >> from Selection.unfold_entities(chain, 'R') is not 1. For chain A, it's 11, >> and for chain C, it's 34. Not knowing why this was, I thought I'd try >> subtracting the lowest ID number from the indices returned by the >> NeighborSearch step (i.e. in chain A, 11 -> 0 so seq[0] would be the first >> residue, the one with ID 11). This happened to seem to work for chain A. >> However, it gives me negative indices for some of the contacts in chain C. >> This means that NeighborSearch can return residues that are not returned by >> unfold_entities(). The lowest residue ID returned by NeighborSearch for >> chain C was 24, whereas for unfold_entities() it was 34. >> >> For both chains A and C, I was given the warning >> >> PDBConstructionWarning: WARNING: Chain [letter] is discontinuous >> at line [line number]. >> >> In fact, I seem to get this warning for just about every chain of every >> structure I load. Is this the reason that the first residues in the two >> chains are at 11 and 34, rather than 1? If so, could it be that >> NeighborSearch is able to work around the discontinuity while >> unfold_entities is not? >> >> Any suggestions? >> >> Thanks for your time and help, >> >> James Jensen >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > From p.j.a.cock at googlemail.com Fri Oct 18 07:08:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Oct 2013 12:08:41 +0100 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: <52605F3C.7080109@eng.ucsd.edu> References: <521FD389.1090207@eng.ucsd.edu> <52605F3C.7080109@eng.ucsd.edu> Message-ID: On Thu, Oct 17, 2013 at 11:05 PM, James Jensen wrote: > > I was using Python 3.2 when I got the "unorderable types" error. For > unrelated reasons, I ended up switching to Python 2.7.3, and now doing > search_all at the residue level works. Can you reduce that to a short test case? It sounds like something we may need to address in the Python 2/3 compatibility. Thanks, Peter From ajingnk at gmail.com Sat Oct 19 11:16:21 2013 From: ajingnk at gmail.com (Jing Lu) Date: Sat, 19 Oct 2013 11:16:21 -0400 Subject: [Biopython] More efficient neighbor joining algorithm to build phylogenetic tree Message-ID: Hello! I am trying to build a large tree (~10000 nodes) from a distance matrix by neighbor joining algorithm. I just modify the existing code from: https://github.com/lijax/biopython/blob/master/Bio/Phylo/TreeConstruction.py . I thought this might be part of biopython in the future. However, the speed for function nj() (neigbhor joining) is slow. The computational complexity of this function is N**3, and the function takes about 1 day to build a tree with 1000 nodes. I am wondering whether there is any efficient algorithm for neighbor joining in biopython or python. Probably, I can write a function based on "fastphylo: Fast tools for phylogenetics" for biopython. Thanks, Jing From p.j.a.cock at googlemail.com Mon Oct 21 13:00:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 Oct 2013 18:00:04 +0100 Subject: [Biopython] More efficient neighbor joining algorithm to build phylogenetic tree In-Reply-To: References: Message-ID: On Sat, Oct 19, 2013 at 4:16 PM, Jing Lu wrote: > Hello! > > I am trying to build a large tree (~10000 nodes) from a distance matrix by > neighbor joining algorithm. I just modify the existing code from: > https://github.com/lijax/biopython/blob/master/Bio/Phylo/TreeConstruction.py > . > > I thought this might be part of biopython in the future. However, the speed > for function nj() (neigbhor joining) is slow. The computational complexity > of this function is N**3, and the function takes about 1 day to build a > tree with 1000 nodes. > > I am wondering whether there is any efficient algorithm for neighbor > joining in biopython or python. Probably, I can write a function based on > "fastphylo: Fast tools for phylogenetics" for biopython. Does it have to be in pure Python? Whenever I've needed a large tree with 1000s of sequences I have used a fast C implementation, with bootstrapping. Peter From bodington at gmail.com Thu Oct 24 03:33:42 2013 From: bodington at gmail.com (Dylan Bodington) Date: Thu, 24 Oct 2013 07:33:42 +0000 (UTC) Subject: [Biopython] Eftech and db='bioproject'... DTD problem? References: <1371051142.58860.YahooMailNeo@web164005.mail.gq1.yahoo.com> Message-ID: Hi, Is there any more news on this? I'm trying to work with a large list of bioprojects, and I've reached this same issue. Dylan Bodington School of Biosciences and Biotechnology Tokyo Institute of Technology Tokyo Japan From nicolas.joannin at gmail.com Thu Oct 24 04:10:17 2013 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Thu, 24 Oct 2013 10:10:17 +0200 Subject: [Biopython] Eftech and db='bioproject'... DTD problem? In-Reply-To: References: <1371051142.58860.YahooMailNeo@web164005.mail.gq1.yahoo.com> Message-ID: Hi Dylan, I have not followed up on this matter. In any case, they have not contacted me to let me know. I would say that: either the problem doesn't exist anymore and that means it's fixed, or the problem is still there, and they haven't dealt with it yet. In the latter case, I would suggest emailing the help desk to ask about it: the more people actually ask for it, the quicker they might take care of it... Best regards, Nicolas Nicolas Joannin, Ph.D. Bioinformatics Center Kyoto University, Uji campus, Japan On Thu, Oct 24, 2013 at 9:33 AM, Dylan Bodington wrote: > Hi, > > Is there any more news on this? I'm trying to work with a large list of > bioprojects, and I've reached this same issue. > > Dylan Bodington > School of Biosciences and Biotechnology > Tokyo Institute of Technology > Tokyo > Japan > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jgrant at smith.edu Fri Oct 25 13:44:00 2013 From: jgrant at smith.edu (Jessica Grant) Date: Fri, 25 Oct 2013 13:44:00 -0400 Subject: [Biopython] codon bias Message-ID: Hello, I was wondering if anyone had some code to determine effective number of codons in a sequence. I'm working with an organism with a non-canonical genetic code, so I don't think I can use any of the standard packages. Thanks, Jessica From chris.mit7 at gmail.com Fri Oct 25 14:22:56 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Fri, 25 Oct 2013 14:22:56 -0400 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: Hi Jessica, Could you be somewhat more descriptive of your goals? Are you trying to do something like determine the coding likelihood of a sequence? You can just take the known coding sequences to empirically derive the codon usage frequency. That would be a simple script like: from collections import Counter fdict = Counter() for i in xrange(0,len(sequence),3): fdict[sequence[i:i+3]] += 1 Which would give you a dictionary of the counts, from which you can derive the frequencies. Chris On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > Hello, > > I was wondering if anyone had some code to determine effective number of > codons in a sequence. I'm working with an organism with a non-canonical > genetic code, so I don't think I can use any of the standard packages. > > Thanks, > > Jessica > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jgrant at smith.edu Tue Oct 29 09:05:51 2013 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 29 Oct 2013 09:05:51 -0400 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: Hello again, I am resending to clarify - I am wondering if anyone has implemented Wright's Effective Number of Codons (as in Wright, F. 1990. The effective number of codons used in a gene. Gene 87:23-29), or any improved method. I have tried using codonW but got some wonky results. I am working with transcriptome data from a non-model organism and want to look at the relationships between ENc, GC3 and other statistics to tease out any information about the data in my transcriptome. Thanks, Jessica On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > Hello, > > I was wondering if anyone had some code to determine effective number of > codons in a sequence. I'm working with an organism with a non-canonical > genetic code, so I don't think I can use any of the standard packages. > > Thanks, > > Jessica > > > From p.j.a.cock at googlemail.com Wed Oct 30 07:15:02 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Oct 2013 11:15:02 +0000 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: On Tue, Oct 29, 2013 at 1:05 PM, Jessica Grant wrote: > Hello again, > > I am resending to clarify - I am wondering if anyone has > implemented Wright's Effective Number of Codons (as in Wright, F. 1990. > The effective number of codons used in a gene. Gene 87:23-29), or any > improved method. I have tried using codonW but got some wonky results. I > am working with transcriptome data from a non-model organism and want to > look at the relationships between ENc, GC3 and other statistics to tease > out any information about the data in my transcriptome. > > Thanks, > > Jessica > > On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > >> Hello, >> >> I was wondering if anyone had some code to determine effective number of >> codons in a sequence. I'm working with an organism with a non-canonical >> genetic code, so I don't think I can use any of the standard packages. >> >> Thanks, >> >> Jessica >> I emailed Frank (we both work at the James Hutton Institute, although he is under the BioSS organisation): http://www.hutton.ac.uk/staff/frank-wright http://www.bioss.ac.uk/people/frank.html Frank suggests looking at the EMBOSS implementation 'chips', http://emboss.sourceforge.net/apps/release/6.5/emboss/apps/chips.html Peter From fernando.j at inbox.com Wed Oct 30 10:22:48 2013 From: fernando.j at inbox.com (john fernando) Date: Wed, 30 Oct 2013 06:22:48 -0800 Subject: [Biopython] generate phylogenetic tree Message-ID: <34F96A989B6.000011C3fernando.j@inbox.com> Hi, first off, I am very new to the bioinformatics/biopython world so this may come as a naive question, so I apologize in advance. I extracted some sequences of PDB, aligned them using BLOSUM62 and have "scores". I was wondering if anyone can give tips/advice on I can set about generating a phylogenetic tree of the results to graphically show the clusters of similar sequences? I want to do this for my 'own' substitution matrix (next step). I am asking not necessarily code but more tools that people have used that can do this using the "scores" I have calculated. Thank you, John ____________________________________________________________ FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop! Check it out at http://www.inbox.com/earth From jgrant at smith.edu Wed Oct 30 11:30:19 2013 From: jgrant at smith.edu (Jessica Grant) Date: Wed, 30 Oct 2013 11:30:19 -0400 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: I have chips up and running! Thanks so much! On Wed, Oct 30, 2013 at 7:15 AM, Peter Cock wrote: > On Tue, Oct 29, 2013 at 1:05 PM, Jessica Grant wrote: > > Hello again, > > > > I am resending to clarify - I am wondering if anyone has > > implemented Wright's Effective Number of Codons (as in Wright, F. 1990. > > The effective number of codons used in a gene. Gene 87:23-29), or any > > improved method. I have tried using codonW but got some wonky results. > I > > am working with transcriptome data from a non-model organism and want to > > look at the relationships between ENc, GC3 and other statistics to tease > > out any information about the data in my transcriptome. > > > > Thanks, > > > > Jessica > > > > On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > > > >> Hello, > >> > >> I was wondering if anyone had some code to determine effective number of > >> codons in a sequence. I'm working with an organism with a non-canonical > >> genetic code, so I don't think I can use any of the standard packages. > >> > >> Thanks, > >> > >> Jessica > >> > > I emailed Frank (we both work at the James Hutton Institute, although > he is under the BioSS organisation): > http://www.hutton.ac.uk/staff/frank-wright > http://www.bioss.ac.uk/people/frank.html > > Frank suggests looking at the EMBOSS implementation 'chips', > http://emboss.sourceforge.net/apps/release/6.5/emboss/apps/chips.html > > Peter > From ribozyme at ioz.ac.cn Wed Oct 30 23:03:21 2013 From: ribozyme at ioz.ac.cn (WU) Date: Thu, 31 Oct 2013 11:03:21 +0800 (GMT+08:00) Subject: [Biopython] generate phylogenetic tree In-Reply-To: References: Message-ID: To Mr. fernando, In biopython there is a module Bio.Phylo which can draw tree. But Bio.Phylo doesn??t infer trees from alignments itself, there are third-party programs available that do such as PhyML. These are supported through the module Bio.Phylo.Applications. Besides, there are also some other software to construct tree from alignment results including MrBayes or PHYLIP. You could see http://biopython.org/DIST/docs/tutorial/Tutorial.html of the 13.5 section for further information. Best wishes Wu Qi From eric.talevich at gmail.com Thu Oct 31 17:38:34 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 31 Oct 2013 14:38:34 -0700 Subject: [Biopython] generate phylogenetic tree In-Reply-To: <34F96A989B6.000011C3fernando.j@inbox.com> References: <34F96A989B6.000011C3fernando.j@inbox.com> Message-ID: On Wed, Oct 30, 2013 at 7:22 AM, john fernando wrote: > Hi, > > first off, I am very new to the bioinformatics/biopython world so this may > come as a naive question, so I apologize in advance. > > I extracted some sequences of PDB, aligned them using BLOSUM62 and have > "scores". > > I was wondering if anyone can give tips/advice on I can set about > generating a phylogenetic tree of the results to graphically show the > clusters of similar sequences? > > I want to do this for my 'own' substitution matrix (next step). > > I am asking not necessarily code but more tools that people have used that > can do this using the "scores" I have calculated. > Thank you, > John > Hi John, To quickly get a tree to look at, given a multiple sequence alignment, I recommend FastTree. http://www.microbesonline.org/fasttree/ If you'd prefer a graphical program to start with, ClustalX and JalView are both capable of building trees with a neighbor-joining algorithm, among other things. http://www.clustal.org/clustal2/ http://www.jalview.org/ To view a large tree and apply your own highlighting and colorization, try Archaeopteryx. https://sites.google.com/site/cmzmasek/home/software/archaeopteryx Back on the command line, some of the EMBOSS tools allow you to supply your own scoring matrix, and so does Phylip, I think. http://emboss.sourceforge.net/ http://evolution.genetics.washington.edu/phylip.html If none of those work for you and you'd like to try building a tree from your own distance matrix using Biopython, this is possible with Yanbo Ye's recent work on another development branch: http://biopython.org/wiki/Phylo#Upcoming_GSoC_2013_features https://github.com/lijax/biopython/ Hope that helps, Eric From devaniranjan at gmail.com Thu Oct 31 22:04:21 2013 From: devaniranjan at gmail.com (George Devaniranjan) Date: Thu, 31 Oct 2013 22:04:21 -0400 Subject: [Biopython] generate phylogenetic tree In-Reply-To: References: <34F96A989B6.000011C3fernando.j@inbox.com> Message-ID: While I have never used PHYLIP a lot , I would really recommend their FAQ's, they give some great resources (both online and books ) to get you started. Eric has given some great tips too, hopefully all this will be of help to you-Good luck. On Thu, Oct 31, 2013 at 5:38 PM, Eric Talevich wrote: > On Wed, Oct 30, 2013 at 7:22 AM, john fernando > wrote: > > > Hi, > > > > first off, I am very new to the bioinformatics/biopython world so this > may > > come as a naive question, so I apologize in advance. > > > > I extracted some sequences of PDB, aligned them using BLOSUM62 and have > > "scores". > > > > I was wondering if anyone can give tips/advice on I can set about > > generating a phylogenetic tree of the results to graphically show the > > clusters of similar sequences? > > > > I want to do this for my 'own' substitution matrix (next step). > > > > I am asking not necessarily code but more tools that people have used > that > > can do this using the "scores" I have calculated. > > Thank you, > > John > > > > Hi John, > > To quickly get a tree to look at, given a multiple sequence alignment, I > recommend FastTree. > http://www.microbesonline.org/fasttree/ > > If you'd prefer a graphical program to start with, ClustalX and JalView are > both capable of building trees with a neighbor-joining algorithm, among > other things. > http://www.clustal.org/clustal2/ > http://www.jalview.org/ > > To view a large tree and apply your own highlighting and colorization, try > Archaeopteryx. > https://sites.google.com/site/cmzmasek/home/software/archaeopteryx > > Back on the command line, some of the EMBOSS tools allow you to supply your > own scoring matrix, and so does Phylip, I think. > http://emboss.sourceforge.net/ > http://evolution.genetics.washington.edu/phylip.html > > If none of those work for you and you'd like to try building a tree from > your own distance matrix using Biopython, this is possible with Yanbo Ye's > recent work on another development branch: > http://biopython.org/wiki/Phylo#Upcoming_GSoC_2013_features > https://github.com/lijax/biopython/ > > Hope that helps, > Eric > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jttkim at googlemail.com Tue Oct 1 12:32:22 2013 From: jttkim at googlemail.com (Jan Kim) Date: Tue, 1 Oct 2013 13:32:22 +0100 Subject: [Biopython] Job: Bioinformatician in Livestock Virology Message-ID: <20131001123221.GD8804@LIN-2F308X1.iah.ac.uk> Bioinformatician in Livestock Virology (POST REF: IRC112153) ============================================================ Full details: http://www.pirbright.ac.uk/jobs/Jobs.aspx This post offers a unique opportunity to develop your own interdisciplinary research profile at the interface of experimental and theoretical bioscience applied to livestock virology at The Pirbright Institute. This post requires an understanding of the principles of bioinformatics and molecular biology, together with practical computing skills. Background knowledge of the biology of livestock animals and the viruses affecting them would also be beneficial. We expect candidates to have a firm grasp of at least one of the following areas: virus evolution; genomics and transcriptomics of viruses, vectors and hosts; or immunology and immunogenetics. We therefore encourage applications from biologists who are developing computer science and bioinformatics knowledge and from candidates with a computer science background who are developing knowledge in the biosciences. The Pirbright Institute, an institute of the Biotechnology and Biological Sciences Research Council (BBSRC), is a unique national center that works through its highly innovative fundamental and applied bioscience to enhance the UK capability to contain, control, and eliminate viral diseases of animals and viruses that spread from animals to humans. We thereby support the competitiveness of UK livestock and poultry producers, and improve the health and quality of life of both animals and people. The closing date for applying is 29 October 2013. Full details, including instructions for applying, can be found at http://www.pirbright.ac.uk/jobs/Jobs.aspx . Informal enquiries are welcome and should be sent to Jan T Kim, Head of Bioinformatics, jan.kim at pirbright.ac.uk. From anaryin at gmail.com Tue Oct 8 21:22:42 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 8 Oct 2013 23:22:42 +0200 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: <521FD389.1090207@eng.ucsd.edu> References: <521FD389.1090207@eng.ucsd.edu> Message-ID: Dear James, Regarding problem 1. What you describe runs fine on my machine, using Python 2.7.5 and an up-to-date Biopython git version. You logic seems fine, maybe it's the version of Python you are using? Regarding your second problem, that of the mismatched indexes. The Selection method returns a *list* of residues while when you iterate over the neighbors and ask for their id it gives back the id of the residue. This id will only correspond to the Selection list index if your residues are numbered from 1 to N without gaps. If your protein starts at residue 3, then the first item given back by Selection has index 0 while in fact the id is 3. Does this make sense? The warning occurs if you have chain breaks. There should be some gaps in your structure, starting at a number other than 1 does not raise this warning normally. Cheers and sorry for the late reply, Jo?o 2013/8/30 James Jensen > Hello! > > I am writing a function that, given two chains in a PDB file, should > return 1) the positions and identities of all residues that are in contact > with (distance < 5 angstroms) a residue on the other chain, and 2) the > amino acid sequences of the chains. I've been doing this with > NeighborSearch.search_all(**radius=5, level='A') and then for each atom > pair, seeing what its parent residue is and whether the parent residues of > the two atoms belong to different chains. This may seem like a roundabout > way of doing it, but if I call search_all(radius=5, level='R'), or indeed > with level=any level other than 'A', I get the error > > TypeError: unorderable types: Residue() < Residue() > > So my first question is why it might be that search_all isn't working at > higher levels. > > For the adjacent residue pairs I identify using NeighborSearch, I get each > residue's position in its respective chain by residue.get_id()[1]. > > I've noticed, however, that if I get the sequence of the chain using seq = > Selection.unfold_entities(**chain, 'R') and then reference (i.e. > seq[index]) the amino acids using the indices returned by the > NeighborSearch step, they are not the same residues that I get if during > the NeighborSearch step I report residue.get_resname() for each adjacent > residue. > > I've tried it with several proteins, and the problem is the same. Chains A > and C of 2h62 are an example. > > I then noticed that the lowest residue ID number of the residues yielded > from Selection.unfold_entities(**chain, 'R') is not 1. For chain A, it's > 11, and for chain C, it's 34. Not knowing why this was, I thought I'd try > subtracting the lowest ID number from the indices returned by the > NeighborSearch step (i.e. in chain A, 11 -> 0 so seq[0] would be the first > residue, the one with ID 11). This happened to seem to work for chain A. > However, it gives me negative indices for some of the contacts in chain C. > This means that NeighborSearch can return residues that are not returned by > unfold_entities(). The lowest residue ID returned by NeighborSearch for > chain C was 24, whereas for unfold_entities() it was 34. > > For both chains A and C, I was given the warning > > PDBConstructionWarning: WARNING: Chain [letter] is discontinuous > at line [line number]. > > In fact, I seem to get this warning for just about every chain of every > structure I load. Is this the reason that the first residues in the two > chains are at 11 and 34, rather than 1? If so, could it be that > NeighborSearch is able to work around the discontinuity while > unfold_entities is not? > > Any suggestions? > > Thanks for your time and help, > > James Jensen > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From idoerg at gmail.com Wed Oct 9 13:03:10 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 9 Oct 2013 09:03:10 -0400 Subject: [Biopython] PyTennessee Message-ID: Hi all If there are any Biopython people in the area, there's a PyTennessee conference Feb 22-23 2014 in Nashville. They are taking calls for proposals now. http://www.pytennessee.org/speaking/cfp/ Thanks to Jeff Chang for the info. Best, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From rlinder at austin.utexas.edu Wed Oct 9 20:54:34 2013 From: rlinder at austin.utexas.edu (Randy Linder) Date: Wed, 9 Oct 2013 15:54:34 -0500 Subject: [Biopython] installing Biopython on Pyzo Message-ID: <5255C28A.20500@austin.utexas.edu> Hi, I'm using the Pyzo distro of Python 3.3 on the Windows 8 desktop and am having trouble installing Biopython. When I attempt to use the Windows installer it says it cannot find Python 3.3. in the registry. Is there a work around for this problem. Thanks in advance for any help you can provide. Randy Linder -- ___________________________________________ Randy Linder Associate Professor Office (512) 471-7825 Lab (512) 471-7826 FAX (512) 232-9529 _U.S. Postal Service address_: Section of Integrative Biology University of Texas 1 University Station #C0930 Austin, TX 78712 _FedEx, UPS, etc. address_: Section of Integrative Biology The University of Texas at Austin Biological Laboratories 404 2401 Whitis Austin, TX 78712 USA From bitsink at gmail.com Wed Oct 9 23:04:37 2013 From: bitsink at gmail.com (Nam Nguyen) Date: Wed, 9 Oct 2013 16:04:37 -0700 Subject: [Biopython] installing Biopython on Pyzo In-Reply-To: <5255C28A.20500@austin.utexas.edu> References: <5255C28A.20500@austin.utexas.edu> Message-ID: Hi Randy, I'm new to Biopython so my reply may be a complete non-sense. But, Pyzo's web site (http://www.pyzo.org/distro.html) says in its __future__ feature: - Activating the Pyzo distro so Windows installers (e.g. from gohlke) just work. On Wed, Oct 9, 2013 at 1:54 PM, Randy Linder wrote: > Hi, > > I'm using the Pyzo distro of Python 3.3 on the Windows 8 desktop and am > having trouble installing Biopython. When I attempt to use the Windows > installer it says it cannot find Python 3.3. in the registry. Is there a > work around for this problem. > > Thanks in advance for any help you can provide. > > Randy Linder > > > -- > > ______________________________**_____________ > Randy Linder > Associate Professor > Office (512) 471-7825 > Lab (512) 471-7826 > FAX (512) 232-9529 > > _U.S. Postal Service address_: > Section of Integrative Biology > University of Texas > 1 University Station #C0930 > Austin, TX 78712 > > _FedEx, UPS, etc. address_: > Section of Integrative Biology > The University of Texas at Austin > Biological Laboratories 404 > 2401 Whitis > Austin, TX 78712 > USA > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Thu Oct 10 08:57:37 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Oct 2013 09:57:37 +0100 Subject: [Biopython] installing Biopython on Pyzo In-Reply-To: References: <5255C28A.20500@austin.utexas.edu> Message-ID: Hi Randy, Nam, Yes, it sounds like Pyzo isn't recording itself in the Windows registry in the same way that the Official Python installers do, and therefore our installer can't find it. I've not heard of Pyzo but there are other Python bundles which include Biopython, e.g. the Enthought Python Distribution which is now branded as Canopy (although their package for Biopython is currently a bit out of date): https://www.enthought.com/products/canopy/package-index/ Regards, Peter On Thu, Oct 10, 2013 at 12:04 AM, Nam Nguyen wrote: > Hi Randy, > > I'm new to Biopython so my reply may be a complete non-sense. But, Pyzo's > web site (http://www.pyzo.org/distro.html) says in its __future__ feature: > > > - Activating the Pyzo distro so Windows installers (e.g. from gohlke) > just work. > > > > On Wed, Oct 9, 2013 at 1:54 PM, Randy Linder wrote: > >> Hi, >> >> I'm using the Pyzo distro of Python 3.3 on the Windows 8 desktop and am >> having trouble installing Biopython. When I attempt to use the Windows >> installer it says it cannot find Python 3.3. in the registry. Is there a >> work around for this problem. >> >> Thanks in advance for any help you can provide. >> >> Randy Linder >> >> >> -- >> >> ______________________________**_____________ >> Randy Linder >> Associate Professor >> Office (512) 471-7825 >> Lab (512) 471-7826 >> FAX (512) 232-9529 >> >> _U.S. Postal Service address_: >> Section of Integrative Biology >> University of Texas >> 1 University Station #C0930 >> Austin, TX 78712 >> >> _FedEx, UPS, etc. address_: >> Section of Integrative Biology >> The University of Texas at Austin >> Biological Laboratories 404 >> 2401 Whitis >> Austin, TX 78712 >> USA >> >> >> ______________________________**_________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From ericmajinglong at gmail.com Thu Oct 10 19:33:43 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 10 Oct 2013 15:33:43 -0400 Subject: [Biopython] How to do RNA-RNA hybridization searches? Message-ID: Hi everybody, I'm looking to try my hand at doing the following problem: I have a sequence of RNA, say, "RNA SEQUENCE 1": "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." I have another sequence of RNA, say "RNA SEQUENCE 2". I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll take a long time, I'm sure, but nonetheless that's what I'd like to do), such that I search: "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. along RNA Sequence 2, to see if I can find that same region present. I was wondering if there was some package that could do that, that either BioPython interfaces with, or is separately implemented as a Python package. Does anybody know if there is such a thing? Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl From chris.mit7 at gmail.com Thu Oct 10 19:59:00 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 10 Oct 2013 15:59:00 -0400 Subject: [Biopython] How to do RNA-RNA hybridization searches? In-Reply-To: References: Message-ID: If you want to do it in python, it's fairly trivial. a = 'CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG' for i in xrange(0,len(a)-14): if a[i:i+15] in seq2: do this or is there some reason you aren't taking this approach? On Thu, Oct 10, 2013 at 3:33 PM, Eric Ma wrote: > Hi everybody, > > I'm looking to try my hand at doing the following problem: > > I have a sequence of RNA, say, "RNA SEQUENCE 1": > "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." > > I have another sequence of RNA, say "RNA SEQUENCE 2". > > I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll > take a long time, I'm sure, but nonetheless that's what I'd like to do), > such that I search: > > "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" > "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" > "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. > > along RNA Sequence 2, to see if I can find that same region present. > > I was wondering if there was some package that could do that, that either > BioPython interfaces with, or is separately implemented as a Python > package. Does anybody know if there is such a thing? > > Cheers, > Eric > ----------------------------------------------------------------------- > Please consider the environment before printing this e-mail. Do you really > need to print it? > > http://about.me/ericmjl > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ivangreg at gmail.com Thu Oct 10 20:00:54 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Thu, 10 Oct 2013 16:00:54 -0400 Subject: [Biopython] How to do RNA-RNA hybridization searches? In-Reply-To: References: Message-ID: I suggest that you try pairwise2. Ivan Ivan Gregoretti, PhD Bioinformatics On Thu, Oct 10, 2013 at 3:33 PM, Eric Ma wrote: > Hi everybody, > > I'm looking to try my hand at doing the following problem: > > I have a sequence of RNA, say, "RNA SEQUENCE 1": > "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." > > I have another sequence of RNA, say "RNA SEQUENCE 2". > > I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll > take a long time, I'm sure, but nonetheless that's what I'd like to do), > such that I search: > > "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" > "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" > "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. > > along RNA Sequence 2, to see if I can find that same region present. > > I was wondering if there was some package that could do that, that either > BioPython interfaces with, or is separately implemented as a Python > package. Does anybody know if there is such a thing? > > Cheers, > Eric > ----------------------------------------------------------------------- > Please consider the environment before printing this e-mail. Do you really > need to print it? > > http://about.me/ericmjl > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From davidsshin at lbl.gov Thu Oct 10 22:36:57 2013 From: davidsshin at lbl.gov (David Shin) Date: Thu, 10 Oct 2013 15:36:57 -0700 Subject: [Biopython] general scripting help Message-ID: Hi - I am trying to write a script to parse through 50 or so deltablast .xml files. File names are: xaa.xml xab.xml xac.xml ... I'm new (2 days) to python, biopython, and just trying to have something to show for a meeting tomorrow. I have my script working well enough for one file, I was wondering if there was a way to go thru each file separately and output according to file name. ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like x?? or even x* # This part gets the length of the query and stores to a variable from Bio import SeqIO record = SeqIO.read("xaa", "fasta") query_length = len(record) #print "query length:", query_length #This part gets the user's high and low percent identity cutoffs high_percent_cutoff = float(input("Enter high percent cutoff: ")) low_percent_cutoff = float(input("Enter low percent cutoff: ")) # This part does the comparison to all the hits if result_handle = open("xaa.xml") from Bio.Blast import NCBIXML blast_record = NCBIXML.read(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: alignment_length = alignment.length identical_residues = hsp.identities percent_identity = float(identical_residues) / float(query_length) if alignment_length > query_length * 0.9 and alignment_length < query_length * 1.1 and percent_identity > low_percent_cutoff and percent_identity <= high_percent_cutoff: print "****Alignment****" print "sequence:", alignment.title print "query length:", query_length print "alighment length:", alignment.length print "identical residues:", identical_residues print "percent identity:", percent_identity print print "12345678901234567890123456789012345678901234567890123456789012345678901234567890" print hsp.query[:80] print hsp.match[:80] print hsp.sbjct[:80] Thanks for any help. Dave From nathaniel.echols at gmail.com Thu Oct 10 22:53:16 2013 From: nathaniel.echols at gmail.com (Nat Echols) Date: Thu, 10 Oct 2013 15:53:16 -0700 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Put all that code into a function with the file name or prefix as an argument, then iterate over possible files: def extract_alignments (prefix) : # old code goes here - append ".xml" to prefix to get alignment file name for prefix in ["xaa","xab","xac"] : extract_alignments(prefix) Or you could do this: import os.path for file_name in sys.argv[1:] : prefix = os.path.splitext(file_name)[0] extract_alignments(prefix) And run as: python my_script.py x*.xml Assuming you have a real OS installed, of course - I'm not sure whether Windows supports wildcards too. -Nat On Thu, Oct 10, 2013 at 3:36 PM, David Shin wrote: > Hi - > > I am trying to write a script to parse through 50 or so deltablast .xml > files. > > File names are: > xaa.xml > xab.xml > xac.xml > ... > > > I'm new (2 days) to python, biopython, and just trying to have something to > show for a meeting tomorrow. I have my script working well enough for one > file, I was wondering if there was a way to go thru each file separately > and output according to file name. > > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like x?? > or even x* > > > # This part gets the length of the query and stores to a variable > from Bio import SeqIO > record = SeqIO.read("xaa", "fasta") > query_length = len(record) > #print "query length:", query_length > > #This part gets the user's high and low percent identity cutoffs > high_percent_cutoff = float(input("Enter high percent cutoff: ")) > low_percent_cutoff = float(input("Enter low percent cutoff: ")) > > # This part does the comparison to all the hits if > result_handle = open("xaa.xml") > from Bio.Blast import NCBIXML > blast_record = NCBIXML.read(result_handle) > > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > alignment_length = alignment.length > identical_residues = hsp.identities > percent_identity = float(identical_residues) / float(query_length) > if alignment_length > query_length * 0.9 and alignment_length < > query_length * 1.1 and percent_identity > low_percent_cutoff and > percent_identity <= high_percent_cutoff: > print "****Alignment****" > print "sequence:", alignment.title > print "query length:", query_length > print "alighment length:", alignment.length > print "identical residues:", identical_residues > print "percent identity:", percent_identity > print > print > > "12345678901234567890123456789012345678901234567890123456789012345678901234567890" > print hsp.query[:80] > print hsp.match[:80] > print hsp.sbjct[:80] > > > Thanks for any help. > Dave > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Thu Oct 10 22:54:26 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 11 Oct 2013 00:54:26 +0200 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Hey David, If you put all your input files in one directory, all you need is a for loop and the os module (import os) and its listdir method. Then you can save the output to a file instead of printing to screen. import os directory = '/home/dave/xmlfiles/' for each_file in os.listdir(directory): output_file = each_file + '_output' fhandle = open(output_file, 'w') record = SeqIO.read(each_file, "fasta") query_length = len(record) # Print to file, use string formatting to make life easier fhandle.write("query length: {0}\n".format(query_length)) etc etc You might want to have a look here: http://pythonforbiologists.com/ Cheers and good luck for tomorrow, Jo?o 2013/10/11 David Shin > Hi - > > I am trying to write a script to parse through 50 or so deltablast .xml > files. > > File names are: > xaa.xml > xab.xml > xac.xml > ... > > > I'm new (2 days) to python, biopython, and just trying to have something to > show for a meeting tomorrow. I have my script working well enough for one > file, I was wondering if there was a way to go thru each file separately > and output according to file name. > > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like x?? > or even x* > > > # This part gets the length of the query and stores to a variable > from Bio import SeqIO > record = SeqIO.read("xaa", "fasta") > query_length = len(record) > #print "query length:", query_length > > #This part gets the user's high and low percent identity cutoffs > high_percent_cutoff = float(input("Enter high percent cutoff: ")) > low_percent_cutoff = float(input("Enter low percent cutoff: ")) > > # This part does the comparison to all the hits if > result_handle = open("xaa.xml") > from Bio.Blast import NCBIXML > blast_record = NCBIXML.read(result_handle) > > for alignment in blast_record.alignments: > for hsp in alignment.hsps: > alignment_length = alignment.length > identical_residues = hsp.identities > percent_identity = float(identical_residues) / float(query_length) > if alignment_length > query_length * 0.9 and alignment_length < > query_length * 1.1 and percent_identity > low_percent_cutoff and > percent_identity <= high_percent_cutoff: > print "****Alignment****" > print "sequence:", alignment.title > print "query length:", query_length > print "alighment length:", alignment.length > print "identical residues:", identical_residues > print "percent identity:", percent_identity > print > print > > "12345678901234567890123456789012345678901234567890123456789012345678901234567890" > print hsp.query[:80] > print hsp.match[:80] > print hsp.sbjct[:80] > > > Thanks for any help. > Dave > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bitsink at gmail.com Thu Oct 10 22:56:04 2013 From: bitsink at gmail.com (Nam Nguyen) Date: Thu, 10 Oct 2013 15:56:04 -0700 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Module glob can help here too. import glob for filename in glob.glob('*.xml'): extract_alignments(filename) You do not have to worry about "real OS". Cheers, Nam On Thu, Oct 10, 2013 at 3:53 PM, Nat Echols wrote: > Put all that code into a function with the file name or prefix as an > argument, then iterate over possible files: > > def extract_alignments (prefix) : > # old code goes here - append ".xml" to prefix to get alignment file name > > for prefix in ["xaa","xab","xac"] : > extract_alignments(prefix) > > Or you could do this: > > import os.path > for file_name in sys.argv[1:] : > prefix = os.path.splitext(file_name)[0] > extract_alignments(prefix) > > And run as: > > python my_script.py x*.xml > > Assuming you have a real OS installed, of course - I'm not sure whether > Windows supports wildcards too. > > -Nat > > > > On Thu, Oct 10, 2013 at 3:36 PM, David Shin wrote: > > > Hi - > > > > I am trying to write a script to parse through 50 or so deltablast .xml > > files. > > > > File names are: > > xaa.xml > > xab.xml > > xac.xml > > ... > > > > > > I'm new (2 days) to python, biopython, and just trying to have something > to > > show for a meeting tomorrow. I have my script working well enough for one > > file, I was wondering if there was a way to go thru each file separately > > and output according to file name. > > > > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like > x?? > > or even x* > > > > > > # This part gets the length of the query and stores to a variable > > from Bio import SeqIO > > record = SeqIO.read("xaa", "fasta") > > query_length = len(record) > > #print "query length:", query_length > > > > #This part gets the user's high and low percent identity cutoffs > > high_percent_cutoff = float(input("Enter high percent cutoff: ")) > > low_percent_cutoff = float(input("Enter low percent cutoff: ")) > > > > # This part does the comparison to all the hits if > > result_handle = open("xaa.xml") > > from Bio.Blast import NCBIXML > > blast_record = NCBIXML.read(result_handle) > > > > for alignment in blast_record.alignments: > > for hsp in alignment.hsps: > > alignment_length = alignment.length > > identical_residues = hsp.identities > > percent_identity = float(identical_residues) / > float(query_length) > > if alignment_length > query_length * 0.9 and alignment_length < > > query_length * 1.1 and percent_identity > low_percent_cutoff and > > percent_identity <= high_percent_cutoff: > > print "****Alignment****" > > print "sequence:", alignment.title > > print "query length:", query_length > > print "alighment length:", alignment.length > > print "identical residues:", identical_residues > > print "percent identity:", percent_identity > > print > > print > > > > > "12345678901234567890123456789012345678901234567890123456789012345678901234567890" > > print hsp.query[:80] > > print hsp.match[:80] > > print hsp.sbjct[:80] > > > > > > Thanks for any help. > > Dave > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ericmajinglong at gmail.com Fri Oct 11 00:02:19 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 10 Oct 2013 17:02:19 -0700 (PDT) Subject: [Biopython] How to do RNA-RNA hybridization searches? In-Reply-To: References: Message-ID: <1381449739213.990e2a93@Nodemailer> I might try Ivan's approach. I was also trying to accommodate non-fully complementary portions. My bad for not stating this.? Thanks everybody! Cheers, Eric Sent from a mobile device. Please pardon typo errors. On Thu, Oct 10, 2013 at 3:59 PM, Chris Mitchell wrote: > If you want to do it in python, it's fairly trivial. > a = 'CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG' > for i in xrange(0,len(a)-14): > if a[i:i+15] in seq2: > do this > or is there some reason you aren't taking this approach? > On Thu, Oct 10, 2013 at 3:33 PM, Eric Ma wrote: >> Hi everybody, >> >> I'm looking to try my hand at doing the following problem: >> >> I have a sequence of RNA, say, "RNA SEQUENCE 1": >> "...CGACGUAUGUUAUGAGCUGAUGGUCACGUGUCGAUGGCUAUAG..." >> >> I have another sequence of RNA, say "RNA SEQUENCE 2". >> >> I'd like to do 15-mer sliding window searches from RNA Sequence 1 (it'll >> take a long time, I'm sure, but nonetheless that's what I'd like to do), >> such that I search: >> >> "[*CGACGUAUGUUAUGA*]GCUGAUGGUCACGUGUCGAUGGCUAUAG" >> "C[*GACGUAUGUUAUGAG*]CUGAUGGUCACGUGUCGAUGGCUAUAG" >> "CG[*ACGUAUGUUAUGAGC*]UGAUGGUCACGUGUCGAUGGCUAUAG" etc. etc. >> >> along RNA Sequence 2, to see if I can find that same region present. >> >> I was wondering if there was some package that could do that, that either >> BioPython interfaces with, or is separately implemented as a Python >> package. Does anybody know if there is such a thing? >> >> Cheers, >> Eric >> ----------------------------------------------------------------------- >> Please consider the environment before printing this e-mail. Do you really >> need to print it? >> >> http://about.me/ericmjl >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From davidsshin at lbl.gov Fri Oct 11 00:24:14 2013 From: davidsshin at lbl.gov (David Shin) Date: Thu, 10 Oct 2013 17:24:14 -0700 Subject: [Biopython] general scripting help In-Reply-To: References: Message-ID: Thanks everyone, I'm going to try to get each to work. Also, I meant to say I've been on biopython 2 days... but have been trying to do general python tutorials for a couple of weeks now, and I would have never come up with those suggestions. Thanks again, and thanks for the link also. On Thu, Oct 10, 2013 at 3:56 PM, Nam Nguyen wrote: > Module glob can help here too. > > import glob > for filename in glob.glob('*.xml'): > extract_alignments(filename) > > You do not have to worry about "real OS". > > Cheers, > Nam > > > On Thu, Oct 10, 2013 at 3:53 PM, Nat Echols wrote: > >> Put all that code into a function with the file name or prefix as an >> argument, then iterate over possible files: >> >> def extract_alignments (prefix) : >> # old code goes here - append ".xml" to prefix to get alignment file >> name >> >> for prefix in ["xaa","xab","xac"] : >> extract_alignments(prefix) >> >> Or you could do this: >> >> import os.path >> for file_name in sys.argv[1:] : >> prefix = os.path.splitext(file_name)[0] >> extract_alignments(prefix) >> >> And run as: >> >> python my_script.py x*.xml >> >> Assuming you have a real OS installed, of course - I'm not sure whether >> Windows supports wildcards too. >> >> -Nat >> >> >> >> On Thu, Oct 10, 2013 at 3:36 PM, David Shin wrote: >> >> > Hi - >> > >> > I am trying to write a script to parse through 50 or so deltablast .xml >> > files. >> > >> > File names are: >> > xaa.xml >> > xab.xml >> > xac.xml >> > ... >> > >> > >> > I'm new (2 days) to python, biopython, and just trying to have >> something to >> > show for a meeting tomorrow. I have my script working well enough for >> one >> > file, I was wondering if there was a way to go thru each file separately >> > and output according to file name. >> > >> > ie. I'm trying to replace "xaa" in lines 3 and 12 with a wildcard like >> x?? >> > or even x* >> > >> > >> > # This part gets the length of the query and stores to a variable >> > from Bio import SeqIO >> > record = SeqIO.read("xaa", "fasta") >> > query_length = len(record) >> > #print "query length:", query_length >> > >> > #This part gets the user's high and low percent identity cutoffs >> > high_percent_cutoff = float(input("Enter high percent cutoff: ")) >> > low_percent_cutoff = float(input("Enter low percent cutoff: ")) >> > >> > # This part does the comparison to all the hits if >> > result_handle = open("xaa.xml") >> > from Bio.Blast import NCBIXML >> > blast_record = NCBIXML.read(result_handle) >> > >> > for alignment in blast_record.alignments: >> > for hsp in alignment.hsps: >> > alignment_length = alignment.length >> > identical_residues = hsp.identities >> > percent_identity = float(identical_residues) / >> float(query_length) >> > if alignment_length > query_length * 0.9 and alignment_length < >> > query_length * 1.1 and percent_identity > low_percent_cutoff and >> > percent_identity <= high_percent_cutoff: >> > print "****Alignment****" >> > print "sequence:", alignment.title >> > print "query length:", query_length >> > print "alighment length:", alignment.length >> > print "identical residues:", identical_residues >> > print "percent identity:", percent_identity >> > print >> > print >> > >> > >> "12345678901234567890123456789012345678901234567890123456789012345678901234567890" >> > print hsp.query[:80] >> > print hsp.match[:80] >> > print hsp.sbjct[:80] >> > >> > >> > Thanks for any help. >> > Dave >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- David Shin, Ph.D Lawrence Berkeley National Labs 1 Cyclotron Road MS 83-R0101 Berkeley, CA 94720 USA From golubchi at stats.ox.ac.uk Tue Oct 15 14:40:29 2013 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Tue, 15 Oct 2013 15:40:29 +0100 Subject: [Biopython] Blast using Biopython Message-ID: <525D53DD.5040903@stats.ox.ac.uk> Hi guys, This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... Thanks Tanya From mmokrejs at fold.natur.cuni.cz Tue Oct 15 22:33:32 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Wed, 16 Oct 2013 00:33:32 +0200 Subject: [Biopython] Blast using Biopython In-Reply-To: <525D53DD.5040903@stats.ox.ac.uk> References: <525D53DD.5040903@stats.ox.ac.uk> Message-ID: <525DC2BC.7050300@fold.natur.cuni.cz> Hi Tanya, I suppose you use the newer ncbi-tools++ suite. Try the legacy blastn from the ncbi-tools suite. The version numbering is same ... I have better experience with "blastall -p blastn" form the old suite. You can also try to find some switch to force the really old blastn algorithm buried in blastall (nowadays the blastall uses the new algorithm which is in the new ncbi-tools++ suite). However, experience shows that "blastall -p blastn" gives different results compared to blastn although BOTH should be in theory using the new algorithm. With the possibility to force the real predecessor of the algorithm in blastall you have a third method to test. From blastall you get only limited results into CSV-formatted output, you cannot change the output columns. For me important results can be only parsed from XML/plaintext results of blastall. You can increase the reward for a match "-r 2" to overcome some gaps on sides but depends what queries you have and whether that does not give you elsewhere falsely widened alignments. You have to test that. Good luck, Martin Tanya Golubchik wrote: > Hi guys, > > This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? > > What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... From jordan.r.willis at Vanderbilt.Edu Tue Oct 15 23:56:56 2013 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Tue, 15 Oct 2013 23:56:56 +0000 Subject: [Biopython] Blast using Biopython In-Reply-To: <525D53DD.5040903@stats.ox.ac.uk> References: <525D53DD.5040903@stats.ox.ac.uk> Message-ID: Tanya, Does it have to be XML? Could you try -outfmt 7 and possibly request qseq and sseq which will return the aligned part of the sequence from the query and subject? J On Oct 15, 2013, at 9:40 AM, Tanya Golubchik wrote: > Hi guys, > > This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? > > What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... > > Thanks > Tanya > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ddufour at pcb.ub.cat Thu Oct 17 08:33:29 2013 From: ddufour at pcb.ub.cat (David Dufour Rausell) Date: Thu, 17 Oct 2013 10:33:29 +0200 Subject: [Biopython] is_na()? Message-ID: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Hello, I would like to know if there is a function like is_aa(residue) but to test if a residue is RNA? Basically, what I want to know if a chain is RNA or not. Thanks in advance. David Dufour Rausell Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona Tel +34 93 4020542 email ddufour at pcb.ub.cat From p.j.a.cock at googlemail.com Thu Oct 17 08:49:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Oct 2013 09:49:15 +0100 Subject: [Biopython] is_na()? In-Reply-To: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: On Thu, Oct 17, 2013 at 9:33 AM, David Dufour Rausell wrote: > Hello, > > I would like to know if there is a function like is_aa(residue) but to test if > a residue is RNA? Basically, what I want to know if a chain is RNA or not. > Thanks in advance. > > David Dufour Rausell Hello David, Are you asking about residues in 3D structures (e.g. from Bio.PDB), letters in sequences (e.g. strings or Seq objects), or another context? If sequence and using Seq objects, you may be able to look at the alphabet. However, most file formats do not define this explicitly, but you can tell the SeqIO parsers via the optional alphabet argument. Peter From ddufour at pcb.ub.cat Thu Oct 17 10:07:45 2013 From: ddufour at pcb.ub.cat (David Dufour Rausell) Date: Thu, 17 Oct 2013 12:07:45 +0200 Subject: [Biopython] is_na()? In-Reply-To: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: Hello again, I forgot to mention that I'm working with PDB files, so I'm using the Bio.PDB module. I'm thinking in extracting the sequence from each chain and check if it is made by RNA residues, but any other idea will be very welcome. Thanks! David Dufour Rausell Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona Tel +34 93 4020542 email ddufour at pcb.ub.cat On Oct 17, 2013, at 10:33 AM, David Dufour Rausell wrote: > Hello, > > I would like to know if there is a function like is_aa(residue) but to test if a residue is RNA? Basically, what I want to know if a chain is RNA or not. Thanks in advance. > > > David Dufour Rausell > > Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > > Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona > Tel +34 93 4020542 > email ddufour at pcb.ub.cat > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From harijay at gmail.com Thu Oct 17 11:44:08 2013 From: harijay at gmail.com (hari jayaram) Date: Thu, 17 Oct 2013 07:44:08 -0400 Subject: [Biopython] Downloadable html documentation? Message-ID: Hi, I recently started using a very fast offline OSX only documentation browser called Dash.app . It works great to very quickly search documentation . It has the ability to load in "docsets" and a set of instructions for how to build your own docset starting from html files. Is there an archive of the html api documentation for Biopython. Thanks Hari From p.j.a.cock at googlemail.com Thu Oct 17 11:52:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Oct 2013 12:52:05 +0100 Subject: [Biopython] Downloadable html documentation? In-Reply-To: References: Message-ID: On Thu, Oct 17, 2013 at 12:44 PM, hari jayaram wrote: > Hi, > I recently started using a very fast offline OSX only documentation browser > called Dash.app . It works great to very quickly search documentation . > > It has the ability to load in "docsets" and a set of instructions for how > to build your own docset starting from html files. > > Is there an archive of the html api documentation for Biopython. > > Thanks > Hari All the API documentation is pulled from the docstrings in the Python code - can you just point Dash.app at the source code instead? Peter From jordan.r.willis at Vanderbilt.Edu Thu Oct 17 11:34:13 2013 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Thu, 17 Oct 2013 11:34:13 +0000 Subject: [Biopython] is_na()? In-Reply-To: References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: I would just get the ID of the residue and ask if it's in the standard amino acid library. If you wanna use all Bio.PDB the Polypeptide class has a three_to_one function that only contains the naturally occurring 20 AA's by their 3 letter code. Do something like: from Bio.PDB.PDBParser import PDBParser from Bio.PDB.Polypeptide import three_to_one structure = PDBParser().get_structure('XXXX', 'you_structure.pdb') residues = structure.get_residues() for resi in residues: try: three_to_one(resi.get_resname()) except KeyError: print "residue {0} {1} on chain {2} is not a standard amino acid".format(resi.get_id()[1],resi.get_resname(),resi.get_parent().get_id()) On Oct 17, 2013, at 5:07 AM, David Dufour Rausell wrote: > Hello again, > > I forgot to mention that I'm working with PDB files, so I'm using the Bio.PDB module. I'm thinking in extracting the sequence from each chain and check if it is made by RNA residues, but any other idea will be very welcome. > > Thanks! > > David Dufour Rausell > > Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > > Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona > Tel +34 93 4020542 > email ddufour at pcb.ub.cat > > On Oct 17, 2013, at 10:33 AM, David Dufour Rausell wrote: > >> Hello, >> >> I would like to know if there is a function like is_aa(residue) but to test if a residue is RNA? Basically, what I want to know if a chain is RNA or not. Thanks in advance. >> >> >> David Dufour Rausell >> >> Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) >> Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) >> >> Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 - Barcelona >> Tel +34 93 4020542 >> email ddufour at pcb.ub.cat >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Thu Oct 17 12:52:43 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 17 Oct 2013 14:52:43 +0200 Subject: [Biopython] is_na()? In-Reply-To: References: <6A16A013-BDD7-4785-9B12-86238EE1BA78@pcb.ub.cat> Message-ID: That list also has all the nucleic acids so it's a matter of parsing it and using only a portion of it. Cheers, Jo?o 2013/10/17 Willis, Jordan R > I would just get the ID of the residue and ask if it's in the standard > amino acid library. If you wanna use all Bio.PDB the Polypeptide class has > a three_to_one function that only contains the naturally occurring 20 AA's > by their 3 letter code. Do something like: > > from Bio.PDB.PDBParser import PDBParser > from Bio.PDB.Polypeptide import three_to_one > structure = PDBParser().get_structure('XXXX', 'you_structure.pdb') > residues = structure.get_residues() > for resi in residues: > try: > three_to_one(resi.get_resname()) > except KeyError: > print "residue {0} {1} on chain {2} is not a standard amino > acid".format(resi.get_id()[1],resi.get_resname(),resi.get_parent().get_id()) > > > > On Oct 17, 2013, at 5:07 AM, David Dufour Rausell > wrote: > > > Hello again, > > > > I forgot to mention that I'm working with PDB files, so I'm using the > Bio.PDB module. I'm thinking in extracting the sequence from each chain and > check if it is made by RNA residues, but any other idea will be very > welcome. > > > > Thanks! > > > > David Dufour Rausell > > > > Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > > Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > > > > Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - 08028 > - Barcelona > > Tel +34 93 4020542 > > email ddufour at pcb.ub.cat > > > > On Oct 17, 2013, at 10:33 AM, David Dufour Rausell wrote: > > > >> Hello, > >> > >> I would like to know if there is a function like is_aa(residue) but to > test if a residue is RNA? Basically, what I want to know if a chain is RNA > or not. Thanks in advance. > >> > >> > >> David Dufour Rausell > >> > >> Genome Biology Group - Centre Nacional d'An?lisis Gen?mic (CNAG) > >> Structural Genomics Group - Centre de Regulaci? Gen?mica (CRG) > >> > >> Parc Cientific de Barcelona - Torre I - Baldiri Reixac, 4 - 2a.p - > 08028 - Barcelona > >> Tel +34 93 4020542 > >> email ddufour at pcb.ub.cat > >> > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From golubchi at stats.ox.ac.uk Thu Oct 17 14:07:50 2013 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Thu, 17 Oct 2013 15:07:50 +0100 Subject: [Biopython] Blast using Biopython In-Reply-To: <525DC2BC.7050300@fold.natur.cuni.cz> References: <525D53DD.5040903@stats.ox.ac.uk> <525DC2BC.7050300@fold.natur.cuni.cz> Message-ID: <525FEF36.7080008@stats.ox.ac.uk> Hi Martin, Using task=blastn seems to solve the problem! Thanks so much, I didn't realise that the default (megablast) behaviour is different even when the word size and other parameters are changed. Blastn seems to find the edges much more precisely than megablast. I haven't thoroughly tested it yet to make sure it doesn't break anything, but so far so good! Thanks Tanya On 15/10/13 23:33, Martin Mokrejs wrote: > Hi Tanya, > I suppose you use the newer ncbi-tools++ suite. Try the legacy blastn from the ncbi-tools suite. > The version numbering is same ... I have better experience with "blastall -p blastn" form the old > suite. You can also try to find some switch to force the really old blastn algorithm buried in > blastall (nowadays the blastall uses the new algorithm which is in the new ncbi-tools++ suite). > However, experience shows that "blastall -p blastn" gives different results compared to blastn > although BOTH should be in theory using the new algorithm. With the possibility to force the real > predecessor of the algorithm in blastall you have a third method to test. > > From blastall you get only limited results into CSV-formatted output, you cannot change the > output columns. For me important results can be only parsed from XML/plaintext results of blastall. > > You can increase the reward for a match "-r 2" to overcome some gaps on sides but depends what > queries you have and whether that does not give you elsewhere falsely widened alignments. You have > to test that. > > Good luck, > Martin > > > Tanya Golubchik wrote: >> Hi guys, >> >> This is strictly speaking more about blast than biopython, but I was wondering if anyone has any tips on doing the following: searching for a hit in a nucleotide database using tblastn, but reporting the actual DNA sequence of the subject, rather than the translated protein sequence. Is there by any chance a way of extracting this from the XML output? >> >> What I'm finding is that blastn sometimes misses the edges, where substitutions close the ends of my hit result in a truncated hit (rather than a complete hit with a mismatch or two). The full hit is reported correctly by tblastn, but of course this returns the protein translation rather than the original nucleotide sequence. It's probably a long shot, but just wondering if anyone has ideas -- the brute force approach would be to get the start and stop positions from tblastn and then extract and re-align this fragment to my query, but that seems redundant given that blast has already done this for me... From harijay at gmail.com Thu Oct 17 15:02:16 2013 From: harijay at gmail.com (hari jayaram) Date: Thu, 17 Oct 2013 11:02:16 -0400 Subject: [Biopython] Downloadable html documentation? In-Reply-To: References: Message-ID: Thanks Peter.. I am sure I can do that. I will do that and share the docset once it is generated on github. I find the app super speedy and easy to use. Strangely easier than a browser and online search. Hari On Thu, Oct 17, 2013 at 7:52 AM, Peter Cock wrote: > On Thu, Oct 17, 2013 at 12:44 PM, hari jayaram wrote: > > Hi, > > I recently started using a very fast offline OSX only documentation > browser > > called Dash.app . It works great to very quickly search documentation . > > > > It has the ability to load in "docsets" and a set of instructions for how > > to build your own docset starting from html files. > > > > Is there an archive of the html api documentation for Biopython. > > > > Thanks > > Hari > > All the API documentation is pulled from the docstrings in > the Python code - can you just point Dash.app at the source > code instead? > > Peter > From jdjensen at eng.ucsd.edu Thu Oct 17 22:05:48 2013 From: jdjensen at eng.ucsd.edu (James Jensen) Date: Thu, 17 Oct 2013 15:05:48 -0700 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: References: <521FD389.1090207@eng.ucsd.edu> Message-ID: <52605F3C.7080109@eng.ucsd.edu> Hi, Jo?o, A late reply is much better than no reply. I'm impressed you tracked this down and followed up, and I appreciate your help. And it took me a while to get around to revisiting this myself. I was using Python 3.2 when I got the "unorderable types" error. For unrelated reasons, I ended up switching to Python 2.7.3, and now doing search_all at the residue level works. The issue with the indexing is not that the residues' get_id() function returns a different number from the corresponding list index in the list returned by Selection.unfold_entities(). That is inconvenient, but I've been working around it. What puzzled me is that it appeared that the NeighborSearch was accessing residues that unfold_entities() wasn't accessing, although this wouldn't make sense because I used NeighborSearch on the results of a call to unfold_entities(). Let me check again; it could have been something I was doing wrong. What do the chain breaks mean? Are they missing data, and if so, what is missing? And what are their consequences for working with the data? How would they be problematic for iterating over residues, calculating distances, returning the amino acid sequence of the structure, etc? Thanks again, James On 10/08/2013 02:22 PM, Jo?o Rodrigues wrote: > Dear James, > > Regarding problem 1. What you describe runs fine on my machine, using > Python 2.7.5 and an up-to-date Biopython git version. You logic seems > fine, maybe it's the version of Python you are using? > > Regarding your second problem, that of the mismatched indexes. The > Selection method returns a *list* of residues while when you iterate > over the neighbors and ask for their id it gives back the id of the > residue. This id will only correspond to the Selection list index if > your residues are numbered from 1 to N without gaps. If your protein > starts at residue 3, then the first item given back by Selection has > index 0 while in fact the id is 3. Does this make sense? > > The warning occurs if you have chain breaks. There should be some gaps > in your structure, starting at a number other than 1 does not raise > this warning normally. > > Cheers and sorry for the late reply, > > Jo?o > > > > 2013/8/30 James Jensen > > > Hello! > > I am writing a function that, given two chains in a PDB file, > should return 1) the positions and identities of all residues that > are in contact with (distance < 5 angstroms) a residue on the > other chain, and 2) the amino acid sequences of the chains. I've > been doing this with NeighborSearch.search_all(radius=5, > level='A') and then for each atom pair, seeing what its parent > residue is and whether the parent residues of the two atoms belong > to different chains. This may seem like a roundabout way of doing > it, but if I call search_all(radius=5, level='R'), or indeed with > level=any level other than 'A', I get the error > > TypeError: unorderable types: Residue() < Residue() > > So my first question is why it might be that search_all isn't > working at higher levels. > > For the adjacent residue pairs I identify using NeighborSearch, I > get each residue's position in its respective chain by > residue.get_id()[1]. > > I've noticed, however, that if I get the sequence of the chain > using seq = Selection.unfold_entities(chain, 'R') and then > reference (i.e. seq[index]) the amino acids using the indices > returned by the NeighborSearch step, they are not the same > residues that I get if during the NeighborSearch step I report > residue.get_resname() for each adjacent residue. > > I've tried it with several proteins, and the problem is the same. > Chains A and C of 2h62 are an example. > > I then noticed that the lowest residue ID number of the residues > yielded from Selection.unfold_entities(chain, 'R') is not 1. For > chain A, it's 11, and for chain C, it's 34. Not knowing why this > was, I thought I'd try subtracting the lowest ID number from the > indices returned by the NeighborSearch step (i.e. in chain A, 11 > -> 0 so seq[0] would be the first residue, the one with ID 11). > This happened to seem to work for chain A. However, it gives me > negative indices for some of the contacts in chain C. This means > that NeighborSearch can return residues that are not returned by > unfold_entities(). The lowest residue ID returned by > NeighborSearch for chain C was 24, whereas for unfold_entities() > it was 34. > > For both chains A and C, I was given the warning > > PDBConstructionWarning: WARNING: Chain [letter] is > discontinuous at line [line number]. > > In fact, I seem to get this warning for just about every chain of > every structure I load. Is this the reason that the first residues > in the two chains are at 11 and 34, rather than 1? If so, could it > be that NeighborSearch is able to work around the discontinuity > while unfold_entities is not? > > Any suggestions? > > Thanks for your time and help, > > James Jensen > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From anaryin at gmail.com Thu Oct 17 23:25:42 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 18 Oct 2013 01:25:42 +0200 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: <52605F3C.7080109@eng.ucsd.edu> References: <521FD389.1090207@eng.ucsd.edu> <52605F3C.7080109@eng.ucsd.edu> Message-ID: Hey James, NeighborSearch. Unless you edit your structure, you should always get back the same atoms from whatever method you use to access that structure (unfold_entities, NeighborSearch, iteration, etc). If you want paste your code here and I can try to explain what is going on, maybe it makes things a bit more clear. Also, if you can reproduce the weird behavior, paste the code on pastebin.com or write it here in the thread so we can try it on our machines too. The behavior of the get_id() is not inconvenient at all, at least from a biological point of view. You usually want the residue position in the amino acid sequence, not the computer science data structure index. You should always work with the indices from the PDB file, otherwise biologists will get quite mad at you if you start using the other numbers :) Chain breaks. Chain breaks are literally a break (discontinuity) in the polypeptide chain. Sometimes you cannot get enough density from the x-ray experiment to accurately determine the position of particular atoms. You usually see this at lower resolution structures (3?) or very mobile regions (loops). What happens is that you therefore get a gap in the structure, say, from residue 14 to residue 20 there is nothing there. But in the sequence that went in the x-ray beam, these 5 residues (15-19) were there, so you get the numbering taking them into account as well. As for implications.. well, depends on what you are doing with the structure. Calculating distances, iterating over residues, etc, will not be problematic at all. You will just 'miss' some residues because they are just not there. You might want to pay particular attention if you are renumbering your structure to make sure it 'respects' these gaps for example. 2h62 has 4 different chains and they are indeed complete. I get the same warning but for all chains, and the lines I get notified about are the first solvent molecules of that particular chain. The way StructureBuilder works is a bit silly indeed: it iterates over the lines of the PDB file and when it finds a different chain identifier from the one it was reading in the line before it adds a new chain. If this chain already exists, it raises this warning. It's a bit silly in this case because HETATM should not be accounted for in this situation since they always come at the end of the file.. If you can, submit a bug report or feature report in our tracker and I'll go over it when I have some free time. Cheers, Jo?o 2013/10/18 James Jensen > Hi, Jo?o, > > A late reply is much better than no reply. I'm impressed you tracked this > down and followed up, and I appreciate your help. And it took me a while to > get around to revisiting this myself. > > I was using Python 3.2 when I got the "unorderable types" error. For > unrelated reasons, I ended up switching to Python 2.7.3, and now doing > search_all at the residue level works. > > The issue with the indexing is not that the residues' get_id() function > returns a different number from the corresponding list index in the list > returned by Selection.unfold_entities(). That is inconvenient, but I've > been working around it. What puzzled me is that it appeared that the > NeighborSearch was accessing residues that unfold_entities() wasn't > accessing, although this wouldn't make sense because I used NeighborSearch > on the results of a call to unfold_entities(). Let me check again; it could > have been something I was doing wrong. > > What do the chain breaks mean? Are they missing data, and if so, what is > missing? And what are their consequences for working with the data? How > would they be problematic for iterating over residues, calculating > distances, returning the amino acid sequence of the structure, etc? > > Thanks again, > > James > > > > On 10/08/2013 02:22 PM, Jo?o Rodrigues wrote: > > Dear James, > > Regarding problem 1. What you describe runs fine on my machine, using > Python 2.7.5 and an up-to-date Biopython git version. You logic seems fine, > maybe it's the version of Python you are using? > > Regarding your second problem, that of the mismatched indexes. The > Selection method returns a *list* of residues while when you iterate over > the neighbors and ask for their id it gives back the id of the residue. > This id will only correspond to the Selection list index if your residues > are numbered from 1 to N without gaps. If your protein starts at residue 3, > then the first item given back by Selection has index 0 while in fact the > id is 3. Does this make sense? > > The warning occurs if you have chain breaks. There should be some gaps > in your structure, starting at a number other than 1 does not raise this > warning normally. > > Cheers and sorry for the late reply, > > Jo?o > > > > 2013/8/30 James Jensen > >> Hello! >> >> I am writing a function that, given two chains in a PDB file, should >> return 1) the positions and identities of all residues that are in contact >> with (distance < 5 angstroms) a residue on the other chain, and 2) the >> amino acid sequences of the chains. I've been doing this with >> NeighborSearch.search_all(radius=5, level='A') and then for each atom pair, >> seeing what its parent residue is and whether the parent residues of the >> two atoms belong to different chains. This may seem like a roundabout way >> of doing it, but if I call search_all(radius=5, level='R'), or indeed with >> level=any level other than 'A', I get the error >> >> TypeError: unorderable types: Residue() < Residue() >> >> So my first question is why it might be that search_all isn't working at >> higher levels. >> >> For the adjacent residue pairs I identify using NeighborSearch, I get >> each residue's position in its respective chain by residue.get_id()[1]. >> >> I've noticed, however, that if I get the sequence of the chain using seq >> = Selection.unfold_entities(chain, 'R') and then reference (i.e. >> seq[index]) the amino acids using the indices returned by the >> NeighborSearch step, they are not the same residues that I get if during >> the NeighborSearch step I report residue.get_resname() for each adjacent >> residue. >> >> I've tried it with several proteins, and the problem is the same. Chains >> A and C of 2h62 are an example. >> >> I then noticed that the lowest residue ID number of the residues yielded >> from Selection.unfold_entities(chain, 'R') is not 1. For chain A, it's 11, >> and for chain C, it's 34. Not knowing why this was, I thought I'd try >> subtracting the lowest ID number from the indices returned by the >> NeighborSearch step (i.e. in chain A, 11 -> 0 so seq[0] would be the first >> residue, the one with ID 11). This happened to seem to work for chain A. >> However, it gives me negative indices for some of the contacts in chain C. >> This means that NeighborSearch can return residues that are not returned by >> unfold_entities(). The lowest residue ID returned by NeighborSearch for >> chain C was 24, whereas for unfold_entities() it was 34. >> >> For both chains A and C, I was given the warning >> >> PDBConstructionWarning: WARNING: Chain [letter] is discontinuous >> at line [line number]. >> >> In fact, I seem to get this warning for just about every chain of every >> structure I load. Is this the reason that the first residues in the two >> chains are at 11 and 34, rather than 1? If so, could it be that >> NeighborSearch is able to work around the discontinuity while >> unfold_entities is not? >> >> Any suggestions? >> >> Thanks for your time and help, >> >> James Jensen >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > From p.j.a.cock at googlemail.com Fri Oct 18 11:08:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Oct 2013 12:08:41 +0100 Subject: [Biopython] (Bio.PDB) problems with NeighborSearch: error at levels above "A", residue index discrepancy with unfold_entities In-Reply-To: <52605F3C.7080109@eng.ucsd.edu> References: <521FD389.1090207@eng.ucsd.edu> <52605F3C.7080109@eng.ucsd.edu> Message-ID: On Thu, Oct 17, 2013 at 11:05 PM, James Jensen wrote: > > I was using Python 3.2 when I got the "unorderable types" error. For > unrelated reasons, I ended up switching to Python 2.7.3, and now doing > search_all at the residue level works. Can you reduce that to a short test case? It sounds like something we may need to address in the Python 2/3 compatibility. Thanks, Peter From ajingnk at gmail.com Sat Oct 19 15:16:21 2013 From: ajingnk at gmail.com (Jing Lu) Date: Sat, 19 Oct 2013 11:16:21 -0400 Subject: [Biopython] More efficient neighbor joining algorithm to build phylogenetic tree Message-ID: Hello! I am trying to build a large tree (~10000 nodes) from a distance matrix by neighbor joining algorithm. I just modify the existing code from: https://github.com/lijax/biopython/blob/master/Bio/Phylo/TreeConstruction.py . I thought this might be part of biopython in the future. However, the speed for function nj() (neigbhor joining) is slow. The computational complexity of this function is N**3, and the function takes about 1 day to build a tree with 1000 nodes. I am wondering whether there is any efficient algorithm for neighbor joining in biopython or python. Probably, I can write a function based on "fastphylo: Fast tools for phylogenetics" for biopython. Thanks, Jing From p.j.a.cock at googlemail.com Mon Oct 21 17:00:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 21 Oct 2013 18:00:04 +0100 Subject: [Biopython] More efficient neighbor joining algorithm to build phylogenetic tree In-Reply-To: References: Message-ID: On Sat, Oct 19, 2013 at 4:16 PM, Jing Lu wrote: > Hello! > > I am trying to build a large tree (~10000 nodes) from a distance matrix by > neighbor joining algorithm. I just modify the existing code from: > https://github.com/lijax/biopython/blob/master/Bio/Phylo/TreeConstruction.py > . > > I thought this might be part of biopython in the future. However, the speed > for function nj() (neigbhor joining) is slow. The computational complexity > of this function is N**3, and the function takes about 1 day to build a > tree with 1000 nodes. > > I am wondering whether there is any efficient algorithm for neighbor > joining in biopython or python. Probably, I can write a function based on > "fastphylo: Fast tools for phylogenetics" for biopython. Does it have to be in pure Python? Whenever I've needed a large tree with 1000s of sequences I have used a fast C implementation, with bootstrapping. Peter From bodington at gmail.com Thu Oct 24 07:33:42 2013 From: bodington at gmail.com (Dylan Bodington) Date: Thu, 24 Oct 2013 07:33:42 +0000 (UTC) Subject: [Biopython] Eftech and db='bioproject'... DTD problem? References: <1371051142.58860.YahooMailNeo@web164005.mail.gq1.yahoo.com> Message-ID: Hi, Is there any more news on this? I'm trying to work with a large list of bioprojects, and I've reached this same issue. Dylan Bodington School of Biosciences and Biotechnology Tokyo Institute of Technology Tokyo Japan From nicolas.joannin at gmail.com Thu Oct 24 08:10:17 2013 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Thu, 24 Oct 2013 10:10:17 +0200 Subject: [Biopython] Eftech and db='bioproject'... DTD problem? In-Reply-To: References: <1371051142.58860.YahooMailNeo@web164005.mail.gq1.yahoo.com> Message-ID: Hi Dylan, I have not followed up on this matter. In any case, they have not contacted me to let me know. I would say that: either the problem doesn't exist anymore and that means it's fixed, or the problem is still there, and they haven't dealt with it yet. In the latter case, I would suggest emailing the help desk to ask about it: the more people actually ask for it, the quicker they might take care of it... Best regards, Nicolas Nicolas Joannin, Ph.D. Bioinformatics Center Kyoto University, Uji campus, Japan On Thu, Oct 24, 2013 at 9:33 AM, Dylan Bodington wrote: > Hi, > > Is there any more news on this? I'm trying to work with a large list of > bioprojects, and I've reached this same issue. > > Dylan Bodington > School of Biosciences and Biotechnology > Tokyo Institute of Technology > Tokyo > Japan > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jgrant at smith.edu Fri Oct 25 17:44:00 2013 From: jgrant at smith.edu (Jessica Grant) Date: Fri, 25 Oct 2013 13:44:00 -0400 Subject: [Biopython] codon bias Message-ID: Hello, I was wondering if anyone had some code to determine effective number of codons in a sequence. I'm working with an organism with a non-canonical genetic code, so I don't think I can use any of the standard packages. Thanks, Jessica From chris.mit7 at gmail.com Fri Oct 25 18:22:56 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Fri, 25 Oct 2013 14:22:56 -0400 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: Hi Jessica, Could you be somewhat more descriptive of your goals? Are you trying to do something like determine the coding likelihood of a sequence? You can just take the known coding sequences to empirically derive the codon usage frequency. That would be a simple script like: from collections import Counter fdict = Counter() for i in xrange(0,len(sequence),3): fdict[sequence[i:i+3]] += 1 Which would give you a dictionary of the counts, from which you can derive the frequencies. Chris On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > Hello, > > I was wondering if anyone had some code to determine effective number of > codons in a sequence. I'm working with an organism with a non-canonical > genetic code, so I don't think I can use any of the standard packages. > > Thanks, > > Jessica > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jgrant at smith.edu Tue Oct 29 13:05:51 2013 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 29 Oct 2013 09:05:51 -0400 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: Hello again, I am resending to clarify - I am wondering if anyone has implemented Wright's Effective Number of Codons (as in Wright, F. 1990. The effective number of codons used in a gene. Gene 87:23-29), or any improved method. I have tried using codonW but got some wonky results. I am working with transcriptome data from a non-model organism and want to look at the relationships between ENc, GC3 and other statistics to tease out any information about the data in my transcriptome. Thanks, Jessica On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > Hello, > > I was wondering if anyone had some code to determine effective number of > codons in a sequence. I'm working with an organism with a non-canonical > genetic code, so I don't think I can use any of the standard packages. > > Thanks, > > Jessica > > > From p.j.a.cock at googlemail.com Wed Oct 30 11:15:02 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Oct 2013 11:15:02 +0000 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: On Tue, Oct 29, 2013 at 1:05 PM, Jessica Grant wrote: > Hello again, > > I am resending to clarify - I am wondering if anyone has > implemented Wright's Effective Number of Codons (as in Wright, F. 1990. > The effective number of codons used in a gene. Gene 87:23-29), or any > improved method. I have tried using codonW but got some wonky results. I > am working with transcriptome data from a non-model organism and want to > look at the relationships between ENc, GC3 and other statistics to tease > out any information about the data in my transcriptome. > > Thanks, > > Jessica > > On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > >> Hello, >> >> I was wondering if anyone had some code to determine effective number of >> codons in a sequence. I'm working with an organism with a non-canonical >> genetic code, so I don't think I can use any of the standard packages. >> >> Thanks, >> >> Jessica >> I emailed Frank (we both work at the James Hutton Institute, although he is under the BioSS organisation): http://www.hutton.ac.uk/staff/frank-wright http://www.bioss.ac.uk/people/frank.html Frank suggests looking at the EMBOSS implementation 'chips', http://emboss.sourceforge.net/apps/release/6.5/emboss/apps/chips.html Peter From fernando.j at inbox.com Wed Oct 30 14:22:48 2013 From: fernando.j at inbox.com (john fernando) Date: Wed, 30 Oct 2013 06:22:48 -0800 Subject: [Biopython] generate phylogenetic tree Message-ID: <34F96A989B6.000011C3fernando.j@inbox.com> Hi, first off, I am very new to the bioinformatics/biopython world so this may come as a naive question, so I apologize in advance. I extracted some sequences of PDB, aligned them using BLOSUM62 and have "scores". I was wondering if anyone can give tips/advice on I can set about generating a phylogenetic tree of the results to graphically show the clusters of similar sequences? I want to do this for my 'own' substitution matrix (next step). I am asking not necessarily code but more tools that people have used that can do this using the "scores" I have calculated. Thank you, John ____________________________________________________________ FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop! Check it out at http://www.inbox.com/earth From jgrant at smith.edu Wed Oct 30 15:30:19 2013 From: jgrant at smith.edu (Jessica Grant) Date: Wed, 30 Oct 2013 11:30:19 -0400 Subject: [Biopython] codon bias In-Reply-To: References: Message-ID: I have chips up and running! Thanks so much! On Wed, Oct 30, 2013 at 7:15 AM, Peter Cock wrote: > On Tue, Oct 29, 2013 at 1:05 PM, Jessica Grant wrote: > > Hello again, > > > > I am resending to clarify - I am wondering if anyone has > > implemented Wright's Effective Number of Codons (as in Wright, F. 1990. > > The effective number of codons used in a gene. Gene 87:23-29), or any > > improved method. I have tried using codonW but got some wonky results. > I > > am working with transcriptome data from a non-model organism and want to > > look at the relationships between ENc, GC3 and other statistics to tease > > out any information about the data in my transcriptome. > > > > Thanks, > > > > Jessica > > > > On Fri, Oct 25, 2013 at 1:44 PM, Jessica Grant wrote: > > > >> Hello, > >> > >> I was wondering if anyone had some code to determine effective number of > >> codons in a sequence. I'm working with an organism with a non-canonical > >> genetic code, so I don't think I can use any of the standard packages. > >> > >> Thanks, > >> > >> Jessica > >> > > I emailed Frank (we both work at the James Hutton Institute, although > he is under the BioSS organisation): > http://www.hutton.ac.uk/staff/frank-wright > http://www.bioss.ac.uk/people/frank.html > > Frank suggests looking at the EMBOSS implementation 'chips', > http://emboss.sourceforge.net/apps/release/6.5/emboss/apps/chips.html > > Peter > From ribozyme at ioz.ac.cn Thu Oct 31 03:03:21 2013 From: ribozyme at ioz.ac.cn (WU) Date: Thu, 31 Oct 2013 11:03:21 +0800 (GMT+08:00) Subject: [Biopython] generate phylogenetic tree In-Reply-To: References: Message-ID: To Mr. fernando, In biopython there is a module Bio.Phylo which can draw tree. But Bio.Phylo doesn?t infer trees from alignments itself, there are third-party programs available that do such as PhyML. These are supported through the module Bio.Phylo.Applications. Besides, there are also some other software to construct tree from alignment results including MrBayes or PHYLIP. You could see http://biopython.org/DIST/docs/tutorial/Tutorial.html of the 13.5 section for further information. Best wishes Wu Qi From eric.talevich at gmail.com Thu Oct 31 21:38:34 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 31 Oct 2013 14:38:34 -0700 Subject: [Biopython] generate phylogenetic tree In-Reply-To: <34F96A989B6.000011C3fernando.j@inbox.com> References: <34F96A989B6.000011C3fernando.j@inbox.com> Message-ID: On Wed, Oct 30, 2013 at 7:22 AM, john fernando wrote: > Hi, > > first off, I am very new to the bioinformatics/biopython world so this may > come as a naive question, so I apologize in advance. > > I extracted some sequences of PDB, aligned them using BLOSUM62 and have > "scores". > > I was wondering if anyone can give tips/advice on I can set about > generating a phylogenetic tree of the results to graphically show the > clusters of similar sequences? > > I want to do this for my 'own' substitution matrix (next step). > > I am asking not necessarily code but more tools that people have used that > can do this using the "scores" I have calculated. > Thank you, > John > Hi John, To quickly get a tree to look at, given a multiple sequence alignment, I recommend FastTree. http://www.microbesonline.org/fasttree/ If you'd prefer a graphical program to start with, ClustalX and JalView are both capable of building trees with a neighbor-joining algorithm, among other things. http://www.clustal.org/clustal2/ http://www.jalview.org/ To view a large tree and apply your own highlighting and colorization, try Archaeopteryx. https://sites.google.com/site/cmzmasek/home/software/archaeopteryx Back on the command line, some of the EMBOSS tools allow you to supply your own scoring matrix, and so does Phylip, I think. http://emboss.sourceforge.net/ http://evolution.genetics.washington.edu/phylip.html If none of those work for you and you'd like to try building a tree from your own distance matrix using Biopython, this is possible with Yanbo Ye's recent work on another development branch: http://biopython.org/wiki/Phylo#Upcoming_GSoC_2013_features https://github.com/lijax/biopython/ Hope that helps, Eric