From anaryin at gmail.com Fri Jun 1 03:03:33 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 1 Jun 2012 09:03:33 +0200 Subject: [Biopython] geometry.py In-Reply-To: <4FC7BDD0.2020905@usp.br> References: <4FC7BDD0.2020905@usp.br> Message-ID: Hi Frederico, >From what I understand, those dimensions are those of the smallest ellipsoid that fits your structure. I would therefore not expect a perfect match. Ezgi can answer better for sure. How do you calculate the dimensions yourself? Best, Jo?o No dia 31 de Mai de 2012 20:55, "Frederico Moraes Ferreira" < ferreirafm at usp.br> escreveu: > Hi Jo?o, > The gyration radio (Rg) is running just fine. They are in excellent > agreement with those from some models I have tested. > However, the maximum dimensions do not match at all. Did orientate the > model before tensor analysis? > Fred > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From ferreirafm at usp.br Fri Jun 1 15:26:38 2012 From: ferreirafm at usp.br (Frederico Moraes Ferreira) Date: Fri, 01 Jun 2012 16:26:38 -0300 Subject: [Biopython] geometry.py In-Reply-To: References: <4FC7BDD0.2020905@usp.br> Message-ID: <4FC9176E.6080109@usp.br> Hi Jo?o, I was pretended by your print commands in the end of the calculate_shape_param function. Looking closer to your code, the outputs are the ellipsoid' s semi-axes. Even though, if I double them some large discrepancies appear when compared with the maximum dimension of the shape. I have tested those functions for a considerable number of pdbs and wouldn't expecting a perfect match either. However, there are some discrepancies which should be investigated. Here goes some of them: ################################ S_1AVSA_29_0004_1.pdb #Dimensions(a,b,c) #Rg #Anisotropy 36.30 76.0 170.03 42.43 0.51 Dmax: 133.07 S_1AVSA_29_0004_10.pdb #Dimensions(a,b,c) #Rg #Anisotropy 37.10 52.63 62.49 20.06 0.07 Dmax: 70.58 S_1AVSA_29_0004_11.pdb #Dimensions(a,b,c) #Rg #Anisotropy 38.94 52.74 81.96 23.47 0.18 Dmax: 104.99 S_1AVSA_29_0004_12.pdb #Dimensions(a,b,c) #Rg #Anisotropy 36.59 54.33 127.19 31.99 0.47 Dmax: 104.18 S_1AVSA_29_0004_13.pdb #Dimensions(a,b,c) #Rg #Anisotropy 33.14 52.26 76.41 21.99 0.19 Dmax: 81.21 S_1AVSA_29_0004_14.pdb #Dimensions(a,b,c) #Rg #Anisotropy 39.52 54.42 121.81 31.11 0.43 Dmax: 104.17 S_1AVSA_29_0004_16.pdb #Dimensions(a,b,c) #Rg #Anisotropy 34.10 53.17 176.35 41.89 0.69 Dmax: 131.53 S_1AVSA_29_0004_17.pdb #Dimensions(a,b,c) #Rg #Anisotropy 31.73 74.31 101.41 28.99 0.23 Dmax: 90.38 ################################# All the Best, Fred P.S.: here goes code to calculate Dmax (https://gist.github.com/2854563) Em 01-06-2012 04:03, Jo?o Rodrigues escreveu: > > Hi Frederico, > > From what I understand, those dimensions are those of the smallest > ellipsoid that fits your structure. I would therefore not expect a > perfect match. Ezgi can answer better for sure. > > How do you calculate the dimensions yourself? > > Best, > > Jo?o > > No dia 31 de Mai de 2012 20:55, "Frederico Moraes Ferreira" > > escreveu: > > Hi Jo?o, > The gyration radio (Rg) is running just fine. They are in > excellent agreement with those from some models I have tested. > However, the maximum dimensions do not match at all. Did orientate > the model before tensor analysis? > Fred > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From ezzgikaraca at gmail.com Wed Jun 6 06:51:12 2012 From: ezzgikaraca at gmail.com (ezgi karaca) Date: Wed, 6 Jun 2012 12:51:12 +0200 Subject: [Biopython] geometry.py Message-ID: Hello Frederico, In the code, we had a small scaling error and a typo, they are fixed right now. You can get the current version from the same link: http://nmr.chem.uu.nl/~joao/f/geometry.py Just to make things clear, this is how we calculate the dimensions: 1. We calculate the gyration tensor of the protein (the geometrical one, we don't consider the masses) 2. We diagonalize the gyration tensor and get the eigenvalues of it 3. We take the square roots of the eigenvalues, and those correspond to the length of the semi-axis of ellipsoid We got this procedure from the following reference: *Vondrasek J (2011) Gyration- and Inertia-Tensor-Based Collective Coordinates for Metadynamics. Application on the Conformational Behavior of Polyalanine Peptides and Trp-Cage Folding - The Journal of Physical Chemistry A* As Jo?o has indicated, Dmax and the maximum axis length should not be exactly the same, since we average out the coordinates while calculating the gyration tensor. But, of course they should at least be close to each other. So, please test the current version to see how well it matches with your calculations and hopefully this version will give us more reasonable results! Cheers, Ezgi From ferreirafm at usp.br Wed Jun 6 10:22:35 2012 From: ferreirafm at usp.br (Frederico Moraes Ferreira) Date: Wed, 06 Jun 2012 11:22:35 -0300 Subject: [Biopython] geometry.py In-Reply-To: References: Message-ID: <4FCF67AB.2060505@usp.br> Dear Ezgi, Thanks for your explanation and references. Here goes calculations for the same pdb set from my previous message. I haven't read the references yet. However, according to Jo?o's definition, if the semi-axis are the ones from the smallest ellipsoid that fits the pdb, we agree at least one of the axes have necessarily to match Dmax. Otherwise, there are some atoms outside the ellipsoid, in which case is against the primary definition. If your interest, we can stay in touch and discuss this matter a bit more. All the best, Fred P.S.: semi-axes not doubled! ############################################## S_1AVSA_29_0004_1.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.12 16.99 38.02 42.43 0.51 Dmax: 133.07 S_1AVSA_29_0004_10.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.30 11.77 13.97 20.06 0.07 Dmax: 70.58 S_1AVSA_29_0004_11.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.71 11.79 18.33 23.47 0.18 Dmax: 105.00 S_1AVSA_29_0004_12.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.18 12.15 28.44 31.99 0.47 Dmax: 104.18 S_1AVSA_29_0004_13.pdb #Dimensions(a,b,c) #Rg #Anisotropy 7.41 11.69 17.08 21.99 0.19 Dmax: 81.21 S_1AVSA_29_0004_14.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.84 12.17 27.24 31.11 0.43 Dmax: 104.17 S_1AVSA_29_0004_15.pdb #Dimensions(a,b,c) #Rg #Anisotropy 10.23 12.39 18.6 24.58 0.13 Dmax: 84.32 S_1AVSA_29_0004_16.pdb #Dimensions(a,b,c) #Rg #Anisotropy 7.62 11.89 39.43 41.89 0.69 Dmax: 131.53 S_1AVSA_29_0004_17.pdb #Dimensions(a,b,c) #Rg #Anisotropy 7.10 16.62 22.68 28.99 0.23 Dmax: 90.38 S_1AVSA_29_0004_18.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.73 11.69 20.16 24.89 0.24 Dmax: 89.03 S_1AVSA_29_0004_19.pdb #Dimensions(a,b,c) #Rg #Anisotropy 6.97 11.02 16.6 21.11 0.2 Dmax: 75.23 ############################################# Em 06-06-2012 07:51, ezgi karaca escreveu: > Hello Frederico, > > In the code, we had a small scaling error and a typo, they are fixed > right now. You can get the current version from the same link: > http://nmr.chem.uu.nl/~joao/f/geometry.py > > Just to make things clear, this is how we calculate the dimensions: > > 1. We calculate the gyration tensor of the protein (the geometrical > one, we don't consider the masses) > > 2. We diagonalize the gyration tensor and get the eigenvalues of it > > 3. We take the square roots of the eigenvalues, and those correspond > to the length of the semi-axis of ellipsoid > > We got this procedure from the following reference: *Vondrasek J > (2011) Gyration- and Inertia-Tensor-Based Collective Coordinates for > Metadynamics. Application on the Conformational Behavior of > Polyalanine Peptides and Trp-Cage Folding - The Journal of Physical > Chemistry A* > > As Jo?o has indicated, Dmax and the maximum axis length should not be > exactly the same, since we average out the coordinates while > calculating the gyration tensor. But, of course they should at least > be close to each other. So, please test the current version to see how > well it matches with your calculations and hopefully this version will > give us more reasonable results! > > Cheers, > > Ezgi > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Sun Jun 10 06:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 10 Jun 2012 11:24:20 +0100 Subject: [Biopython] EU-codefest In-Reply-To: <20120609195912.GA27963@thebird.nl> References: <20120609195912.GA27963@thebird.nl> Message-ID: Dear Biopythoneers, Some of you might like to attend an Open-Bio Hackathon in Italy this summer - 19 and 20 July 2012, in Lodi. This is about a week after BOSC and the pre-BOSC CodeFest in California http://www.open-bio.org/wiki/BOSC_2012 Peter ---------- Forwarded message ---------- From: *Pjotr Prins* Date: Saturday, June 9, 2012 Subject: EU-codefest To: cjfields at illinois.edu Cc: p.j.a.cock at googlemail.com Hi Chris and Peter, Would you mind sending a reminder of the EU-codefest to your lists? Registration form is up: http://www.open-bio.org/wiki/EU_Codefest_2012 Three main topics will be worked on during the CodeFest: NGS and high performance parsers for OpenBio projects. RDF and semantic web for bioinformatics. Bioinformatics pipelines definition, execution and distribution. other tracks are welcome! Pj. From clements at galaxyproject.org Sun Jun 10 13:33:25 2012 From: clements at galaxyproject.org (Dave Clements) Date: Sun, 10 Jun 2012 10:33:25 -0700 Subject: [Biopython] GCC2012 Early Registration ENDS THIS MONDAY JUNE 11 In-Reply-To: References: Message-ID: Hello all, Just a *final* reminder that early registration for the 2012 Galaxy Community Conference (GCC2012) *closes on Monday June 11 (*which is probably* today* when you read this*)*. Registering early saves 36 to 42% on registration costs, and allows you to sign up for the GCC2012 Training Dayand book discounted conference lodging *before they fill up*. *Register today . * GCC2012 will be held July 25-27, in Chicago, Illinois, United States. This year GCC2012 features a full day of tutorial sessionswith 3 parallel tracks, each featuring four, 90 minute workshops and covering 10 different topic, including the newly added Variant and SNP Analsys, RNA-Seq Analysis, and Galaxy Code Architecture sessions. The two-day main meetingincludes over 25 talks by Galaxy community members and Galaxy developers addressing the challenges of integrating, analyzing, and sharing the diverse and very large datasets that are now typical in biomedical research. GCC2012 is an opportunity to share best practices with, and learn from, a large community of researchers and support staff who are facing the challenges of data-intensive biology. Galaxy is an open web-based platform for data intensive biomedical researchthat is widely used and deployed at research organizations of all sizes around the world. See you in Chicago! Dave Clements, on behalf of the GCC2012 Organizing Committee Links: http://galaxyproject.org/GCC2012 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From barry_finzel at yahoo.com Fri Jun 15 17:55:29 2012 From: barry_finzel at yahoo.com (Barry Finzel) Date: Fri, 15 Jun 2012 14:55:29 -0700 (PDT) Subject: [Biopython] Bio.PDB DisorderedResidue Usage Message-ID: <1339797329.23297.YahooMailClassic@web130201.mail.mud.yahoo.com> In reading a PDB file with a DisorderedResidue, I find I cannot select the "first" of the individual Residue objects wrapped in the DisorderedEntity wrapper as they were read. Using: from Bio.PDB import *parser = Bio.PDBParser()struct = parser.get_structure('input','1fjl.pdb') PDB structure 1fjl contains a disordered residue defined in the following records: SEQRES ? 1 D ? 14 ? DA ?DA ?DT ?DA ?DA ?DT ?DC ?DT ?DG ?DA ?DT ?DT ?DASEQRES ? 2 D ? 14 ? DC ATOM ? 1691 ?P ?A DT D ? 8 ? ? ?21.334 116.347 ?17.134 ?0.50 31.45 ? ? ? ? ? P ?ATOM ? 1692 ?OP1A DT D ? 8 ? ? ?20.849 116.344 ?18.534 ?0.50 36.10 ? ? ? ? ? O ?ATOM ? 1693 ?OP2A DT D ? 8 ? ? ?20.849 117.360 ?16.179 ?0.50 32.93 ? ? ? ? ? O ?ATOM ? 1694 ?O5'A DT D ? 8 ? ? ?22.893 116.496 ?17.162 ?0.50 30.63 ? ? ? ? ? O ?..ATOM ? 1711 ?P ?B DA D ? 8 ? ? ?21.278 115.687 ?17.543 ?0.50 30.90 ? ? ? ? ? P ?ATOM ? 1712 ?OP1B DA D ? 8 ? ? ?20.886 115.137 ?18.859 ?0.50 32.04 ? ? ? ? ? O ?ATOM ? 1713 ?OP2B DA D ? 8 ? ? ?20.643 116.935 ?17.025 ?0.50 27.28 ? ? ? ? ? O ?ATOM ? 1714 ?O5'B DA D ? 8 ? ? ?22.825 115.928 ?17.500 ?0.50 26.48 ? ? ? ? ? O ? The PDB always encodes the sequence of the FIRST variant in the SEQRES card (in this case, DT), but there seems to be no way to unwrap the two Residue.Residue objects wrapped in a Residue.DisorderedResidue to identify which of the two residues this would be. ?Unlike the DisorderedAtoms which are selected by ALTLOC code, the DisorderedResidues are selected by residue name, ('DT' or 'DA') in this case. I have an application where I need to be certain that the Residue instance I select matches the residue type of the SEQRES card entry (e.g., the FIRST one in the file). Is there any way to do this? ?I would have thought that keying the Residue instances in the DisorderedResidue on ALTLOC (as in DisorderAtom) would have been a better way to handle this. Barry FinzelUniversity of Minnesota From Anita.Norman at slu.se Thu Jun 21 12:08:24 2012 From: Anita.Norman at slu.se (Anita Norman) Date: Thu, 21 Jun 2012 18:08:24 +0200 Subject: [Biopython] extremely long execution time Message-ID: Hello Biopythoners! I am working with fastq files and though I have been working with them with many different scripts, I now for the first time am running into the problem that it will take 8+ days for one script to execute one file. I figure I must be doing something wrong. Here is what I am trying to do: I have a file with a list of record id's(~3 mil rec ids) recidfile I have two paired files from which the record id's originally came from (~10 mil recs each) infile I wish to create two files (withfile and withoutfile) from each of the paired files run individually. One of the new files will have record ids that are in the list and the other with record ids that are not in the list I have tried doing this with and without the fastqGeneralIterator, but both methods will require at least 8 days Here is my code with the fastqGeneralIterator: from time import time from Bio.SeqIO.QualityIO import FastqGeneralIterator start = time() recids = open(recidfile, 'r') for item in recids: recidlist.append(item[0:-2]) handle1 = open(withfile, 'w') handle2 = open(withoutfile, 'w') for header, seq, qual in FastqGeneralIterator(open(infile)): if header[:-1] in recidlist: handle1.write('%s\n%s\n+\n%s\n' %(header, seq, qual)) else: handle2.write('%s\n%s\n+\n%s\n' %(header, seq, qual)) Can anyone advise me on how I can possibly make this go faster? I would prefer 8 minutes over 8 days. Thanks in advance Anita From p.j.a.cock at googlemail.com Thu Jun 21 12:21:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Jun 2012 17:21:07 +0100 Subject: [Biopython] extremely long execution time In-Reply-To: References: Message-ID: On Thu, Jun 21, 2012 at 5:08 PM, Anita Norman wrote: > Hello Biopythoners! > > ... > > Here is my code with the fastqGeneralIterator: > > from time import time > from Bio.SeqIO.QualityIO import FastqGeneralIterator > > start = time() > > recids = open(recidfile, 'r') > for item in recids: recidlist.append(item[0:-2]) You must be leaving out a line to define recidlist - but I'll assume it was just: recidlist = [] Notice the "Filtering a sequence file" in the tutorial uses a set, http://biopython.org/DIST/docs/tutorial/Tutorial.html and says "Note that we use a Python set rather than a list, this makes testing membership faster." So try this: recidset = set([]) for item in recids: recidset.add(item[0:-2]) (and later use recidset instead of recidlist) Peter From p.j.a.cock at googlemail.com Fri Jun 22 17:29:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Jun 2012 22:29:43 +0100 Subject: [Biopython] extremely long execution time In-Reply-To: References: Message-ID: Hi Anita, Thank you for letting us know how you got on - I'm impressed just how much of a difference it made :) Peter On Fri, Jun 22, 2012 at 7:56 PM, Anita Norman wrote: > Hi Peter, > > Thanks so much for your quick and helpful response. What a difference it > makes. Now one entire file runs in less than one minute. > > All the best and happy midsummer! > > Anita > > > > On 21/06/2012 18:21, "Peter Cock" wrote: > >>On Thu, Jun 21, 2012 at 5:08 PM, Anita Norman wrote: >>> Hello Biopythoners! >>> >>> ... >>> >>> Here is my code with the fastqGeneralIterator: >>> >>> from time import time >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator >>> >>> start = time() >>> >>> recids = open(recidfile, 'r') >>> for item in recids: recidlist.append(item[0:-2]) >> >>You must be leaving out a line to define recidlist - but I'll assume >>it was just: >> >>recidlist = [] >> >>Notice the "Filtering a sequence file" in the tutorial uses a set, >>http://biopython.org/DIST/docs/tutorial/Tutorial.html and says >>"Note that we use a Python set rather than a list, this makes >>testing membership faster." >> >>So try this: >> >>recidset = set([]) >>for item in recids: recidset.add(item[0:-2]) >> >>(and later use recidset instead of recidlist) >> >>Peter > From p.j.a.cock at googlemail.com Mon Jun 25 14:22:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Jun 2012 19:22:12 +0100 Subject: [Biopython] Biopython 1.60 Message-ID: Dear Biopythoneers, Biopython 1.60 is out: http://news.open-bio.org/news/2012/06/biopython-1-60-released/ Thank you to everyone who has contributed. Peter P.S. We're on Twitter as @Biopython - see also @obf_news https://twitter.com/#!/biopython From David.Lapointe at umassmed.edu Mon Jun 25 16:57:47 2012 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Mon, 25 Jun 2012 20:57:47 +0000 Subject: [Biopython] Biopython 1.60 Message-ID: <86BFEB1DFA6CB3448DB8AB1FC52F405909D651@ummscsmbx06.ad.umassmed.edu> Great! I have EMBOSS installed but the tests always fail the EMBOSS tests. They are on the $PATH. David -- David Lapointe, Ph.D. Director Scientific Computing/Information Services University of Massachusetts Medical School 55 Lake Avenue N Worcester MA 01655 508-856-5141 (v) ' the lyf so short, the craft so long to lerne' From zhigangwu.bgi at gmail.com Mon Jun 25 19:10:39 2012 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Tue, 26 Jun 2012 07:10:39 +0800 Subject: [Biopython] Biopython 1.60 In-Reply-To: References: Message-ID: <69156DC5-445E-45B9-B352-7E816F8A90C3@gmail.com> Great. Thanks! Sent from my iPhone On Jun 26, 2012, at 2:22 AM, Peter Cock wrote: > Dear Biopythoneers, > > Biopython 1.60 is out: > http://news.open-bio.org/news/2012/06/biopython-1-60-released/ > > Thank you to everyone who has contributed. > > Peter > > P.S. We're on Twitter as @Biopython - see also @obf_news > https://twitter.com/#!/biopython > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Jun 26 04:27:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Jun 2012 09:27:53 +0100 Subject: [Biopython] Biopython 1.60 In-Reply-To: <86BFEB1DFA6CB3448DB8AB1FC52F405909D651@ummscsmbx06.ad.umassmed.edu> References: <86BFEB1DFA6CB3448DB8AB1FC52F405909D651@ummscsmbx06.ad.umassmed.edu> Message-ID: On Monday, June 25, 2012, Lapointe, David wrote: > Great! > > I have EMBOSS installed but the tests always fail the EMBOSS tests. They > are on the $PATH. > > David > > Hi David, Could you tell us which version of EMBOSS you have, your OS, and copy and paste the error from the tests please? My guess is a minor change between versions of EMBOSS. Thanks, Peter From p.j.a.cock at googlemail.com Tue Jun 26 08:40:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Jun 2012 13:40:50 +0100 Subject: [Biopython] test_Emboss.py, was: Biopython 1.6.0 Message-ID: On Tue, Jun 26, 2012 at 1:11 PM, David Lapointe wrote: > Ubuntu 10.04 > Python 2.6.5 > Emboss 6.3.1 > > These are the tests that fail, always. It may be because I have a > wrapper around the application ( setting up the EMBOSS_VARS) > > ====================================================================== > FAIL: needle with the asis trick, output to a file. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "test_Emboss.py", line 538, in test_needle_file > ? ?self.assertTrue(os.path.isfile(filename)) > AssertionError > > ====================================================================== > FAIL: water with the asis trick, output to a file. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "test_Emboss.py", line 479, in test_water_file > ? ?self.run_water(cline) > ?File "test_Emboss.py", line 462, in run_water > ? ?self.assertTrue(os.path.isfile(cline.outfile)) > AssertionError > > > On Tuesday, June 26, 2012, Peter wrote: > > ? On Monday, June 25, 2012, Lapointe, David wrote: > > ? ? ? Great! > > ? ? ? I have EMBOSS installed but the tests always ?fail the EMBOSS > tests. They are on the $PATH. > > ? ? ? David > Thanks David, I've made a small change to give a more informative error in future: https://github.com/biopython/biopython/commit/1d1f2a45658f808e22b8d0dbdcf2e6f825581fd7 However, I think problem is probably your wrapper script fails to escape filenames with spaces. Both those failing tests use output files with a space in their name. This might be useful to you: http://www.biostars.org/post/show/18642/bash-dollar-at-variable-loses-quote-characters/ Regards, Peter From linlifeng at gmail.com Wed Jun 27 17:20:46 2012 From: linlifeng at gmail.com (Lifeng Lin) Date: Wed, 27 Jun 2012 16:20:46 -0500 Subject: [Biopython] download sequences by date from Genbank Message-ID: Hi folks, Is there an elegant way of downloading sequences from Genbank and using date as a cutoff? I am trying to maintain an up-to-date local version of all sequences for a certain number of species. When "synching" with Genbank, all i can think of is retrieving all GI numbers for these species once again, compare them with what i have locally, and generate a list of new sequences and append them. I have a hunch that there might be a better way of doing this, for example, if there is a date filter that we can apply for Genbank download, then all the trouble for comparisons would be saved. Any suggestions? best, L. From mjldehoon at yahoo.com Wed Jun 27 21:01:59 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 27 Jun 2012 18:01:59 -0700 (PDT) Subject: [Biopython] download sequences by date from Genbank In-Reply-To: Message-ID: <1340845319.35121.YahooMailClassic@web161206.mail.bf1.yahoo.com> Hi Lifeng, Have a look at esearch in the NCBI E-Utilities: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch You can access the E-Utilities and parse the results with Bio.Entrez as described in the Biopython manual. Best, -Michiel --- On Wed, 6/27/12, Lifeng Lin wrote: > From: Lifeng Lin > Subject: [Biopython] download sequences by date from Genbank > To: biopython at lists.open-bio.org > Date: Wednesday, June 27, 2012, 5:20 PM > Hi folks, > > Is there an elegant way of downloading sequences from > Genbank and using > date as a cutoff? > > I am trying to maintain an up-to-date local version of all > sequences for a > certain number of species. When "synching" with Genbank, all > i can think of > is retrieving all GI numbers for these species once again, > compare them > with what i have locally, and generate a list of new > sequences and append > them. I have a hunch that there might be a better way of > doing this, for > example, if there is a date filter that we can apply for > Genbank download, > then all the trouble for comparisons would be saved. > > Any suggestions? > > best, > L. > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From arklenna at gmail.com Sat Jun 30 01:50:15 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sat, 30 Jun 2012 01:50:15 -0400 Subject: [Biopython] Variant interface Message-ID: Hi all, I'm working on Biopython for Google summer of code; my project is to create an interface between Biopython and various existing tools for handling sequence variants (including VCF format). I am seeking feedback from variant users. What could my interface offer that would make it easier to use variants with Biopython? For example, I am planning on a function that will essentially skim through a large file to give a general overview of its contents. More specifically, in what ways should variant data be able to interact with existing parts of Biopython (such as SeqFeature, SeqRecord)? Looking forward to any thoughts you share. Cheers, Lenna github.com/lennax From chris.mit7 at gmail.com Sat Jun 30 10:47:10 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sat, 30 Jun 2012 10:47:10 -0400 Subject: [Biopython] Variant interface In-Reply-To: References: Message-ID: Hi Lenna, Here are some features of the VCF/GFF parser I wrote that I use: The choice to be used as an iterator for parsing huge files or store the entire vcf in memory. isVariant(chromosome, position, *arg) -- returns if it is a variant, optional arg if it's a user-supplied variant writeVariant(file) -- writes the vcf object to a given file handle getVariants() -- generator for all the vcf objects, similar method for GFF exists getAttribute(attribute, *arg) -- get objects with a given attribute, optional arg is to get objects with a given attribute equal to arg getChildren() -- gets child objects of GFF if exists getParent() -- gets parent object of GFF if exists getXXX() -- gets the standard info for any VCF/GFF object like SeqId, Start, End, Alt, Ref, VCF type, etc. addAttribute(key, value) -- adds a feature to a given GFF/VCF object removeAttribute(key) -- removes feature Optional keywords: filter = [string,string...] -- only keep variants with the keys in filter filterOnly = Bool -- only keep features specified in filter in our object (so if we have 20 key-value attributes, just keep the keys in filter) keyDelim = string -- for compatibility with non-standard vcf/GFF formats that don't use '=' for the key-value separator fast = [string, string...] -- if we parse the file in memory, keep the keys in this list in a dictionary for immediate access to the vcf/GFF objects exclude = [string, string...] -- exclude entries with these keys cols = (int,int) -- what cols to use -- useful for parsing in GFF files that have been merged with bedtools random = Bool -- if we're treating the file as an iterable, stores the object's file position coordinates in a dictionary for random access to objects Some usage cases this helps me with: Parsing through a file and adding/removing annotations (for instance if I want to add the coding transcript affected by a VCF to the file itself) Trim down files based on several criteria to a smaller more informative file Being able to read a file only to the point I care about (random access methods that can index attributes as well as the normal identifiers/iterator) Immediate access to an attribute I care about Hope that helps Chris On Sat, Jun 30, 2012 at 1:50 AM, Lenna Peterson wrote: > Hi all, > > > I'm working on Biopython for Google summer of code; my project is to create > an interface between Biopython and various existing tools for handling > sequence variants (including VCF format). > > > I am seeking feedback from variant users. What could my interface offer > that would make it easier to use variants with Biopython? For example, I am > planning on a function that will essentially skim through a large file to > give a general overview of its contents. More specifically, in what ways > should variant data be able to interact with existing parts of Biopython > (such as SeqFeature, SeqRecord)? > > > Looking forward to any thoughts you share. > > > Cheers, > > > Lenna > > github.com/lennax > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Fri Jun 1 07:03:33 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 1 Jun 2012 09:03:33 +0200 Subject: [Biopython] geometry.py In-Reply-To: <4FC7BDD0.2020905@usp.br> References: <4FC7BDD0.2020905@usp.br> Message-ID: Hi Frederico, >From what I understand, those dimensions are those of the smallest ellipsoid that fits your structure. I would therefore not expect a perfect match. Ezgi can answer better for sure. How do you calculate the dimensions yourself? Best, Jo?o No dia 31 de Mai de 2012 20:55, "Frederico Moraes Ferreira" < ferreirafm at usp.br> escreveu: > Hi Jo?o, > The gyration radio (Rg) is running just fine. They are in excellent > agreement with those from some models I have tested. > However, the maximum dimensions do not match at all. Did orientate the > model before tensor analysis? > Fred > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From ferreirafm at usp.br Fri Jun 1 19:26:38 2012 From: ferreirafm at usp.br (Frederico Moraes Ferreira) Date: Fri, 01 Jun 2012 16:26:38 -0300 Subject: [Biopython] geometry.py In-Reply-To: References: <4FC7BDD0.2020905@usp.br> Message-ID: <4FC9176E.6080109@usp.br> Hi Jo?o, I was pretended by your print commands in the end of the calculate_shape_param function. Looking closer to your code, the outputs are the ellipsoid' s semi-axes. Even though, if I double them some large discrepancies appear when compared with the maximum dimension of the shape. I have tested those functions for a considerable number of pdbs and wouldn't expecting a perfect match either. However, there are some discrepancies which should be investigated. Here goes some of them: ################################ S_1AVSA_29_0004_1.pdb #Dimensions(a,b,c) #Rg #Anisotropy 36.30 76.0 170.03 42.43 0.51 Dmax: 133.07 S_1AVSA_29_0004_10.pdb #Dimensions(a,b,c) #Rg #Anisotropy 37.10 52.63 62.49 20.06 0.07 Dmax: 70.58 S_1AVSA_29_0004_11.pdb #Dimensions(a,b,c) #Rg #Anisotropy 38.94 52.74 81.96 23.47 0.18 Dmax: 104.99 S_1AVSA_29_0004_12.pdb #Dimensions(a,b,c) #Rg #Anisotropy 36.59 54.33 127.19 31.99 0.47 Dmax: 104.18 S_1AVSA_29_0004_13.pdb #Dimensions(a,b,c) #Rg #Anisotropy 33.14 52.26 76.41 21.99 0.19 Dmax: 81.21 S_1AVSA_29_0004_14.pdb #Dimensions(a,b,c) #Rg #Anisotropy 39.52 54.42 121.81 31.11 0.43 Dmax: 104.17 S_1AVSA_29_0004_16.pdb #Dimensions(a,b,c) #Rg #Anisotropy 34.10 53.17 176.35 41.89 0.69 Dmax: 131.53 S_1AVSA_29_0004_17.pdb #Dimensions(a,b,c) #Rg #Anisotropy 31.73 74.31 101.41 28.99 0.23 Dmax: 90.38 ################################# All the Best, Fred P.S.: here goes code to calculate Dmax (https://gist.github.com/2854563) Em 01-06-2012 04:03, Jo?o Rodrigues escreveu: > > Hi Frederico, > > From what I understand, those dimensions are those of the smallest > ellipsoid that fits your structure. I would therefore not expect a > perfect match. Ezgi can answer better for sure. > > How do you calculate the dimensions yourself? > > Best, > > Jo?o > > No dia 31 de Mai de 2012 20:55, "Frederico Moraes Ferreira" > > escreveu: > > Hi Jo?o, > The gyration radio (Rg) is running just fine. They are in > excellent agreement with those from some models I have tested. > However, the maximum dimensions do not match at all. Did orientate > the model before tensor analysis? > Fred > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From ezzgikaraca at gmail.com Wed Jun 6 10:51:12 2012 From: ezzgikaraca at gmail.com (ezgi karaca) Date: Wed, 6 Jun 2012 12:51:12 +0200 Subject: [Biopython] geometry.py Message-ID: Hello Frederico, In the code, we had a small scaling error and a typo, they are fixed right now. You can get the current version from the same link: http://nmr.chem.uu.nl/~joao/f/geometry.py Just to make things clear, this is how we calculate the dimensions: 1. We calculate the gyration tensor of the protein (the geometrical one, we don't consider the masses) 2. We diagonalize the gyration tensor and get the eigenvalues of it 3. We take the square roots of the eigenvalues, and those correspond to the length of the semi-axis of ellipsoid We got this procedure from the following reference: *Vondrasek J (2011) Gyration- and Inertia-Tensor-Based Collective Coordinates for Metadynamics. Application on the Conformational Behavior of Polyalanine Peptides and Trp-Cage Folding - The Journal of Physical Chemistry A* As Jo?o has indicated, Dmax and the maximum axis length should not be exactly the same, since we average out the coordinates while calculating the gyration tensor. But, of course they should at least be close to each other. So, please test the current version to see how well it matches with your calculations and hopefully this version will give us more reasonable results! Cheers, Ezgi From ferreirafm at usp.br Wed Jun 6 14:22:35 2012 From: ferreirafm at usp.br (Frederico Moraes Ferreira) Date: Wed, 06 Jun 2012 11:22:35 -0300 Subject: [Biopython] geometry.py In-Reply-To: References: Message-ID: <4FCF67AB.2060505@usp.br> Dear Ezgi, Thanks for your explanation and references. Here goes calculations for the same pdb set from my previous message. I haven't read the references yet. However, according to Jo?o's definition, if the semi-axis are the ones from the smallest ellipsoid that fits the pdb, we agree at least one of the axes have necessarily to match Dmax. Otherwise, there are some atoms outside the ellipsoid, in which case is against the primary definition. If your interest, we can stay in touch and discuss this matter a bit more. All the best, Fred P.S.: semi-axes not doubled! ############################################## S_1AVSA_29_0004_1.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.12 16.99 38.02 42.43 0.51 Dmax: 133.07 S_1AVSA_29_0004_10.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.30 11.77 13.97 20.06 0.07 Dmax: 70.58 S_1AVSA_29_0004_11.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.71 11.79 18.33 23.47 0.18 Dmax: 105.00 S_1AVSA_29_0004_12.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.18 12.15 28.44 31.99 0.47 Dmax: 104.18 S_1AVSA_29_0004_13.pdb #Dimensions(a,b,c) #Rg #Anisotropy 7.41 11.69 17.08 21.99 0.19 Dmax: 81.21 S_1AVSA_29_0004_14.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.84 12.17 27.24 31.11 0.43 Dmax: 104.17 S_1AVSA_29_0004_15.pdb #Dimensions(a,b,c) #Rg #Anisotropy 10.23 12.39 18.6 24.58 0.13 Dmax: 84.32 S_1AVSA_29_0004_16.pdb #Dimensions(a,b,c) #Rg #Anisotropy 7.62 11.89 39.43 41.89 0.69 Dmax: 131.53 S_1AVSA_29_0004_17.pdb #Dimensions(a,b,c) #Rg #Anisotropy 7.10 16.62 22.68 28.99 0.23 Dmax: 90.38 S_1AVSA_29_0004_18.pdb #Dimensions(a,b,c) #Rg #Anisotropy 8.73 11.69 20.16 24.89 0.24 Dmax: 89.03 S_1AVSA_29_0004_19.pdb #Dimensions(a,b,c) #Rg #Anisotropy 6.97 11.02 16.6 21.11 0.2 Dmax: 75.23 ############################################# Em 06-06-2012 07:51, ezgi karaca escreveu: > Hello Frederico, > > In the code, we had a small scaling error and a typo, they are fixed > right now. You can get the current version from the same link: > http://nmr.chem.uu.nl/~joao/f/geometry.py > > Just to make things clear, this is how we calculate the dimensions: > > 1. We calculate the gyration tensor of the protein (the geometrical > one, we don't consider the masses) > > 2. We diagonalize the gyration tensor and get the eigenvalues of it > > 3. We take the square roots of the eigenvalues, and those correspond > to the length of the semi-axis of ellipsoid > > We got this procedure from the following reference: *Vondrasek J > (2011) Gyration- and Inertia-Tensor-Based Collective Coordinates for > Metadynamics. Application on the Conformational Behavior of > Polyalanine Peptides and Trp-Cage Folding - The Journal of Physical > Chemistry A* > > As Jo?o has indicated, Dmax and the maximum axis length should not be > exactly the same, since we average out the coordinates while > calculating the gyration tensor. But, of course they should at least > be close to each other. So, please test the current version to see how > well it matches with your calculations and hopefully this version will > give us more reasonable results! > > Cheers, > > Ezgi > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Sun Jun 10 10:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 10 Jun 2012 11:24:20 +0100 Subject: [Biopython] EU-codefest In-Reply-To: <20120609195912.GA27963@thebird.nl> References: <20120609195912.GA27963@thebird.nl> Message-ID: Dear Biopythoneers, Some of you might like to attend an Open-Bio Hackathon in Italy this summer - 19 and 20 July 2012, in Lodi. This is about a week after BOSC and the pre-BOSC CodeFest in California http://www.open-bio.org/wiki/BOSC_2012 Peter ---------- Forwarded message ---------- From: *Pjotr Prins* Date: Saturday, June 9, 2012 Subject: EU-codefest To: cjfields at illinois.edu Cc: p.j.a.cock at googlemail.com Hi Chris and Peter, Would you mind sending a reminder of the EU-codefest to your lists? Registration form is up: http://www.open-bio.org/wiki/EU_Codefest_2012 Three main topics will be worked on during the CodeFest: NGS and high performance parsers for OpenBio projects. RDF and semantic web for bioinformatics. Bioinformatics pipelines definition, execution and distribution. other tracks are welcome! Pj. From clements at galaxyproject.org Sun Jun 10 17:33:25 2012 From: clements at galaxyproject.org (Dave Clements) Date: Sun, 10 Jun 2012 10:33:25 -0700 Subject: [Biopython] GCC2012 Early Registration ENDS THIS MONDAY JUNE 11 In-Reply-To: References: Message-ID: Hello all, Just a *final* reminder that early registration for the 2012 Galaxy Community Conference (GCC2012) *closes on Monday June 11 (*which is probably* today* when you read this*)*. Registering early saves 36 to 42% on registration costs, and allows you to sign up for the GCC2012 Training Dayand book discounted conference lodging *before they fill up*. *Register today . * GCC2012 will be held July 25-27, in Chicago, Illinois, United States. This year GCC2012 features a full day of tutorial sessionswith 3 parallel tracks, each featuring four, 90 minute workshops and covering 10 different topic, including the newly added Variant and SNP Analsys, RNA-Seq Analysis, and Galaxy Code Architecture sessions. The two-day main meetingincludes over 25 talks by Galaxy community members and Galaxy developers addressing the challenges of integrating, analyzing, and sharing the diverse and very large datasets that are now typical in biomedical research. GCC2012 is an opportunity to share best practices with, and learn from, a large community of researchers and support staff who are facing the challenges of data-intensive biology. Galaxy is an open web-based platform for data intensive biomedical researchthat is widely used and deployed at research organizations of all sizes around the world. See you in Chicago! Dave Clements, on behalf of the GCC2012 Organizing Committee Links: http://galaxyproject.org/GCC2012 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ From barry_finzel at yahoo.com Fri Jun 15 21:55:29 2012 From: barry_finzel at yahoo.com (Barry Finzel) Date: Fri, 15 Jun 2012 14:55:29 -0700 (PDT) Subject: [Biopython] Bio.PDB DisorderedResidue Usage Message-ID: <1339797329.23297.YahooMailClassic@web130201.mail.mud.yahoo.com> In reading a PDB file with a DisorderedResidue, I find I cannot select the "first" of the individual Residue objects wrapped in the DisorderedEntity wrapper as they were read. Using: from Bio.PDB import *parser = Bio.PDBParser()struct = parser.get_structure('input','1fjl.pdb') PDB structure 1fjl contains a disordered residue defined in the following records: SEQRES ? 1 D ? 14 ? DA ?DA ?DT ?DA ?DA ?DT ?DC ?DT ?DG ?DA ?DT ?DT ?DASEQRES ? 2 D ? 14 ? DC ATOM ? 1691 ?P ?A DT D ? 8 ? ? ?21.334 116.347 ?17.134 ?0.50 31.45 ? ? ? ? ? P ?ATOM ? 1692 ?OP1A DT D ? 8 ? ? ?20.849 116.344 ?18.534 ?0.50 36.10 ? ? ? ? ? O ?ATOM ? 1693 ?OP2A DT D ? 8 ? ? ?20.849 117.360 ?16.179 ?0.50 32.93 ? ? ? ? ? O ?ATOM ? 1694 ?O5'A DT D ? 8 ? ? ?22.893 116.496 ?17.162 ?0.50 30.63 ? ? ? ? ? O ?..ATOM ? 1711 ?P ?B DA D ? 8 ? ? ?21.278 115.687 ?17.543 ?0.50 30.90 ? ? ? ? ? P ?ATOM ? 1712 ?OP1B DA D ? 8 ? ? ?20.886 115.137 ?18.859 ?0.50 32.04 ? ? ? ? ? O ?ATOM ? 1713 ?OP2B DA D ? 8 ? ? ?20.643 116.935 ?17.025 ?0.50 27.28 ? ? ? ? ? O ?ATOM ? 1714 ?O5'B DA D ? 8 ? ? ?22.825 115.928 ?17.500 ?0.50 26.48 ? ? ? ? ? O ? The PDB always encodes the sequence of the FIRST variant in the SEQRES card (in this case, DT), but there seems to be no way to unwrap the two Residue.Residue objects wrapped in a Residue.DisorderedResidue to identify which of the two residues this would be. ?Unlike the DisorderedAtoms which are selected by ALTLOC code, the DisorderedResidues are selected by residue name, ('DT' or 'DA') in this case. I have an application where I need to be certain that the Residue instance I select matches the residue type of the SEQRES card entry (e.g., the FIRST one in the file). Is there any way to do this? ?I would have thought that keying the Residue instances in the DisorderedResidue on ALTLOC (as in DisorderAtom) would have been a better way to handle this. Barry FinzelUniversity of Minnesota From Anita.Norman at slu.se Thu Jun 21 16:08:24 2012 From: Anita.Norman at slu.se (Anita Norman) Date: Thu, 21 Jun 2012 18:08:24 +0200 Subject: [Biopython] extremely long execution time Message-ID: Hello Biopythoners! I am working with fastq files and though I have been working with them with many different scripts, I now for the first time am running into the problem that it will take 8+ days for one script to execute one file. I figure I must be doing something wrong. Here is what I am trying to do: I have a file with a list of record id's(~3 mil rec ids) recidfile I have two paired files from which the record id's originally came from (~10 mil recs each) infile I wish to create two files (withfile and withoutfile) from each of the paired files run individually. One of the new files will have record ids that are in the list and the other with record ids that are not in the list I have tried doing this with and without the fastqGeneralIterator, but both methods will require at least 8 days Here is my code with the fastqGeneralIterator: from time import time from Bio.SeqIO.QualityIO import FastqGeneralIterator start = time() recids = open(recidfile, 'r') for item in recids: recidlist.append(item[0:-2]) handle1 = open(withfile, 'w') handle2 = open(withoutfile, 'w') for header, seq, qual in FastqGeneralIterator(open(infile)): if header[:-1] in recidlist: handle1.write('%s\n%s\n+\n%s\n' %(header, seq, qual)) else: handle2.write('%s\n%s\n+\n%s\n' %(header, seq, qual)) Can anyone advise me on how I can possibly make this go faster? I would prefer 8 minutes over 8 days. Thanks in advance Anita From p.j.a.cock at googlemail.com Thu Jun 21 16:21:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Jun 2012 17:21:07 +0100 Subject: [Biopython] extremely long execution time In-Reply-To: References: Message-ID: On Thu, Jun 21, 2012 at 5:08 PM, Anita Norman wrote: > Hello Biopythoners! > > ... > > Here is my code with the fastqGeneralIterator: > > from time import time > from Bio.SeqIO.QualityIO import FastqGeneralIterator > > start = time() > > recids = open(recidfile, 'r') > for item in recids: recidlist.append(item[0:-2]) You must be leaving out a line to define recidlist - but I'll assume it was just: recidlist = [] Notice the "Filtering a sequence file" in the tutorial uses a set, http://biopython.org/DIST/docs/tutorial/Tutorial.html and says "Note that we use a Python set rather than a list, this makes testing membership faster." So try this: recidset = set([]) for item in recids: recidset.add(item[0:-2]) (and later use recidset instead of recidlist) Peter From p.j.a.cock at googlemail.com Fri Jun 22 21:29:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Jun 2012 22:29:43 +0100 Subject: [Biopython] extremely long execution time In-Reply-To: References: Message-ID: Hi Anita, Thank you for letting us know how you got on - I'm impressed just how much of a difference it made :) Peter On Fri, Jun 22, 2012 at 7:56 PM, Anita Norman wrote: > Hi Peter, > > Thanks so much for your quick and helpful response. What a difference it > makes. Now one entire file runs in less than one minute. > > All the best and happy midsummer! > > Anita > > > > On 21/06/2012 18:21, "Peter Cock" wrote: > >>On Thu, Jun 21, 2012 at 5:08 PM, Anita Norman wrote: >>> Hello Biopythoners! >>> >>> ... >>> >>> Here is my code with the fastqGeneralIterator: >>> >>> from time import time >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator >>> >>> start = time() >>> >>> recids = open(recidfile, 'r') >>> for item in recids: recidlist.append(item[0:-2]) >> >>You must be leaving out a line to define recidlist - but I'll assume >>it was just: >> >>recidlist = [] >> >>Notice the "Filtering a sequence file" in the tutorial uses a set, >>http://biopython.org/DIST/docs/tutorial/Tutorial.html and says >>"Note that we use a Python set rather than a list, this makes >>testing membership faster." >> >>So try this: >> >>recidset = set([]) >>for item in recids: recidset.add(item[0:-2]) >> >>(and later use recidset instead of recidlist) >> >>Peter > From p.j.a.cock at googlemail.com Mon Jun 25 18:22:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Jun 2012 19:22:12 +0100 Subject: [Biopython] Biopython 1.60 Message-ID: Dear Biopythoneers, Biopython 1.60 is out: http://news.open-bio.org/news/2012/06/biopython-1-60-released/ Thank you to everyone who has contributed. Peter P.S. We're on Twitter as @Biopython - see also @obf_news https://twitter.com/#!/biopython From David.Lapointe at umassmed.edu Mon Jun 25 20:57:47 2012 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Mon, 25 Jun 2012 20:57:47 +0000 Subject: [Biopython] Biopython 1.60 Message-ID: <86BFEB1DFA6CB3448DB8AB1FC52F405909D651@ummscsmbx06.ad.umassmed.edu> Great! I have EMBOSS installed but the tests always fail the EMBOSS tests. They are on the $PATH. David -- David Lapointe, Ph.D. Director Scientific Computing/Information Services University of Massachusetts Medical School 55 Lake Avenue N Worcester MA 01655 508-856-5141 (v) ' the lyf so short, the craft so long to lerne' From zhigangwu.bgi at gmail.com Mon Jun 25 23:10:39 2012 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Tue, 26 Jun 2012 07:10:39 +0800 Subject: [Biopython] Biopython 1.60 In-Reply-To: References: Message-ID: <69156DC5-445E-45B9-B352-7E816F8A90C3@gmail.com> Great. Thanks! Sent from my iPhone On Jun 26, 2012, at 2:22 AM, Peter Cock wrote: > Dear Biopythoneers, > > Biopython 1.60 is out: > http://news.open-bio.org/news/2012/06/biopython-1-60-released/ > > Thank you to everyone who has contributed. > > Peter > > P.S. We're on Twitter as @Biopython - see also @obf_news > https://twitter.com/#!/biopython > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Jun 26 08:27:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Jun 2012 09:27:53 +0100 Subject: [Biopython] Biopython 1.60 In-Reply-To: <86BFEB1DFA6CB3448DB8AB1FC52F405909D651@ummscsmbx06.ad.umassmed.edu> References: <86BFEB1DFA6CB3448DB8AB1FC52F405909D651@ummscsmbx06.ad.umassmed.edu> Message-ID: On Monday, June 25, 2012, Lapointe, David wrote: > Great! > > I have EMBOSS installed but the tests always fail the EMBOSS tests. They > are on the $PATH. > > David > > Hi David, Could you tell us which version of EMBOSS you have, your OS, and copy and paste the error from the tests please? My guess is a minor change between versions of EMBOSS. Thanks, Peter From p.j.a.cock at googlemail.com Tue Jun 26 12:40:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Jun 2012 13:40:50 +0100 Subject: [Biopython] test_Emboss.py, was: Biopython 1.6.0 Message-ID: On Tue, Jun 26, 2012 at 1:11 PM, David Lapointe wrote: > Ubuntu 10.04 > Python 2.6.5 > Emboss 6.3.1 > > These are the tests that fail, always. It may be because I have a > wrapper around the application ( setting up the EMBOSS_VARS) > > ====================================================================== > FAIL: needle with the asis trick, output to a file. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "test_Emboss.py", line 538, in test_needle_file > ? ?self.assertTrue(os.path.isfile(filename)) > AssertionError > > ====================================================================== > FAIL: water with the asis trick, output to a file. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "test_Emboss.py", line 479, in test_water_file > ? ?self.run_water(cline) > ?File "test_Emboss.py", line 462, in run_water > ? ?self.assertTrue(os.path.isfile(cline.outfile)) > AssertionError > > > On Tuesday, June 26, 2012, Peter wrote: > > ? On Monday, June 25, 2012, Lapointe, David wrote: > > ? ? ? Great! > > ? ? ? I have EMBOSS installed but the tests always ?fail the EMBOSS > tests. They are on the $PATH. > > ? ? ? David > Thanks David, I've made a small change to give a more informative error in future: https://github.com/biopython/biopython/commit/1d1f2a45658f808e22b8d0dbdcf2e6f825581fd7 However, I think problem is probably your wrapper script fails to escape filenames with spaces. Both those failing tests use output files with a space in their name. This might be useful to you: http://www.biostars.org/post/show/18642/bash-dollar-at-variable-loses-quote-characters/ Regards, Peter From linlifeng at gmail.com Wed Jun 27 21:20:46 2012 From: linlifeng at gmail.com (Lifeng Lin) Date: Wed, 27 Jun 2012 16:20:46 -0500 Subject: [Biopython] download sequences by date from Genbank Message-ID: Hi folks, Is there an elegant way of downloading sequences from Genbank and using date as a cutoff? I am trying to maintain an up-to-date local version of all sequences for a certain number of species. When "synching" with Genbank, all i can think of is retrieving all GI numbers for these species once again, compare them with what i have locally, and generate a list of new sequences and append them. I have a hunch that there might be a better way of doing this, for example, if there is a date filter that we can apply for Genbank download, then all the trouble for comparisons would be saved. Any suggestions? best, L. From mjldehoon at yahoo.com Thu Jun 28 01:01:59 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 27 Jun 2012 18:01:59 -0700 (PDT) Subject: [Biopython] download sequences by date from Genbank In-Reply-To: Message-ID: <1340845319.35121.YahooMailClassic@web161206.mail.bf1.yahoo.com> Hi Lifeng, Have a look at esearch in the NCBI E-Utilities: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch You can access the E-Utilities and parse the results with Bio.Entrez as described in the Biopython manual. Best, -Michiel --- On Wed, 6/27/12, Lifeng Lin wrote: > From: Lifeng Lin > Subject: [Biopython] download sequences by date from Genbank > To: biopython at lists.open-bio.org > Date: Wednesday, June 27, 2012, 5:20 PM > Hi folks, > > Is there an elegant way of downloading sequences from > Genbank and using > date as a cutoff? > > I am trying to maintain an up-to-date local version of all > sequences for a > certain number of species. When "synching" with Genbank, all > i can think of > is retrieving all GI numbers for these species once again, > compare them > with what i have locally, and generate a list of new > sequences and append > them. I have a hunch that there might be a better way of > doing this, for > example, if there is a date filter that we can apply for > Genbank download, > then all the trouble for comparisons would be saved. > > Any suggestions? > > best, > L. > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From arklenna at gmail.com Sat Jun 30 05:50:15 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sat, 30 Jun 2012 01:50:15 -0400 Subject: [Biopython] Variant interface Message-ID: Hi all, I'm working on Biopython for Google summer of code; my project is to create an interface between Biopython and various existing tools for handling sequence variants (including VCF format). I am seeking feedback from variant users. What could my interface offer that would make it easier to use variants with Biopython? For example, I am planning on a function that will essentially skim through a large file to give a general overview of its contents. More specifically, in what ways should variant data be able to interact with existing parts of Biopython (such as SeqFeature, SeqRecord)? Looking forward to any thoughts you share. Cheers, Lenna github.com/lennax From chris.mit7 at gmail.com Sat Jun 30 14:47:10 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sat, 30 Jun 2012 10:47:10 -0400 Subject: [Biopython] Variant interface In-Reply-To: References: Message-ID: Hi Lenna, Here are some features of the VCF/GFF parser I wrote that I use: The choice to be used as an iterator for parsing huge files or store the entire vcf in memory. isVariant(chromosome, position, *arg) -- returns if it is a variant, optional arg if it's a user-supplied variant writeVariant(file) -- writes the vcf object to a given file handle getVariants() -- generator for all the vcf objects, similar method for GFF exists getAttribute(attribute, *arg) -- get objects with a given attribute, optional arg is to get objects with a given attribute equal to arg getChildren() -- gets child objects of GFF if exists getParent() -- gets parent object of GFF if exists getXXX() -- gets the standard info for any VCF/GFF object like SeqId, Start, End, Alt, Ref, VCF type, etc. addAttribute(key, value) -- adds a feature to a given GFF/VCF object removeAttribute(key) -- removes feature Optional keywords: filter = [string,string...] -- only keep variants with the keys in filter filterOnly = Bool -- only keep features specified in filter in our object (so if we have 20 key-value attributes, just keep the keys in filter) keyDelim = string -- for compatibility with non-standard vcf/GFF formats that don't use '=' for the key-value separator fast = [string, string...] -- if we parse the file in memory, keep the keys in this list in a dictionary for immediate access to the vcf/GFF objects exclude = [string, string...] -- exclude entries with these keys cols = (int,int) -- what cols to use -- useful for parsing in GFF files that have been merged with bedtools random = Bool -- if we're treating the file as an iterable, stores the object's file position coordinates in a dictionary for random access to objects Some usage cases this helps me with: Parsing through a file and adding/removing annotations (for instance if I want to add the coding transcript affected by a VCF to the file itself) Trim down files based on several criteria to a smaller more informative file Being able to read a file only to the point I care about (random access methods that can index attributes as well as the normal identifiers/iterator) Immediate access to an attribute I care about Hope that helps Chris On Sat, Jun 30, 2012 at 1:50 AM, Lenna Peterson wrote: > Hi all, > > > I'm working on Biopython for Google summer of code; my project is to create > an interface between Biopython and various existing tools for handling > sequence variants (including VCF format). > > > I am seeking feedback from variant users. What could my interface offer > that would make it easier to use variants with Biopython? For example, I am > planning on a function that will essentially skim through a large file to > give a general overview of its contents. More specifically, in what ways > should variant data be able to interact with existing parts of Biopython > (such as SeqFeature, SeqRecord)? > > > Looking forward to any thoughts you share. > > > Cheers, > > > Lenna > > github.com/lennax > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython >