From mictadlo at gmail.com Thu Mar 1 02:50:23 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 17:50:23 +1000 Subject: [Biopython] coverage calculating from BAM Message-ID: Hello, How is it possible to calculate coverage from a BAM file in format eg. 10x coverage? Thank you in advance. From mictadlo at gmail.com Thu Mar 1 03:14:04 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 18:14:04 +1000 Subject: [Biopython] Google Summer of Code Message-ID: Hello, Is it possible to use PyPy with: * BioPython * Pysam * Matplotlib * etc If not than it might be good idea to get a support for it with help of Google Summer of Code, because PyPy getting faster and faster. Cheers, From p.j.a.cock at googlemail.com Thu Mar 1 06:00:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 11:00:01 +0000 Subject: [Biopython] samtools does not return correct exit code In-Reply-To: References: Message-ID: On Thu, Mar 1, 2012 at 1:40 AM, Mic wrote: > Hallo, > Samtools does not return correct the exit code: > > import subprocess > import logging > import sys > > def run_cmd(args): > ? ? ? ?if subprocess.call(args,shell=True) != 0: > ? ? ? ? ? ? ? ?print 'hello' > ? ? ? ? ? ? ? ?logging.error("Error copying sequence file args='%s'" % > str(args)) > ? ? ? ? ? ? ? ?return 1 > ? ? ? ?print 'e', sys.stderr > ? ? ? ?print 'o', sys.stdout > ? ? ? ?return 0 > > > def runSamtools( cmd ): > ? ?'''run a samtools command''' > > ? ?try: > ? ? ? ?retcode = subprocess.call(cmd, shell=True) > ? ? ? ?print retcode > ? ? ? ?if retcode < 0: > ? ? ? ? ? ?print >>sys.stderr, "Child was terminated by signal", -retcode > ? ?except OSError, e: > ? ? ? ?print >>sys.stderr, "Execution failed:", e > > print run_cmd("samtools faidx ex1.fa") > print runSamtools("samtools faidx ex1.fa") > > print 'Hello still alive' > > > and as output I got: > > $ python p3.py > open: No such file or directory > [_razf_open] fail to open ex1.fa > [fai_build] fail to open the FASTA file ex1.fa > e ', mode 'w' at 0x7ffa4658d270> > o ', mode 'w' at 0x7ffa4658d1e0> > 0 > open: No such file or directory > [_razf_open] fail to open ex1.fa > [fai_build] fail to open the FASTA file ex1.fa > 0 > None > Hello still alive > > How can I get sure that all samtools commands were executed successfully? > > Thank you in advance. Hi Mic, General Bioinformatics with Python questions are fine on the Biopython mailing list, but I think this qurey might be better asked elsewhere. Are you saying the samtools binary returns an error code 0 (success) even when it fails? If so, that should be raised as a bug with samtools. Alternatively pysam has support built in for calling the samtools commands. I'm not sure exactly how that works internally (e.g. via subprocess or by a C API call), but ask on the pysam mailing list. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 1 06:02:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 11:02:33 +0000 Subject: [Biopython] coverage calculating from BAM In-Reply-To: References: Message-ID: On Thu, Mar 1, 2012 at 7:50 AM, Mic wrote: > Hello, > How is it possible to calculate coverage from a BAM file in format eg. > 10x?coverage? > > Thank you in advance. Normally we'd talk about coverage as it varies along the genome, perhaps using a sliding window. This is often represented using a wiggle file or a BigWig file - and there are scripts for computing these from SAM/BAM alignments. Are you looking for a single number the entire BAM file? Peter P.S. What does this have to do with pysam or Biopython? From p.j.a.cock at googlemail.com Thu Mar 1 06:11:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 11:11:31 +0000 Subject: [Biopython] Google Summer of Code In-Reply-To: References: Message-ID: On Thu, Mar 1, 2012 at 8:14 AM, Mic wrote: > Hello, > Is it possible to use PyPy with: > * BioPython > * Pysam > * Matplotlib > * etc > > If not than it might be good idea to get a support for it with help of > Google Summer of Code, because PyPy getting faster and faster. Most of Biopython is working under PyPy (ignoring the C extensions, much like our situation under Jython). This was mentioned in the release notice for Bioython 1.59 - early adopters may be able to find other problems that we're not aware of from the unit tests: http://news.open-bio.org/news/2012/02/biopython-1-59-released/ I doubt there is enough work here alone to make a GSoC project. I'm not sure about pysam under PyPy - but I would be interested to know, because here interfacing with the samtools C code is the essence of pysam. My impression from the PyPy mailing lists calling external C libraries from PyPy is that this is another area of active work. For matplotlib, you would need NumPy under PyPy. That is an area of active work for the PyPy team who are currently trying to re-implement a pure-python version of NumPy which they are calling NumPyPy (originally it was called micronumpy) sufficient for other libraries using just the Python numpy API to run. A problem with this is many Python libraries also use the NumPy C API (e.g. bits of Biopython). See for example: http://morepypy.blogspot.com/2012/01/numpypy-status-update.html http://technicaldiscovery.blogspot.com/2011/10/thoughts-on-porting-numpy-to-pypy.html I suggest reading the PyPy and NumPy mailing list archives for more about this. Peter From mictadlo at gmail.com Thu Mar 1 06:56:57 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 21:56:57 +1000 Subject: [Biopython] samtools does not return correct exit code In-Reply-To: References: Message-ID: Thank you, pysam has similar problems and I posted already a bug report. http://code.google.com/p/pysam/issues/detail?id=89 http://code.google.com/p/pysam/issues/detail?id=90 I am going to post on samtools mailing list the problem. Cheers, On Thu, Mar 1, 2012 at 9:00 PM, Peter Cock wrote: > On Thu, Mar 1, 2012 at 1:40 AM, Mic wrote: > > Hallo, > > Samtools does not return correct the exit code: > > > > import subprocess > > import logging > > import sys > > > > def run_cmd(args): > > if subprocess.call(args,shell=True) != 0: > > print 'hello' > > logging.error("Error copying sequence file args='%s'" % > > str(args)) > > return 1 > > print 'e', sys.stderr > > print 'o', sys.stdout > > return 0 > > > > > > def runSamtools( cmd ): > > '''run a samtools command''' > > > > try: > > retcode = subprocess.call(cmd, shell=True) > > print retcode > > if retcode < 0: > > print >>sys.stderr, "Child was terminated by signal", -retcode > > except OSError, e: > > print >>sys.stderr, "Execution failed:", e > > > > print run_cmd("samtools faidx ex1.fa") > > print runSamtools("samtools faidx ex1.fa") > > > > print 'Hello still alive' > > > > > > and as output I got: > > > > $ python p3.py > > open: No such file or directory > > [_razf_open] fail to open ex1.fa > > [fai_build] fail to open the FASTA file ex1.fa > > e ', mode 'w' at 0x7ffa4658d270> > > o ', mode 'w' at 0x7ffa4658d1e0> > > 0 > > open: No such file or directory > > [_razf_open] fail to open ex1.fa > > [fai_build] fail to open the FASTA file ex1.fa > > 0 > > None > > Hello still alive > > > > How can I get sure that all samtools commands were executed successfully? > > > > Thank you in advance. > > Hi Mic, > > General Bioinformatics with Python questions are fine on the Biopython > mailing list, but I think this qurey might be better asked elsewhere. > > Are you saying the samtools binary returns an error code 0 (success) > even when it fails? If so, that should be raised as a bug with samtools. > > Alternatively pysam has support built in for calling the samtools commands. > I'm not sure exactly how that works internally (e.g. via subprocess or by > a C API call), but ask on the pysam mailing list. > > Regards, > > Peter > From mrrizkalla at gmail.com Fri Mar 2 08:41:21 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 15:41:21 +0200 Subject: [Biopython] Bio.Phylo bugs & pain points In-Reply-To: References: Message-ID: Dear Biopython list, I am facing similar problem with Phylo in the context of your thread. I am using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick using phyml command-line and want to visualize it using Bio.Phylo. I read the newick, draw_ascii and draw_graphiz perfectly but not draw(). I have networkx, and pylab installed. my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") Phylo.draw_ascii(my_view_tree) my_view_tree_xml = my_view_tree.as_phyloxml() Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) *Error:* Traceback (most recent call last): File "itree/itree2/iTree2.py", line 563, in view_tree Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) AttributeError: 'module' object has no attribute 'draw' Thank you. On Sat, Feb 18, 2012 at 7:11 PM, Eric Talevich wrote: > On Sat, Feb 18, 2012 at 11:34 AM, Eric Talevich >wrote: > > > So -- do the trees drawn by Phylo.draw() look right? > > > > > Here's how to get a quick tree, using a test file from the Biopython source > distribution: > > >>> from Bio import Phylo > >>> tree = Phylo.read("Tests/PhyloXML/apaf.xml", "phyloxml") > >>> Phylo.draw(tree) > > > If you don't have the Tests/ directory, you can use any other Newick, Nexus > or PhyloXML tree; just change the file name and format name in the call to > Phylo.read(). > > Thanks, > Eric > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mrrizkalla at gmail.com Fri Mar 2 08:48:15 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 15:48:15 +0200 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' Message-ID: > > Dear Biopython list, > > I am facing similar problem with Phylo in the context of your thread. I am > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick using > phyml command-line and want to visualize it using Bio.Phylo. I read the > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have > networkx, and pylab installed. > > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") > Phylo.draw_ascii(my_view_tree) > my_view_tree_xml = my_view_tree.as_phyloxml() > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) > > *Error:* > > Traceback (most recent call last): > File "itree/itree2/iTree2.py", line 563, in view_tree > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > axes=None) > AttributeError: 'module' object has no attribute 'draw' > > Thank you. Mariam From mrrizkalla at gmail.com Fri Mar 2 08:48:15 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 15:48:15 +0200 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' Message-ID: > > Dear Biopython list, > > I am facing similar problem with Phylo in the context of your thread. I am > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick using > phyml command-line and want to visualize it using Bio.Phylo. I read the > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have > networkx, and pylab installed. > > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") > Phylo.draw_ascii(my_view_tree) > my_view_tree_xml = my_view_tree.as_phyloxml() > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) > > *Error:* > > Traceback (most recent call last): > File "itree/itree2/iTree2.py", line 563, in view_tree > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > axes=None) > AttributeError: 'module' object has no attribute 'draw' > > Thank you. Mariam From eric.talevich at gmail.com Fri Mar 2 09:53:24 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 2 Mar 2012 09:53:24 -0500 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' In-Reply-To: References: Message-ID: On Fri, Mar 2, 2012 at 8:48 AM, Mariam Reyad Rizkallah wrote: > > > > Dear Biopython list, > > > > I am facing similar problem with Phylo in the context of your thread. I > am > > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick > using > > phyml command-line and want to visualize it using Bio.Phylo. I read the > > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have > > networkx, and pylab installed. > > > > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") > > Phylo.draw_ascii(my_view_tree) > > my_view_tree_xml = my_view_tree.as_phyloxml() > > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > axes=None) > > > > *Error:* > > > > Traceback (most recent call last): > > File "itree/itree2/iTree2.py", line 563, in view_tree > > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > > axes=None) > > AttributeError: 'module' object has no attribute 'draw' > > > > > Thank you. > > Mariam > > Hi Mariam, Would you mind checking the version number of Biopython within the interpreter or script you're using? Like this: import Bio print Bio.__version__ The function Phylo.draw was part of Biopython 1.58, so the simplest explanation is that your script is using a different, older installation of Biopython that's also installed on your system. Alternatively, to get the best experience with Phylo.draw I'd recommend updating to the current Biopython 1.59. Hope that helps, Eric From mrrizkalla at gmail.com Fri Mar 2 10:27:51 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 17:27:51 +0200 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' In-Reply-To: References: Message-ID: Hi Eric, I was just upgraded to 1.59 and when I checked the version, it was 1.56!!!!! I can't believe that I didn't pay attention to that! Thank you very much. Mariam On Fri, Mar 2, 2012 at 4:53 PM, Eric Talevich wrote: > On Fri, Mar 2, 2012 at 8:48 AM, Mariam Reyad Rizkallah < > mrrizkalla at gmail.com> wrote: > >> > >> > Dear Biopython list, >> > >> > I am facing similar problem with Phylo in the context of your thread. I >> am >> > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick >> using >> > phyml command-line and want to visualize it using Bio.Phylo. I read the >> > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have >> > networkx, and pylab installed. >> > >> > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") >> > Phylo.draw_ascii(my_view_tree) >> > my_view_tree_xml = my_view_tree.as_phyloxml() >> > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, >> axes=None) >> > >> > *Error:* >> >> > >> > Traceback (most recent call last): >> > File "itree/itree2/iTree2.py", line 563, in view_tree >> > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, >> > axes=None) >> > AttributeError: 'module' object has no attribute 'draw' >> > >> > >> Thank you. >> >> Mariam >> >> > Hi Mariam, > > Would you mind checking the version number of Biopython within the > interpreter or script you're using? Like this: > > import Bio > print Bio.__version__ > > > The function Phylo.draw was part of Biopython 1.58, so the simplest > explanation is that your script is using a different, older installation of > Biopython that's also installed on your system. > > Alternatively, to get the best experience with Phylo.draw I'd recommend > updating to the current Biopython 1.59. > > Hope that helps, > Eric > From MatatTHC at gmx.de Sun Mar 4 05:44:56 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sun, 4 Mar 2012 11:44:56 +0100 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: Hi, its now implemented and tested. I would like to post it as Cookbook entry. How can I do this? Matthias 2012/2/24 Peter Cock > On Thu, Feb 23, 2012 at 9:09 PM, Matthias Bernt wrote: > > hi peter, > > > > Thank you for the suggestions. I will try to create the functions as > > suggested. > > Should I post them here? > > Sure - or on the wiki under a new 'Cookbook' entry? > http://biopython.org/wiki/Category:Cookbook > > > I think we keep it as it is at the moment. Performance is not so > important > > for me .. so far. > > Optimisation can still be done later. > > Of course :) > > Do you know the quote "premature optimization is the root of all evil"? > > Peter > From mmokrejs at fold.natur.cuni.cz Sun Mar 4 13:09:28 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Sun, 04 Mar 2012 19:09:28 +0100 Subject: [Biopython] Write FASTA sequence on a single line Message-ID: <4F53AFD8.1090506@fold.natur.cuni.cz> Hi, is there an option to tell FASTA writer to write output with a sequence on a single line (so that a FASTA entry would span just two lines altogether)? I see it should be faster to eventually parse using SeqIO because one would avoid calls for each line in the FASTAinput file. In my code I have for _record in SeqIO.parse(fastah, 'fasta'): which boils down to biopython's: append(line.rstrip().replace(" ","").replace("\r","")) per every line with _sequence_. Thank you for comments, Martin From w.arindrarto at gmail.com Sun Mar 4 13:46:51 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 4 Mar 2012 19:46:51 +0100 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: <4F53AFD8.1090506@fold.natur.cuni.cz> References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: Hi Martin, A quick glance at Bio.SeqIO.FastaIO.FastaWriter shows that there is indeed an option to set the line wrapping length. However, the regular writing function that calls FastaWriter (SeqIO.write) only accepts three parameters (sequence, handle, and format), so if you really want to use Biopython's fasta writer, you should call FastaWriter directly. For example, as shown in the docs: from Bio.SeqIO.FastaIO import FastaWriter writer = FastaWriter(open(outfile, 'w'), wrap=0) writer.write_file(records) Alternatively, you can iterate over the records manually and write them to the output file like so: with open(outfile, 'w') as target: for rec in records: # records is the list containing your SeqRecord objects target.write('>%s\n' % rec.id) target.write('%s\n' % rec.seq.tostring()) Hope that helps! Bow On Sun, Mar 4, 2012 at 19:09, Martin Mokrejs wrote: > Hi, > is there an option to tell FASTA writer to write output with a > sequence on a single line (so that a FASTA entry would span just > two lines altogether)? I see it should be faster to eventually > parse using SeqIO because one would avoid calls for each line in > the FASTAinput file. > > In my code I have > for _record in SeqIO.parse(fastah, 'fasta'): > > which boils down to biopython's: > append(line.rstrip().replace(" ","").replace("\r","")) > > per every line with _sequence_. > > Thank you for comments, > Martin > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mmokrejs at fold.natur.cuni.cz Sun Mar 4 14:15:27 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Sun, 04 Mar 2012 20:15:27 +0100 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: <4F53BF4F.4010907@fold.natur.cuni.cz> Hi Willis and Wibowo, yes, I also write the new fasta files myself, obeying the biopythons writer. Sometimes I even parse them myself. But mainly I wanted this issue raised up and get it implemented into biopython. And I hope that the argument that parsing of these files is faster will be valued as well. Not talking about the fact that one can use grep(1) to search through the sequences, which is impossible if the sequences are split over several lines. I would even say that one-line sequences should be default. ;)) Or at least if len() is below e.g. 2000. ;) But thanks for pointer to the direct FastaWriter use. I forgot about this and just had the feeling there was a way ... ;) Martin Wibowo Arindrarto wrote: > Hi Martin, > > A quick glance at Bio.SeqIO.FastaIO.FastaWriter shows that there is indeed an option to set the line wrapping length. However, the regular writing function that calls FastaWriter (SeqIO.write) only accepts three parameters (sequence, handle, and format), so if you really want to use Biopython's fasta writer, you should call FastaWriter directly. > > For example, as shown in the docs: > > from Bio.SeqIO.FastaIO import FastaWriter > writer = FastaWriter(open(outfile, 'w'), wrap=0) > writer.write_file(records) > > Alternatively, you can iterate over the records manually and write them to the output file like so: > > with open(outfile, 'w') as target: > for rec in records: # records is the list containing your SeqRecord objects > target.write('>%s\n' % rec.id ) > target.write('%s\n' % rec.seq.tostring()) > > > Hope that helps! > Bow > > > On Sun, Mar 4, 2012 at 19:09, Martin Mokrejs > wrote: > > Hi, > is there an option to tell FASTA writer to write output with a > sequence on a single line (so that a FASTA entry would span just > two lines altogether)? I see it should be faster to eventually > parse using SeqIO because one would avoid calls for each line in > the FASTAinput file. > > In my code I have > for _record in SeqIO.parse(fastah, 'fasta'): > > which boils down to biopython's: > append(line.rstrip().replace(" ","").replace("\r","")) > > per every line with _sequence_. > > Thank you for comments, > Martin > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Sun Mar 4 14:15:36 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 4 Mar 2012 19:15:36 +0000 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: On Sun, Mar 4, 2012 at 6:46 PM, Wibowo Arindrarto wrote: > Hi Martin, > > A quick glance at Bio.SeqIO.FastaIO.FastaWriter shows > that there is indeed an option to set the line wrapping length. > However, the regular writing function that calls FastaWriter > (SeqIO.write) only accepts three parameters (sequence, > handle, and format), so if you really want to use Biopython's > fasta writer, you should call FastaWriter directly. Exactly. The top level SeqIO API is file format neutral, so if you want to do something format specific, you have to import and use the underlying parser/writer directly - in this case Bio.SeqIO.FastaIO.FastaWriter as you showed. Peter From jordan.r.willis at Vanderbilt.Edu Sun Mar 4 13:35:58 2012 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 4 Mar 2012 12:35:58 -0600 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: <4F53AFD8.1090506@fold.natur.cuni.cz> References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: I don't think there is an option, but you could possibly write it in the SeqIO class. I have to do this all the time and i just write it to a file manually for _record in SeqIO.parse(fastah,'fasta) open(outputfile,'a').write(">"+_record.id+record.seq) You can of course format these to output to a file anyway you want. Jordan On Mar 4, 2012, at 12:09 PM, Martin Mokrejs wrote: > Hi, > is there an option to tell FASTA writer to write output with a > sequence on a single line (so that a FASTA entry would span just > two lines altogether)? I see it should be faster to eventually > parse using SeqIO because one would avoid calls for each line in > the FASTAinput file. > > In my code I have > for _record in SeqIO.parse(fastah, 'fasta'): > > which boils down to biopython's: > append(line.rstrip().replace(" ","").replace("\r","")) > > per every line with _sequence_. > > Thank you for comments, > Martin > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Mon Mar 5 04:39:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Mar 2012 09:39:56 +0000 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: On Sun, Mar 4, 2012 at 10:44 AM, Matthias Bernt wrote: > Hi, > > its now implemented and tested. I would like to post it as Cookbook entry. > How can I do this? > > Matthias It is a wiki, so register an account and you should be able to edit and add pages. Put this under the 'Cookbook' category, which just means adding [[category:Cookbook]] to the end, and it will then automatically appear here: http://biopython.org/wiki/Category:Cookbook Thanks, Peter From MatatTHC at gmx.de Mon Mar 5 10:57:13 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Mon, 5 Mar 2012 16:57:13 +0100 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: Hi, when I try to register (http://biopython.org/wiki/Special:UserLogin/signup) I get an error: """ Permission error You do not have permission to create this user account, for the following reason: The action you have requested is limited to users in the group: Administrators. """ Matthias 2012/3/5 Peter Cock > > On Sun, Mar 4, 2012 at 10:44 AM, Matthias Bernt wrote: > > Hi, > > > > its now implemented and tested. I would like to post it as Cookbook entry. > > How can I do this? > > > > Matthias > > It is a wiki, so register an account and you should be able to edit > and add pages. Put this under the 'Cookbook' category, which just > means adding [[category:Cookbook]] to the end, and it will then > automatically appear here: > > http://biopython.org/wiki/Category:Cookbook > > Thanks, > > Peter From p.j.a.cock at googlemail.com Mon Mar 5 11:12:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Mar 2012 16:12:42 +0000 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: On Mon, Mar 5, 2012 at 3:57 PM, Matthias Bernt wrote: > Hi, > > when I try to register > (http://biopython.org/wiki/Special:UserLogin/signup) I get an error: > > """ > Permission error > You do not have permission to create this user account, for the > following reason: > The action you have requested is limited to users in the group: Administrators. > """ > > Matthias Very odd. That shouldn't happen - other people have managed to create accounts recently. I can try to create an account for you if you like - email me directly with desired username and email details. Peter From jordan.r.willis at Vanderbilt.Edu Mon Mar 5 22:37:30 2012 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Mon, 5 Mar 2012 21:37:30 -0600 Subject: [Biopython] MultiProcess SeqIO objects Message-ID: <0B7618F3-3F18-4838-92CD-A533CAE4117D@Vanderbilt.Edu> Hello BioPython, I was wondering if anyone has used the multiprocessing tool in conjunction with Biopython type objects? Here is my problem, I have 60 million sequences given in fastq format and I want to multiprocess these without having to iterate through the list multiple times. So I have something like this: from multiprocessing import Pool from Bio import SeqIO input_handle = open("huge_fastaqf_file.fastq,) def convert_to_fasta(input) return [[record.id , record.seq.reverse_complement ] for record in SeqIO.parse(input,'fastq')] p = Pool(processes=4) g = p.map(convert_to_fasta,input_handle) for i in g: print i[0],i[1] Unfortunately, it seems to divide up the handle by all the names and tries makes the input in the function convert_to_fasta the first line of input. What I want it to do is divide up the fastq object and do my function on 4 processors. I can't figure out how in the world to do this though. Thanks, jordan From from.d.putto at gmail.com Tue Mar 6 06:50:40 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Tue, 6 Mar 2012 12:50:40 +0100 Subject: [Biopython] access ModBase using Biopython Message-ID: Hi all, Is it possible to access ModBase using Biopython? How can I retrieve homology model using sequence from the databases like ModBase/SWISS-MODEL using biopython? Thanks -- Sheila From anaryin at gmail.com Tue Mar 6 06:53:43 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 6 Mar 2012 12:53:43 +0100 Subject: [Biopython] access ModBase using Biopython In-Reply-To: References: Message-ID: Hi Sheila, Modbase is not possible to access through Biopython. You would have to write your own script to interact with the webpage. Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao No dia 6 de Mar?o de 2012 12:50, Sheila the angel escreveu: > Hi all, > Is it possible to access ModBase using Biopython? > How can I retrieve homology model using sequence from the databases > like ModBase/SWISS-MODEL using biopython? > > Thanks > -- > Sheila > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Mar 6 06:55:13 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 06 Mar 2012 06:55:13 -0500 Subject: [Biopython] MultiProcess SeqIO objects In-Reply-To: <0B7618F3-3F18-4838-92CD-A533CAE4117D@Vanderbilt.Edu> References: <0B7618F3-3F18-4838-92CD-A533CAE4117D@Vanderbilt.Edu> Message-ID: <871up6ueou.fsf@fastmail.fm> Jordan; > I was wondering if anyone has used the multiprocessing tool in > conjunction with Biopython type objects? Here is my problem, I have 60 > million sequences given in fastq format and I want to multiprocess > these without having to iterate through the list multiple times. Are you trying to make the parsing run in the parallel, or some downstream processing happen in parallel? The later is definitely preferable if you are looking for speed ups since the parsing will be primarily IO bound. You can make the processing faster by avoiding using SeqIO objects since the conversion of quality scores will take the most time. Here is a working example: from multiprocessing import Pool from Bio.SeqIO.QualityIO import FastqGeneralIterator from Bio.Seq import Seq def do_something_with_record(info): name, seq = info return name, seq def convert_to_fasta(in_handle): for rec_id, seq, _ in FastqGeneralIterator(in_handle): yield rec_id, str(Seq(seq).reverse_complement()) with open("example.fastq") as input_handle: p = Pool(processes=4) g = p.map(do_something_with_record, convert_to_fasta(input_handle)) for i in g: print i Hope this helps, Brad > So I have something like this: > > from multiprocessing import Pool > from Bio import SeqIO > > input_handle = open("huge_fastaqf_file.fastq,) > > > def convert_to_fasta(input) > return [[record.id , record.seq.reverse_complement ] for record in SeqIO.parse(input,'fastq')] > > p = Pool(processes=4) > g = p.map(convert_to_fasta,input_handle) > > for i in g: > print i[0],i[1] > > Unfortunately, it seems to divide up the handle by all the names and tries makes the input in the function convert_to_fasta the first line of input. What I want it to do is divide up the fastq object and do my function on 4 processors. > > I can't figure out how in the world to do this though. From rbuels at gmail.com Tue Mar 6 11:00:00 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 06 Mar 2012 11:00:00 -0500 Subject: [Biopython] Google Summer of Code organization application submitted Message-ID: <4F563480.1080808@gmail.com> Hi all, I'd like to let you know that I had a look through the GSoC wiki pages for the GSoC wiki pages, and they look pretty good. Thank you very much, everyone who worked on them. I went ahead and submitted our application to Google for participation in GSoC 2012. If you have any more ideas for projects that a Google-funded intern might work on this summer, now is the time to add them to the wiki at http://www.open-bio.org/wiki/Google_Summer_of_Code, and on your project's wiki page that is linked from there. Google will most likely be evaluating these ideas within the next couple of days. If you are interested in helping out with Google Summer of Code this year, now is the time to make sure you are listed on your project's wiki as a prospective mentor. Also, please make sure you are a member of both the OBF GSoC and OBF GSoC-Mentors email lists[1][2]. The better those project idea pages are, the stronger our case for getting Google funding will be. Thanks a lot for all the hard work you community members have put in so far! Rob ---- Robert Buels (prospective) 2012 OBF GSoC Organization Admin [1] http://lists.open-bio.org/mailman/listinfo/gsoc [2] http://lists.open-bio.org/mailman/listinfo/gsoc-mentors From open-bio at wvr7.me.uk Tue Mar 6 14:47:56 2012 From: open-bio at wvr7.me.uk (Giles Weaver) Date: Tue, 06 Mar 2012 19:47:56 +0000 Subject: [Biopython] Job opportunity: Head of Bioinformatics at Institute for Animal Health (Surrey, UK) Message-ID: <1e04a705a4f5f350a28309dde7ee0376@wvr7.me.uk> Dear All. Please pass the following onto anyone who may be interested. Note the closing date is the 16th March (next Friday!). For a pretty version of the advert without mangled formatting please see http://www.jobs.ac.uk/job/ADZ114/head-of-bioinformatics/. Thanks, Giles HEAD OF BIOINFORMATICS DRIVE AND SUPPORT QUANTITATIVE RESEARCH INTO THE VIRAL DISEASES OF ANIMALS ?42,769-?47,521 Ref: IRC43544 BASED: INSTITUTE FOR ANIMAL HEALTH, PIRBRIGHT LABORATORY, SURREY Leading the bioinformatics team, you will provide support to to IAH scientists involved in quantitative biology, but will also have the opportunity to pursue your own research. Areas of current interest include modelling of virus evolution and host immune responses using next-generation sequencing data; _in silico_ analysis of host genetics and genomics data; and learning and predicting networks of biomolecular interactions from post-genomic data sets. In this high-profile role, you'll be expected to seek funds for new projects and continue your excellent track record of publication. Building collaborative research links with other members of IAH is encouraged. Holding a PhD or equivalent in a relevant branch of the biosciences, you will have experience in a recognised R&D environment. The ability to develop and manage relational databases is essential, so we would expect proficiency in MySQL (or similar), languages such as Perl or Python, and familiarity with R, Bioconductor or another statistical program. Experience of writing grant applications and managing staff would be helpful. The Institute for Animal Health (IAH) is an institute of the Biotechnology and Biological Sciences Research Council (BBSRC). We work to enhance the UK's capability to contain, control, and eliminate viral diseases in animals through highly innovative fundamental and applied bioscience. Informal enquiries about the post can be made to Simon Gubbins, Head of Mathematical Biology (simon.gubbins at iah.ac.uk [1]) APPLICATIONS ARE HANDLED BY THE RCUK SHARED SERVICES CENTRE; TO APPLY PLEASE VISIT OUR JOB BOARD AT HTTPS://EXT.SSC.RCUK.AC.UK [2] AND COMPLETE AN ONLINE APPLICATION FORM. APPLICANTS WHO WOULD LIKE TO RECEIVE THIS ADVERT IN AN ALTERNATIVE FORMAT (E.G. LARGE PRINT, BRAILLE, AUDIO OR HARD COPY), OR WHO ARE UNABLE TO APPLY ONLINE SHOULD CONTACT US BY TELEPHONE ON 01793 867003, PLEASE QUOTE REFERENCE NUMBER IRC43544. FOR MORE INFORMATION ABOUT THE IAH GO TO CLOSING DATE: 16TH MARCH 2012. Links: ------ [1] mailto:simon.gubbins at iah.ac.uk [2] https://ext.ssc.rcuk.ac.uk/ From mnemonico at posthocergopropterhoc.net Tue Mar 6 18:39:13 2012 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Tue, 6 Mar 2012 20:39:13 -0300 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: This one bit me while following the cookbook tonight. Explicitly setting retmode= 'xml' or 'html' fails. I think I read somewhere that 'text' is expected to break every now and then. Seems its the only retmode option that remains functional. -- .''`. Hugo A. M. Torres : :' : `. `' ?Talk is cheap, `- show me the code. ? -- L. Torvalds. 2012/2/23 Peter Cock > 2012/2/23 ??(Feng GAO) : > > Hi all, > > We have some python code using gi number to get record from Genbank. > > Part of the code is: > > > > handle = Entrez.efetch(db="protein", id=ID, rettype="gb") > > record = SeqIO.read(handle,"genbank") > > > > We have had no problem with this code > > until this week when we started getting "ValueError: No records found > in handle". > > Anyone have an idea how to fix it now? Thanks! > > Feng > > Try using an explicit retmode="text" in the efetch call. > The NCBI changed the defaults with EFetch 2.0, which > went live earlier this month. You're probably getting > XML back instead. > > Note to self: I wonder if the Biopython tutorial examples > need to be updated as well... > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Wed Mar 7 03:59:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 08:59:47 +0000 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: On Tuesday, March 6, 2012, A M Torres, Hugo < mnemonico at posthocergopropterhoc.net> wrote: > This one bit me while following the cookbook tonight. > > Explicitly setting retmode= 'xml' or 'html' fails. > > I think I read somewhere that 'text' is expected to break every now and > then. Seems its the only retmode option that remains functional. > Do you any specific examples? Even before this change Entrez calls would sometimes time out or fail with network problems. Peter From mnemonico at posthocergopropterhoc.net Wed Mar 7 11:18:37 2012 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Wed, 7 Mar 2012 13:18:37 -0300 Subject: [Biopython] meddling with GeneDiagram Message-ID: Fellow biopythoneers, I've been playing around with GenomeDiagram trying to draw a gene's features. My impressions are this is a very nifty tool indeed. However I see a problem in the way I would can draw a gene though it might be just my inexperience with as a user: The sigil don't automatically distinguish between a FeatureLocation with fuzzy position(i.e. BeforePosition(0)) and a feature with an exact position (i.e. ExactPosition(6475)). As example suppose I would like to draw the genes from a SeqRecord object built from the TP53 genbank file: def draw_gene(seqrec): diagram = GenomeDiagram.Diagram(seqrec.id) gene_track = diagram.new_track(1, name='Genes: ') gene_set = gene_track.new_set() ???? genes = ( i for i in seqrec.features if i.type == 'gene') color = colors.green for gene in genes: if gene.strand == 1: angle = 0 # else: angle = 180 gene_set.add_feature(gene,? sigil='ARROW',? color=color,? arrowshaft_height=1, arrowhead_length=0.2, label=True, label_size=14,? label_angle=angle, ) diagram.draw(format='linear',? pagesize='A4',? fragments=1,? start=0,? end=len(seqrec) ) diagram.write('gene_diagram.svg', 'SVG') The resulting image looks like gene_diagram.svg. There seems to be a WRAP53 gene on the minus strand and the sigil represents it as awhole gene. but its only a portion of it. Maybe we could represent its just a piece by drawing the arrowhead pointing inwards instead of outwards as in gene_arrow.png. Is that possible to implement? -- .''`. Hugo A. M. Torres : :' : `. `' ?Talk is cheap, `- show me the code. ? -- L. Torvalds. -------------- next part -------------- A non-text attachment was scrubbed... Name: gene_arrow.png Type: image/png Size: 8525 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gene_diagram.svg Type: image/svg+xml Size: 2316 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Wed Mar 7 11:39:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 16:39:19 +0000 Subject: [Biopython] meddling with GeneDiagram In-Reply-To: References: Message-ID: On Wed, Mar 7, 2012 at 4:18 PM, A M Torres, Hugo wrote: > Fellow biopythoneers, > > I've been playing around with GenomeDiagram trying to draw a gene's > features. My impressions are this is a very nifty tool indeed. Complex though ;) > However I see a problem in the way I would can draw a gene though it might > be just my inexperience with as a user: The sigil don't > automatically distinguish between a FeatureLocation with fuzzy > position(i.e. BeforePosition(0)) and a feature with an exact position (i.e. > ExactPosition(6475)). No, it doesn't. > As example suppose I would like to draw the genes from a SeqRecord object > built from the TP53 genbank file: > ... > > The resulting image looks like gene_diagram.svg. There seems to be a WRAP53 > gene on the minus strand and the sigil represents it as awhole gene. but > its only a portion of it. Maybe we could represent its just a piece by > drawing the arrowhead pointing inwards instead of outwards as in > gene_arrow.png. > > Is that possible to implement? Somewhat related is cropping of features only partly in view, and a general 'jaggy' feature for showing truncation in some why. Leighton and I did discuss the later and there is an implementation on a branch which didn't make it into Biopython 1.59 but could be in the next release. At its simplest this is a sigil with a jagged edge at both ends, useful for marking things like NNNNN regions in scaffolds/supercontigs, or even perhaps repeat regions. Dealing with the left and right ends of sigils generically would be more powerful though, and more complex. That would be required for your example - arrow head at one end, so kind of truncation marker at the other. We've also talked about other wish list ideas like exons and links, frame aware placement, frame less placement, etc. All these kinds of things only make sense a "high zoom" or if drawing small genomes like viruses - while original GenomeDiagram targeted entire bacteria ("low zoom", or "zoomed out") where you only needed and wanted a simple box for each gene. Peter From p.j.a.cock at googlemail.com Wed Mar 7 12:32:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 17:32:57 +0000 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: On Wed, Mar 7, 2012 at 5:22 PM, A M Torres, Hugo wrote: > Hi Peter. > > More specificaly for instance, > > handle = Entrez.efetch(db='nucleotide', > rettype='gb', > retmode='xml', > ?????????????????????????? id=gene) > record = SeqIO.read(handle, 'gb') > handle.close() > > ==================================================== > ... ValueError: No records found in handle Hi Hugo, Getting an error here is good - there were no GenBank formatted records in your file (while there should have been an XML record). Perhaps if we expect this to be a common error a more specific exception would be nicer? e.g. ValueError("This is XML, not GenBank plain text") Maybe I don't understand what you are querying? Peter From nathaniel.echols at gmail.com Wed Mar 7 20:08:32 2012 From: nathaniel.echols at gmail.com (Nat Echols) Date: Wed, 7 Mar 2012 18:08:32 -0700 Subject: [Biopython] server for automatic high-quality search & alignment? Message-ID: Hi list-- Does anyone know of a remotely callable web service similar to HHPred in functionality - i.e. capable of running a homology search against the PDB, and returning high-quality alignments? We're using NCBI BLAST for this right now and will probably use the EBI's WU-BLAST server in the future, but these are considered inferior to HHPred for weak homologs. Unfortunately HHPred isn't something we can use from Python, at least not in production code. thanks, Nat From mnemonico at posthocergopropterhoc.net Wed Mar 7 21:26:39 2012 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Wed, 7 Mar 2012 23:26:39 -0300 Subject: [Biopython] meddling with GeneDiagram In-Reply-To: References: Message-ID: On Wed, Mar 7, 2012 at 5:29 PM, Peter Cock wrote: > Did you mean to go off the list? No! My mistake. I meant reply to all. Been a little too distracted today. I will replay to all now. > > > On Wednesday, March 7, 2012, A M Torres, Hugo < > mnemonico at posthocergopropterhoc.net> wrote: > > Hi Peter > > > >> Somewhat related is cropping of features only partly in view, and a > >> general 'jaggy' feature for showing truncation in some why. Leighton > >> and I did discuss the later and there is an implementation on a branch > >> which didn't make it into Biopython 1.59 but could be in the next > release. > >> At its simplest this is a sigil with a jagged edge at both ends, useful > >> for marking things like NNNNN regions in scaffolds/supercontigs, or > >> even perhaps repeat regions. > > > > Neat. Thats exactly what I am needing, a third kind of sigil to represent > > fuzzy ends. Glad to know it will be available. > > > >> > >> Dealing with the left and right ends of > >> sigils generically would be more powerful though, and more complex. > >> That would be required for your example - arrow head at one end, so > >> kind of truncation marker at the other. > > > > Yea that would work exactly as I'd expect. One sigil for each end, > > the shaft and the arrowhead. > > It would have a lot of uses :) > > > > Then we could automatically test whether or not to replace the > > user chosen sigil with the 'fuzzy' one: > > > > if not isinstance(gene.location.end, ExactPosition): > > if gene.strand == -1: > > shaft_sigil = 'FUZZY' > > else: > > arrowhead_sigil = 'FUZZY' > >> > >> We've also talked about other wish list ideas like exons and links, > >> frame aware placement, frame less placement, etc. All these kinds > >> of things only make sense a "high zoom" or if drawing small genomes > >> like viruses - while original GenomeDiagram targeted entire bacteria > >> ("low zoom", or "zoomed out") where you only needed and wanted > >> a simple box for each gene. > > > > I see. Sounds like pretty interesting stuff. Maybe I could help out > > but I will need some tutoring. Never worked on a an open source > > collaborative project before. Is the next-release code hosted on > > someplace like github? > > Yes, it is on GitHub - have a look at the links on our wiki pages. > https://github.com/biopython/biopython Alright, great. I have forked myself a copy. > If I could learn how to get and use the code without interfering > with my working installation of biopython (maybe using something > like virtualenv?) I've never used virtualenv, but I hear good things about it. > > Are you on Windows, Mac or Linux? > Debian testing (tends to be very up-to-date) > > I would gladly contribute some work. Let me know if I can be of hand. > > I don't want to put you off, but the GenomeDiagram code is > pretty complex... And right now probably only two people > can really be said to understand it (Leighton and myself). > No problem. I might try and have a look. I will try to use the virtualenv thing to experiment without breaking the system's biopython. I'll try first to contribute some small code changes just to get the hang of it. Then if you guys decide some of the changes are worthwhile you can incorporate them in the main project. This should be fun. > There are also two semi-duplicated areas of code, for > drawing linear and circular diagrams. In general, drawing > signals on circular diagrams is a LOT harder to implement. > > Right now the most important thing is actually the documentation, > something I managed to do a bit more of recently: > http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ > > It is the graph functions that need doing next - perhaps by > adapting Leighton's old documentation from before GD > was integrated into Biopython. I mean bar charts, line > graphs and heat maps. > > Peter -- -- .''`. Hugo A. M. Torres : :' : `. `' ?Talk is cheap, `- show me the code. ? -- L. Torvalds. From idoerg at gmail.com Wed Mar 7 22:01:08 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 7 Mar 2012 22:01:08 -0500 Subject: [Biopython] server for automatic high-quality search & alignment? In-Reply-To: References: Message-ID: Did you try ffas? On Wed, Mar 7, 2012 at 8:08 PM, Nat Echols wrote: > Hi list-- > > Does anyone know of a remotely callable web service similar to HHPred > in functionality - i.e. capable of running a homology search against > the PDB, and returning high-quality alignments? We're using NCBI > BLAST for this right now and will probably use the EBI's WU-BLAST > server in the future, but these are considered inferior to HHPred for > weak homologs. Unfortunately HHPred isn't something we can use from > Python, at least not in production code. > > thanks, > Nat > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Thu Mar 8 05:06:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Mar 2012 10:06:32 +0000 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: On Wed, Mar 7, 2012 at 8:25 PM, A M Torres, Hugo wrote: >> >> Hi Hugo, >> >> Getting an error here is good - there were no GenBank formatted records >> in your file (while there should have been an XML record). Perhaps if >> we expect this to be a common error a more specific exception would >> be nicer? e.g. ValueError("This is XML, not GenBank plain text") >> >> Maybe I don't understand what you are querying? >> >> Peter > > You are right. I was expecting to SeqIO to read xml. If I want to parse xml > it seems I should have used Entrez.read instead. > > Sorry for the noise. No problem - thanks for clarifying this, Peter From ming.xue at boehringer-ingelheim.com Mon Mar 12 18:18:38 2012 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Mon, 12 Mar 2012 18:18:38 -0400 Subject: [Biopython] efetch only returns 20 records of pubmed In-Reply-To: References: Message-ID: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> Hello, I used the biopython 1.5.9 to download some pubmed abstracts. The query from browser showed 409580 records. But I only got the count of 20 from record["IdList"] and they matched the records on the first page from browser. Am I blocked by NCBI or there is a parameter for page I missed? from Bio import Entrez Entrez.email = 'my.email at domain.com' query = Entrez.esearch(db="pubmed", term="publisher[sb]") record = Entrez.read(query) print len(record["IdList"]) Thanks for your comments, Ming From winda002 at student.otago.ac.nz Mon Mar 12 18:55:20 2012 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 13 Mar 2012 11:55:20 +1300 Subject: [Biopython] efetch only returns 20 records of pubmed In-Reply-To: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> References: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> Message-ID: <4F5E7ED8.80800@student.otago.ac.nz> Hi Min, I think "retmax" is the parameter you are looking for. If you plan on making some huge query, be sure to do it outside of peak times (US) and think about using the WebEnv features ("Using the history and WebEnv" section of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial) if you want to download a lot of data. Cheers, David On 3/13/2012 11:18 AM, ming.xue at boehringer-ingelheim.com wrote: > Hello, > > I used the biopython 1.5.9 to download some pubmed abstracts. The query from > browser showed 409580 records. But I only got the count of 20 from > record["IdList"] and they matched the records on the first page from browser. > Am I blocked by NCBI or there is a parameter for page I missed? > > from Bio import Entrez > Entrez.email = 'my.email at domain.com' > > query = Entrez.esearch(db="pubmed", term="publisher[sb]") > record = Entrez.read(query) > print len(record["IdList"]) > > Thanks for your comments, > Ming > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From ming.xue at boehringer-ingelheim.com Mon Mar 12 22:53:48 2012 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Mon, 12 Mar 2012 22:53:48 -0400 Subject: [Biopython] efetch only returns 20 records of pubmed In-Reply-To: <4F5E7ED8.80800@student.otago.ac.nz> References: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> <4F5E7ED8.80800@student.otago.ac.nz> Message-ID: <015E2E9BC1647A45B40F286F2BE01C4195FC8D@nahexm101.am.boehringer.com> Hi David, Thanks for the quick tips and I certainly missed the tutorials. But I had more serious problem as I think I got denied. During my test of the examples in the section 8.15 of the Tutorials, my simple command of Entrez.einfo(db='pubmed') failed at 9:45 pm US EDT but the same command worked fine on my personal computer with a different IP. I emailed NCBI for clarification. Thanks, Ming -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of David Winter Sent: Monday, March 12, 2012 6:55 PM To: biopython at lists.open-bio.org Subject: Re: [Biopython] efetch only returns 20 records of pubmed Hi Min, I think "retmax" is the parameter you are looking for. If you plan on making some huge query, be sure to do it outside of peak times (US) and think about using the WebEnv features ("Using the history and WebEnv" section of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial) if you want to download a lot of data. Cheers, David On 3/13/2012 11:18 AM, ming.xue at boehringer-ingelheim.com wrote: > Hello, > > I used the biopython 1.5.9 to download some pubmed abstracts. The query from > browser showed 409580 records. But I only got the count of 20 from > record["IdList"] and they matched the records on the first page from browser. > Am I blocked by NCBI or there is a parameter for page I missed? > > from Bio import Entrez > Entrez.email = 'my.email at domain.com' > > query = Entrez.esearch(db="pubmed", term="publisher[sb]") > record = Entrez.read(query) > print len(record["IdList"]) > > Thanks for your comments, > Ming > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From zhigang.wu at email.ucr.edu Thu Mar 15 12:32:29 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 15 Mar 2012 09:32:29 -0700 Subject: [Biopython] Documentation typo found Message-ID: Hi biopython community, Here I am reporting a minor typo in the tutorial of Bio.Entrez ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114) and hope biopython administrator who have appropriate editing permission of that page to correct it. In the middle of above page, there are several lines of codes illustrating how to retrieve the information like author, source and title. I have pasted the original code below, in which the typo "CO" has been highlighted in red color, which should be corrected to "SO" Original Version: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*CO*", "?") ... print Should be corrected to: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*SO*", "?") ... print Zhigang PhD candidate in Plant Biology Department of Botany and Plant Sciences University of California Riverside, CA From zhigangwu.bgi at gmail.com Thu Mar 15 12:46:51 2012 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Thu, 15 Mar 2012 09:46:51 -0700 Subject: [Biopython] Bio.Entrez documentation typo found Message-ID: Hi biopython community, Sorry for duplicate posting if you see this post a second time. Here I am reporting a minor typo in the tutorial of Bio.Entrez ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114) and hope biopython administrator who have appropriate editing permission of that page to correct it. In the middle of above page, there are several lines of codes illustrating how to retrieve the information like author, source and title. I have pasted the original code below, in which the typo "CO" has been highlighted in red color, which should be corrected to "SO" Original Version: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*CO*", "?") ... print Should be corrected to: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*SO*", "?") ... print Zhigang PhD candidate in Plant Biology Department of Botany and Plant Sciences University of California Riverside, CA From p.j.a.cock at googlemail.com Thu Mar 15 12:52:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 15 Mar 2012 16:52:20 +0000 Subject: [Biopython] Documentation typo found In-Reply-To: References: Message-ID: On Thu, Mar 15, 2012 at 4:32 PM, Zhigang Wu wrote: > Hi biopython community, > > Here I am reporting a minor typo in the tutorial of Bio.Entrez ( > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114) and hope > biopython administrator who have appropriate editing permission of that > page to correct it. In case you or anyone else reading is interested, the tutorial HTML and PDF files are generated from the LaTeX file Doc/Tutorial.tex in the Biopython source code: https://github.com/biopython/biopython/blob/master/Doc/Tutorial.tex LaTeX is an old markup language which is still very popular in the areas of mathematics and physics because of its excellent formula support. See http://www.latex-project.org/ for background. > In the middle of above page, there are several lines of codes illustrating > how to retrieve the information like author, source and title. I have > pasted the original code below, in which the typo "CO" has been highlighted > in red color, which should be corrected to "SO" I think colors and special fonts in HTML emails may get turned into plain text by the mailing list. But I understood. > Original Version: > >>>> for record in records: > ... ? ? print "title:", record.get("TI", "?") > ... ? ? print "authors:", record.get("AU", "?") > ... ? ? print "source:", record.get("*CO*", "?") > ... ? ? print > > > Should be corrected to: > >>>> for record in records: > ... ? ? print "title:", record.get("TI", "?") > ... ? ? print "authors:", record.get("AU", "?") > ... ? ? print "source:", record.get("*SO*", "?") > ... ? ? print > > Zhigang This is in the Pubmed and Medline parsing example from the Entrez chapter, and yes, you are quite right. Fixed: https://github.com/biopython/biopython/commit/998bffdc7a67297b22fac96e1c810297a32f0e36 Thank you, Peter From golubchi at stats.ox.ac.uk Fri Mar 16 08:13:29 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Fri, 16 Mar 2012 12:13:29 +0000 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names Message-ID: <4F632E69.8010906@stats.ox.ac.uk> Dear all, I may be missing something in the documentation, but I can't seem to figure out how to write newick trees with internal node names (preserved as plain text). I think in some versions of Bio.Phylo a call to tree.format('newick') accomplished this by default, but currently I can't replicate this behaviour and get a tree with unnamed internal nodes. Any pointers would be appreciated! Thanks Tanya From rbuels at gmail.com Fri Mar 16 15:49:16 2012 From: rbuels at gmail.com (Robert Buels) Date: Fri, 16 Mar 2012 12:49:16 -0700 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! Message-ID: <4F63993C.7050809@gmail.com> Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2012 FAQ at http://goo.gl/kNv48 Student applications are due April 6, 2012 at 19:00 UTC. Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and whom to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2012 Administrator From mictadlo at gmail.com Mon Mar 19 00:29:40 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 19 Mar 2012 14:29:40 +1000 Subject: [Biopython] [Bio-bwa-help] making a "libbwa" In-Reply-To: <4F61CAF8.7020507@crs4.it> References: <4F61CAF8.7020507@crs4.it> Message-ID: +1 BioRuby did it in the following way: https://github.com/fstrozzi/bioruby-bwa/blob/master/ext/mkrf_conf.rb https://github.com/fstrozzi/bioruby-bwa/wiki Maybe it could be integrated to be Biopython? Cheers, On Thu, Mar 15, 2012 at 8:56 PM, Luca Pireddu wrote: > Hello list, > > I'd like the discuss the idea of refactoring BWA to separate the > alignment logic from the rest of the code base, thus resulting in a > libbwa alignment library which could be used through the regular command > line interface or through other means, as one saw fit. > > I'm one of the developers of Seal (http://biodoop-seal.sf.net/), a suite > of Hadoop-based tools for the processing of sequencing data. Within the > Seal suite, we have the Seqal program for read mapping which at this > time contains a "fork" of the BWA 0.5.10 code which we've patched in a > few points and then built as a library that we can use within our > application. In this way, we can feed the alignment algorithm with read > data we have pre-loaded in memory instead of files in a supported > format, and we can also fetch the alignment results directory from BWA's > memory structures rather than the regular output files. The resulting > library, as long as it has a stable API, provides much more flexibility > than a command-line program, allowing it to be used more easily and > elegantly in settings different from the regular fastq to sam/bam > workflow/application. Seqal is a concrete example. As another example, > we've built a Python interface for the library allowing us to easily use > it in scripts and testing. > > If there is interest for this idea, especially on the part of Heng, we > could discuss a viable API. Myself and my colleagues are certainly > willing to propose a first draft and even contribute the patches > necessary to implement it. > > Looking forward to hearing from you, > > -- > Luca Pireddu > CRS4 - Distributed Computing Group > Loc. Pixina Manna Edificio 1 > 09010 Pula (CA), Italy > Tel: +39 0709250452 > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Bio-bwa-help mailing list > Bio-bwa-help at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bio-bwa-help > From hernan.morales at gmail.com Mon Mar 19 22:52:46 2012 From: hernan.morales at gmail.com (=?UTF-8?Q?Hern=C3=A1n_Morales_Durand?=) Date: Mon, 19 Mar 2012 23:52:46 -0300 Subject: [Biopython] SeqIO fasta "fakes" recognition In-Reply-To: <4F467992.9060205@unifi.it> References: <4F4663E1.5010206@unifi.it> <4F467992.9060205@unifi.it> Message-ID: Why don't you filter for file names ended with fasta known extensions? (.fa, .fasta, etc.) 2012/2/23 Marco Galardini > On 02/23/2012 05:35 PM, Eric Talevich wrote: > >> >> I suppose there's always: >> >> try: >> record = SeqIO.read("gigo.png", "fasta") >> assert str(record.seq).isalpha() >> except: >> # complain... >> >> >> Thanks for the hint, I've implemented this (using the parse method) and > i'll see how it will perform (i guess it will had some overhead). > > Marco > > -- > ------------------------------**------------------- > Marco Galardini > DBE - Department of Evolutionary Biology > University of Florence - Italy > > e-mail: marco.galardini at unifi.it > www: http://www.unifi.it/dblage/**CMpro-v-p-51.html > phone: +39 055 2288249 > mobile: +39 340 2808041 > ------------------------------**------------------- > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > -- Hern?n Morales Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. From chaitanya.talnikar at iitb.ac.in Tue Mar 20 16:21:50 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Tue, 20 Mar 2012 20:21:50 +0000 Subject: [Biopython] GSoC Application: "Representation and manipulation of genomic variants" Message-ID: Hi, I, Chaitanya Talnikar, am a third year Undergraduate student of Chemical Engineering at Indian Institute of Technology. I would like to work on the project "Representation and manipulation of genomic variants". As I understand this project is on human genomic variations and constructing a representation of the variations that would account for most of the variation file formats. I would like to how much should I write in the proposal. Would a description of the internal representation be sufficient? Talking about my experience, I have a good grasp of python. I have done courses on Molecular Biology and Computational Biology in which I learnt the sequence alignment algorithms and the classification of proteins based on hidden markov models of insertion, deletion and mutations in proteins. I have used bioinformatics in the projects that I've done in the field of systems biology and made extensive use of NCBI blast and other utilities. I have worked on several projects related to programming. Some of projects whose code is online are: LTanks (http://code.google.com/p/ltanks/) This is a clone of a windows game. It has had around 200 downloads. OffApt (http://code.google.com/p/offapt/) This is a software that allows people to download ubuntu packages from windows, it includes a dependency resolver for debs and also a downloader. This project required a lot of file parsing and string manipulations I've mainly used python to solve the mathematical problems at http://projecteuler.net From eric.talevich at gmail.com Tue Mar 20 18:11:25 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 20 Mar 2012 18:11:25 -0400 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: <4F632E69.8010906@stats.ox.ac.uk> References: <4F632E69.8010906@stats.ox.ac.uk> Message-ID: Hi Tanya, In case you haven't solved this yet, could you post some small portion of the Newick tree you're working with? It's possible that your tree is oddly formatted, and the Newick parser isn't picking up the internal node labels in the first place. In the current version of Biopython (1.59), this seems to work fine: from Bio import Phylo # Example file from our test suite tree = Phylo.read("Tests/Nexus/int_node_labels.nwk", "newick") # Print to the console print tree.format("newick") # To write the tree to a file, this is preferred: Phylo.write(tree, "my_new_file.nwk", "newick") Cheers, Eric On Fri, Mar 16, 2012 at 8:13 AM, Tanya Golubchik wrote: > Dear all, > > I may be missing something in the documentation, but I can't seem to > figure out how to write newick trees with internal node names (preserved > as plain text). I think in some versions of Bio.Phylo a call to > tree.format('newick') accomplished this by default, but currently I > can't replicate this behaviour and get a tree with unnamed internal nodes. > > Any pointers would be appreciated! > > Thanks > Tanya > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From wangdanburnett at 163.com Wed Mar 21 00:45:45 2012 From: wangdanburnett at 163.com (=?GB2312?B?zfXx9Q==?=) Date: Wed, 21 Mar 2012 12:45:45 +0800 Subject: [Biopython] join biopython mailing list Message-ID: <4F695CF9.4020406@163.com> Hello, biopython users ,developers and maintainers: I'm a new guy to use biopython from the Chinese mainland. But...I don't know how to join the mailing list? Could someone help me? Thanks a lot. Wang Dan from China From guillaume.bayot at gmail.com Wed Mar 21 05:19:21 2012 From: guillaume.bayot at gmail.com (Guillaume Bayot) Date: Wed, 21 Mar 2012 10:19:21 +0100 Subject: [Biopython] join biopython mailing list In-Reply-To: <4F695CF9.4020406@163.com> References: <4F695CF9.4020406@163.com> Message-ID: Hello, You can subscribe to the discussion list here http://lists.open-bio.org/mailman/listinfo/biopython Le 21/03/2012 05:45, ???? a ??crit : Hello, biopython users ,developers and maintainers: I'm a new guy to use biopython from the Chinese mainland. But...I don't know how to join the mailing list? Could someone help me? Thanks a lot. Wang Dan from China _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.orghttp://lists.open-bio.org/mailman/listinfo/biopython From golubchi at stats.ox.ac.uk Wed Mar 21 06:12:27 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Wed, 21 Mar 2012 10:12:27 +0000 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: References: <4F632E69.8010906@stats.ox.ac.uk> Message-ID: <4F69A98B.3040504@stats.ox.ac.uk> Hi Eric, It works in Biopython 1.59; I was using 1.58 originally. So problem solved. There's a few other strange things in Phylo that I can't work out, though -- for instance, what happens to 'PhyloXML.Other' attributes -- I can write these on the tree, and save the tree, but it can't be re-opened because the parser rejects it as improperly formatted. The documentation is a bit vague on this; in particular, passing None to 'attributes' when creating Phylo.Other objects fails, while passing an empty dictionary works... what is meant to be in 'attributes' when creating an Other object? Also, the 'is_aligned' sequence property disappears when a tree is saved in phyloxml format and then read back using Phylo.read: >>> print tree Phylogeny(rooted=True, branch_length_unit='SNV') Clade(branch_length=0.0, name='N1') Clade(branch_length=0.0, name='C00000761') BranchColor(blue=0, green=128, red=0) Sequence(type='dna') MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA', is_aligned=True) Clade(branch_length=0.0, name='C00000763') BranchColor(blue=0, green=0, red=255) Sequence(type='dna') MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA', is_aligned=True) >>> Phylo.write(tree, myfile, 'phyloxml') 1 >>> tree2 = Phylo.read(myfile, 'phyloxml') >>> print tree2 Phylogeny(rooted=True, branch_length_unit='SNV') Clade(branch_length=0.0, name='N1') Clade(branch_length=0.0, name='C00000761') BranchColor(blue=0, green=128, red=0) Sequence(type='dna') MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA') Clade(branch_length=0.0, name='C00000763') BranchColor(blue=0, green=0, red=255) Sequence(type='dna') MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA') Cheers Tanya On 20/03/12 22:11, Eric Talevich wrote: > Hi Tanya, > > In case you haven't solved this yet, could you post some small portion > of the Newick tree you're working with? It's possible that your tree is > oddly formatted, and the Newick parser isn't picking up the internal > node labels in the first place. > > In the current version of Biopython (1.59), this seems to work fine: > > from Bio import Phylo > # Example file from our test suite > tree = Phylo.read("Tests/Nexus/int_node_labels.nwk", "newick") > # Print to the console > print tree.format("newick") > # To write the tree to a file, this is preferred: > Phylo.write(tree, "my_new_file.nwk", "newick") > > > Cheers, > Eric > > > On Fri, Mar 16, 2012 at 8:13 AM, Tanya Golubchik > > wrote: > > Dear all, > > I may be missing something in the documentation, but I can't seem to > figure out how to write newick trees with internal node names (preserved > as plain text). I think in some versions of Bio.Phylo a call to > tree.format('newick') accomplished this by default, but currently I > can't replicate this behaviour and get a tree with unnamed internal > nodes. > > Any pointers would be appreciated! > > Thanks > Tanya > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Wed Mar 21 06:14:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Mar 2012 10:14:38 +0000 Subject: [Biopython] join biopython mailing list In-Reply-To: <4F695CF9.4020406@163.com> References: <4F695CF9.4020406@163.com> Message-ID: 2012/3/21 ???? : > Hello, biopython users ,developers and maintainers: > I'm a new guy to use biopython from the Chinese mainland. But...I don't > know how to join the mailing list? Could someone help me? > Thanks a lot. > Wang Dan from China I just checked the mail server, and you are (now) subscribed. Welcome! Peter From wangdanburnett at 163.com Wed Mar 21 06:06:33 2012 From: wangdanburnett at 163.com (=?GB2312?B?zfXx9Q==?=) Date: Wed, 21 Mar 2012 18:06:33 +0800 Subject: [Biopython] join biopython mailing list In-Reply-To: References: <4F695CF9.4020406@163.com> Message-ID: <4F69A829.8000001@163.com> ?? 2012??03??21?? 17:19, Guillaume Bayot ????: > Hello, > > You can subscribe to the discussion list here > http://lists.open-bio.org/mailman/listinfo/biopython > > Thanks a lot. I??ve found the page. Don > Le 21/03/2012 05:45, ???? a ??crit : >> Hello, biopython users ,developers and maintainers: >> I'm a new guy to use biopython from the Chinese mainland. But...I don't >> know how to join the mailing list? Could someone help me? >> Thanks a lot. >> Wang Dan from China >> >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython From reece at harts.net Wed Mar 21 12:01:57 2012 From: reece at harts.net (Reece Hart) Date: Wed, 21 Mar 2012 09:01:57 -0700 Subject: [Biopython] GSoC Application: "Representation and manipulation of genomic variants" In-Reply-To: References: Message-ID: On Tue, Mar 20, 2012 at 1:21 PM, Chaitanya Talnikar < chaitanya.talnikar at iitb.ac.in> wrote: > I, Chaitanya Talnikar, am a third year Undergraduate student of > Chemical Engineering at Indian Institute of Technology. I would like > to work on the project "Representation and manipulation of genomic > variants". ... Would a description of the internal > representation be sufficient? > Hi Chaitanya- I'm glad that you're interested in this project. There are many aspects of variant representation that a student (or perhaps even multiple students) might work on. Do not feel that you must tackle the entire project description. Before you spend a lot of time on an application, I suggest that you start a new thread with a short description of what you'd like to accomplish and questions you have. The BioPython community is a nurturing environment and I'm sure you'll get some good suggestions about scoping the project. Ultimately, we all want to help you write a successful application that results in an important contribution to the community. You'll be the first to initiate such a discussion, which gives you a wide-open opportunity. Does this answer give you enough to proceed with initiating a discussion? Thanks, Reece From golubchi at stats.ox.ac.uk Thu Mar 22 11:20:11 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Thu, 22 Mar 2012 15:20:11 +0000 Subject: [Biopython] Phylo.draw - font size Message-ID: <4F6B432B.1050901@stats.ox.ac.uk> Hi guys, Does anyone know how to change the font size of the text annotations on the current figure from Phylo.draw (ie the node names)? Changing rcParams['font.size'] changes the axes but not the annotations. Thanks Tanya From eric.talevich at gmail.com Thu Mar 22 15:55:42 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 22 Mar 2012 15:55:42 -0400 Subject: [Biopython] Phylo.draw - font size In-Reply-To: <4F6B432B.1050901@stats.ox.ac.uk> References: <4F6B432B.1050901@stats.ox.ac.uk> Message-ID: On Thu, Mar 22, 2012 at 11:20 AM, Tanya Golubchik wrote: > Hi guys, > > Does anyone know how to change the font size of the text annotations on > the current figure from Phylo.draw (ie the node names)? Changing > rcParams['font.size'] changes the axes but not the annotations. > There doesn't seem to be a great way to do this directly, but you can scale the entire image on-screen by changing the figure.dpi value. It defaults to 80dpi, so this will magnify everything by 50%: >>> rcParams['figure.dpi'] = 120 Alternatively, you can edit the source of Bio/Phylo/_utils.py at lines 347 (taxon labels) and 357 (confidence/support values), or copy the entire _utils.draw function into your own code and edit the same lines there. If this feels ridiculous (as it probably does), I can add font_size and branch_width_scale keyword arguments in the next release. Would that help? Any other options you'd like to see, keeping in mind that this function isn't meant to compete with standalone programs like Archaeopteryx? From eric.talevich at gmail.com Thu Mar 22 19:29:58 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 22 Mar 2012 19:29:58 -0400 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: <4F69A98B.3040504@stats.ox.ac.uk> References: <4F632E69.8010906@stats.ox.ac.uk> <4F69A98B.3040504@stats.ox.ac.uk> Message-ID: On Wed, Mar 21, 2012 at 6:12 AM, Tanya Golubchik wrote: > There's a few other strange things in Phylo that I can't work out, > though -- for instance, what happens to 'PhyloXML.Other' attributes -- I > can write these on the tree, and save the tree, but it can't be > re-opened because the parser rejects it as improperly formatted. The > documentation is a bit vague on this; in particular, passing None to > 'attributes' when creating Phylo.Other objects fails, while passing an > empty dictionary works... what is meant to be in 'attributes' when > creating an Other object? The Other element is somewhat vaguely defined in PhyloXML specification, too; it's meant to allow defining new XML elements without updating the official spec. The 'attributes' attribute translates directly to the attributes of the new XML element you're creating. It should be a dictionary of strings-to-strings (somewhat like the 'annotations' attribute of SeqRecord). Something like: >>> other = PhyloXML.Other("img", attributes={"src"="foo.png"}) >>> mytree.other.append(other) >>> print mytree.format("phyloxml") I see there was a bug here, where the PhyloXML.Other constructor should initialize 'attributes' to an empty dictionary if it's not provided. Fixed in the trunk: https://github.com/biopython/biopython/commit/9e3fec461b189fe77b10db6de0c88df5b77e5bb0 > Also, the 'is_aligned' sequence property disappears when a tree is saved > in phyloxml format and then read back using Phylo.read: > >>>> print tree > Phylogeny(rooted=True, branch_length_unit='SNV') > ? ?Clade(branch_length=0.0, name='N1') > ? ? ? ?Clade(branch_length=0.0, name='C00000761') > ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA', > is_aligned=True) > ? ? ? ?Clade(branch_length=0.0, name='C00000763') > ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA', > is_aligned=True) > >>>> Phylo.write(tree, myfile, 'phyloxml') > 1 >>>> tree2 = Phylo.read(myfile, 'phyloxml') >>>> print tree2 > Phylogeny(rooted=True, branch_length_unit='SNV') > ? ?Clade(branch_length=0.0, name='N1') > ? ? ? ?Clade(branch_length=0.0, name='C00000761') > ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA') > ? ? ? ?Clade(branch_length=0.0, name='C00000763') > ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA') > This looks like a bug, too. (Thanks for finding these!) I don't immediately see the cause of the problem, I'll try to take a crack at it soon. From zhigang.wu at email.ucr.edu Fri Mar 23 15:00:12 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Fri, 23 Mar 2012 12:00:12 -0700 Subject: [Biopython] Google Summer of Code (GSoC) Message-ID: Hi Biopython community, I am, Zhigang Wu, a third year graduate student in UC Riverside with a research focus on miRNA evolution. I am interested in implementing the Biopython SearchIO module, which is used to parse the blast reports from currently popular sequence alignment tools like NCBI BLAST+, FASTA, HMMER3 and etc. I was a BioPerl user until one year ago, since then I have been a Biopython user. I have been using BioPerl's SearchIO extensively in my research project. BioPerl's SearchIO module provides a common API capable of handling all popular formats and is great. I'd like to write one in Python. As mentioned briefly, I have approximately one year experience of Perl programming experience, 1 year Python programming experience; and occasionally I also writing C++ programs; Other than this, I also have a bit experience on R. Right now, I am preparing my proposal that is due by April 6. I am listing below the core methods that the Biopythonic SearchIO module is going to support. For the sake of consistency, the moethods are very similar to existing SeqIO and AlignIOmodules. 1. SearchIO.parse(handle, format), is a generator function. 2. SearchIO.to_dict(iterator): this function takes in an iterator arguments which is produced by SearchIO.parse(...) function. 3. SearchIO.read(handle, format): provide fasta access to blast report have only one record 4. SearchIO.write(....) outputs specified blast output 5. SearchIO.convert(...) provide format conversion between different formats 6. ... I'd like to hear back from you any feedback or suggestions on the method or any format that in your research field is considered to be popular and you want it to be supported in Biopythonic SearchIO module. Regards, Zhigang Wu From ferreirafm at usp.br Fri Mar 23 17:55:27 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Fri, 23 Mar 2012 18:55:27 -0300 Subject: [Biopython] remove list redundancy Message-ID: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Hi Biopy users, I have a mult-sequence fasta file which I've read as a list. Is there a clever way/method to remove redundant sequences? Thanks in advance, Fred ### CODE: def redundancy(fastafile): f=open(fastafile, 'r') record = list(SeqIO.parse(f,"fasta")) new_rec = record f.close print len(record) for i in range(len(record)): for j in range(len(record)): if i < j: if record[i].seq == record[j].seq: del new_rec[j] print len(new_rec) ### RESULTS: $ redundancy.py -run all_emm_fake.fasta 823 /usr/lib64/python2.7/site-packages/Bio/Seq.py:197: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. "and to avoid this warning.", FutureWarning) 823 ### EXPECTING: Worse, the function above is not working. I was expecting 823 before and 822 after running it. From idoerg at gmail.com Fri Mar 23 18:19:27 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 23 Mar 2012 18:19:27 -0400 Subject: [Biopython] remove list redundancy In-Reply-To: <20120323185527.47476o6f3ticn2of@webmail.usp.br> References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: Python assigns by reference, not by value. So you can have the following: >>> a=[1,2,3] >>> b=a >>> print b [1, 2, 3] >>> del b[1] >>> print a [1, 3] >>> So if you remove an item from list b, it will remove it from a as well. Which is why in your case, record and new_rec end up the same, since they were the same to start off with. Furthermore, in your loop, you are changing the length of "record" which is the target of a for loop. Never a good idea and yields unexpected results. Finally, the index "j" you are using points to one thing in record, but will point to another thing in new_rec. You can do an assignment by value using the copy module new_rec=copy.copy(record) That will create a completely new copy of record in new_rec. That still won't solve the problem that you have in the shifting place "j" points to in the loop though. I would suggest building a list of non-redundant sequences rather than deleting from a list of redundant sequences. HTH, Iddo On Fri, Mar 23, 2012 at 5:55 PM, wrote: > Hi Biopy users, > I have a mult-sequence fasta file which I've read as a list. Is there a > clever way/method to remove redundant sequences? > Thanks in advance, > Fred > > ### CODE: > def redundancy(fastafile): > f=open(fastafile, 'r') > record = list(SeqIO.parse(f,"fasta")) > new_rec = record > f.close > print len(record) > for i in range(len(record)): > for j in range(len(record)): > if i < j: > if record[i].seq == record[j].seq: > del new_rec[j] > print len(new_rec) > > > ### RESULTS: > $ redundancy.py -run all_emm_fake.fasta > 823 > /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In > future comparing Seq objects will use string comparison (not object > comparison). Incompatible alphabets will trigger a warning (not an > exception). In the interim please use id(seq1)==id(seq2) or > str(seq1)==str(seq2) to make your code explicit and to avoid this warning. > "and to avoid this warning.", FutureWarning) > 823 > > ### EXPECTING: > Worse, the function above is not working. I was expecting 823 before and > 822 after running it. > > > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From w.arindrarto at gmail.com Fri Mar 23 18:23:13 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 23 Mar 2012 23:23:13 +0100 Subject: [Biopython] remove list redundancy In-Reply-To: References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: Hi Ferreira, As Iddo have mentioned, it's better to build a new list containing unique records instead. Here's my shot at a method like that: from Bio import SeqIO from Bio.SeqUtils.CheckSum import seguid # Returns a list containing unique SeqRecord list. def remove_redundant(fastafile): records = SeqIO.parse(fastafile, 'fasta') # new list container unique_records = [] # unique sequence checksum container checksum_container = [] for record in records: checksum = seguid(record.seq) if checksum not in checksum_container: unique_records.append(record) return unique_records I assume your Fasta file is not very big since you opted to load everything into memory in your initial script. If it were big, you could change the method into a generator to save memory, by writing this instead: # previous lines... for record in records: checksum = seguid(record.seq) if checksum not in checksum_container: yield record And iterating over the function like so: for unique_record in remove_redundant(fastafile): # process the records here Hope that helps, --- Bow On Fri, Mar 23, 2012 at 23:19, Iddo Friedberg wrote: > Python assigns by reference, not by value. So you can have the following: > >>>> a=[1,2,3] >>>> b=a >>>> print b > [1, 2, 3] >>>> del b[1] >>>> print a > [1, 3] >>>> > > So if you remove an item from list b, it will remove it from a as well. > Which is why in your case, record and new_rec end up the same, since they > were the same to start off with. > > Furthermore, in your loop, you are changing the length of "record" which is > the target of a for loop. Never a good idea and yields unexpected results. > Finally, the index "j" you are using points to one thing in record, but > will point to another thing in new_rec. > > You can do an assignment by value using the copy module > new_rec=copy.copy(record) > > That will create a completely new copy of record in new_rec. > > That still won't solve the problem that you have in the shifting place "j" > points to in the loop though. > > I would suggest building a list of non-redundant sequences rather than > deleting from a list of redundant sequences. > > > HTH, > > Iddo > > On Fri, Mar 23, 2012 at 5:55 PM, wrote: > >> Hi Biopy users, >> I have a mult-sequence fasta file which I've read as a list. Is there a >> clever way/method to remove redundant sequences? >> Thanks in advance, >> Fred >> >> ### CODE: >> ? ?def redundancy(fastafile): >> ? ?f=open(fastafile, 'r') >> ? ?record = list(SeqIO.parse(f,"fasta")) >> ? ?new_rec = record >> ? ?f.close >> ? ?print len(record) >> ? ?for i in range(len(record)): >> ? ? ? ?for j in range(len(record)): >> ? ? ? ? ? ?if i < j: >> ? ? ? ? ? ? ? ?if record[i].seq == record[j].seq: >> ? ? ? ? ? ? ? ? ? ?del new_rec[j] >> ? ? print len(new_rec) >> >> >> ### RESULTS: >> $ redundancy.py -run all_emm_fake.fasta >> 823 >> /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In >> future comparing Seq objects will use string comparison (not object >> comparison). Incompatible alphabets will trigger a warning (not an >> exception). In the interim please use id(seq1)==id(seq2) or >> str(seq1)==str(seq2) to make your code explicit and to avoid this warning. >> ?"and to avoid this warning.", FutureWarning) >> 823 >> >> ### EXPECTING: >> Worse, the function above is not working. I was expecting 823 before and >> 822 after running it. >> >> >> >> >> ______________________________**_________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> > ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. > .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>>----.<--.>++++++.<<<<------------------------------------. > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From nuin at genedrift.org Fri Mar 23 18:27:54 2012 From: nuin at genedrift.org (nuin at genedrift.org) Date: Fri, 23 Mar 2012 22:27:54 +0000 Subject: [Biopython] remove list redundancy In-Reply-To: <20120323185527.47476o6f3ticn2of@webmail.usp.br> References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: <1428368114-1332541675-cardhu_decombobulator_blackberry.rim.net-2070697673-@b2.c21.bise6.blackberry> Not a BioPython solution per se but you can uniquify your list using a set. HTH Paulo Sent from my BlackBerry device on the Rogers Wireless Network -----Original Message----- From: ferreirafm at usp.br Sender: biopython-bounces at lists.open-bio.org Date: Fri, 23 Mar 2012 18:55:27 To: Subject: [Biopython] remove list redundancy Hi Biopy users, I have a mult-sequence fasta file which I've read as a list. Is there a clever way/method to remove redundant sequences? Thanks in advance, Fred ### CODE: def redundancy(fastafile): f=open(fastafile, 'r') record = list(SeqIO.parse(f,"fasta")) new_rec = record f.close print len(record) for i in range(len(record)): for j in range(len(record)): if i < j: if record[i].seq == record[j].seq: del new_rec[j] print len(new_rec) ### RESULTS: $ redundancy.py -run all_emm_fake.fasta 823 /usr/lib64/python2.7/site-packages/Bio/Seq.py:197: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. "and to avoid this warning.", FutureWarning) 823 ### EXPECTING: Worse, the function above is not working. I was expecting 823 before and 822 after running it. _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From w.arindrarto at gmail.com Fri Mar 23 18:39:59 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 23 Mar 2012 23:39:59 +0100 Subject: [Biopython] remove list redundancy In-Reply-To: References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: Ferreira, I just realized I missed one important line: ? ? ? ?if checksum not in checksum_container: ? ? ? ? ? ?unique_records.append(record) should be: ? ? ? ?if checksum not in checksum_container: checksum_container.append(checksum) ? ? ? ? ? unique_records.append(record) Basically what the method does is it only adds the sequence record to the unique_records list only if its sequence checksum is not present in checksum_container already. And apologies for the double mass-post everyone. Have a nice weekend, --- Bow On Fri, Mar 23, 2012 at 23:23, Wibowo Arindrarto wrote: > Hi Ferreira, > > As Iddo have mentioned, it's better to build a new list containing > unique records instead. Here's my shot at a method like that: > > > from Bio import SeqIO > from Bio.SeqUtils.CheckSum import seguid > > # Returns a list containing unique SeqRecord list. > def remove_redundant(fastafile): > ? ?records = SeqIO.parse(fastafile, 'fasta') > > ? ?# new list container > ? ?unique_records = [] > ? ?# unique sequence checksum container > ? ?checksum_container = [] > > ? ?for record in records: > ? ? ? ?checksum = seguid(record.seq) > ? ? ? ?if checksum not in checksum_container: > ? ? ? ? ? ?unique_records.append(record) > > ? ?return unique_records > > I assume your Fasta file is not very big since you opted to load > everything into memory in your initial script. If it were big, you > could change the method into a generator to save memory, by writing > this instead: > > ? ?# previous lines... > ? ?for record in records: > ? ? ? ?checksum = seguid(record.seq) > ? ? ? ?if checksum not in checksum_container: > ? ? ? ? ? ?yield record > > And iterating over the function like so: > > ? ?for unique_record in remove_redundant(fastafile): > ? ? ? ?# process the records here > > > Hope that helps, > --- > Bow > > > On Fri, Mar 23, 2012 at 23:19, Iddo Friedberg wrote: >> Python assigns by reference, not by value. So you can have the following: >> >>>>> a=[1,2,3] >>>>> b=a >>>>> print b >> [1, 2, 3] >>>>> del b[1] >>>>> print a >> [1, 3] >>>>> >> >> So if you remove an item from list b, it will remove it from a as well. >> Which is why in your case, record and new_rec end up the same, since they >> were the same to start off with. >> >> Furthermore, in your loop, you are changing the length of "record" which is >> the target of a for loop. Never a good idea and yields unexpected results. >> Finally, the index "j" you are using points to one thing in record, but >> will point to another thing in new_rec. >> >> You can do an assignment by value using the copy module >> new_rec=copy.copy(record) >> >> That will create a completely new copy of record in new_rec. >> >> That still won't solve the problem that you have in the shifting place "j" >> points to in the loop though. >> >> I would suggest building a list of non-redundant sequences rather than >> deleting from a list of redundant sequences. >> >> >> HTH, >> >> Iddo >> >> On Fri, Mar 23, 2012 at 5:55 PM, wrote: >> >>> Hi Biopy users, >>> I have a mult-sequence fasta file which I've read as a list. Is there a >>> clever way/method to remove redundant sequences? >>> Thanks in advance, >>> Fred >>> >>> ### CODE: >>> ? ?def redundancy(fastafile): >>> ? ?f=open(fastafile, 'r') >>> ? ?record = list(SeqIO.parse(f,"fasta")) >>> ? ?new_rec = record >>> ? ?f.close >>> ? ?print len(record) >>> ? ?for i in range(len(record)): >>> ? ? ? ?for j in range(len(record)): >>> ? ? ? ? ? ?if i < j: >>> ? ? ? ? ? ? ? ?if record[i].seq == record[j].seq: >>> ? ? ? ? ? ? ? ? ? ?del new_rec[j] >>> ? ? print len(new_rec) >>> >>> >>> ### RESULTS: >>> $ redundancy.py -run all_emm_fake.fasta >>> 823 >>> /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In >>> future comparing Seq objects will use string comparison (not object >>> comparison). Incompatible alphabets will trigger a warning (not an >>> exception). In the interim please use id(seq1)==id(seq2) or >>> str(seq1)==str(seq2) to make your code explicit and to avoid this warning. >>> ?"and to avoid this warning.", FutureWarning) >>> 823 >>> >>> ### EXPECTING: >>> Worse, the function above is not working. I was expecting 823 before and >>> 822 after running it. >>> >>> >>> >>> >>> ______________________________**_________________ >>> Biopython mailing list ?- ?Biopython at lists.open-bio.org >>> http://lists.open-bio.org/**mailman/listinfo/biopython >>> >> >> >> >> -- >> Iddo Friedberg >> http://iddo-friedberg.net/contact.html >> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> >> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. >> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>>>----.<--.>++++++.<<<<------------------------------------. >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython From ferreirafm at usp.br Fri Mar 23 19:35:54 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Fri, 23 Mar 2012 20:35:54 -0300 Subject: [Biopython] remove list redundancy In-Reply-To: <20120323185527.47476o6f3ticn2of@webmail.usp.br> References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: <20120323203554.14654j9m0wvpho0q@webmail.usp.br> Thanks everyone for helping. Have I weekend. Fred Citando ferreirafm at usp.br: > Hi Biopy users, > I have a mult-sequence fasta file which I've read as a list. Is > there a clever way/method to remove redundant sequences? > Thanks in advance, > Fred > > ### CODE: > def redundancy(fastafile): > f=open(fastafile, 'r') > record = list(SeqIO.parse(f,"fasta")) > new_rec = record > f.close > print len(record) > for i in range(len(record)): > for j in range(len(record)): > if i < j: > if record[i].seq == record[j].seq: > del new_rec[j] > print len(new_rec) > > > ### RESULTS: > $ redundancy.py -run all_emm_fake.fasta > 823 > /usr/lib64/python2.7/site-packages/Bio/Seq.py:197: FutureWarning: In > future comparing Seq objects will use string comparison (not object > comparison). Incompatible alphabets will trigger a warning (not an > exception). In the interim please use id(seq1)==id(seq2) or > str(seq1)==str(seq2) to make your code explicit and to avoid this > warning. > "and to avoid this warning.", FutureWarning) > 823 > > ### EXPECTING: > Worse, the function above is not working. I was expecting 823 before > and 822 after running it. > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From eric.talevich at gmail.com Fri Mar 23 21:21:37 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 23 Mar 2012 21:21:37 -0400 Subject: [Biopython] Phylo.draw - font size In-Reply-To: References: <4F6B432B.1050901@stats.ox.ac.uk> Message-ID: On Thu, Mar 22, 2012 at 3:55 PM, Eric Talevich wrote: > On Thu, Mar 22, 2012 at 11:20 AM, Tanya Golubchik > wrote: >> Hi guys, >> >> Does anyone know how to change the font size of the text annotations on >> the current figure from Phylo.draw (ie the node names)? Changing >> rcParams['font.size'] changes the axes but not the annotations. >> > > There doesn't seem to be a great way to do this directly, but you can > scale the entire image on-screen by changing the figure.dpi value. It > defaults to 80dpi, so this will magnify everything by 50%: > >>>> rcParams['figure.dpi'] = 120 > > Alternatively, you can edit the source of Bio/Phylo/_utils.py at lines > 347 (taxon labels) and 357 (confidence/support values), or copy the > entire _utils.draw function into your own code and edit the same lines > there. > > If this feels ridiculous (as it probably does), I can add font_size > and branch_width_scale keyword arguments in the next release. Would > that help? Any other options you'd like to see, keeping in mind that > this function isn't meant to compete with standalone programs like > Archaeopteryx? I've fixed this in the trunk: https://github.com/biopython/biopython/commit/e25a1b8bde6c9adba7db92bfe13d1bd4320cadcf Now rcParams["font.size"] will scale the fonts as expected (though the proportions are still hard-coded), and rcParams["lines.linewidth"] will scale the lines, e.g. if you set the line width to 2 in rcParams, a branch with width=2 will be displayed with a width of 4 pixels, and width=0.5 will be displayed with 1 pixel. Other changes are still in progress. Cheers, Eric From chaitanya.talnikar at iitb.ac.in Sun Mar 25 14:25:25 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Sun, 25 Mar 2012 23:55:25 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions Message-ID: Hi again, I have a few questions on this topic. 1. For the implementation of variants what would be better, to create a new SeqVariant class from scratch or to extend the SeqFeature class to accomodate variants? I guess a separate class would be better. 2. While looking at the Biopython wiki I came across an implementation of GFF at https://github.com/chapmanb/bcbb/tree/master/gff As GVF is an extension of GFF3, this module could be used for reading GVF's too. Is this module a good start to modify it to support GVFs? 3. I've been going through the VCF documentation and SNPs, insertions and deletions can be represented just like it is done in VCF, the object would have a start position, length of reference sequence(no need to store this sequence) and a list of alternate sequence objects. I have to still look into the SV(Structural variants), rearrangements and imprecise variant information, so this representation is only for SNPs and small indels. The GVF has a very similar format for small indels and SNPs, just that it provides an extra end position column which is not required if we have the reference sequence. Regards, Chaitanya Talnikar Undergraduate Student Department of Chemical Engineering IIT Bombay From chris.mit7 at gmail.com Sun Mar 25 15:13:47 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sun, 25 Mar 2012 15:13:47 -0400 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants Message-ID: Hey everyone, I'm interested in undertaking this project. I'm currently a PhD student in Biochemical, Cellular, & Molecular Biology at Johns Hopkins School of Medicine, and I've been a hobby programmer for several years. I primarily code in Python and C++. I'm a core developer of Mudlet, which is in C++ and has a fair user base. For Python, I have nothing published for general consumption yet, though I will more than likely be putting out a Mass Spectrometry toolset in the upcoming year. I'm currently working on large -omics based data (whole genome alignments, RNA-Seq) so I have a flavor of what formats end users will encounter (I've worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and Affy arrays for SNPs/CNVs) and more importantly, I know how the end user will want to utilize the data. By far, I see the biggest hurdle is to arrange several types of data representations into a universal reference frame (for instance bam files being 0 based, sam being 1 based, CG vcf files being 0 based, closed interval versus half open, etc etc etc). I've written parsers for my own use that interconvert between formats and can read/output GFF/VCF files, and this would be a great opportunity to expand on my existing toolset and get valuable feedback from others in the community. Thanks, Chris From p.j.a.cock at googlemail.com Sun Mar 25 15:59:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 25 Mar 2012 20:59:41 +0100 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: On Sun, Mar 25, 2012 at 8:13 PM, Chris Mitchell wrote: > Hey everyone, > > I'm interested in undertaking this project. ?I'm currently a PhD student in > Biochemical, Cellular, & Molecular Biology at Johns Hopkins School of > Medicine, and I've been a hobby programmer for several years. ?I primarily > code in Python and C++. Great - and your background sounds good. > I'm currently working on large -omics based data (whole genome alignments, > RNA-Seq) so I have a flavor of what formats end users will encounter (I've > worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and > Affy arrays for SNPs/CNVs) and more importantly, I know how the end user > will want to utilize the data. ?By far, I see the biggest hurdle is to > arrange several types of data representations into a universal reference > frame (for instance bam files being 0 based, sam being 1 based, CG vcf > files being 0 based, closed interval versus half open, etc etc etc). That's easy - we're Python programers therefore any parsed data structure should be converted to used Python counting. Peter From rbuels at gmail.com Sun Mar 25 16:09:50 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 25 Mar 2012 13:09:50 -0700 Subject: [Biopython] Announcing OBF Summer of Code - please forward! Message-ID: <4F6F7B8E.1050903@gmail.com> Hi all, Here's an advertising-ready announcement for OBF's Summer of Code, thanks to Christian Zmasek and Hilmar Lapp for their excellent writing. Student applications are due April 6! Please spread it widely, we need to reach lots of students with it! Rob Buels OBF GSoC 2012 Admin ============================================================ *** Please disseminate widely at your local institutions *** *** including posting to message and job boards, so that *** *** we reach as many students as possible. *** ============================================================ OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2011 Applications due 19:00 UTC, April 6, 2012. http://www.open-bio.org/wiki/Google_Summer_of_Code The Open Bioinformatics Foundation Summer of Code program provides a unique opportunity for undergraduate, masters, and PhD students to obtain hands-on experience writing and extending open-source software for bioinformatics under the mentorship of experienced developers from around the world. The program is the participation of the Open Bioinformatics Foundation (OBF) as a mentoring organization in the Google Summer of Code(tm) (http://code.google.com/soc/). Students successfully completing the 3 month program receive a $5,000 USD stipend, and may work entirely from their home or home institution. Participation is open to students from any country in the world except countries subject to US trade restrictions. Each student will have at least one dedicated mentor to show them the ropes and help them complete their project. The Open Bioinformatics Foundation is particularly seeking students interested in both bioinformatics (computational biology) and software development. Some initial project ideas are listed on the website. These range from sequence search I/O in BioPython to lightweight sequence objects and lazy parsing in BioPerl, a next-generation BioRuby interface to Ensembl to developing cloud-optimized versions of BioJava modules. All project ideas are flexible and many can be adjusted in scope to match the skills of the student. We also particularly welcome and encourage students proposing their own project ideas; historically some of the most successful Summer of Code projects are ones proposed by the students themselves. TO APPLY: Apply online at the Google Summer of Code website (http://socghop.appspot.com/), where you will also find GSoC program rules and eligibility requirements. The 12-day application period for students runs from Monday, March 26 through Friday, April 6th, 2012. INQUIRIES: We strongly encourage all interested students to get in touch with us with their ideas as early on as possible. See the OBF GSoC page for contact details. 2012 OBF Summer of Code: http://www.open-bio.org/wiki/Google_Summer_of_Code Google Summer of Code FAQ: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2012/faqs From ankeshth at gmail.com Mon Mar 26 00:31:26 2012 From: ankeshth at gmail.com (Ankesh Thakur) Date: Mon, 26 Mar 2012 10:01:26 +0530 Subject: [Biopython] Query for GSoc projects on SearchIO and Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: Dear Sir, I am a student of Biological Sciences and bioengineering at Indian Institute of Technology, Kanpur (IIT Kanpur). I am willing to write codes for Biopython during this summer. I am not very much clear about the goals of this project. I want to know more about the suggested projects, like what else I need to do apart from conversion of one file format to other and showing the data on the console in human readable form. I have no prior experience with bio modules of python. I have arround than seven months experience with python git hub. And I have done Molecular biology, Genetics and Bio-chemistry courses. I would like to learn Biopython, BioPerl( if required) and other necessary tools during this summer. Eagerly waiting for your reply. Regards, Ankesh Kumar Thakur. From p.j.a.cock at googlemail.com Mon Mar 26 05:19:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 26 Mar 2012 10:19:18 +0100 Subject: [Biopython] Query for GSoc projects on SearchIO and Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: On Mon, Mar 26, 2012 at 5:31 AM, Ankesh Thakur wrote: > Dear Sir, > ? I am a student of Biological Sciences and bioengineering at Indian > Institute of Technology, Kanpur (IIT Kanpur). I am willing to write > codes for Biopython during this summer. I am not very much clear about > the goals of this project. I want to know more about the suggested > projects, like what else I need to do apart from conversion of one file > format to other and showing the data on the console in human readable > form. > > ? I have no prior experience with bio modules of python. I have arround than > seven months experience with python git hub. And I have done Molecular > biology, Genetics and Bio-chemistry courses. I would like to learn > Biopython, BioPerl( if required) and other necessary tools during this > summer. Eagerly waiting for your reply. > > Regards, > Ankesh Kumar Thakur. Hello Ankesh, Both the SearchIO and genomic variant GSoC project ideas are more than just file format conversion and 'pretty printing' at the console. An essential part of this is designing a suitable object representation for efficient use of the data. That probably means creating objects (Python classes). This will require both a good understanding of the meaning of the data being represented (e.g. how are BLAST search results structured) but also how to design Python objects. For the SearchIO project, I went into a lot more detail on the Biopython development mailing list last week: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html Peter From chapmanb at 50mail.com Mon Mar 26 07:07:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 26 Mar 2012 07:07:36 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: Message-ID: <87398vmxhj.fsf@fastmail.fm> Chaitanya; Thanks for the interest and specific questions. > 1. For the implementation of variants what would be better, to create > a new SeqVariant class from scratch or to extend the SeqFeature class > to accomodate variants? I guess a separate class would be better. My preference would be to see how far the SeqFeature class can take you before implementing a new class. It should be general enough to handle variant data, but the bigger challenge might be designing a lightweight representation that is compatible with existing SeqFeatures. > 2. While looking at the Biopython wiki I came across an implementation > of GFF at > https://github.com/chapmanb/bcbb/tree/master/gff > As GVF is an extension of GFF3, this module could be used for reading > GVF's too. Is this module a good start to modify it to support GVFs? That would be perfect. We're hoping to merge this into the Biopython code base before the next release. There is also an existing VCF parser we'd love to use here: https://github.com/jamescasbon/PyVCF > 3. I've been going through the VCF documentation and SNPs, insertions > and deletions can be represented just like it is done in VCF, the > object would have a start position, length of reference sequence(no > need to store this sequence) and a list of alternate sequence objects. > I have to still look into the SV(Structural variants), rearrangements > and imprecise variant information, so this representation is only for > SNPs and small indels. The GVF has a very similar format for small > indels and SNPs, just that it provides an extra end position column > which is not required if we have the reference sequence. This sounds good. My general suggestion is to start writing your proposal as soon as possible. A concrete first draft will help with more detailed comments. The wiki has good information on the project plan: http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply and the NESCent wiki has some examples of well-written proposals from previous years: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application One of the key aspects is having a detailed week-by-week outline of your plans for the summer. Thanks again for the interest, Brad From chapmanb at 50mail.com Mon Mar 26 07:02:29 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 26 Mar 2012 07:02:29 -0400 Subject: [Biopython] Google Summer of Code (GSoC) In-Reply-To: References: Message-ID: <8762drmxq2.fsf@fastmail.fm> Zhigang; > I am, Zhigang Wu, a third year graduate student in UC Riverside with a > research focus on miRNA evolution. I am interested in implementing the > Biopython SearchIO module, which is used to parse the blast reports from > currently popular sequence alignment tools like NCBI BLAST+, FASTA, HMMER3 > and etc. Welcome. Thanks for the introduction and your interest in the SearchIO project. > Right now, I am preparing my proposal that is due by April 6. I am listing > below the core methods that the Biopythonic SearchIO module is going to > support. For the sake of consistency, the moethods are very similar to > existing SeqIO and > AlignIOmodules. > > 1. SearchIO.parse(handle, format), is a generator function. > 2. SearchIO.to_dict(iterator): this function takes in an iterator > arguments which is produced by SearchIO.parse(...) function. > 3. SearchIO.read(handle, format): provide fasta access to blast report > have only one record > 4. SearchIO.write(....) outputs specified blast output > 5. SearchIO.convert(...) provide format conversion between different > formats > 6. ... > > I'd like to hear back from you any feedback or suggestions on the method or > any format that in your research field is considered to be popular and you > want it to be supported in Biopythonic SearchIO module. This all sounds great. My suggestion would be to make your project proposal available once you have a first draft, and then folks will have more detailed comments. The wiki has good information on the project plan: http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply and the NESCent wiki has some examples of well-written proposals from previous years: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application One of the key aspects is having a detailed week-by-week outline of your plans for the summer. Thanks again, Brad From chapmanb at 50mail.com Mon Mar 26 07:16:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 26 Mar 2012 07:16:27 -0400 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: <87wr67liic.fsf@fastmail.fm> Chris; Welcome and thanks for the interest in the project. > I'm currently working on large -omics based data (whole genome alignments, > RNA-Seq) so I have a flavor of what formats end users will encounter (I've > worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and > Affy arrays for SNPs/CNVs) and more importantly, I know how the end user > will want to utilize the data. By far, I see the biggest hurdle is to > arrange several types of data representations into a universal reference > frame (for instance bam files being 0 based, sam being 1 based, CG vcf > files being 0 based, closed interval versus half open, etc etc etc). I've > written parsers for my own use that interconvert between formats and can > read/output GFF/VCF files, and this would be a great opportunity to expand > on my existing toolset and get valuable feedback from others in the > community. I agree with Peter: you want to convert everything to standard Python 0-based internally. The goal is to have a consistent data structure so you can code independent of the input/output formats. There are some existing VCF and GFF parsers we were targeting for inclusion: https://github.com/jamescasbon/PyVCF http://biopython.org/wiki/GFF_Parsing but it would be great to see code you've written as well. I am repeating myself, but my general suggestion is to start writing your proposal as soon as possible. A concrete first draft will help with more detailed comments. The wiki has good information on the project plan: http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply and the NESCent wiki has some examples of well-written proposals from previous years: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application One of the key aspects is having a detailed week-by-week outline of your plans for the summer. Brad From reece at harts.net Mon Mar 26 08:07:25 2012 From: reece at harts.net (Reece Hart) Date: Mon, 26 Mar 2012 05:07:25 -0700 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: Hi Chris- Great. The only thing I have to add to what Peter and Brad said is that you should feel free to refine your proposal with us (GSoC mentors) and/or the BioPython community. -Reece From cjfields at illinois.edu Mon Mar 26 13:24:08 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 26 Mar 2012 17:24:08 +0000 Subject: [Biopython] Query for GSoc projects on SearchIO and Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: On Mar 26, 2012, at 4:19 AM, Peter Cock wrote: > On Mon, Mar 26, 2012 at 5:31 AM, Ankesh Thakur wrote: >> Dear Sir, >> I am a student of Biological Sciences and bioengineering at Indian >> Institute of Technology, Kanpur (IIT Kanpur). I am willing to write >> codes for Biopython during this summer. I am not very much clear about >> the goals of this project. I want to know more about the suggested >> projects, like what else I need to do apart from conversion of one file >> format to other and showing the data on the console in human readable >> form. >> >> I have no prior experience with bio modules of python. I have arround than >> seven months experience with python git hub. And I have done Molecular >> biology, Genetics and Bio-chemistry courses. I would like to learn >> Biopython, BioPerl( if required) and other necessary tools during this >> summer. Eagerly waiting for your reply. >> >> Regards, >> Ankesh Kumar Thakur. > > Hello Ankesh, > > Both the SearchIO and genomic variant GSoC project ideas are > more than just file format conversion and 'pretty printing' at the > console. An essential part of this is designing a suitable object > representation for efficient use of the data. That probably means > creating objects (Python classes). This will require both a good > understanding of the meaning of the data being represented > (e.g. how are BLAST search results structured) but also how > to design Python objects. > > For the SearchIO project, I went into a lot more detail on the > Biopython development mailing list last week: > http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html > > Peter Might be a good opportunity go over what works via the bioperl SearchIO implementations, what doesn't, etc. The vast majority of the speed issues we (bioperl) have seen with SearchIO seem to have much more to do with object generation than with parsing (I think Ruby has the same issue). Bioperl's SearchIO is summarized in the HOWTO: http://www.bioperl.org/wiki/HOWTO:SearchIO Simple enough, each reports are divi'd up into one or more Result, each of which can have multiple Hits, again each of which can have multiple HSPs. HSPs are also paired SeqFeatures, one for the query, one for the hit (I think this was implemented later). Some basic notes about the BLAST parser design (SAX-like), written by Steve Chervitz during the time this was drawn up, are here: https://github.com/bioperl/bioperl-live/blob/master/Bio/SearchIO/blast.pm#L2440 This doesn't apply to all SearchIO parsers, but it gives an idea of the thoughts behind it. chris From mictadlo at gmail.com Tue Mar 27 00:33:08 2012 From: mictadlo at gmail.com (Mic) Date: Tue, 27 Mar 2012 14:33:08 +1000 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <87398vmxhj.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> Message-ID: Hello, http://code.google.com/p/pysam/downloads/detail?name=pysam-0.5.tar.gz&can=2&q= *added vcf parsing* What is the difference between pysam's VCF and PyVCF?* * On Mon, Mar 26, 2012 at 9:07 PM, Brad Chapman wrote: > > Chaitanya; > Thanks for the interest and specific questions. > > > 1. For the implementation of variants what would be better, to create > > a new SeqVariant class from scratch or to extend the SeqFeature class > > to accomodate variants? I guess a separate class would be better. > > My preference would be to see how far the SeqFeature class can take you > before implementing a new class. It should be general enough to handle > variant data, but the bigger challenge might be designing a lightweight > representation that is compatible with existing SeqFeatures. > > > 2. While looking at the Biopython wiki I came across an implementation > > of GFF at > > https://github.com/chapmanb/bcbb/tree/master/gff > > As GVF is an extension of GFF3, this module could be used for reading > > GVF's too. Is this module a good start to modify it to support GVFs? > > That would be perfect. We're hoping to merge this into the Biopython > code base before the next release. There is also an existing VCF parser > we'd love to use here: > > https://github.com/jamescasbon/PyVCF > > > 3. I've been going through the VCF documentation and SNPs, insertions > > and deletions can be represented just like it is done in VCF, the > > object would have a start position, length of reference sequence(no > > need to store this sequence) and a list of alternate sequence objects. > > I have to still look into the SV(Structural variants), rearrangements > > and imprecise variant information, so this representation is only for > > SNPs and small indels. The GVF has a very similar format for small > > indels and SNPs, just that it provides an extra end position column > > which is not required if we have the reference sequence. > > This sounds good. My general suggestion is to start writing your > proposal as soon as possible. A concrete first draft will help with more > detailed comments. The wiki has good information on the project plan: > > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > > and the NESCent wiki has some examples of well-written proposals from > previous years: > > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > > One of the key aspects is having a detailed week-by-week outline of your > plans for the summer. > > Thanks again for the interest, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Mar 27 06:20:26 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 27 Mar 2012 06:20:26 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> Message-ID: <87vclqjqfp.fsf@fastmail.fm> Mic; > http://code.google.com/p/pysam/downloads/detail?name=pysam-0.5.tar.gz&can=2&q= > *added vcf parsing* > What is the difference between pysam's VCF and PyVCF?* Good point, thanks for mentioning this. pysam's VCF is also worth exploring as a base for the variant representation. I added links to it and the other resources on the GSoC project description page. Thanks, Brad From chaitanya.talnikar at iitb.ac.in Tue Mar 27 14:57:45 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Wed, 28 Mar 2012 00:27:45 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <87398vmxhj.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> Message-ID: Hi, I have uploaded the first draft of my project proposal. I will add more sections to the project plan in a day or two. Just wanted to have the initial draft up. I hope to write a better proposal with your feedback. Regards, Chaitanya On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > > Chaitanya; > Thanks for the interest and specific questions. > >> 1. For the implementation of variants what would be better, to create >> a new SeqVariant class from scratch or to extend the SeqFeature class >> to accomodate variants? I guess a separate class would be better. > > My preference would be to see how far the SeqFeature class can take you > before implementing a new class. It should be general enough to handle > variant data, but the bigger challenge might be designing a lightweight > representation that is compatible with existing SeqFeatures. > >> 2. While looking at the Biopython wiki I came across an implementation >> of GFF at >> https://github.com/chapmanb/bcbb/tree/master/gff >> As GVF is an extension of GFF3, this module could be used for reading >> GVF's too. Is this module a good start to modify it to support GVFs? > > That would be perfect. We're hoping to merge this into the Biopython > code base before the next release. There is also an existing VCF parser > we'd love to use here: > > https://github.com/jamescasbon/PyVCF > >> 3. I've been going through the VCF documentation and SNPs, insertions >> and deletions can be represented just like it is done in VCF, the >> object would have a start position, length of reference sequence(no >> need to store this sequence) and a list of alternate sequence objects. >> I have to still look into the SV(Structural variants), rearrangements >> and imprecise variant information, so this representation is only for >> SNPs and small indels. The GVF has a very similar format for small >> indels and SNPs, just that it provides an extra end position column >> which is not required if we have the reference sequence. > > This sounds good. My general suggestion is to start writing your > proposal as soon as possible. A concrete first draft will help with more > detailed comments. The wiki has good information on the project plan: > > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > > and the NESCent wiki has some examples of well-written proposals from > previous years: > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > > One of the key aspects is having a detailed week-by-week outline of your > plans for the summer. > > Thanks again for the interest, > Brad From chapmanb at 50mail.com Tue Mar 27 20:43:33 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 27 Mar 2012 20:43:33 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> Message-ID: <874nt9k11m.fsf@fastmail.fm> Chaitanya; The easiest way to work on your proposal is to write it in a public Google Doc and then share with the list. I don't yet have access to all of the Melange GSoC project and I'd imagine others who might have thoughts are in the same boat. As a side benefit it's also much easier to collaborate on editing and notes. Brad > Hi, > I have uploaded the first draft of my project proposal. I will add > more sections to the project plan in a day or two. Just wanted to have > the initial draft up. I hope to write a better proposal with your > feedback. > > Regards, > Chaitanya > > On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > > > > Chaitanya; > > Thanks for the interest and specific questions. > > > >> 1. For the implementation of variants what would be better, to create > >> a new SeqVariant class from scratch or to extend the SeqFeature class > >> to accomodate variants? I guess a separate class would be better. > > > > My preference would be to see how far the SeqFeature class can take you > > before implementing a new class. It should be general enough to handle > > variant data, but the bigger challenge might be designing a lightweight > > representation that is compatible with existing SeqFeatures. > > > >> 2. While looking at the Biopython wiki I came across an implementation > >> of GFF at > >> https://github.com/chapmanb/bcbb/tree/master/gff > >> As GVF is an extension of GFF3, this module could be used for reading > >> GVF's too. Is this module a good start to modify it to support GVFs? > > > > That would be perfect. We're hoping to merge this into the Biopython > > code base before the next release. There is also an existing VCF parser > > we'd love to use here: > > > > https://github.com/jamescasbon/PyVCF > > > >> 3. I've been going through the VCF documentation and SNPs, insertions > >> and deletions can be represented just like it is done in VCF, the > >> object would have a start position, length of reference sequence(no > >> need to store this sequence) and a list of alternate sequence objects. > >> I have to still look into the SV(Structural variants), rearrangements > >> and imprecise variant information, so this representation is only for > >> SNPs and small indels. The GVF has a very similar format for small > >> indels and SNPs, just that it provides an extra end position column > >> which is not required if we have the reference sequence. > > > > This sounds good. My general suggestion is to start writing your > > proposal as soon as possible. A concrete first draft will help with more > > detailed comments. The wiki has good information on the project plan: > > > > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > > > > and the NESCent wiki has some examples of well-written proposals from > > previous years: > > > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > > > > One of the key aspects is having a detailed week-by-week outline of your > > plans for the summer. > > > > Thanks again for the interest, > > Brad From chaitanya.talnikar at iitb.ac.in Wed Mar 28 06:19:04 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Wed, 28 Mar 2012 15:49:04 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <874nt9k11m.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> Message-ID: Here's the google doc link, I have made it editable too. https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: > > Chaitanya; > The easiest way to work on your proposal is to write it in a > public Google Doc and then share with the list. I don't yet have access > to all of the Melange GSoC project and I'd imagine others who might > have thoughts are in the same boat. As a side benefit it's also much > easier to collaborate on editing and notes. > > Brad > >> Hi, >> I have uploaded the first draft of my project proposal. I will add >> more sections to the project plan in a day or two. Just wanted to have >> the initial draft up. I hope to write a better proposal with your >> feedback. >> >> Regards, >> Chaitanya >> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: >> > >> > Chaitanya; >> > Thanks for the interest and specific questions. >> > >> >> 1. For the implementation of variants what would be better, to create >> >> a new SeqVariant class from scratch or to extend the SeqFeature class >> >> to accomodate variants? I guess a separate class would be better. >> > >> > My preference would be to see how far the SeqFeature class can take you >> > before implementing a new class. It should be general enough to handle >> > variant data, but the bigger challenge might be designing a lightweight >> > representation that is compatible with existing SeqFeatures. >> > >> >> 2. While looking at the Biopython wiki I came across an implementation >> >> of GFF at >> >> https://github.com/chapmanb/bcbb/tree/master/gff >> >> As GVF is an extension of GFF3, this module could be used for reading >> >> GVF's too. Is this module a good start to modify it to support GVFs? >> > >> > That would be perfect. We're hoping to merge this into the Biopython >> > code base before the next release. There is also an existing VCF parser >> > we'd love to use here: >> > >> > https://github.com/jamescasbon/PyVCF >> > >> >> 3. I've been going through the VCF documentation and SNPs, insertions >> >> and deletions can be represented just like it is done in VCF, the >> >> object would have a start position, length of reference sequence(no >> >> need to store this sequence) and a list of alternate sequence objects. >> >> I have to still look into the SV(Structural variants), rearrangements >> >> and imprecise variant information, so this representation is only for >> >> SNPs and small indels. The GVF has a very similar format for small >> >> indels and SNPs, just that it provides an extra end position column >> >> which is not required if we have the reference sequence. >> > >> > This sounds good. My general suggestion is to start writing your >> > proposal as soon as possible. A concrete first draft will help with more >> > detailed comments. The wiki has good information on the project plan: >> > >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply >> > >> > and the NESCent wiki has some examples of well-written proposals from >> > previous years: >> > >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application >> > >> > One of the key aspects is having a detailed week-by-week outline of your >> > plans for the summer. >> > >> > Thanks again for the interest, >> > Brad From ferreirafm at usp.br Wed Mar 28 08:54:44 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 09:54:44 -0300 Subject: [Biopython] help with NCBIXML.parse Message-ID: <20120328095444.663777w48sm3kpp0@webmail.usp.br> Hi there, What I'm doing wrong in the following piece of code? Thanks in advance, Fred #### CODE #### def blast_cmd(query_seq): outf = open('blast_out.xml', 'w') for subj_seq in glob.iglob('emm*.fasta'): blast_cline = NcbiblastpCommandline(cmd = "blastp", task = "blastp-short", query = query_seq, subject = subj_seq, ungapped = True, comp_based_stats = "0", max_target_seqs = "1", matrix = "PAM30", outfmt = "5") stdout, stderr = blast_cline() outf.write(stdout) outf.close() handle = open("blast_out.xml") blast_records = NCBIXML.parse(handle) for record in blast_records: print record #### RESULTS #### $ run_blast.py --blast query.fasta Traceback (most recent call last): File "/home/ferreirafm/bin/redundancy.py", line 121, in main() File "/home/ferreirafm/bin/redundancy.py", line 106, in main blast_cmd(query_seq) File "/home/ferreirafm/bin/redundancy.py", line 63, in blast_cmd for record in blast_records: File "/usr/lib64/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 652, in parse expat_parser.Parse(text, False) xml.parsers.expat.ExpatError: junk after document element: line 88, column 14 From p.j.a.cock at googlemail.com Wed Mar 28 09:08:28 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 14:08:28 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328095444.663777w48sm3kpp0@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 1:54 PM, wrote: > Hi there, > What I'm doing wrong in the following piece of code? > Thanks in advance, > Fred You seem to be calling BLAST multiple times in a loop and trying to give it SeqRecord objects. It wants FASTA files, and you can call BLAST once with a single FASTA query file (containing multiple records) and a single database or FASTA subject file (also containing multiple records). As to the specific error, did you look at your blast_out.xml file and what it said on line 88? Peter From ferreirafm at usp.br Wed Mar 28 10:03:51 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 11:03:51 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> Message-ID: <20120328110351.6152656322vvivd3@webmail.usp.br> Hi Peter, Thanks for answer. Citando Peter Cock : > You seem to be calling BLAST multiple times in a loop and > trying to give it SeqRecord objects. Yes, because I want just only one hit per sequence. If someone has a overcome to this, it would be great. If a run it with a multiple fasta file, I'll take several hits per sequence. Like this: P02977 emm1.22.pep 100.00 2 0 0 15 16 90 91 9.4 9.2 P02977 emm1.22.pep 100.00 2 0 0 14 15 104 105 9.4 9.2 P02977 emm1-2.3.pep 62.50 8 3 0 8 15 196 203 0.033 17.5 P02977 emm1.23.pep 62.50 8 3 0 8 15 196 203 0.033 17.5 P02977 emm1-2.4.pep 100.00 2 0 0 15 16 99 100 5.0 9.2 P02977 emm1.24.pep 100.00 2 0 0 15 16 88 89 7.5 9.2 P02977 emm1.24.pep 100.00 2 0 0 14 15 102 103 7.5 9.2 P02977 emm1.25.pep 100.00 2 0 0 15 16 81 82 4.3 9.2 > It wants FASTA files, > and you can call BLAST once with a single FASTA query > file (containing multiple records) and a single database or > FASTA subject file (also containing multiple records). > > As to the specific error, did you look at your blast_out.xml > file and what it said on line 88? > line 88 is a second "header" of the xml file. It seems xmlparse can't handle it. > Peter > From p.j.a.cock at googlemail.com Wed Mar 28 10:19:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 15:19:05 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328110351.6152656322vvivd3@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 3:03 PM, wrote: > Citando Peter Cock : >> You seem to be calling BLAST multiple times in a loop and >> trying to give it SeqRecord objects. > > > Yes, because I want just only one hit per sequence. If someone has a > overcome to this, it would be great. If a run it with a multiple fasta file, > I'll take several hits per sequence. Like this: > > ... Try using the -max_target_seqs argument. >> As to the specific error, did you look at your blast_out.xml >> file and what it said on line 88? > > line 88 is a second "header" of the xml file. It seems xmlparse can't handle > it. > > > "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> > That is not allowed in XML. On re-reading your code, I see this happens because you are effectively concatenating the output for several BLAST runs (via stdout) into the one file. Historically the NCBI BLAST tools used to do something like this but with on a new line, so we do have some special case code to cope with that. You could try making this small change: outf.write(stdout) to: outf.write(stdout) outf.write("\n") That might work. However that isn't an elegant solution because if it works it relies on some special case code in Biopython for an NCBI bug. Instead you could parse each output inside the for loop? Peter From ferreirafm at usp.br Wed Mar 28 10:59:52 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 11:59:52 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> Message-ID: <20120328115952.76164qh9rj1aopyg@webmail.usp.br> Citando Peter Cock : > Try using the -max_target_seqs argument. I have already tried it. I issued blast with -max_target_seq 1 on a muit-fasta file. See resulats my last post. > > That is not allowed in XML. On re-reading your code, I see > this happens because you are effectively concatenating the > output for several BLAST runs (via stdout) into the one file. > > Historically the NCBI BLAST tools used to do something like > this but with on a new line, so we do > have some special case code to cope with that. You could > try making this small change: > > outf.write(stdout) > > to: > > outf.write(stdout) > outf.write("\n") Yep, it works. > > That might work. However that isn't an elegant solution > because if it works it relies on some special case code > in Biopython for an NCBI bug. > > Instead you could parse each output inside the for loop? That's a solution, but this way I would have to do it several times which would be even less pythonic > > Peter > From p.j.a.cock at googlemail.com Wed Mar 28 11:26:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 16:26:46 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328115952.76164qh9rj1aopyg@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 3:59 PM, wrote: > Citando Peter Cock : > >> Try using the -max_target_seqs argument. > > I have already tried it. > I issued blast with -max_target_seq 1 on a muit-fasta file. ?See resulats my > last post. > What does 'print blast_cline' give? i.e. What is the actual command being called. Which version of NCBI BLAST+ are you using? Peter From ferreirafm at usp.br Wed Mar 28 11:32:38 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 12:32:38 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> Message-ID: <20120328123238.63565uv5g501qrkm@webmail.usp.br> > Citando Peter Cock : > > What does 'print blast_cline' give? i.e. What is the actual command > being called. A long list of: ... > > Which version of NCBI BLAST+ are you using? 2.2.26+ > > Peter > From p.j.a.cock at googlemail.com Wed Mar 28 11:37:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 16:37:40 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328123238.63565uv5g501qrkm@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> <20120328123238.63565uv5g501qrkm@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 4:32 PM, wrote: > > > >> Citando Peter Cock : >> >> >> What does 'print blast_cline' give? i.e. What is the actual command >> being called. > > > A long list of: > > > > > > > ... That's the BLAST parser's output - I mean the NcbiblastpCommandline object you assigned to variable blast_cline. >> Which version of NCBI BLAST+ are you using? > > 2.2.26+ Huh, that is the latest. I'm still using 2.2.25+ here. Peter From ferreirafm at usp.br Wed Mar 28 13:31:38 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 14:31:38 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> <20120328123238.63565uv5g501qrkm@webmail.usp.br> Message-ID: <20120328143138.46662ec6ytufdgca@webmail.usp.br> Citando Peter Cock : > That's the BLAST parser's output - I mean the NcbiblastpCommandline > object you assigned to variable blast_cline. In my last post I meant "It passed" the print loop. However, I don't know what to do with this. I was waiting for the blast results from the alignment when printing a blast record. It isn't it? > Huh, that is the latest. I'm still using 2.2.25+ here. > > Peter > and...??? From alfonso.esposito1983 at hotmail.it Thu Mar 29 10:12:53 2012 From: alfonso.esposito1983 at hotmail.it (fonz esposito) Date: Thu, 29 Mar 2012 16:12:53 +0200 Subject: [Biopython] Blast sequences and SNPs detection Message-ID: Dear All, I am Alfonso Esposito, I am a PhD student in environmental microbiology and I am quite new to the python community. I am trying to figure out how to make a script but I am going mad. I would need a script that takes as input a fasta file with N sequences, blast it on the nucleotide collection in NCBI and delivers a output file containing each SNP or gap with the correspondent nucleotide position (for example position 123 A->G or Gap between 145 and 146)... thanks everybody and I hope to reicive your answer Regards Alfonso From p.j.a.cock at googlemail.com Thu Mar 29 10:27:24 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 15:27:24 +0100 Subject: [Biopython] Blast sequences and SNPs detection In-Reply-To: References: Message-ID: On Thu, Mar 29, 2012 at 3:12 PM, fonz esposito wrote: > > Dear All, > > I am Alfonso Esposito, I am a PhD student in environmental microbiology and I am > quite new to the python community. I am trying to figure out how to make a script > but I am going mad. I would need a script that takes as input a fasta file with N > sequences, blast it on the nucleotide collection in NCBI and delivers a output file > containing each SNP or gap with the correspondent nucleotide position (for > example position 123 A->G or Gap between 145 and 146)... thanks everybody > and I hope to reicive your answer Hello Alfonso, I am confused about your aim here. Surely a dedicated SNP detection tool would be more appropriate than BLAST? BLAST finds similar sequences, it doesn't find SNPs. Are you hoping to take the matched sequences and lookup their annotation for SNPs? Or are you wanting to treat BLAST pairwise sequence alignments as if there were alternative strains/alleles and interpret the differences as SNPs? Perhaps you plan to restrict your BLAST search to a known accession/reference genome? Also if your FASTA file with N sequence in it is actually high throughput sequencing reads (e.g. Illumina reads), you probably want to start with a mapping tool like BWA to do the alignment, not BLAST. Peter From zhigang.wu at email.ucr.edu Thu Mar 29 12:51:41 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 29 Mar 2012 09:51:41 -0700 Subject: [Biopython] Biopython GSoC Proposal Message-ID: Hi Biopython community, Here I am posting my draft of proposal, in which I have proposed to implement the SearchIO module. Please follow the link to access it https://docs.google.com/document/d/15fkPAZfN2Ln8nMJr4Ad7lMscaGbKOiTaXcGpxxvIe3A/edit Any comments and remarks are welcome. Zhigang From p.j.a.cock at googlemail.com Thu Mar 29 14:31:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 19:31:21 +0100 Subject: [Biopython] Blast sequences and SNPs detection In-Reply-To: References: Message-ID: I'm assuming Fonz meant to send this to the list, my reply is below. On Thu, Mar 29, 2012 at 7:21 PM, fonz esposito wrote: > Dear Peter and dear all, > > first of all thanks for answering me so quickly, then I will try to explain > better my problem: I have sequences from DGGE bands, they have some > mistakes, mainly invalid basecall so I need to blast every single sequence > (after trimming the first and last bases from the AB1) on NCBI, and then > compare it to the best hit, checking out every mismatch. This could be > automated, I did with biopython the blast and I can process the output but I > did not manage to indicate the exact nucleotide number and what the mismatch > is, and when there is a gap I don't exactly know how to tell the program to > output the gap location in the original sequence I blasted. > > I hope that I was clearer now, let me know if you can help me > > Alfonso So these are 'Sanger' capillary reads, and while you may have lots I'm guessing this is under 100 in all? In that case using BLAST is probably going to be OK - although depending on how many sequences you have you might want to run that locally rather than at the NCBI. Which database are you intending to search against? i.e. Do you know what organism your bands should be from (or even what kind of organism)? What are you trying to do with any suspect bases where your sequences differ from those in the database? I personally (if the number of sequences was quite small) might think about working directly from BLAST pairwise alignment to go back to the chromatogram in Chromas (or an equivalent tool) to see if the base call can be manually corrected, or is the difference appears to be real. Peter P.S. You can read the (trimmed) sequences from ABI/AB1 files directly within Biopython 1.58 or later. From chapmanb at 50mail.com Thu Mar 29 21:13:46 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 29 Mar 2012 21:13:46 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> Message-ID: <87wr626gc5.fsf@fastmail.fm> Chaitanya; Thanks for making this available. It's a great start and you need to work from here on being much more detailed in your project plan. I left specific comments in-line in the proposal. Let us know when you have a revised version and we can work more. Thanks again, Brad > Here's the google doc link, I have made it editable too. > > https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit > > On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: > > > > Chaitanya; > > The easiest way to work on your proposal is to write it in a > > public Google Doc and then share with the list. I don't yet have access > > to all of the Melange GSoC project and I'd imagine others who might > > have thoughts are in the same boat. As a side benefit it's also much > > easier to collaborate on editing and notes. > > > > Brad > > > >> Hi, > >> I have uploaded the first draft of my project proposal. I will add > >> more sections to the project plan in a day or two. Just wanted to have > >> the initial draft up. I hope to write a better proposal with your > >> feedback. > >> > >> Regards, > >> Chaitanya > >> > >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > >> > > >> > Chaitanya; > >> > Thanks for the interest and specific questions. > >> > > >> >> 1. For the implementation of variants what would be better, to create > >> >> a new SeqVariant class from scratch or to extend the SeqFeature class > >> >> to accomodate variants? I guess a separate class would be better. > >> > > >> > My preference would be to see how far the SeqFeature class can take you > >> > before implementing a new class. It should be general enough to handle > >> > variant data, but the bigger challenge might be designing a lightweight > >> > representation that is compatible with existing SeqFeatures. > >> > > >> >> 2. While looking at the Biopython wiki I came across an implementation > >> >> of GFF at > >> >> https://github.com/chapmanb/bcbb/tree/master/gff > >> >> As GVF is an extension of GFF3, this module could be used for reading > >> >> GVF's too. Is this module a good start to modify it to support GVFs? > >> > > >> > That would be perfect. We're hoping to merge this into the Biopython > >> > code base before the next release. There is also an existing VCF parser > >> > we'd love to use here: > >> > > >> > https://github.com/jamescasbon/PyVCF > >> > > >> >> 3. I've been going through the VCF documentation and SNPs, insertions > >> >> and deletions can be represented just like it is done in VCF, the > >> >> object would have a start position, length of reference sequence(no > >> >> need to store this sequence) and a list of alternate sequence objects. > >> >> I have to still look into the SV(Structural variants), rearrangements > >> >> and imprecise variant information, so this representation is only for > >> >> SNPs and small indels. The GVF has a very similar format for small > >> >> indels and SNPs, just that it provides an extra end position column > >> >> which is not required if we have the reference sequence. > >> > > >> > This sounds good. My general suggestion is to start writing your > >> > proposal as soon as possible. A concrete first draft will help with more > >> > detailed comments. The wiki has good information on the project plan: > >> > > >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > >> > > >> > and the NESCent wiki has some examples of well-written proposals from > >> > previous years: > >> > > >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > >> > > >> > One of the key aspects is having a detailed week-by-week outline of your > >> > plans for the summer. > >> > > >> > Thanks again for the interest, > >> > Brad From chapmanb at 50mail.com Thu Mar 29 21:15:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 29 Mar 2012 21:15:27 -0400 Subject: [Biopython] Biopython GSoC Proposal In-Reply-To: References: Message-ID: <87ty166g9c.fsf@fastmail.fm> Zhigang; > Here I am posting my draft of proposal, in which I have proposed to > implement the SearchIO module. Please follow the link to access it > https://docs.google.com/document/d/15fkPAZfN2Ln8nMJr4Ad7lMscaGbKOiTaXcGpxxvIe3A/edit Thanks for putting this together. You've got an excellent start. I added comments in the document on specific areas. Let us know if you have any questions or need followup on any points. Thanks again, Brad From anna.kostikova at gmail.com Fri Mar 30 10:49:22 2012 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Fri, 30 Mar 2012 16:49:22 +0200 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch Message-ID: Dear list members, Is there a parameter in entrez.efetch/entrez.esearch which would allow to only look for and download records with the maximum sequence length of ? e.g. an analogue to SLEN parameter of the web interface of the NCBI website. Thanks a lot in advance, Anna From p.j.a.cock at googlemail.com Fri Mar 30 11:10:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 30 Mar 2012 16:10:45 +0100 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: On Fri, Mar 30, 2012 at 3:49 PM, Anna Kostikova wrote: > Dear list members, > > Is there a parameter in entrez.efetch/entrez.esearch which would allow > to only look for and download records with the maximum sequence length > of ? e.g. an analogue to SLEN parameter of the web > interface of the NCBI website. > > Thanks a lot in advance, > Anna For esearch, have you checked the available search fields using einfo - shown in the Biopython Tutorial and also here: http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ Both the nucleotide and protein databases do include SLEN as a search field for sequence length. Have you tried including something like 123[SLEN] in your Entrez search term? For efetch with a sequence database you can use seq_start and seq_stop to retrieve just part of the sequence. But that would just crop it: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html Peter From anna.kostikova at gmail.com Fri Mar 30 11:28:17 2012 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Fri, 30 Mar 2012 17:28:17 +0200 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: Thanks a lot Peter for the advice and the link - super useful trick! a very straightaway solution, indeed:) handle = Entrez.esearch(db="nucleotide",term=organism +'[ORGN] AND 100:1000[SLEN]') Thanks a lot again, Anna 2012/3/30 Peter Cock : > On Fri, Mar 30, 2012 at 3:49 PM, Anna Kostikova > wrote: >> Dear list members, >> >> Is there a parameter in entrez.efetch/entrez.esearch which would allow >> to only look for and download records with the maximum sequence length >> of ? e.g. an analogue to SLEN parameter of the web >> interface of the NCBI website. >> >> Thanks a lot in advance, >> Anna > > For esearch, have you checked the available search fields using > einfo - shown in the Biopython Tutorial and also here: > http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ > > Both the nucleotide and protein databases do include SLEN as a > search field for sequence length. Have you tried including something > like 123[SLEN] in your Entrez search term? > > For efetch with a sequence database you can use seq_start and seq_stop > to retrieve just part of the sequence. But that would just crop it: > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html > > Peter From p.j.a.cock at googlemail.com Fri Mar 30 11:41:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 30 Mar 2012 16:41:55 +0100 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: On Fri, Mar 30, 2012 at 4:28 PM, Anna Kostikova wrote: > Thanks a lot Peter for the advice and the link - super useful trick! > a very straightaway solution, indeed:) > > handle = Entrez.esearch(db="nucleotide",term=organism +'[ORGN] AND > 100:1000[SLEN]') > > Thanks a lot again, > Anna Thanks for letting us know it worked :) How did you find the range trick you're using for the length search? Peter From anna.kostikova at gmail.com Fri Mar 30 12:55:39 2012 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Fri, 30 Mar 2012 18:55:39 +0200 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: > How did you find the range trick you're using for the length search? the idea came thanks to you. Essentially, the presence of SLEN term in your blog post on 'NCBI Entrez EInfo' pushed me to try the same syntax I'd use in perl or NCBI web interface. And it worked :) Anna 2012/3/30 Peter Cock : > On Fri, Mar 30, 2012 at 4:28 PM, Anna Kostikova > wrote: >> Thanks a lot Peter for the advice and the link - super useful trick! >> a very straightaway solution, indeed:) >> >> handle = Entrez.esearch(db="nucleotide",term=organism +'[ORGN] AND >> 100:1000[SLEN]') >> >> Thanks a lot again, >> Anna > > Thanks for letting us know it worked :) > > How did you find the range trick you're using for the length search? > > Peter From chris.mit7 at gmail.com Sat Mar 31 00:41:32 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sat, 31 Mar 2012 00:41:32 -0400 Subject: [Biopython] GSOC Genome Variants proposal Message-ID: Hey everyone, Here's a draft of my proposal: https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit I've allowed comments to be put in. Please tear it to shreds :). Thanks, Chris From reece at harts.net Sat Mar 31 16:26:05 2012 From: reece at harts.net (Reece Hart) Date: Sat, 31 Mar 2012 13:26:05 -0700 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: References: Message-ID: On Fri, Mar 30, 2012 at 9:41 PM, Chris Mitchell wrote: > Here's a draft of my proposal: > > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > Thanks Chris. I'm reading this proposal and others this weekend. Thanks for submitting! -Reece From igorrcosta at hotmail.com Sat Mar 31 23:04:28 2012 From: igorrcosta at hotmail.com (Igor Rodrigues da Costa) Date: Sun, 1 Apr 2012 03:04:28 +0000 Subject: [Biopython] Back translation support in Biopython Message-ID: Hi, I am interested in participating in GSoC this summer. I would like to know if there is community support for a new project: Extending Seq class to add support to back translation of proteins (something like this: http://www.bork.embl.de/pal2nal/ ). If this project isn't strong enough at its own, it could be added to any existing project, or it could be complemented with others suggestions from the community. Thanks for your attention,Igor From mictadlo at gmail.com Thu Mar 1 01:40:02 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 11:40:02 +1000 Subject: [Biopython] samtools does not return correct exit code Message-ID: Hallo, Samtools does not return correct the exit code: import subprocess import logging import sys def run_cmd(args): if subprocess.call(args,shell=True) != 0: print 'hello' logging.error("Error copying sequence file args='%s'" % str(args)) return 1 print 'e', sys.stderr print 'o', sys.stdout return 0 def runSamtools( cmd ): '''run a samtools command''' try: retcode = subprocess.call(cmd, shell=True) print retcode if retcode < 0: print >>sys.stderr, "Child was terminated by signal", -retcode except OSError, e: print >>sys.stderr, "Execution failed:", e print run_cmd("samtools faidx ex1.fa") print runSamtools("samtools faidx ex1.fa") print 'Hello still alive' and as output I got: $ python p3.py open: No such file or directory [_razf_open] fail to open ex1.fa [fai_build] fail to open the FASTA file ex1.fa e ', mode 'w' at 0x7ffa4658d270> o ', mode 'w' at 0x7ffa4658d1e0> 0 open: No such file or directory [_razf_open] fail to open ex1.fa [fai_build] fail to open the FASTA file ex1.fa 0 None Hello still alive How can I get sure that all samtools commands were executed successfully? Thank you in advance. From mictadlo at gmail.com Thu Mar 1 07:50:23 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 17:50:23 +1000 Subject: [Biopython] coverage calculating from BAM Message-ID: Hello, How is it possible to calculate coverage from a BAM file in format eg. 10x coverage? Thank you in advance. From mictadlo at gmail.com Thu Mar 1 08:14:04 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 18:14:04 +1000 Subject: [Biopython] Google Summer of Code Message-ID: Hello, Is it possible to use PyPy with: * BioPython * Pysam * Matplotlib * etc If not than it might be good idea to get a support for it with help of Google Summer of Code, because PyPy getting faster and faster. Cheers, From p.j.a.cock at googlemail.com Thu Mar 1 11:00:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 11:00:01 +0000 Subject: [Biopython] samtools does not return correct exit code In-Reply-To: References: Message-ID: On Thu, Mar 1, 2012 at 1:40 AM, Mic wrote: > Hallo, > Samtools does not return correct the exit code: > > import subprocess > import logging > import sys > > def run_cmd(args): > ? ? ? ?if subprocess.call(args,shell=True) != 0: > ? ? ? ? ? ? ? ?print 'hello' > ? ? ? ? ? ? ? ?logging.error("Error copying sequence file args='%s'" % > str(args)) > ? ? ? ? ? ? ? ?return 1 > ? ? ? ?print 'e', sys.stderr > ? ? ? ?print 'o', sys.stdout > ? ? ? ?return 0 > > > def runSamtools( cmd ): > ? ?'''run a samtools command''' > > ? ?try: > ? ? ? ?retcode = subprocess.call(cmd, shell=True) > ? ? ? ?print retcode > ? ? ? ?if retcode < 0: > ? ? ? ? ? ?print >>sys.stderr, "Child was terminated by signal", -retcode > ? ?except OSError, e: > ? ? ? ?print >>sys.stderr, "Execution failed:", e > > print run_cmd("samtools faidx ex1.fa") > print runSamtools("samtools faidx ex1.fa") > > print 'Hello still alive' > > > and as output I got: > > $ python p3.py > open: No such file or directory > [_razf_open] fail to open ex1.fa > [fai_build] fail to open the FASTA file ex1.fa > e ', mode 'w' at 0x7ffa4658d270> > o ', mode 'w' at 0x7ffa4658d1e0> > 0 > open: No such file or directory > [_razf_open] fail to open ex1.fa > [fai_build] fail to open the FASTA file ex1.fa > 0 > None > Hello still alive > > How can I get sure that all samtools commands were executed successfully? > > Thank you in advance. Hi Mic, General Bioinformatics with Python questions are fine on the Biopython mailing list, but I think this qurey might be better asked elsewhere. Are you saying the samtools binary returns an error code 0 (success) even when it fails? If so, that should be raised as a bug with samtools. Alternatively pysam has support built in for calling the samtools commands. I'm not sure exactly how that works internally (e.g. via subprocess or by a C API call), but ask on the pysam mailing list. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 1 11:02:33 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 11:02:33 +0000 Subject: [Biopython] coverage calculating from BAM In-Reply-To: References: Message-ID: On Thu, Mar 1, 2012 at 7:50 AM, Mic wrote: > Hello, > How is it possible to calculate coverage from a BAM file in format eg. > 10x?coverage? > > Thank you in advance. Normally we'd talk about coverage as it varies along the genome, perhaps using a sliding window. This is often represented using a wiggle file or a BigWig file - and there are scripts for computing these from SAM/BAM alignments. Are you looking for a single number the entire BAM file? Peter P.S. What does this have to do with pysam or Biopython? From p.j.a.cock at googlemail.com Thu Mar 1 11:11:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Mar 2012 11:11:31 +0000 Subject: [Biopython] Google Summer of Code In-Reply-To: References: Message-ID: On Thu, Mar 1, 2012 at 8:14 AM, Mic wrote: > Hello, > Is it possible to use PyPy with: > * BioPython > * Pysam > * Matplotlib > * etc > > If not than it might be good idea to get a support for it with help of > Google Summer of Code, because PyPy getting faster and faster. Most of Biopython is working under PyPy (ignoring the C extensions, much like our situation under Jython). This was mentioned in the release notice for Bioython 1.59 - early adopters may be able to find other problems that we're not aware of from the unit tests: http://news.open-bio.org/news/2012/02/biopython-1-59-released/ I doubt there is enough work here alone to make a GSoC project. I'm not sure about pysam under PyPy - but I would be interested to know, because here interfacing with the samtools C code is the essence of pysam. My impression from the PyPy mailing lists calling external C libraries from PyPy is that this is another area of active work. For matplotlib, you would need NumPy under PyPy. That is an area of active work for the PyPy team who are currently trying to re-implement a pure-python version of NumPy which they are calling NumPyPy (originally it was called micronumpy) sufficient for other libraries using just the Python numpy API to run. A problem with this is many Python libraries also use the NumPy C API (e.g. bits of Biopython). See for example: http://morepypy.blogspot.com/2012/01/numpypy-status-update.html http://technicaldiscovery.blogspot.com/2011/10/thoughts-on-porting-numpy-to-pypy.html I suggest reading the PyPy and NumPy mailing list archives for more about this. Peter From mictadlo at gmail.com Thu Mar 1 11:56:57 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 1 Mar 2012 21:56:57 +1000 Subject: [Biopython] samtools does not return correct exit code In-Reply-To: References: Message-ID: Thank you, pysam has similar problems and I posted already a bug report. http://code.google.com/p/pysam/issues/detail?id=89 http://code.google.com/p/pysam/issues/detail?id=90 I am going to post on samtools mailing list the problem. Cheers, On Thu, Mar 1, 2012 at 9:00 PM, Peter Cock wrote: > On Thu, Mar 1, 2012 at 1:40 AM, Mic wrote: > > Hallo, > > Samtools does not return correct the exit code: > > > > import subprocess > > import logging > > import sys > > > > def run_cmd(args): > > if subprocess.call(args,shell=True) != 0: > > print 'hello' > > logging.error("Error copying sequence file args='%s'" % > > str(args)) > > return 1 > > print 'e', sys.stderr > > print 'o', sys.stdout > > return 0 > > > > > > def runSamtools( cmd ): > > '''run a samtools command''' > > > > try: > > retcode = subprocess.call(cmd, shell=True) > > print retcode > > if retcode < 0: > > print >>sys.stderr, "Child was terminated by signal", -retcode > > except OSError, e: > > print >>sys.stderr, "Execution failed:", e > > > > print run_cmd("samtools faidx ex1.fa") > > print runSamtools("samtools faidx ex1.fa") > > > > print 'Hello still alive' > > > > > > and as output I got: > > > > $ python p3.py > > open: No such file or directory > > [_razf_open] fail to open ex1.fa > > [fai_build] fail to open the FASTA file ex1.fa > > e ', mode 'w' at 0x7ffa4658d270> > > o ', mode 'w' at 0x7ffa4658d1e0> > > 0 > > open: No such file or directory > > [_razf_open] fail to open ex1.fa > > [fai_build] fail to open the FASTA file ex1.fa > > 0 > > None > > Hello still alive > > > > How can I get sure that all samtools commands were executed successfully? > > > > Thank you in advance. > > Hi Mic, > > General Bioinformatics with Python questions are fine on the Biopython > mailing list, but I think this qurey might be better asked elsewhere. > > Are you saying the samtools binary returns an error code 0 (success) > even when it fails? If so, that should be raised as a bug with samtools. > > Alternatively pysam has support built in for calling the samtools commands. > I'm not sure exactly how that works internally (e.g. via subprocess or by > a C API call), but ask on the pysam mailing list. > > Regards, > > Peter > From mrrizkalla at gmail.com Fri Mar 2 13:41:21 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 15:41:21 +0200 Subject: [Biopython] Bio.Phylo bugs & pain points In-Reply-To: References: Message-ID: Dear Biopython list, I am facing similar problem with Phylo in the context of your thread. I am using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick using phyml command-line and want to visualize it using Bio.Phylo. I read the newick, draw_ascii and draw_graphiz perfectly but not draw(). I have networkx, and pylab installed. my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") Phylo.draw_ascii(my_view_tree) my_view_tree_xml = my_view_tree.as_phyloxml() Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) *Error:* Traceback (most recent call last): File "itree/itree2/iTree2.py", line 563, in view_tree Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) AttributeError: 'module' object has no attribute 'draw' Thank you. On Sat, Feb 18, 2012 at 7:11 PM, Eric Talevich wrote: > On Sat, Feb 18, 2012 at 11:34 AM, Eric Talevich >wrote: > > > So -- do the trees drawn by Phylo.draw() look right? > > > > > Here's how to get a quick tree, using a test file from the Biopython source > distribution: > > >>> from Bio import Phylo > >>> tree = Phylo.read("Tests/PhyloXML/apaf.xml", "phyloxml") > >>> Phylo.draw(tree) > > > If you don't have the Tests/ directory, you can use any other Newick, Nexus > or PhyloXML tree; just change the file name and format name in the call to > Phylo.read(). > > Thanks, > Eric > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mrrizkalla at gmail.com Fri Mar 2 13:48:15 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 15:48:15 +0200 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' Message-ID: > > Dear Biopython list, > > I am facing similar problem with Phylo in the context of your thread. I am > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick using > phyml command-line and want to visualize it using Bio.Phylo. I read the > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have > networkx, and pylab installed. > > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") > Phylo.draw_ascii(my_view_tree) > my_view_tree_xml = my_view_tree.as_phyloxml() > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) > > *Error:* > > Traceback (most recent call last): > File "itree/itree2/iTree2.py", line 563, in view_tree > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > axes=None) > AttributeError: 'module' object has no attribute 'draw' > > Thank you. Mariam From mrrizkalla at gmail.com Fri Mar 2 13:48:15 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 15:48:15 +0200 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' Message-ID: > > Dear Biopython list, > > I am facing similar problem with Phylo in the context of your thread. I am > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick using > phyml command-line and want to visualize it using Bio.Phylo. I read the > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have > networkx, and pylab installed. > > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") > Phylo.draw_ascii(my_view_tree) > my_view_tree_xml = my_view_tree.as_phyloxml() > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, axes=None) > > *Error:* > > Traceback (most recent call last): > File "itree/itree2/iTree2.py", line 563, in view_tree > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > axes=None) > AttributeError: 'module' object has no attribute 'draw' > > Thank you. Mariam From eric.talevich at gmail.com Fri Mar 2 14:53:24 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 2 Mar 2012 09:53:24 -0500 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' In-Reply-To: References: Message-ID: On Fri, Mar 2, 2012 at 8:48 AM, Mariam Reyad Rizkallah wrote: > > > > Dear Biopython list, > > > > I am facing similar problem with Phylo in the context of your thread. I > am > > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick > using > > phyml command-line and want to visualize it using Bio.Phylo. I read the > > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have > > networkx, and pylab installed. > > > > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") > > Phylo.draw_ascii(my_view_tree) > > my_view_tree_xml = my_view_tree.as_phyloxml() > > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > axes=None) > > > > *Error:* > > > > Traceback (most recent call last): > > File "itree/itree2/iTree2.py", line 563, in view_tree > > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, > > axes=None) > > AttributeError: 'module' object has no attribute 'draw' > > > > > Thank you. > > Mariam > > Hi Mariam, Would you mind checking the version number of Biopython within the interpreter or script you're using? Like this: import Bio print Bio.__version__ The function Phylo.draw was part of Biopython 1.58, so the simplest explanation is that your script is using a different, older installation of Biopython that's also installed on your system. Alternatively, to get the best experience with Phylo.draw I'd recommend updating to the current Biopython 1.59. Hope that helps, Eric From mrrizkalla at gmail.com Fri Mar 2 15:27:51 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Fri, 2 Mar 2012 17:27:51 +0200 Subject: [Biopython] Bio.Phylo AttributeError: 'module' object has no attribute 'draw' In-Reply-To: References: Message-ID: Hi Eric, I was just upgraded to 1.59 and when I checked the version, it was 1.56!!!!! I can't believe that I didn't pay attention to that! Thank you very much. Mariam On Fri, Mar 2, 2012 at 4:53 PM, Eric Talevich wrote: > On Fri, Mar 2, 2012 at 8:48 AM, Mariam Reyad Rizkallah < > mrrizkalla at gmail.com> wrote: > >> > >> > Dear Biopython list, >> > >> > I am facing similar problem with Phylo in the context of your thread. I >> am >> > using Biopython 1.58 - Ubuntu 32 bit system. I have created a newick >> using >> > phyml command-line and want to visualize it using Bio.Phylo. I read the >> > newick, draw_ascii and draw_graphiz perfectly but not draw(). I have >> > networkx, and pylab installed. >> > >> > my_view_tree = Phylo.read("myseq.phy_phyml_tree.txt", "newick") >> > Phylo.draw_ascii(my_view_tree) >> > my_view_tree_xml = my_view_tree.as_phyloxml() >> > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, >> axes=None) >> > >> > *Error:* >> >> > >> > Traceback (most recent call last): >> > File "itree/itree2/iTree2.py", line 563, in view_tree >> > Phylo.draw(my_view_tree_xml, do_show=True, show_confidence=True, >> > axes=None) >> > AttributeError: 'module' object has no attribute 'draw' >> > >> > >> Thank you. >> >> Mariam >> >> > Hi Mariam, > > Would you mind checking the version number of Biopython within the > interpreter or script you're using? Like this: > > import Bio > print Bio.__version__ > > > The function Phylo.draw was part of Biopython 1.58, so the simplest > explanation is that your script is using a different, older installation of > Biopython that's also installed on your system. > > Alternatively, to get the best experience with Phylo.draw I'd recommend > updating to the current Biopython 1.59. > > Hope that helps, > Eric > From MatatTHC at gmx.de Sun Mar 4 10:44:56 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Sun, 4 Mar 2012 11:44:56 +0100 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: Hi, its now implemented and tested. I would like to post it as Cookbook entry. How can I do this? Matthias 2012/2/24 Peter Cock > On Thu, Feb 23, 2012 at 9:09 PM, Matthias Bernt wrote: > > hi peter, > > > > Thank you for the suggestions. I will try to create the functions as > > suggested. > > Should I post them here? > > Sure - or on the wiki under a new 'Cookbook' entry? > http://biopython.org/wiki/Category:Cookbook > > > I think we keep it as it is at the moment. Performance is not so > important > > for me .. so far. > > Optimisation can still be done later. > > Of course :) > > Do you know the quote "premature optimization is the root of all evil"? > > Peter > From mmokrejs at fold.natur.cuni.cz Sun Mar 4 18:09:28 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Sun, 04 Mar 2012 19:09:28 +0100 Subject: [Biopython] Write FASTA sequence on a single line Message-ID: <4F53AFD8.1090506@fold.natur.cuni.cz> Hi, is there an option to tell FASTA writer to write output with a sequence on a single line (so that a FASTA entry would span just two lines altogether)? I see it should be faster to eventually parse using SeqIO because one would avoid calls for each line in the FASTAinput file. In my code I have for _record in SeqIO.parse(fastah, 'fasta'): which boils down to biopython's: append(line.rstrip().replace(" ","").replace("\r","")) per every line with _sequence_. Thank you for comments, Martin From w.arindrarto at gmail.com Sun Mar 4 18:46:51 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 4 Mar 2012 19:46:51 +0100 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: <4F53AFD8.1090506@fold.natur.cuni.cz> References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: Hi Martin, A quick glance at Bio.SeqIO.FastaIO.FastaWriter shows that there is indeed an option to set the line wrapping length. However, the regular writing function that calls FastaWriter (SeqIO.write) only accepts three parameters (sequence, handle, and format), so if you really want to use Biopython's fasta writer, you should call FastaWriter directly. For example, as shown in the docs: from Bio.SeqIO.FastaIO import FastaWriter writer = FastaWriter(open(outfile, 'w'), wrap=0) writer.write_file(records) Alternatively, you can iterate over the records manually and write them to the output file like so: with open(outfile, 'w') as target: for rec in records: # records is the list containing your SeqRecord objects target.write('>%s\n' % rec.id) target.write('%s\n' % rec.seq.tostring()) Hope that helps! Bow On Sun, Mar 4, 2012 at 19:09, Martin Mokrejs wrote: > Hi, > is there an option to tell FASTA writer to write output with a > sequence on a single line (so that a FASTA entry would span just > two lines altogether)? I see it should be faster to eventually > parse using SeqIO because one would avoid calls for each line in > the FASTAinput file. > > In my code I have > for _record in SeqIO.parse(fastah, 'fasta'): > > which boils down to biopython's: > append(line.rstrip().replace(" ","").replace("\r","")) > > per every line with _sequence_. > > Thank you for comments, > Martin > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mmokrejs at fold.natur.cuni.cz Sun Mar 4 19:15:27 2012 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Sun, 04 Mar 2012 20:15:27 +0100 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: <4F53BF4F.4010907@fold.natur.cuni.cz> Hi Willis and Wibowo, yes, I also write the new fasta files myself, obeying the biopythons writer. Sometimes I even parse them myself. But mainly I wanted this issue raised up and get it implemented into biopython. And I hope that the argument that parsing of these files is faster will be valued as well. Not talking about the fact that one can use grep(1) to search through the sequences, which is impossible if the sequences are split over several lines. I would even say that one-line sequences should be default. ;)) Or at least if len() is below e.g. 2000. ;) But thanks for pointer to the direct FastaWriter use. I forgot about this and just had the feeling there was a way ... ;) Martin Wibowo Arindrarto wrote: > Hi Martin, > > A quick glance at Bio.SeqIO.FastaIO.FastaWriter shows that there is indeed an option to set the line wrapping length. However, the regular writing function that calls FastaWriter (SeqIO.write) only accepts three parameters (sequence, handle, and format), so if you really want to use Biopython's fasta writer, you should call FastaWriter directly. > > For example, as shown in the docs: > > from Bio.SeqIO.FastaIO import FastaWriter > writer = FastaWriter(open(outfile, 'w'), wrap=0) > writer.write_file(records) > > Alternatively, you can iterate over the records manually and write them to the output file like so: > > with open(outfile, 'w') as target: > for rec in records: # records is the list containing your SeqRecord objects > target.write('>%s\n' % rec.id ) > target.write('%s\n' % rec.seq.tostring()) > > > Hope that helps! > Bow > > > On Sun, Mar 4, 2012 at 19:09, Martin Mokrejs > wrote: > > Hi, > is there an option to tell FASTA writer to write output with a > sequence on a single line (so that a FASTA entry would span just > two lines altogether)? I see it should be faster to eventually > parse using SeqIO because one would avoid calls for each line in > the FASTAinput file. > > In my code I have > for _record in SeqIO.parse(fastah, 'fasta'): > > which boils down to biopython's: > append(line.rstrip().replace(" ","").replace("\r","")) > > per every line with _sequence_. > > Thank you for comments, > Martin > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Sun Mar 4 19:15:36 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 4 Mar 2012 19:15:36 +0000 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: On Sun, Mar 4, 2012 at 6:46 PM, Wibowo Arindrarto wrote: > Hi Martin, > > A quick glance at Bio.SeqIO.FastaIO.FastaWriter shows > that there is indeed an option to set the line wrapping length. > However, the regular writing function that calls FastaWriter > (SeqIO.write) only accepts three parameters (sequence, > handle, and format), so if you really want to use Biopython's > fasta writer, you should call FastaWriter directly. Exactly. The top level SeqIO API is file format neutral, so if you want to do something format specific, you have to import and use the underlying parser/writer directly - in this case Bio.SeqIO.FastaIO.FastaWriter as you showed. Peter From jordan.r.willis at Vanderbilt.Edu Sun Mar 4 18:35:58 2012 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 4 Mar 2012 12:35:58 -0600 Subject: [Biopython] Write FASTA sequence on a single line In-Reply-To: <4F53AFD8.1090506@fold.natur.cuni.cz> References: <4F53AFD8.1090506@fold.natur.cuni.cz> Message-ID: I don't think there is an option, but you could possibly write it in the SeqIO class. I have to do this all the time and i just write it to a file manually for _record in SeqIO.parse(fastah,'fasta) open(outputfile,'a').write(">"+_record.id+record.seq) You can of course format these to output to a file anyway you want. Jordan On Mar 4, 2012, at 12:09 PM, Martin Mokrejs wrote: > Hi, > is there an option to tell FASTA writer to write output with a > sequence on a single line (so that a FASTA entry would span just > two lines altogether)? I see it should be faster to eventually > parse using SeqIO because one would avoid calls for each line in > the FASTAinput file. > > In my code I have > for _record in SeqIO.parse(fastah, 'fasta'): > > which boils down to biopython's: > append(line.rstrip().replace(" ","").replace("\r","")) > > per every line with _sequence_. > > Thank you for comments, > Martin > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Mon Mar 5 09:39:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Mar 2012 09:39:56 +0000 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: On Sun, Mar 4, 2012 at 10:44 AM, Matthias Bernt wrote: > Hi, > > its now implemented and tested. I would like to post it as Cookbook entry. > How can I do this? > > Matthias It is a wiki, so register an account and you should be able to edit and add pages. Put this under the 'Cookbook' category, which just means adding [[category:Cookbook]] to the end, and it will then automatically appear here: http://biopython.org/wiki/Category:Cookbook Thanks, Peter From MatatTHC at gmx.de Mon Mar 5 15:57:13 2012 From: MatatTHC at gmx.de (Matthias Bernt) Date: Mon, 5 Mar 2012 16:57:13 +0100 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: Hi, when I try to register (http://biopython.org/wiki/Special:UserLogin/signup) I get an error: """ Permission error You do not have permission to create this user account, for the following reason: The action you have requested is limited to users in the group: Administrators. """ Matthias 2012/3/5 Peter Cock > > On Sun, Mar 4, 2012 at 10:44 AM, Matthias Bernt wrote: > > Hi, > > > > its now implemented and tested. I would like to post it as Cookbook entry. > > How can I do this? > > > > Matthias > > It is a wiki, so register an account and you should be able to edit > and add pages. Put this under the 'Cookbook' category, which just > means adding [[category:Cookbook]] to the end, and it will then > automatically appear here: > > http://biopython.org/wiki/Category:Cookbook > > Thanks, > > Peter From p.j.a.cock at googlemail.com Mon Mar 5 16:12:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Mar 2012 16:12:42 +0000 Subject: [Biopython] Degenerated Codons In-Reply-To: References: <20120222150602.313750@gmx.net> Message-ID: On Mon, Mar 5, 2012 at 3:57 PM, Matthias Bernt wrote: > Hi, > > when I try to register > (http://biopython.org/wiki/Special:UserLogin/signup) I get an error: > > """ > Permission error > You do not have permission to create this user account, for the > following reason: > The action you have requested is limited to users in the group: Administrators. > """ > > Matthias Very odd. That shouldn't happen - other people have managed to create accounts recently. I can try to create an account for you if you like - email me directly with desired username and email details. Peter From jordan.r.willis at Vanderbilt.Edu Tue Mar 6 03:37:30 2012 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Mon, 5 Mar 2012 21:37:30 -0600 Subject: [Biopython] MultiProcess SeqIO objects Message-ID: <0B7618F3-3F18-4838-92CD-A533CAE4117D@Vanderbilt.Edu> Hello BioPython, I was wondering if anyone has used the multiprocessing tool in conjunction with Biopython type objects? Here is my problem, I have 60 million sequences given in fastq format and I want to multiprocess these without having to iterate through the list multiple times. So I have something like this: from multiprocessing import Pool from Bio import SeqIO input_handle = open("huge_fastaqf_file.fastq,) def convert_to_fasta(input) return [[record.id , record.seq.reverse_complement ] for record in SeqIO.parse(input,'fastq')] p = Pool(processes=4) g = p.map(convert_to_fasta,input_handle) for i in g: print i[0],i[1] Unfortunately, it seems to divide up the handle by all the names and tries makes the input in the function convert_to_fasta the first line of input. What I want it to do is divide up the fastq object and do my function on 4 processors. I can't figure out how in the world to do this though. Thanks, jordan From from.d.putto at gmail.com Tue Mar 6 11:50:40 2012 From: from.d.putto at gmail.com (Sheila the angel) Date: Tue, 6 Mar 2012 12:50:40 +0100 Subject: [Biopython] access ModBase using Biopython Message-ID: Hi all, Is it possible to access ModBase using Biopython? How can I retrieve homology model using sequence from the databases like ModBase/SWISS-MODEL using biopython? Thanks -- Sheila From anaryin at gmail.com Tue Mar 6 11:53:43 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 6 Mar 2012 12:53:43 +0100 Subject: [Biopython] access ModBase using Biopython In-Reply-To: References: Message-ID: Hi Sheila, Modbase is not possible to access through Biopython. You would have to write your own script to interact with the webpage. Best, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao No dia 6 de Mar?o de 2012 12:50, Sheila the angel escreveu: > Hi all, > Is it possible to access ModBase using Biopython? > How can I retrieve homology model using sequence from the databases > like ModBase/SWISS-MODEL using biopython? > > Thanks > -- > Sheila > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Mar 6 11:55:13 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 06 Mar 2012 06:55:13 -0500 Subject: [Biopython] MultiProcess SeqIO objects In-Reply-To: <0B7618F3-3F18-4838-92CD-A533CAE4117D@Vanderbilt.Edu> References: <0B7618F3-3F18-4838-92CD-A533CAE4117D@Vanderbilt.Edu> Message-ID: <871up6ueou.fsf@fastmail.fm> Jordan; > I was wondering if anyone has used the multiprocessing tool in > conjunction with Biopython type objects? Here is my problem, I have 60 > million sequences given in fastq format and I want to multiprocess > these without having to iterate through the list multiple times. Are you trying to make the parsing run in the parallel, or some downstream processing happen in parallel? The later is definitely preferable if you are looking for speed ups since the parsing will be primarily IO bound. You can make the processing faster by avoiding using SeqIO objects since the conversion of quality scores will take the most time. Here is a working example: from multiprocessing import Pool from Bio.SeqIO.QualityIO import FastqGeneralIterator from Bio.Seq import Seq def do_something_with_record(info): name, seq = info return name, seq def convert_to_fasta(in_handle): for rec_id, seq, _ in FastqGeneralIterator(in_handle): yield rec_id, str(Seq(seq).reverse_complement()) with open("example.fastq") as input_handle: p = Pool(processes=4) g = p.map(do_something_with_record, convert_to_fasta(input_handle)) for i in g: print i Hope this helps, Brad > So I have something like this: > > from multiprocessing import Pool > from Bio import SeqIO > > input_handle = open("huge_fastaqf_file.fastq,) > > > def convert_to_fasta(input) > return [[record.id , record.seq.reverse_complement ] for record in SeqIO.parse(input,'fastq')] > > p = Pool(processes=4) > g = p.map(convert_to_fasta,input_handle) > > for i in g: > print i[0],i[1] > > Unfortunately, it seems to divide up the handle by all the names and tries makes the input in the function convert_to_fasta the first line of input. What I want it to do is divide up the fastq object and do my function on 4 processors. > > I can't figure out how in the world to do this though. From rbuels at gmail.com Tue Mar 6 16:00:00 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 06 Mar 2012 11:00:00 -0500 Subject: [Biopython] Google Summer of Code organization application submitted Message-ID: <4F563480.1080808@gmail.com> Hi all, I'd like to let you know that I had a look through the GSoC wiki pages for the GSoC wiki pages, and they look pretty good. Thank you very much, everyone who worked on them. I went ahead and submitted our application to Google for participation in GSoC 2012. If you have any more ideas for projects that a Google-funded intern might work on this summer, now is the time to add them to the wiki at http://www.open-bio.org/wiki/Google_Summer_of_Code, and on your project's wiki page that is linked from there. Google will most likely be evaluating these ideas within the next couple of days. If you are interested in helping out with Google Summer of Code this year, now is the time to make sure you are listed on your project's wiki as a prospective mentor. Also, please make sure you are a member of both the OBF GSoC and OBF GSoC-Mentors email lists[1][2]. The better those project idea pages are, the stronger our case for getting Google funding will be. Thanks a lot for all the hard work you community members have put in so far! Rob ---- Robert Buels (prospective) 2012 OBF GSoC Organization Admin [1] http://lists.open-bio.org/mailman/listinfo/gsoc [2] http://lists.open-bio.org/mailman/listinfo/gsoc-mentors From open-bio at wvr7.me.uk Tue Mar 6 19:47:56 2012 From: open-bio at wvr7.me.uk (Giles Weaver) Date: Tue, 06 Mar 2012 19:47:56 +0000 Subject: [Biopython] Job opportunity: Head of Bioinformatics at Institute for Animal Health (Surrey, UK) Message-ID: <1e04a705a4f5f350a28309dde7ee0376@wvr7.me.uk> Dear All. Please pass the following onto anyone who may be interested. Note the closing date is the 16th March (next Friday!). For a pretty version of the advert without mangled formatting please see http://www.jobs.ac.uk/job/ADZ114/head-of-bioinformatics/. Thanks, Giles HEAD OF BIOINFORMATICS DRIVE AND SUPPORT QUANTITATIVE RESEARCH INTO THE VIRAL DISEASES OF ANIMALS ?42,769-?47,521 Ref: IRC43544 BASED: INSTITUTE FOR ANIMAL HEALTH, PIRBRIGHT LABORATORY, SURREY Leading the bioinformatics team, you will provide support to to IAH scientists involved in quantitative biology, but will also have the opportunity to pursue your own research. Areas of current interest include modelling of virus evolution and host immune responses using next-generation sequencing data; _in silico_ analysis of host genetics and genomics data; and learning and predicting networks of biomolecular interactions from post-genomic data sets. In this high-profile role, you'll be expected to seek funds for new projects and continue your excellent track record of publication. Building collaborative research links with other members of IAH is encouraged. Holding a PhD or equivalent in a relevant branch of the biosciences, you will have experience in a recognised R&D environment. The ability to develop and manage relational databases is essential, so we would expect proficiency in MySQL (or similar), languages such as Perl or Python, and familiarity with R, Bioconductor or another statistical program. Experience of writing grant applications and managing staff would be helpful. The Institute for Animal Health (IAH) is an institute of the Biotechnology and Biological Sciences Research Council (BBSRC). We work to enhance the UK's capability to contain, control, and eliminate viral diseases in animals through highly innovative fundamental and applied bioscience. Informal enquiries about the post can be made to Simon Gubbins, Head of Mathematical Biology (simon.gubbins at iah.ac.uk [1]) APPLICATIONS ARE HANDLED BY THE RCUK SHARED SERVICES CENTRE; TO APPLY PLEASE VISIT OUR JOB BOARD AT HTTPS://EXT.SSC.RCUK.AC.UK [2] AND COMPLETE AN ONLINE APPLICATION FORM. APPLICANTS WHO WOULD LIKE TO RECEIVE THIS ADVERT IN AN ALTERNATIVE FORMAT (E.G. LARGE PRINT, BRAILLE, AUDIO OR HARD COPY), OR WHO ARE UNABLE TO APPLY ONLINE SHOULD CONTACT US BY TELEPHONE ON 01793 867003, PLEASE QUOTE REFERENCE NUMBER IRC43544. FOR MORE INFORMATION ABOUT THE IAH GO TO CLOSING DATE: 16TH MARCH 2012. Links: ------ [1] mailto:simon.gubbins at iah.ac.uk [2] https://ext.ssc.rcuk.ac.uk/ From mnemonico at posthocergopropterhoc.net Tue Mar 6 23:39:13 2012 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Tue, 6 Mar 2012 20:39:13 -0300 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: This one bit me while following the cookbook tonight. Explicitly setting retmode= 'xml' or 'html' fails. I think I read somewhere that 'text' is expected to break every now and then. Seems its the only retmode option that remains functional. -- .''`. Hugo A. M. Torres : :' : `. `' ?Talk is cheap, `- show me the code. ? -- L. Torvalds. 2012/2/23 Peter Cock > 2012/2/23 ??(Feng GAO) : > > Hi all, > > We have some python code using gi number to get record from Genbank. > > Part of the code is: > > > > handle = Entrez.efetch(db="protein", id=ID, rettype="gb") > > record = SeqIO.read(handle,"genbank") > > > > We have had no problem with this code > > until this week when we started getting "ValueError: No records found > in handle". > > Anyone have an idea how to fix it now? Thanks! > > Feng > > Try using an explicit retmode="text" in the efetch call. > The NCBI changed the defaults with EFetch 2.0, which > went live earlier this month. You're probably getting > XML back instead. > > Note to self: I wonder if the Biopython tutorial examples > need to be updated as well... > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Wed Mar 7 08:59:47 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 08:59:47 +0000 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: On Tuesday, March 6, 2012, A M Torres, Hugo < mnemonico at posthocergopropterhoc.net> wrote: > This one bit me while following the cookbook tonight. > > Explicitly setting retmode= 'xml' or 'html' fails. > > I think I read somewhere that 'text' is expected to break every now and > then. Seems its the only retmode option that remains functional. > Do you any specific examples? Even before this change Entrez calls would sometimes time out or fail with network problems. Peter From mnemonico at posthocergopropterhoc.net Wed Mar 7 16:18:37 2012 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Wed, 7 Mar 2012 13:18:37 -0300 Subject: [Biopython] meddling with GeneDiagram Message-ID: Fellow biopythoneers, I've been playing around with GenomeDiagram trying to draw a gene's features. My impressions are this is a very nifty tool indeed. However I see a problem in the way I would can draw a gene though it might be just my inexperience with as a user: The sigil don't automatically distinguish between a FeatureLocation with fuzzy position(i.e. BeforePosition(0)) and a feature with an exact position (i.e. ExactPosition(6475)). As example suppose I would like to draw the genes from a SeqRecord object built from the TP53 genbank file: def draw_gene(seqrec): diagram = GenomeDiagram.Diagram(seqrec.id) gene_track = diagram.new_track(1, name='Genes: ') gene_set = gene_track.new_set() ???? genes = ( i for i in seqrec.features if i.type == 'gene') color = colors.green for gene in genes: if gene.strand == 1: angle = 0 # else: angle = 180 gene_set.add_feature(gene,? sigil='ARROW',? color=color,? arrowshaft_height=1, arrowhead_length=0.2, label=True, label_size=14,? label_angle=angle, ) diagram.draw(format='linear',? pagesize='A4',? fragments=1,? start=0,? end=len(seqrec) ) diagram.write('gene_diagram.svg', 'SVG') The resulting image looks like gene_diagram.svg. There seems to be a WRAP53 gene on the minus strand and the sigil represents it as awhole gene. but its only a portion of it. Maybe we could represent its just a piece by drawing the arrowhead pointing inwards instead of outwards as in gene_arrow.png. Is that possible to implement? -- .''`. Hugo A. M. Torres : :' : `. `' ?Talk is cheap, `- show me the code. ? -- L. Torvalds. -------------- next part -------------- A non-text attachment was scrubbed... Name: gene_arrow.png Type: image/png Size: 8525 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gene_diagram.svg Type: image/svg+xml Size: 2316 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Wed Mar 7 16:39:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 16:39:19 +0000 Subject: [Biopython] meddling with GeneDiagram In-Reply-To: References: Message-ID: On Wed, Mar 7, 2012 at 4:18 PM, A M Torres, Hugo wrote: > Fellow biopythoneers, > > I've been playing around with GenomeDiagram trying to draw a gene's > features. My impressions are this is a very nifty tool indeed. Complex though ;) > However I see a problem in the way I would can draw a gene though it might > be just my inexperience with as a user: The sigil don't > automatically distinguish between a FeatureLocation with fuzzy > position(i.e. BeforePosition(0)) and a feature with an exact position (i.e. > ExactPosition(6475)). No, it doesn't. > As example suppose I would like to draw the genes from a SeqRecord object > built from the TP53 genbank file: > ... > > The resulting image looks like gene_diagram.svg. There seems to be a WRAP53 > gene on the minus strand and the sigil represents it as awhole gene. but > its only a portion of it. Maybe we could represent its just a piece by > drawing the arrowhead pointing inwards instead of outwards as in > gene_arrow.png. > > Is that possible to implement? Somewhat related is cropping of features only partly in view, and a general 'jaggy' feature for showing truncation in some why. Leighton and I did discuss the later and there is an implementation on a branch which didn't make it into Biopython 1.59 but could be in the next release. At its simplest this is a sigil with a jagged edge at both ends, useful for marking things like NNNNN regions in scaffolds/supercontigs, or even perhaps repeat regions. Dealing with the left and right ends of sigils generically would be more powerful though, and more complex. That would be required for your example - arrow head at one end, so kind of truncation marker at the other. We've also talked about other wish list ideas like exons and links, frame aware placement, frame less placement, etc. All these kinds of things only make sense a "high zoom" or if drawing small genomes like viruses - while original GenomeDiagram targeted entire bacteria ("low zoom", or "zoomed out") where you only needed and wanted a simple box for each gene. Peter From p.j.a.cock at googlemail.com Wed Mar 7 17:32:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Mar 2012 17:32:57 +0000 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: On Wed, Mar 7, 2012 at 5:22 PM, A M Torres, Hugo wrote: > Hi Peter. > > More specificaly for instance, > > handle = Entrez.efetch(db='nucleotide', > rettype='gb', > retmode='xml', > ?????????????????????????? id=gene) > record = SeqIO.read(handle, 'gb') > handle.close() > > ==================================================== > ... ValueError: No records found in handle Hi Hugo, Getting an error here is good - there were no GenBank formatted records in your file (while there should have been an XML record). Perhaps if we expect this to be a common error a more specific exception would be nicer? e.g. ValueError("This is XML, not GenBank plain text") Maybe I don't understand what you are querying? Peter From nathaniel.echols at gmail.com Thu Mar 8 01:08:32 2012 From: nathaniel.echols at gmail.com (Nat Echols) Date: Wed, 7 Mar 2012 18:08:32 -0700 Subject: [Biopython] server for automatic high-quality search & alignment? Message-ID: Hi list-- Does anyone know of a remotely callable web service similar to HHPred in functionality - i.e. capable of running a homology search against the PDB, and returning high-quality alignments? We're using NCBI BLAST for this right now and will probably use the EBI's WU-BLAST server in the future, but these are considered inferior to HHPred for weak homologs. Unfortunately HHPred isn't something we can use from Python, at least not in production code. thanks, Nat From mnemonico at posthocergopropterhoc.net Thu Mar 8 02:26:39 2012 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Wed, 7 Mar 2012 23:26:39 -0300 Subject: [Biopython] meddling with GeneDiagram In-Reply-To: References: Message-ID: On Wed, Mar 7, 2012 at 5:29 PM, Peter Cock wrote: > Did you mean to go off the list? No! My mistake. I meant reply to all. Been a little too distracted today. I will replay to all now. > > > On Wednesday, March 7, 2012, A M Torres, Hugo < > mnemonico at posthocergopropterhoc.net> wrote: > > Hi Peter > > > >> Somewhat related is cropping of features only partly in view, and a > >> general 'jaggy' feature for showing truncation in some why. Leighton > >> and I did discuss the later and there is an implementation on a branch > >> which didn't make it into Biopython 1.59 but could be in the next > release. > >> At its simplest this is a sigil with a jagged edge at both ends, useful > >> for marking things like NNNNN regions in scaffolds/supercontigs, or > >> even perhaps repeat regions. > > > > Neat. Thats exactly what I am needing, a third kind of sigil to represent > > fuzzy ends. Glad to know it will be available. > > > >> > >> Dealing with the left and right ends of > >> sigils generically would be more powerful though, and more complex. > >> That would be required for your example - arrow head at one end, so > >> kind of truncation marker at the other. > > > > Yea that would work exactly as I'd expect. One sigil for each end, > > the shaft and the arrowhead. > > It would have a lot of uses :) > > > > Then we could automatically test whether or not to replace the > > user chosen sigil with the 'fuzzy' one: > > > > if not isinstance(gene.location.end, ExactPosition): > > if gene.strand == -1: > > shaft_sigil = 'FUZZY' > > else: > > arrowhead_sigil = 'FUZZY' > >> > >> We've also talked about other wish list ideas like exons and links, > >> frame aware placement, frame less placement, etc. All these kinds > >> of things only make sense a "high zoom" or if drawing small genomes > >> like viruses - while original GenomeDiagram targeted entire bacteria > >> ("low zoom", or "zoomed out") where you only needed and wanted > >> a simple box for each gene. > > > > I see. Sounds like pretty interesting stuff. Maybe I could help out > > but I will need some tutoring. Never worked on a an open source > > collaborative project before. Is the next-release code hosted on > > someplace like github? > > Yes, it is on GitHub - have a look at the links on our wiki pages. > https://github.com/biopython/biopython Alright, great. I have forked myself a copy. > If I could learn how to get and use the code without interfering > with my working installation of biopython (maybe using something > like virtualenv?) I've never used virtualenv, but I hear good things about it. > > Are you on Windows, Mac or Linux? > Debian testing (tends to be very up-to-date) > > I would gladly contribute some work. Let me know if I can be of hand. > > I don't want to put you off, but the GenomeDiagram code is > pretty complex... And right now probably only two people > can really be said to understand it (Leighton and myself). > No problem. I might try and have a look. I will try to use the virtualenv thing to experiment without breaking the system's biopython. I'll try first to contribute some small code changes just to get the hang of it. Then if you guys decide some of the changes are worthwhile you can incorporate them in the main project. This should be fun. > There are also two semi-duplicated areas of code, for > drawing linear and circular diagrams. In general, drawing > signals on circular diagrams is a LOT harder to implement. > > Right now the most important thing is actually the documentation, > something I managed to do a bit more of recently: > http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ > > It is the graph functions that need doing next - perhaps by > adapting Leighton's old documentation from before GD > was integrated into Biopython. I mean bar charts, line > graphs and heat maps. > > Peter -- -- .''`. Hugo A. M. Torres : :' : `. `' ?Talk is cheap, `- show me the code. ? -- L. Torvalds. From idoerg at gmail.com Thu Mar 8 03:01:08 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 7 Mar 2012 22:01:08 -0500 Subject: [Biopython] server for automatic high-quality search & alignment? In-Reply-To: References: Message-ID: Did you try ffas? On Wed, Mar 7, 2012 at 8:08 PM, Nat Echols wrote: > Hi list-- > > Does anyone know of a remotely callable web service similar to HHPred > in functionality - i.e. capable of running a homology search against > the PDB, and returning high-quality alignments? We're using NCBI > BLAST for this right now and will probably use the EBI's WU-BLAST > server in the future, but these are considered inferior to HHPred for > weak homologs. Unfortunately HHPred isn't something we can use from > Python, at least not in production code. > > thanks, > Nat > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Thu Mar 8 10:06:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Mar 2012 10:06:32 +0000 Subject: [Biopython] Entrez and SeqIO "no records found in handle" In-Reply-To: References: <1330032923.65491.YahooMailNeo@web15106.mail.cnb.yahoo.com> Message-ID: On Wed, Mar 7, 2012 at 8:25 PM, A M Torres, Hugo wrote: >> >> Hi Hugo, >> >> Getting an error here is good - there were no GenBank formatted records >> in your file (while there should have been an XML record). Perhaps if >> we expect this to be a common error a more specific exception would >> be nicer? e.g. ValueError("This is XML, not GenBank plain text") >> >> Maybe I don't understand what you are querying? >> >> Peter > > You are right. I was expecting to SeqIO to read xml. If I want to parse xml > it seems I should have used Entrez.read instead. > > Sorry for the noise. No problem - thanks for clarifying this, Peter From ming.xue at boehringer-ingelheim.com Mon Mar 12 22:18:38 2012 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Mon, 12 Mar 2012 18:18:38 -0400 Subject: [Biopython] efetch only returns 20 records of pubmed In-Reply-To: References: Message-ID: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> Hello, I used the biopython 1.5.9 to download some pubmed abstracts. The query from browser showed 409580 records. But I only got the count of 20 from record["IdList"] and they matched the records on the first page from browser. Am I blocked by NCBI or there is a parameter for page I missed? from Bio import Entrez Entrez.email = 'my.email at domain.com' query = Entrez.esearch(db="pubmed", term="publisher[sb]") record = Entrez.read(query) print len(record["IdList"]) Thanks for your comments, Ming From winda002 at student.otago.ac.nz Mon Mar 12 22:55:20 2012 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 13 Mar 2012 11:55:20 +1300 Subject: [Biopython] efetch only returns 20 records of pubmed In-Reply-To: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> References: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> Message-ID: <4F5E7ED8.80800@student.otago.ac.nz> Hi Min, I think "retmax" is the parameter you are looking for. If you plan on making some huge query, be sure to do it outside of peak times (US) and think about using the WebEnv features ("Using the history and WebEnv" section of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial) if you want to download a lot of data. Cheers, David On 3/13/2012 11:18 AM, ming.xue at boehringer-ingelheim.com wrote: > Hello, > > I used the biopython 1.5.9 to download some pubmed abstracts. The query from > browser showed 409580 records. But I only got the count of 20 from > record["IdList"] and they matched the records on the first page from browser. > Am I blocked by NCBI or there is a parameter for page I missed? > > from Bio import Entrez > Entrez.email = 'my.email at domain.com' > > query = Entrez.esearch(db="pubmed", term="publisher[sb]") > record = Entrez.read(query) > print len(record["IdList"]) > > Thanks for your comments, > Ming > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From ming.xue at boehringer-ingelheim.com Tue Mar 13 02:53:48 2012 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Mon, 12 Mar 2012 22:53:48 -0400 Subject: [Biopython] efetch only returns 20 records of pubmed In-Reply-To: <4F5E7ED8.80800@student.otago.ac.nz> References: <015E2E9BC1647A45B40F286F2BE01C4195FC7C@nahexm101.am.boehringer.com> <4F5E7ED8.80800@student.otago.ac.nz> Message-ID: <015E2E9BC1647A45B40F286F2BE01C4195FC8D@nahexm101.am.boehringer.com> Hi David, Thanks for the quick tips and I certainly missed the tutorials. But I had more serious problem as I think I got denied. During my test of the examples in the section 8.15 of the Tutorials, my simple command of Entrez.einfo(db='pubmed') failed at 9:45 pm US EDT but the same command worked fine on my personal computer with a different IP. I emailed NCBI for clarification. Thanks, Ming -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of David Winter Sent: Monday, March 12, 2012 6:55 PM To: biopython at lists.open-bio.org Subject: Re: [Biopython] efetch only returns 20 records of pubmed Hi Min, I think "retmax" is the parameter you are looking for. If you plan on making some huge query, be sure to do it outside of peak times (US) and think about using the WebEnv features ("Using the history and WebEnv" section of the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial) if you want to download a lot of data. Cheers, David On 3/13/2012 11:18 AM, ming.xue at boehringer-ingelheim.com wrote: > Hello, > > I used the biopython 1.5.9 to download some pubmed abstracts. The query from > browser showed 409580 records. But I only got the count of 20 from > record["IdList"] and they matched the records on the first page from browser. > Am I blocked by NCBI or there is a parameter for page I missed? > > from Bio import Entrez > Entrez.email = 'my.email at domain.com' > > query = Entrez.esearch(db="pubmed", term="publisher[sb]") > record = Entrez.read(query) > print len(record["IdList"]) > > Thanks for your comments, > Ming > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From zhigang.wu at email.ucr.edu Thu Mar 15 16:32:29 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 15 Mar 2012 09:32:29 -0700 Subject: [Biopython] Documentation typo found Message-ID: Hi biopython community, Here I am reporting a minor typo in the tutorial of Bio.Entrez ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114) and hope biopython administrator who have appropriate editing permission of that page to correct it. In the middle of above page, there are several lines of codes illustrating how to retrieve the information like author, source and title. I have pasted the original code below, in which the typo "CO" has been highlighted in red color, which should be corrected to "SO" Original Version: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*CO*", "?") ... print Should be corrected to: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*SO*", "?") ... print Zhigang PhD candidate in Plant Biology Department of Botany and Plant Sciences University of California Riverside, CA From zhigangwu.bgi at gmail.com Thu Mar 15 16:46:51 2012 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Thu, 15 Mar 2012 09:46:51 -0700 Subject: [Biopython] Bio.Entrez documentation typo found Message-ID: Hi biopython community, Sorry for duplicate posting if you see this post a second time. Here I am reporting a minor typo in the tutorial of Bio.Entrez ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114) and hope biopython administrator who have appropriate editing permission of that page to correct it. In the middle of above page, there are several lines of codes illustrating how to retrieve the information like author, source and title. I have pasted the original code below, in which the typo "CO" has been highlighted in red color, which should be corrected to "SO" Original Version: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*CO*", "?") ... print Should be corrected to: >>> for record in records: ... print "title:", record.get("TI", "?") ... print "authors:", record.get("AU", "?") ... print "source:", record.get("*SO*", "?") ... print Zhigang PhD candidate in Plant Biology Department of Botany and Plant Sciences University of California Riverside, CA From p.j.a.cock at googlemail.com Thu Mar 15 16:52:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 15 Mar 2012 16:52:20 +0000 Subject: [Biopython] Documentation typo found In-Reply-To: References: Message-ID: On Thu, Mar 15, 2012 at 4:32 PM, Zhigang Wu wrote: > Hi biopython community, > > Here I am reporting a minor typo in the tutorial of Bio.Entrez ( > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114) and hope > biopython administrator who have appropriate editing permission of that > page to correct it. In case you or anyone else reading is interested, the tutorial HTML and PDF files are generated from the LaTeX file Doc/Tutorial.tex in the Biopython source code: https://github.com/biopython/biopython/blob/master/Doc/Tutorial.tex LaTeX is an old markup language which is still very popular in the areas of mathematics and physics because of its excellent formula support. See http://www.latex-project.org/ for background. > In the middle of above page, there are several lines of codes illustrating > how to retrieve the information like author, source and title. I have > pasted the original code below, in which the typo "CO" has been highlighted > in red color, which should be corrected to "SO" I think colors and special fonts in HTML emails may get turned into plain text by the mailing list. But I understood. > Original Version: > >>>> for record in records: > ... ? ? print "title:", record.get("TI", "?") > ... ? ? print "authors:", record.get("AU", "?") > ... ? ? print "source:", record.get("*CO*", "?") > ... ? ? print > > > Should be corrected to: > >>>> for record in records: > ... ? ? print "title:", record.get("TI", "?") > ... ? ? print "authors:", record.get("AU", "?") > ... ? ? print "source:", record.get("*SO*", "?") > ... ? ? print > > Zhigang This is in the Pubmed and Medline parsing example from the Entrez chapter, and yes, you are quite right. Fixed: https://github.com/biopython/biopython/commit/998bffdc7a67297b22fac96e1c810297a32f0e36 Thank you, Peter From golubchi at stats.ox.ac.uk Fri Mar 16 12:13:29 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Fri, 16 Mar 2012 12:13:29 +0000 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names Message-ID: <4F632E69.8010906@stats.ox.ac.uk> Dear all, I may be missing something in the documentation, but I can't seem to figure out how to write newick trees with internal node names (preserved as plain text). I think in some versions of Bio.Phylo a call to tree.format('newick') accomplished this by default, but currently I can't replicate this behaviour and get a tree with unnamed internal nodes. Any pointers would be appreciated! Thanks Tanya From rbuels at gmail.com Fri Mar 16 19:49:16 2012 From: rbuels at gmail.com (Robert Buels) Date: Fri, 16 Mar 2012 12:49:16 -0700 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! Message-ID: <4F63993C.7050809@gmail.com> Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2012 FAQ at http://goo.gl/kNv48 Student applications are due April 6, 2012 at 19:00 UTC. Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and whom to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2012 Administrator From mictadlo at gmail.com Mon Mar 19 04:29:40 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 19 Mar 2012 14:29:40 +1000 Subject: [Biopython] [Bio-bwa-help] making a "libbwa" In-Reply-To: <4F61CAF8.7020507@crs4.it> References: <4F61CAF8.7020507@crs4.it> Message-ID: +1 BioRuby did it in the following way: https://github.com/fstrozzi/bioruby-bwa/blob/master/ext/mkrf_conf.rb https://github.com/fstrozzi/bioruby-bwa/wiki Maybe it could be integrated to be Biopython? Cheers, On Thu, Mar 15, 2012 at 8:56 PM, Luca Pireddu wrote: > Hello list, > > I'd like the discuss the idea of refactoring BWA to separate the > alignment logic from the rest of the code base, thus resulting in a > libbwa alignment library which could be used through the regular command > line interface or through other means, as one saw fit. > > I'm one of the developers of Seal (http://biodoop-seal.sf.net/), a suite > of Hadoop-based tools for the processing of sequencing data. Within the > Seal suite, we have the Seqal program for read mapping which at this > time contains a "fork" of the BWA 0.5.10 code which we've patched in a > few points and then built as a library that we can use within our > application. In this way, we can feed the alignment algorithm with read > data we have pre-loaded in memory instead of files in a supported > format, and we can also fetch the alignment results directory from BWA's > memory structures rather than the regular output files. The resulting > library, as long as it has a stable API, provides much more flexibility > than a command-line program, allowing it to be used more easily and > elegantly in settings different from the regular fastq to sam/bam > workflow/application. Seqal is a concrete example. As another example, > we've built a Python interface for the library allowing us to easily use > it in scripts and testing. > > If there is interest for this idea, especially on the part of Heng, we > could discuss a viable API. Myself and my colleagues are certainly > willing to propose a first draft and even contribute the patches > necessary to implement it. > > Looking forward to hearing from you, > > -- > Luca Pireddu > CRS4 - Distributed Computing Group > Loc. Pixina Manna Edificio 1 > 09010 Pula (CA), Italy > Tel: +39 0709250452 > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Bio-bwa-help mailing list > Bio-bwa-help at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bio-bwa-help > From hernan.morales at gmail.com Tue Mar 20 02:52:46 2012 From: hernan.morales at gmail.com (=?UTF-8?Q?Hern=C3=A1n_Morales_Durand?=) Date: Mon, 19 Mar 2012 23:52:46 -0300 Subject: [Biopython] SeqIO fasta "fakes" recognition In-Reply-To: <4F467992.9060205@unifi.it> References: <4F4663E1.5010206@unifi.it> <4F467992.9060205@unifi.it> Message-ID: Why don't you filter for file names ended with fasta known extensions? (.fa, .fasta, etc.) 2012/2/23 Marco Galardini > On 02/23/2012 05:35 PM, Eric Talevich wrote: > >> >> I suppose there's always: >> >> try: >> record = SeqIO.read("gigo.png", "fasta") >> assert str(record.seq).isalpha() >> except: >> # complain... >> >> >> Thanks for the hint, I've implemented this (using the parse method) and > i'll see how it will perform (i guess it will had some overhead). > > Marco > > -- > ------------------------------**------------------- > Marco Galardini > DBE - Department of Evolutionary Biology > University of Florence - Italy > > e-mail: marco.galardini at unifi.it > www: http://www.unifi.it/dblage/**CMpro-v-p-51.html > phone: +39 055 2288249 > mobile: +39 340 2808041 > ------------------------------**------------------- > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > -- Hern?n Morales Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. From chaitanya.talnikar at iitb.ac.in Tue Mar 20 20:21:50 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Tue, 20 Mar 2012 20:21:50 +0000 Subject: [Biopython] GSoC Application: "Representation and manipulation of genomic variants" Message-ID: Hi, I, Chaitanya Talnikar, am a third year Undergraduate student of Chemical Engineering at Indian Institute of Technology. I would like to work on the project "Representation and manipulation of genomic variants". As I understand this project is on human genomic variations and constructing a representation of the variations that would account for most of the variation file formats. I would like to how much should I write in the proposal. Would a description of the internal representation be sufficient? Talking about my experience, I have a good grasp of python. I have done courses on Molecular Biology and Computational Biology in which I learnt the sequence alignment algorithms and the classification of proteins based on hidden markov models of insertion, deletion and mutations in proteins. I have used bioinformatics in the projects that I've done in the field of systems biology and made extensive use of NCBI blast and other utilities. I have worked on several projects related to programming. Some of projects whose code is online are: LTanks (http://code.google.com/p/ltanks/) This is a clone of a windows game. It has had around 200 downloads. OffApt (http://code.google.com/p/offapt/) This is a software that allows people to download ubuntu packages from windows, it includes a dependency resolver for debs and also a downloader. This project required a lot of file parsing and string manipulations I've mainly used python to solve the mathematical problems at http://projecteuler.net From eric.talevich at gmail.com Tue Mar 20 22:11:25 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 20 Mar 2012 18:11:25 -0400 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: <4F632E69.8010906@stats.ox.ac.uk> References: <4F632E69.8010906@stats.ox.ac.uk> Message-ID: Hi Tanya, In case you haven't solved this yet, could you post some small portion of the Newick tree you're working with? It's possible that your tree is oddly formatted, and the Newick parser isn't picking up the internal node labels in the first place. In the current version of Biopython (1.59), this seems to work fine: from Bio import Phylo # Example file from our test suite tree = Phylo.read("Tests/Nexus/int_node_labels.nwk", "newick") # Print to the console print tree.format("newick") # To write the tree to a file, this is preferred: Phylo.write(tree, "my_new_file.nwk", "newick") Cheers, Eric On Fri, Mar 16, 2012 at 8:13 AM, Tanya Golubchik wrote: > Dear all, > > I may be missing something in the documentation, but I can't seem to > figure out how to write newick trees with internal node names (preserved > as plain text). I think in some versions of Bio.Phylo a call to > tree.format('newick') accomplished this by default, but currently I > can't replicate this behaviour and get a tree with unnamed internal nodes. > > Any pointers would be appreciated! > > Thanks > Tanya > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From wangdanburnett at 163.com Wed Mar 21 04:45:45 2012 From: wangdanburnett at 163.com (=?GB2312?B?zfXx9Q==?=) Date: Wed, 21 Mar 2012 12:45:45 +0800 Subject: [Biopython] join biopython mailing list Message-ID: <4F695CF9.4020406@163.com> Hello, biopython users ,developers and maintainers: I'm a new guy to use biopython from the Chinese mainland. But...I don't know how to join the mailing list? Could someone help me? Thanks a lot. Wang Dan from China From guillaume.bayot at gmail.com Wed Mar 21 09:19:21 2012 From: guillaume.bayot at gmail.com (Guillaume Bayot) Date: Wed, 21 Mar 2012 10:19:21 +0100 Subject: [Biopython] join biopython mailing list In-Reply-To: <4F695CF9.4020406@163.com> References: <4F695CF9.4020406@163.com> Message-ID: Hello, You can subscribe to the discussion list here http://lists.open-bio.org/mailman/listinfo/biopython Le 21/03/2012 05:45, ?? a ?crit : Hello, biopython users ,developers and maintainers: I'm a new guy to use biopython from the Chinese mainland. But...I don't know how to join the mailing list? Could someone help me? Thanks a lot. Wang Dan from China _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.orghttp://lists.open-bio.org/mailman/listinfo/biopython From golubchi at stats.ox.ac.uk Wed Mar 21 10:12:27 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Wed, 21 Mar 2012 10:12:27 +0000 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: References: <4F632E69.8010906@stats.ox.ac.uk> Message-ID: <4F69A98B.3040504@stats.ox.ac.uk> Hi Eric, It works in Biopython 1.59; I was using 1.58 originally. So problem solved. There's a few other strange things in Phylo that I can't work out, though -- for instance, what happens to 'PhyloXML.Other' attributes -- I can write these on the tree, and save the tree, but it can't be re-opened because the parser rejects it as improperly formatted. The documentation is a bit vague on this; in particular, passing None to 'attributes' when creating Phylo.Other objects fails, while passing an empty dictionary works... what is meant to be in 'attributes' when creating an Other object? Also, the 'is_aligned' sequence property disappears when a tree is saved in phyloxml format and then read back using Phylo.read: >>> print tree Phylogeny(rooted=True, branch_length_unit='SNV') Clade(branch_length=0.0, name='N1') Clade(branch_length=0.0, name='C00000761') BranchColor(blue=0, green=128, red=0) Sequence(type='dna') MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA', is_aligned=True) Clade(branch_length=0.0, name='C00000763') BranchColor(blue=0, green=0, red=255) Sequence(type='dna') MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA', is_aligned=True) >>> Phylo.write(tree, myfile, 'phyloxml') 1 >>> tree2 = Phylo.read(myfile, 'phyloxml') >>> print tree2 Phylogeny(rooted=True, branch_length_unit='SNV') Clade(branch_length=0.0, name='N1') Clade(branch_length=0.0, name='C00000761') BranchColor(blue=0, green=128, red=0) Sequence(type='dna') MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA') Clade(branch_length=0.0, name='C00000763') BranchColor(blue=0, green=0, red=255) Sequence(type='dna') MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA') Cheers Tanya On 20/03/12 22:11, Eric Talevich wrote: > Hi Tanya, > > In case you haven't solved this yet, could you post some small portion > of the Newick tree you're working with? It's possible that your tree is > oddly formatted, and the Newick parser isn't picking up the internal > node labels in the first place. > > In the current version of Biopython (1.59), this seems to work fine: > > from Bio import Phylo > # Example file from our test suite > tree = Phylo.read("Tests/Nexus/int_node_labels.nwk", "newick") > # Print to the console > print tree.format("newick") > # To write the tree to a file, this is preferred: > Phylo.write(tree, "my_new_file.nwk", "newick") > > > Cheers, > Eric > > > On Fri, Mar 16, 2012 at 8:13 AM, Tanya Golubchik > > wrote: > > Dear all, > > I may be missing something in the documentation, but I can't seem to > figure out how to write newick trees with internal node names (preserved > as plain text). I think in some versions of Bio.Phylo a call to > tree.format('newick') accomplished this by default, but currently I > can't replicate this behaviour and get a tree with unnamed internal > nodes. > > Any pointers would be appreciated! > > Thanks > Tanya > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Wed Mar 21 10:14:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Mar 2012 10:14:38 +0000 Subject: [Biopython] join biopython mailing list In-Reply-To: <4F695CF9.4020406@163.com> References: <4F695CF9.4020406@163.com> Message-ID: 2012/3/21 ?? : > Hello, biopython users ,developers and maintainers: > I'm a new guy to use biopython from the Chinese mainland. But...I don't > know how to join the mailing list? Could someone help me? > Thanks a lot. > Wang Dan from China I just checked the mail server, and you are (now) subscribed. Welcome! Peter From wangdanburnett at 163.com Wed Mar 21 10:06:33 2012 From: wangdanburnett at 163.com (=?GB2312?B?zfXx9Q==?=) Date: Wed, 21 Mar 2012 18:06:33 +0800 Subject: [Biopython] join biopython mailing list In-Reply-To: References: <4F695CF9.4020406@163.com> Message-ID: <4F69A829.8000001@163.com> ? 2012?03?21? 17:19, Guillaume Bayot ??: > Hello, > > You can subscribe to the discussion list here > http://lists.open-bio.org/mailman/listinfo/biopython > > Thanks a lot. I?ve found the page. Don > Le 21/03/2012 05:45, ?? a ?crit : >> Hello, biopython users ,developers and maintainers: >> I'm a new guy to use biopython from the Chinese mainland. But...I don't >> know how to join the mailing list? Could someone help me? >> Thanks a lot. >> Wang Dan from China >> >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython From reece at harts.net Wed Mar 21 16:01:57 2012 From: reece at harts.net (Reece Hart) Date: Wed, 21 Mar 2012 09:01:57 -0700 Subject: [Biopython] GSoC Application: "Representation and manipulation of genomic variants" In-Reply-To: References: Message-ID: On Tue, Mar 20, 2012 at 1:21 PM, Chaitanya Talnikar < chaitanya.talnikar at iitb.ac.in> wrote: > I, Chaitanya Talnikar, am a third year Undergraduate student of > Chemical Engineering at Indian Institute of Technology. I would like > to work on the project "Representation and manipulation of genomic > variants". ... Would a description of the internal > representation be sufficient? > Hi Chaitanya- I'm glad that you're interested in this project. There are many aspects of variant representation that a student (or perhaps even multiple students) might work on. Do not feel that you must tackle the entire project description. Before you spend a lot of time on an application, I suggest that you start a new thread with a short description of what you'd like to accomplish and questions you have. The BioPython community is a nurturing environment and I'm sure you'll get some good suggestions about scoping the project. Ultimately, we all want to help you write a successful application that results in an important contribution to the community. You'll be the first to initiate such a discussion, which gives you a wide-open opportunity. Does this answer give you enough to proceed with initiating a discussion? Thanks, Reece From golubchi at stats.ox.ac.uk Thu Mar 22 15:20:11 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Thu, 22 Mar 2012 15:20:11 +0000 Subject: [Biopython] Phylo.draw - font size Message-ID: <4F6B432B.1050901@stats.ox.ac.uk> Hi guys, Does anyone know how to change the font size of the text annotations on the current figure from Phylo.draw (ie the node names)? Changing rcParams['font.size'] changes the axes but not the annotations. Thanks Tanya From eric.talevich at gmail.com Thu Mar 22 19:55:42 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 22 Mar 2012 15:55:42 -0400 Subject: [Biopython] Phylo.draw - font size In-Reply-To: <4F6B432B.1050901@stats.ox.ac.uk> References: <4F6B432B.1050901@stats.ox.ac.uk> Message-ID: On Thu, Mar 22, 2012 at 11:20 AM, Tanya Golubchik wrote: > Hi guys, > > Does anyone know how to change the font size of the text annotations on > the current figure from Phylo.draw (ie the node names)? Changing > rcParams['font.size'] changes the axes but not the annotations. > There doesn't seem to be a great way to do this directly, but you can scale the entire image on-screen by changing the figure.dpi value. It defaults to 80dpi, so this will magnify everything by 50%: >>> rcParams['figure.dpi'] = 120 Alternatively, you can edit the source of Bio/Phylo/_utils.py at lines 347 (taxon labels) and 357 (confidence/support values), or copy the entire _utils.draw function into your own code and edit the same lines there. If this feels ridiculous (as it probably does), I can add font_size and branch_width_scale keyword arguments in the next release. Would that help? Any other options you'd like to see, keeping in mind that this function isn't meant to compete with standalone programs like Archaeopteryx? From eric.talevich at gmail.com Thu Mar 22 23:29:58 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 22 Mar 2012 19:29:58 -0400 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: <4F69A98B.3040504@stats.ox.ac.uk> References: <4F632E69.8010906@stats.ox.ac.uk> <4F69A98B.3040504@stats.ox.ac.uk> Message-ID: On Wed, Mar 21, 2012 at 6:12 AM, Tanya Golubchik wrote: > There's a few other strange things in Phylo that I can't work out, > though -- for instance, what happens to 'PhyloXML.Other' attributes -- I > can write these on the tree, and save the tree, but it can't be > re-opened because the parser rejects it as improperly formatted. The > documentation is a bit vague on this; in particular, passing None to > 'attributes' when creating Phylo.Other objects fails, while passing an > empty dictionary works... what is meant to be in 'attributes' when > creating an Other object? The Other element is somewhat vaguely defined in PhyloXML specification, too; it's meant to allow defining new XML elements without updating the official spec. The 'attributes' attribute translates directly to the attributes of the new XML element you're creating. It should be a dictionary of strings-to-strings (somewhat like the 'annotations' attribute of SeqRecord). Something like: >>> other = PhyloXML.Other("img", attributes={"src"="foo.png"}) >>> mytree.other.append(other) >>> print mytree.format("phyloxml") I see there was a bug here, where the PhyloXML.Other constructor should initialize 'attributes' to an empty dictionary if it's not provided. Fixed in the trunk: https://github.com/biopython/biopython/commit/9e3fec461b189fe77b10db6de0c88df5b77e5bb0 > Also, the 'is_aligned' sequence property disappears when a tree is saved > in phyloxml format and then read back using Phylo.read: > >>>> print tree > Phylogeny(rooted=True, branch_length_unit='SNV') > ? ?Clade(branch_length=0.0, name='N1') > ? ? ? ?Clade(branch_length=0.0, name='C00000761') > ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA', > is_aligned=True) > ? ? ? ?Clade(branch_length=0.0, name='C00000763') > ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA', > is_aligned=True) > >>>> Phylo.write(tree, myfile, 'phyloxml') > 1 >>>> tree2 = Phylo.read(myfile, 'phyloxml') >>>> print tree2 > Phylogeny(rooted=True, branch_length_unit='SNV') > ? ?Clade(branch_length=0.0, name='N1') > ? ? ? ?Clade(branch_length=0.0, name='C00000761') > ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA') > ? ? ? ?Clade(branch_length=0.0, name='C00000763') > ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) > ? ? ? ? ? ?Sequence(type='dna') > ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA') > This looks like a bug, too. (Thanks for finding these!) I don't immediately see the cause of the problem, I'll try to take a crack at it soon. From zhigang.wu at email.ucr.edu Fri Mar 23 19:00:12 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Fri, 23 Mar 2012 12:00:12 -0700 Subject: [Biopython] Google Summer of Code (GSoC) Message-ID: Hi Biopython community, I am, Zhigang Wu, a third year graduate student in UC Riverside with a research focus on miRNA evolution. I am interested in implementing the Biopython SearchIO module, which is used to parse the blast reports from currently popular sequence alignment tools like NCBI BLAST+, FASTA, HMMER3 and etc. I was a BioPerl user until one year ago, since then I have been a Biopython user. I have been using BioPerl's SearchIO extensively in my research project. BioPerl's SearchIO module provides a common API capable of handling all popular formats and is great. I'd like to write one in Python. As mentioned briefly, I have approximately one year experience of Perl programming experience, 1 year Python programming experience; and occasionally I also writing C++ programs; Other than this, I also have a bit experience on R. Right now, I am preparing my proposal that is due by April 6. I am listing below the core methods that the Biopythonic SearchIO module is going to support. For the sake of consistency, the moethods are very similar to existing SeqIO and AlignIOmodules. 1. SearchIO.parse(handle, format), is a generator function. 2. SearchIO.to_dict(iterator): this function takes in an iterator arguments which is produced by SearchIO.parse(...) function. 3. SearchIO.read(handle, format): provide fasta access to blast report have only one record 4. SearchIO.write(....) outputs specified blast output 5. SearchIO.convert(...) provide format conversion between different formats 6. ... I'd like to hear back from you any feedback or suggestions on the method or any format that in your research field is considered to be popular and you want it to be supported in Biopythonic SearchIO module. Regards, Zhigang Wu From ferreirafm at usp.br Fri Mar 23 21:55:27 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Fri, 23 Mar 2012 18:55:27 -0300 Subject: [Biopython] remove list redundancy Message-ID: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Hi Biopy users, I have a mult-sequence fasta file which I've read as a list. Is there a clever way/method to remove redundant sequences? Thanks in advance, Fred ### CODE: def redundancy(fastafile): f=open(fastafile, 'r') record = list(SeqIO.parse(f,"fasta")) new_rec = record f.close print len(record) for i in range(len(record)): for j in range(len(record)): if i < j: if record[i].seq == record[j].seq: del new_rec[j] print len(new_rec) ### RESULTS: $ redundancy.py -run all_emm_fake.fasta 823 /usr/lib64/python2.7/site-packages/Bio/Seq.py:197: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. "and to avoid this warning.", FutureWarning) 823 ### EXPECTING: Worse, the function above is not working. I was expecting 823 before and 822 after running it. From idoerg at gmail.com Fri Mar 23 22:19:27 2012 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 23 Mar 2012 18:19:27 -0400 Subject: [Biopython] remove list redundancy In-Reply-To: <20120323185527.47476o6f3ticn2of@webmail.usp.br> References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: Python assigns by reference, not by value. So you can have the following: >>> a=[1,2,3] >>> b=a >>> print b [1, 2, 3] >>> del b[1] >>> print a [1, 3] >>> So if you remove an item from list b, it will remove it from a as well. Which is why in your case, record and new_rec end up the same, since they were the same to start off with. Furthermore, in your loop, you are changing the length of "record" which is the target of a for loop. Never a good idea and yields unexpected results. Finally, the index "j" you are using points to one thing in record, but will point to another thing in new_rec. You can do an assignment by value using the copy module new_rec=copy.copy(record) That will create a completely new copy of record in new_rec. That still won't solve the problem that you have in the shifting place "j" points to in the loop though. I would suggest building a list of non-redundant sequences rather than deleting from a list of redundant sequences. HTH, Iddo On Fri, Mar 23, 2012 at 5:55 PM, wrote: > Hi Biopy users, > I have a mult-sequence fasta file which I've read as a list. Is there a > clever way/method to remove redundant sequences? > Thanks in advance, > Fred > > ### CODE: > def redundancy(fastafile): > f=open(fastafile, 'r') > record = list(SeqIO.parse(f,"fasta")) > new_rec = record > f.close > print len(record) > for i in range(len(record)): > for j in range(len(record)): > if i < j: > if record[i].seq == record[j].seq: > del new_rec[j] > print len(new_rec) > > > ### RESULTS: > $ redundancy.py -run all_emm_fake.fasta > 823 > /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In > future comparing Seq objects will use string comparison (not object > comparison). Incompatible alphabets will trigger a warning (not an > exception). In the interim please use id(seq1)==id(seq2) or > str(seq1)==str(seq2) to make your code explicit and to avoid this warning. > "and to avoid this warning.", FutureWarning) > 823 > > ### EXPECTING: > Worse, the function above is not working. I was expecting 823 before and > 822 after running it. > > > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From w.arindrarto at gmail.com Fri Mar 23 22:23:13 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 23 Mar 2012 23:23:13 +0100 Subject: [Biopython] remove list redundancy In-Reply-To: References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: Hi Ferreira, As Iddo have mentioned, it's better to build a new list containing unique records instead. Here's my shot at a method like that: from Bio import SeqIO from Bio.SeqUtils.CheckSum import seguid # Returns a list containing unique SeqRecord list. def remove_redundant(fastafile): records = SeqIO.parse(fastafile, 'fasta') # new list container unique_records = [] # unique sequence checksum container checksum_container = [] for record in records: checksum = seguid(record.seq) if checksum not in checksum_container: unique_records.append(record) return unique_records I assume your Fasta file is not very big since you opted to load everything into memory in your initial script. If it were big, you could change the method into a generator to save memory, by writing this instead: # previous lines... for record in records: checksum = seguid(record.seq) if checksum not in checksum_container: yield record And iterating over the function like so: for unique_record in remove_redundant(fastafile): # process the records here Hope that helps, --- Bow On Fri, Mar 23, 2012 at 23:19, Iddo Friedberg wrote: > Python assigns by reference, not by value. So you can have the following: > >>>> a=[1,2,3] >>>> b=a >>>> print b > [1, 2, 3] >>>> del b[1] >>>> print a > [1, 3] >>>> > > So if you remove an item from list b, it will remove it from a as well. > Which is why in your case, record and new_rec end up the same, since they > were the same to start off with. > > Furthermore, in your loop, you are changing the length of "record" which is > the target of a for loop. Never a good idea and yields unexpected results. > Finally, the index "j" you are using points to one thing in record, but > will point to another thing in new_rec. > > You can do an assignment by value using the copy module > new_rec=copy.copy(record) > > That will create a completely new copy of record in new_rec. > > That still won't solve the problem that you have in the shifting place "j" > points to in the loop though. > > I would suggest building a list of non-redundant sequences rather than > deleting from a list of redundant sequences. > > > HTH, > > Iddo > > On Fri, Mar 23, 2012 at 5:55 PM, wrote: > >> Hi Biopy users, >> I have a mult-sequence fasta file which I've read as a list. Is there a >> clever way/method to remove redundant sequences? >> Thanks in advance, >> Fred >> >> ### CODE: >> ? ?def redundancy(fastafile): >> ? ?f=open(fastafile, 'r') >> ? ?record = list(SeqIO.parse(f,"fasta")) >> ? ?new_rec = record >> ? ?f.close >> ? ?print len(record) >> ? ?for i in range(len(record)): >> ? ? ? ?for j in range(len(record)): >> ? ? ? ? ? ?if i < j: >> ? ? ? ? ? ? ? ?if record[i].seq == record[j].seq: >> ? ? ? ? ? ? ? ? ? ?del new_rec[j] >> ? ? print len(new_rec) >> >> >> ### RESULTS: >> $ redundancy.py -run all_emm_fake.fasta >> 823 >> /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In >> future comparing Seq objects will use string comparison (not object >> comparison). Incompatible alphabets will trigger a warning (not an >> exception). In the interim please use id(seq1)==id(seq2) or >> str(seq1)==str(seq2) to make your code explicit and to avoid this warning. >> ?"and to avoid this warning.", FutureWarning) >> 823 >> >> ### EXPECTING: >> Worse, the function above is not working. I was expecting 823 before and >> 822 after running it. >> >> >> >> >> ______________________________**_________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> > ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. > .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>>----.<--.>++++++.<<<<------------------------------------. > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From nuin at genedrift.org Fri Mar 23 22:27:54 2012 From: nuin at genedrift.org (nuin at genedrift.org) Date: Fri, 23 Mar 2012 22:27:54 +0000 Subject: [Biopython] remove list redundancy In-Reply-To: <20120323185527.47476o6f3ticn2of@webmail.usp.br> References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: <1428368114-1332541675-cardhu_decombobulator_blackberry.rim.net-2070697673-@b2.c21.bise6.blackberry> Not a BioPython solution per se but you can uniquify your list using a set. HTH Paulo Sent from my BlackBerry device on the Rogers Wireless Network -----Original Message----- From: ferreirafm at usp.br Sender: biopython-bounces at lists.open-bio.org Date: Fri, 23 Mar 2012 18:55:27 To: Subject: [Biopython] remove list redundancy Hi Biopy users, I have a mult-sequence fasta file which I've read as a list. Is there a clever way/method to remove redundant sequences? Thanks in advance, Fred ### CODE: def redundancy(fastafile): f=open(fastafile, 'r') record = list(SeqIO.parse(f,"fasta")) new_rec = record f.close print len(record) for i in range(len(record)): for j in range(len(record)): if i < j: if record[i].seq == record[j].seq: del new_rec[j] print len(new_rec) ### RESULTS: $ redundancy.py -run all_emm_fake.fasta 823 /usr/lib64/python2.7/site-packages/Bio/Seq.py:197: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. "and to avoid this warning.", FutureWarning) 823 ### EXPECTING: Worse, the function above is not working. I was expecting 823 before and 822 after running it. _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From w.arindrarto at gmail.com Fri Mar 23 22:39:59 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 23 Mar 2012 23:39:59 +0100 Subject: [Biopython] remove list redundancy In-Reply-To: References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: Ferreira, I just realized I missed one important line: ? ? ? ?if checksum not in checksum_container: ? ? ? ? ? ?unique_records.append(record) should be: ? ? ? ?if checksum not in checksum_container: checksum_container.append(checksum) ? ? ? ? ? unique_records.append(record) Basically what the method does is it only adds the sequence record to the unique_records list only if its sequence checksum is not present in checksum_container already. And apologies for the double mass-post everyone. Have a nice weekend, --- Bow On Fri, Mar 23, 2012 at 23:23, Wibowo Arindrarto wrote: > Hi Ferreira, > > As Iddo have mentioned, it's better to build a new list containing > unique records instead. Here's my shot at a method like that: > > > from Bio import SeqIO > from Bio.SeqUtils.CheckSum import seguid > > # Returns a list containing unique SeqRecord list. > def remove_redundant(fastafile): > ? ?records = SeqIO.parse(fastafile, 'fasta') > > ? ?# new list container > ? ?unique_records = [] > ? ?# unique sequence checksum container > ? ?checksum_container = [] > > ? ?for record in records: > ? ? ? ?checksum = seguid(record.seq) > ? ? ? ?if checksum not in checksum_container: > ? ? ? ? ? ?unique_records.append(record) > > ? ?return unique_records > > I assume your Fasta file is not very big since you opted to load > everything into memory in your initial script. If it were big, you > could change the method into a generator to save memory, by writing > this instead: > > ? ?# previous lines... > ? ?for record in records: > ? ? ? ?checksum = seguid(record.seq) > ? ? ? ?if checksum not in checksum_container: > ? ? ? ? ? ?yield record > > And iterating over the function like so: > > ? ?for unique_record in remove_redundant(fastafile): > ? ? ? ?# process the records here > > > Hope that helps, > --- > Bow > > > On Fri, Mar 23, 2012 at 23:19, Iddo Friedberg wrote: >> Python assigns by reference, not by value. So you can have the following: >> >>>>> a=[1,2,3] >>>>> b=a >>>>> print b >> [1, 2, 3] >>>>> del b[1] >>>>> print a >> [1, 3] >>>>> >> >> So if you remove an item from list b, it will remove it from a as well. >> Which is why in your case, record and new_rec end up the same, since they >> were the same to start off with. >> >> Furthermore, in your loop, you are changing the length of "record" which is >> the target of a for loop. Never a good idea and yields unexpected results. >> Finally, the index "j" you are using points to one thing in record, but >> will point to another thing in new_rec. >> >> You can do an assignment by value using the copy module >> new_rec=copy.copy(record) >> >> That will create a completely new copy of record in new_rec. >> >> That still won't solve the problem that you have in the shifting place "j" >> points to in the loop though. >> >> I would suggest building a list of non-redundant sequences rather than >> deleting from a list of redundant sequences. >> >> >> HTH, >> >> Iddo >> >> On Fri, Mar 23, 2012 at 5:55 PM, wrote: >> >>> Hi Biopy users, >>> I have a mult-sequence fasta file which I've read as a list. Is there a >>> clever way/method to remove redundant sequences? >>> Thanks in advance, >>> Fred >>> >>> ### CODE: >>> ? ?def redundancy(fastafile): >>> ? ?f=open(fastafile, 'r') >>> ? ?record = list(SeqIO.parse(f,"fasta")) >>> ? ?new_rec = record >>> ? ?f.close >>> ? ?print len(record) >>> ? ?for i in range(len(record)): >>> ? ? ? ?for j in range(len(record)): >>> ? ? ? ? ? ?if i < j: >>> ? ? ? ? ? ? ? ?if record[i].seq == record[j].seq: >>> ? ? ? ? ? ? ? ? ? ?del new_rec[j] >>> ? ? print len(new_rec) >>> >>> >>> ### RESULTS: >>> $ redundancy.py -run all_emm_fake.fasta >>> 823 >>> /usr/lib64/python2.7/site-**packages/Bio/Seq.py:197: FutureWarning: In >>> future comparing Seq objects will use string comparison (not object >>> comparison). Incompatible alphabets will trigger a warning (not an >>> exception). In the interim please use id(seq1)==id(seq2) or >>> str(seq1)==str(seq2) to make your code explicit and to avoid this warning. >>> ?"and to avoid this warning.", FutureWarning) >>> 823 >>> >>> ### EXPECTING: >>> Worse, the function above is not working. I was expecting 823 before and >>> 822 after running it. >>> >>> >>> >>> >>> ______________________________**_________________ >>> Biopython mailing list ?- ?Biopython at lists.open-bio.org >>> http://lists.open-bio.org/**mailman/listinfo/biopython >>> >> >> >> >> -- >> Iddo Friedberg >> http://iddo-friedberg.net/contact.html >> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> >> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. >> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>>>----.<--.>++++++.<<<<------------------------------------. >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython From ferreirafm at usp.br Fri Mar 23 23:35:54 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Fri, 23 Mar 2012 20:35:54 -0300 Subject: [Biopython] remove list redundancy In-Reply-To: <20120323185527.47476o6f3ticn2of@webmail.usp.br> References: <20120323185527.47476o6f3ticn2of@webmail.usp.br> Message-ID: <20120323203554.14654j9m0wvpho0q@webmail.usp.br> Thanks everyone for helping. Have I weekend. Fred Citando ferreirafm at usp.br: > Hi Biopy users, > I have a mult-sequence fasta file which I've read as a list. Is > there a clever way/method to remove redundant sequences? > Thanks in advance, > Fred > > ### CODE: > def redundancy(fastafile): > f=open(fastafile, 'r') > record = list(SeqIO.parse(f,"fasta")) > new_rec = record > f.close > print len(record) > for i in range(len(record)): > for j in range(len(record)): > if i < j: > if record[i].seq == record[j].seq: > del new_rec[j] > print len(new_rec) > > > ### RESULTS: > $ redundancy.py -run all_emm_fake.fasta > 823 > /usr/lib64/python2.7/site-packages/Bio/Seq.py:197: FutureWarning: In > future comparing Seq objects will use string comparison (not object > comparison). Incompatible alphabets will trigger a warning (not an > exception). In the interim please use id(seq1)==id(seq2) or > str(seq1)==str(seq2) to make your code explicit and to avoid this > warning. > "and to avoid this warning.", FutureWarning) > 823 > > ### EXPECTING: > Worse, the function above is not working. I was expecting 823 before > and 822 after running it. > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From eric.talevich at gmail.com Sat Mar 24 01:21:37 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 23 Mar 2012 21:21:37 -0400 Subject: [Biopython] Phylo.draw - font size In-Reply-To: References: <4F6B432B.1050901@stats.ox.ac.uk> Message-ID: On Thu, Mar 22, 2012 at 3:55 PM, Eric Talevich wrote: > On Thu, Mar 22, 2012 at 11:20 AM, Tanya Golubchik > wrote: >> Hi guys, >> >> Does anyone know how to change the font size of the text annotations on >> the current figure from Phylo.draw (ie the node names)? Changing >> rcParams['font.size'] changes the axes but not the annotations. >> > > There doesn't seem to be a great way to do this directly, but you can > scale the entire image on-screen by changing the figure.dpi value. It > defaults to 80dpi, so this will magnify everything by 50%: > >>>> rcParams['figure.dpi'] = 120 > > Alternatively, you can edit the source of Bio/Phylo/_utils.py at lines > 347 (taxon labels) and 357 (confidence/support values), or copy the > entire _utils.draw function into your own code and edit the same lines > there. > > If this feels ridiculous (as it probably does), I can add font_size > and branch_width_scale keyword arguments in the next release. Would > that help? Any other options you'd like to see, keeping in mind that > this function isn't meant to compete with standalone programs like > Archaeopteryx? I've fixed this in the trunk: https://github.com/biopython/biopython/commit/e25a1b8bde6c9adba7db92bfe13d1bd4320cadcf Now rcParams["font.size"] will scale the fonts as expected (though the proportions are still hard-coded), and rcParams["lines.linewidth"] will scale the lines, e.g. if you set the line width to 2 in rcParams, a branch with width=2 will be displayed with a width of 4 pixels, and width=0.5 will be displayed with 1 pixel. Other changes are still in progress. Cheers, Eric From chaitanya.talnikar at iitb.ac.in Sun Mar 25 18:25:25 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Sun, 25 Mar 2012 23:55:25 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions Message-ID: Hi again, I have a few questions on this topic. 1. For the implementation of variants what would be better, to create a new SeqVariant class from scratch or to extend the SeqFeature class to accomodate variants? I guess a separate class would be better. 2. While looking at the Biopython wiki I came across an implementation of GFF at https://github.com/chapmanb/bcbb/tree/master/gff As GVF is an extension of GFF3, this module could be used for reading GVF's too. Is this module a good start to modify it to support GVFs? 3. I've been going through the VCF documentation and SNPs, insertions and deletions can be represented just like it is done in VCF, the object would have a start position, length of reference sequence(no need to store this sequence) and a list of alternate sequence objects. I have to still look into the SV(Structural variants), rearrangements and imprecise variant information, so this representation is only for SNPs and small indels. The GVF has a very similar format for small indels and SNPs, just that it provides an extra end position column which is not required if we have the reference sequence. Regards, Chaitanya Talnikar Undergraduate Student Department of Chemical Engineering IIT Bombay From chris.mit7 at gmail.com Sun Mar 25 19:13:47 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sun, 25 Mar 2012 15:13:47 -0400 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants Message-ID: Hey everyone, I'm interested in undertaking this project. I'm currently a PhD student in Biochemical, Cellular, & Molecular Biology at Johns Hopkins School of Medicine, and I've been a hobby programmer for several years. I primarily code in Python and C++. I'm a core developer of Mudlet, which is in C++ and has a fair user base. For Python, I have nothing published for general consumption yet, though I will more than likely be putting out a Mass Spectrometry toolset in the upcoming year. I'm currently working on large -omics based data (whole genome alignments, RNA-Seq) so I have a flavor of what formats end users will encounter (I've worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and Affy arrays for SNPs/CNVs) and more importantly, I know how the end user will want to utilize the data. By far, I see the biggest hurdle is to arrange several types of data representations into a universal reference frame (for instance bam files being 0 based, sam being 1 based, CG vcf files being 0 based, closed interval versus half open, etc etc etc). I've written parsers for my own use that interconvert between formats and can read/output GFF/VCF files, and this would be a great opportunity to expand on my existing toolset and get valuable feedback from others in the community. Thanks, Chris From p.j.a.cock at googlemail.com Sun Mar 25 19:59:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 25 Mar 2012 20:59:41 +0100 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: On Sun, Mar 25, 2012 at 8:13 PM, Chris Mitchell wrote: > Hey everyone, > > I'm interested in undertaking this project. ?I'm currently a PhD student in > Biochemical, Cellular, & Molecular Biology at Johns Hopkins School of > Medicine, and I've been a hobby programmer for several years. ?I primarily > code in Python and C++. Great - and your background sounds good. > I'm currently working on large -omics based data (whole genome alignments, > RNA-Seq) so I have a flavor of what formats end users will encounter (I've > worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and > Affy arrays for SNPs/CNVs) and more importantly, I know how the end user > will want to utilize the data. ?By far, I see the biggest hurdle is to > arrange several types of data representations into a universal reference > frame (for instance bam files being 0 based, sam being 1 based, CG vcf > files being 0 based, closed interval versus half open, etc etc etc). That's easy - we're Python programers therefore any parsed data structure should be converted to used Python counting. Peter From rbuels at gmail.com Sun Mar 25 20:09:50 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 25 Mar 2012 13:09:50 -0700 Subject: [Biopython] Announcing OBF Summer of Code - please forward! Message-ID: <4F6F7B8E.1050903@gmail.com> Hi all, Here's an advertising-ready announcement for OBF's Summer of Code, thanks to Christian Zmasek and Hilmar Lapp for their excellent writing. Student applications are due April 6! Please spread it widely, we need to reach lots of students with it! Rob Buels OBF GSoC 2012 Admin ============================================================ *** Please disseminate widely at your local institutions *** *** including posting to message and job boards, so that *** *** we reach as many students as possible. *** ============================================================ OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2011 Applications due 19:00 UTC, April 6, 2012. http://www.open-bio.org/wiki/Google_Summer_of_Code The Open Bioinformatics Foundation Summer of Code program provides a unique opportunity for undergraduate, masters, and PhD students to obtain hands-on experience writing and extending open-source software for bioinformatics under the mentorship of experienced developers from around the world. The program is the participation of the Open Bioinformatics Foundation (OBF) as a mentoring organization in the Google Summer of Code(tm) (http://code.google.com/soc/). Students successfully completing the 3 month program receive a $5,000 USD stipend, and may work entirely from their home or home institution. Participation is open to students from any country in the world except countries subject to US trade restrictions. Each student will have at least one dedicated mentor to show them the ropes and help them complete their project. The Open Bioinformatics Foundation is particularly seeking students interested in both bioinformatics (computational biology) and software development. Some initial project ideas are listed on the website. These range from sequence search I/O in BioPython to lightweight sequence objects and lazy parsing in BioPerl, a next-generation BioRuby interface to Ensembl to developing cloud-optimized versions of BioJava modules. All project ideas are flexible and many can be adjusted in scope to match the skills of the student. We also particularly welcome and encourage students proposing their own project ideas; historically some of the most successful Summer of Code projects are ones proposed by the students themselves. TO APPLY: Apply online at the Google Summer of Code website (http://socghop.appspot.com/), where you will also find GSoC program rules and eligibility requirements. The 12-day application period for students runs from Monday, March 26 through Friday, April 6th, 2012. INQUIRIES: We strongly encourage all interested students to get in touch with us with their ideas as early on as possible. See the OBF GSoC page for contact details. 2012 OBF Summer of Code: http://www.open-bio.org/wiki/Google_Summer_of_Code Google Summer of Code FAQ: http://www.google-melange.com/document/show/gsoc_program/google/gsoc2012/faqs From ankeshth at gmail.com Mon Mar 26 04:31:26 2012 From: ankeshth at gmail.com (Ankesh Thakur) Date: Mon, 26 Mar 2012 10:01:26 +0530 Subject: [Biopython] Query for GSoc projects on SearchIO and Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: Dear Sir, I am a student of Biological Sciences and bioengineering at Indian Institute of Technology, Kanpur (IIT Kanpur). I am willing to write codes for Biopython during this summer. I am not very much clear about the goals of this project. I want to know more about the suggested projects, like what else I need to do apart from conversion of one file format to other and showing the data on the console in human readable form. I have no prior experience with bio modules of python. I have arround than seven months experience with python git hub. And I have done Molecular biology, Genetics and Bio-chemistry courses. I would like to learn Biopython, BioPerl( if required) and other necessary tools during this summer. Eagerly waiting for your reply. Regards, Ankesh Kumar Thakur. From p.j.a.cock at googlemail.com Mon Mar 26 09:19:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 26 Mar 2012 10:19:18 +0100 Subject: [Biopython] Query for GSoc projects on SearchIO and Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: On Mon, Mar 26, 2012 at 5:31 AM, Ankesh Thakur wrote: > Dear Sir, > ? I am a student of Biological Sciences and bioengineering at Indian > Institute of Technology, Kanpur (IIT Kanpur). I am willing to write > codes for Biopython during this summer. I am not very much clear about > the goals of this project. I want to know more about the suggested > projects, like what else I need to do apart from conversion of one file > format to other and showing the data on the console in human readable > form. > > ? I have no prior experience with bio modules of python. I have arround than > seven months experience with python git hub. And I have done Molecular > biology, Genetics and Bio-chemistry courses. I would like to learn > Biopython, BioPerl( if required) and other necessary tools during this > summer. Eagerly waiting for your reply. > > Regards, > Ankesh Kumar Thakur. Hello Ankesh, Both the SearchIO and genomic variant GSoC project ideas are more than just file format conversion and 'pretty printing' at the console. An essential part of this is designing a suitable object representation for efficient use of the data. That probably means creating objects (Python classes). This will require both a good understanding of the meaning of the data being represented (e.g. how are BLAST search results structured) but also how to design Python objects. For the SearchIO project, I went into a lot more detail on the Biopython development mailing list last week: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html Peter From chapmanb at 50mail.com Mon Mar 26 11:07:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 26 Mar 2012 07:07:36 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: Message-ID: <87398vmxhj.fsf@fastmail.fm> Chaitanya; Thanks for the interest and specific questions. > 1. For the implementation of variants what would be better, to create > a new SeqVariant class from scratch or to extend the SeqFeature class > to accomodate variants? I guess a separate class would be better. My preference would be to see how far the SeqFeature class can take you before implementing a new class. It should be general enough to handle variant data, but the bigger challenge might be designing a lightweight representation that is compatible with existing SeqFeatures. > 2. While looking at the Biopython wiki I came across an implementation > of GFF at > https://github.com/chapmanb/bcbb/tree/master/gff > As GVF is an extension of GFF3, this module could be used for reading > GVF's too. Is this module a good start to modify it to support GVFs? That would be perfect. We're hoping to merge this into the Biopython code base before the next release. There is also an existing VCF parser we'd love to use here: https://github.com/jamescasbon/PyVCF > 3. I've been going through the VCF documentation and SNPs, insertions > and deletions can be represented just like it is done in VCF, the > object would have a start position, length of reference sequence(no > need to store this sequence) and a list of alternate sequence objects. > I have to still look into the SV(Structural variants), rearrangements > and imprecise variant information, so this representation is only for > SNPs and small indels. The GVF has a very similar format for small > indels and SNPs, just that it provides an extra end position column > which is not required if we have the reference sequence. This sounds good. My general suggestion is to start writing your proposal as soon as possible. A concrete first draft will help with more detailed comments. The wiki has good information on the project plan: http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply and the NESCent wiki has some examples of well-written proposals from previous years: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application One of the key aspects is having a detailed week-by-week outline of your plans for the summer. Thanks again for the interest, Brad From chapmanb at 50mail.com Mon Mar 26 11:02:29 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 26 Mar 2012 07:02:29 -0400 Subject: [Biopython] Google Summer of Code (GSoC) In-Reply-To: References: Message-ID: <8762drmxq2.fsf@fastmail.fm> Zhigang; > I am, Zhigang Wu, a third year graduate student in UC Riverside with a > research focus on miRNA evolution. I am interested in implementing the > Biopython SearchIO module, which is used to parse the blast reports from > currently popular sequence alignment tools like NCBI BLAST+, FASTA, HMMER3 > and etc. Welcome. Thanks for the introduction and your interest in the SearchIO project. > Right now, I am preparing my proposal that is due by April 6. I am listing > below the core methods that the Biopythonic SearchIO module is going to > support. For the sake of consistency, the moethods are very similar to > existing SeqIO and > AlignIOmodules. > > 1. SearchIO.parse(handle, format), is a generator function. > 2. SearchIO.to_dict(iterator): this function takes in an iterator > arguments which is produced by SearchIO.parse(...) function. > 3. SearchIO.read(handle, format): provide fasta access to blast report > have only one record > 4. SearchIO.write(....) outputs specified blast output > 5. SearchIO.convert(...) provide format conversion between different > formats > 6. ... > > I'd like to hear back from you any feedback or suggestions on the method or > any format that in your research field is considered to be popular and you > want it to be supported in Biopythonic SearchIO module. This all sounds great. My suggestion would be to make your project proposal available once you have a first draft, and then folks will have more detailed comments. The wiki has good information on the project plan: http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply and the NESCent wiki has some examples of well-written proposals from previous years: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application One of the key aspects is having a detailed week-by-week outline of your plans for the summer. Thanks again, Brad From chapmanb at 50mail.com Mon Mar 26 11:16:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 26 Mar 2012 07:16:27 -0400 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: <87wr67liic.fsf@fastmail.fm> Chris; Welcome and thanks for the interest in the project. > I'm currently working on large -omics based data (whole genome alignments, > RNA-Seq) so I have a flavor of what formats end users will encounter (I've > worked with Illumina & Complete Genomics RNA-Seq and genome assemblies, and > Affy arrays for SNPs/CNVs) and more importantly, I know how the end user > will want to utilize the data. By far, I see the biggest hurdle is to > arrange several types of data representations into a universal reference > frame (for instance bam files being 0 based, sam being 1 based, CG vcf > files being 0 based, closed interval versus half open, etc etc etc). I've > written parsers for my own use that interconvert between formats and can > read/output GFF/VCF files, and this would be a great opportunity to expand > on my existing toolset and get valuable feedback from others in the > community. I agree with Peter: you want to convert everything to standard Python 0-based internally. The goal is to have a consistent data structure so you can code independent of the input/output formats. There are some existing VCF and GFF parsers we were targeting for inclusion: https://github.com/jamescasbon/PyVCF http://biopython.org/wiki/GFF_Parsing but it would be great to see code you've written as well. I am repeating myself, but my general suggestion is to start writing your proposal as soon as possible. A concrete first draft will help with more detailed comments. The wiki has good information on the project plan: http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply and the NESCent wiki has some examples of well-written proposals from previous years: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application One of the key aspects is having a detailed week-by-week outline of your plans for the summer. Brad From reece at harts.net Mon Mar 26 12:07:25 2012 From: reece at harts.net (Reece Hart) Date: Mon, 26 Mar 2012 05:07:25 -0700 Subject: [Biopython] GSOC 2012: Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: Hi Chris- Great. The only thing I have to add to what Peter and Brad said is that you should feel free to refine your proposal with us (GSoC mentors) and/or the BioPython community. -Reece From cjfields at illinois.edu Mon Mar 26 17:24:08 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 26 Mar 2012 17:24:08 +0000 Subject: [Biopython] Query for GSoc projects on SearchIO and Representation and manipulation of genomic variants In-Reply-To: References: Message-ID: On Mar 26, 2012, at 4:19 AM, Peter Cock wrote: > On Mon, Mar 26, 2012 at 5:31 AM, Ankesh Thakur wrote: >> Dear Sir, >> I am a student of Biological Sciences and bioengineering at Indian >> Institute of Technology, Kanpur (IIT Kanpur). I am willing to write >> codes for Biopython during this summer. I am not very much clear about >> the goals of this project. I want to know more about the suggested >> projects, like what else I need to do apart from conversion of one file >> format to other and showing the data on the console in human readable >> form. >> >> I have no prior experience with bio modules of python. I have arround than >> seven months experience with python git hub. And I have done Molecular >> biology, Genetics and Bio-chemistry courses. I would like to learn >> Biopython, BioPerl( if required) and other necessary tools during this >> summer. Eagerly waiting for your reply. >> >> Regards, >> Ankesh Kumar Thakur. > > Hello Ankesh, > > Both the SearchIO and genomic variant GSoC project ideas are > more than just file format conversion and 'pretty printing' at the > console. An essential part of this is designing a suitable object > representation for efficient use of the data. That probably means > creating objects (Python classes). This will require both a good > understanding of the meaning of the data being represented > (e.g. how are BLAST search results structured) but also how > to design Python objects. > > For the SearchIO project, I went into a lot more detail on the > Biopython development mailing list last week: > http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html > > Peter Might be a good opportunity go over what works via the bioperl SearchIO implementations, what doesn't, etc. The vast majority of the speed issues we (bioperl) have seen with SearchIO seem to have much more to do with object generation than with parsing (I think Ruby has the same issue). Bioperl's SearchIO is summarized in the HOWTO: http://www.bioperl.org/wiki/HOWTO:SearchIO Simple enough, each reports are divi'd up into one or more Result, each of which can have multiple Hits, again each of which can have multiple HSPs. HSPs are also paired SeqFeatures, one for the query, one for the hit (I think this was implemented later). Some basic notes about the BLAST parser design (SAX-like), written by Steve Chervitz during the time this was drawn up, are here: https://github.com/bioperl/bioperl-live/blob/master/Bio/SearchIO/blast.pm#L2440 This doesn't apply to all SearchIO parsers, but it gives an idea of the thoughts behind it. chris From mictadlo at gmail.com Tue Mar 27 04:33:08 2012 From: mictadlo at gmail.com (Mic) Date: Tue, 27 Mar 2012 14:33:08 +1000 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <87398vmxhj.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> Message-ID: Hello, http://code.google.com/p/pysam/downloads/detail?name=pysam-0.5.tar.gz&can=2&q= *added vcf parsing* What is the difference between pysam's VCF and PyVCF?* * On Mon, Mar 26, 2012 at 9:07 PM, Brad Chapman wrote: > > Chaitanya; > Thanks for the interest and specific questions. > > > 1. For the implementation of variants what would be better, to create > > a new SeqVariant class from scratch or to extend the SeqFeature class > > to accomodate variants? I guess a separate class would be better. > > My preference would be to see how far the SeqFeature class can take you > before implementing a new class. It should be general enough to handle > variant data, but the bigger challenge might be designing a lightweight > representation that is compatible with existing SeqFeatures. > > > 2. While looking at the Biopython wiki I came across an implementation > > of GFF at > > https://github.com/chapmanb/bcbb/tree/master/gff > > As GVF is an extension of GFF3, this module could be used for reading > > GVF's too. Is this module a good start to modify it to support GVFs? > > That would be perfect. We're hoping to merge this into the Biopython > code base before the next release. There is also an existing VCF parser > we'd love to use here: > > https://github.com/jamescasbon/PyVCF > > > 3. I've been going through the VCF documentation and SNPs, insertions > > and deletions can be represented just like it is done in VCF, the > > object would have a start position, length of reference sequence(no > > need to store this sequence) and a list of alternate sequence objects. > > I have to still look into the SV(Structural variants), rearrangements > > and imprecise variant information, so this representation is only for > > SNPs and small indels. The GVF has a very similar format for small > > indels and SNPs, just that it provides an extra end position column > > which is not required if we have the reference sequence. > > This sounds good. My general suggestion is to start writing your > proposal as soon as possible. A concrete first draft will help with more > detailed comments. The wiki has good information on the project plan: > > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > > and the NESCent wiki has some examples of well-written proposals from > previous years: > > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > > One of the key aspects is having a detailed week-by-week outline of your > plans for the summer. > > Thanks again for the interest, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Mar 27 10:20:26 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 27 Mar 2012 06:20:26 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> Message-ID: <87vclqjqfp.fsf@fastmail.fm> Mic; > http://code.google.com/p/pysam/downloads/detail?name=pysam-0.5.tar.gz&can=2&q= > *added vcf parsing* > What is the difference between pysam's VCF and PyVCF?* Good point, thanks for mentioning this. pysam's VCF is also worth exploring as a base for the variant representation. I added links to it and the other resources on the GSoC project description page. Thanks, Brad From chaitanya.talnikar at iitb.ac.in Tue Mar 27 18:57:45 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Wed, 28 Mar 2012 00:27:45 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <87398vmxhj.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> Message-ID: Hi, I have uploaded the first draft of my project proposal. I will add more sections to the project plan in a day or two. Just wanted to have the initial draft up. I hope to write a better proposal with your feedback. Regards, Chaitanya On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > > Chaitanya; > Thanks for the interest and specific questions. > >> 1. For the implementation of variants what would be better, to create >> a new SeqVariant class from scratch or to extend the SeqFeature class >> to accomodate variants? I guess a separate class would be better. > > My preference would be to see how far the SeqFeature class can take you > before implementing a new class. It should be general enough to handle > variant data, but the bigger challenge might be designing a lightweight > representation that is compatible with existing SeqFeatures. > >> 2. While looking at the Biopython wiki I came across an implementation >> of GFF at >> https://github.com/chapmanb/bcbb/tree/master/gff >> As GVF is an extension of GFF3, this module could be used for reading >> GVF's too. Is this module a good start to modify it to support GVFs? > > That would be perfect. We're hoping to merge this into the Biopython > code base before the next release. There is also an existing VCF parser > we'd love to use here: > > https://github.com/jamescasbon/PyVCF > >> 3. I've been going through the VCF documentation and SNPs, insertions >> and deletions can be represented just like it is done in VCF, the >> object would have a start position, length of reference sequence(no >> need to store this sequence) and a list of alternate sequence objects. >> I have to still look into the SV(Structural variants), rearrangements >> and imprecise variant information, so this representation is only for >> SNPs and small indels. The GVF has a very similar format for small >> indels and SNPs, just that it provides an extra end position column >> which is not required if we have the reference sequence. > > This sounds good. My general suggestion is to start writing your > proposal as soon as possible. A concrete first draft will help with more > detailed comments. The wiki has good information on the project plan: > > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > > and the NESCent wiki has some examples of well-written proposals from > previous years: > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > > One of the key aspects is having a detailed week-by-week outline of your > plans for the summer. > > Thanks again for the interest, > Brad From chapmanb at 50mail.com Wed Mar 28 00:43:33 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 27 Mar 2012 20:43:33 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> Message-ID: <874nt9k11m.fsf@fastmail.fm> Chaitanya; The easiest way to work on your proposal is to write it in a public Google Doc and then share with the list. I don't yet have access to all of the Melange GSoC project and I'd imagine others who might have thoughts are in the same boat. As a side benefit it's also much easier to collaborate on editing and notes. Brad > Hi, > I have uploaded the first draft of my project proposal. I will add > more sections to the project plan in a day or two. Just wanted to have > the initial draft up. I hope to write a better proposal with your > feedback. > > Regards, > Chaitanya > > On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > > > > Chaitanya; > > Thanks for the interest and specific questions. > > > >> 1. For the implementation of variants what would be better, to create > >> a new SeqVariant class from scratch or to extend the SeqFeature class > >> to accomodate variants? I guess a separate class would be better. > > > > My preference would be to see how far the SeqFeature class can take you > > before implementing a new class. It should be general enough to handle > > variant data, but the bigger challenge might be designing a lightweight > > representation that is compatible with existing SeqFeatures. > > > >> 2. While looking at the Biopython wiki I came across an implementation > >> of GFF at > >> https://github.com/chapmanb/bcbb/tree/master/gff > >> As GVF is an extension of GFF3, this module could be used for reading > >> GVF's too. Is this module a good start to modify it to support GVFs? > > > > That would be perfect. We're hoping to merge this into the Biopython > > code base before the next release. There is also an existing VCF parser > > we'd love to use here: > > > > https://github.com/jamescasbon/PyVCF > > > >> 3. I've been going through the VCF documentation and SNPs, insertions > >> and deletions can be represented just like it is done in VCF, the > >> object would have a start position, length of reference sequence(no > >> need to store this sequence) and a list of alternate sequence objects. > >> I have to still look into the SV(Structural variants), rearrangements > >> and imprecise variant information, so this representation is only for > >> SNPs and small indels. The GVF has a very similar format for small > >> indels and SNPs, just that it provides an extra end position column > >> which is not required if we have the reference sequence. > > > > This sounds good. My general suggestion is to start writing your > > proposal as soon as possible. A concrete first draft will help with more > > detailed comments. The wiki has good information on the project plan: > > > > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > > > > and the NESCent wiki has some examples of well-written proposals from > > previous years: > > > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > > > > One of the key aspects is having a detailed week-by-week outline of your > > plans for the summer. > > > > Thanks again for the interest, > > Brad From chaitanya.talnikar at iitb.ac.in Wed Mar 28 10:19:04 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Wed, 28 Mar 2012 15:49:04 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <874nt9k11m.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> Message-ID: Here's the google doc link, I have made it editable too. https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: > > Chaitanya; > The easiest way to work on your proposal is to write it in a > public Google Doc and then share with the list. I don't yet have access > to all of the Melange GSoC project and I'd imagine others who might > have thoughts are in the same boat. As a side benefit it's also much > easier to collaborate on editing and notes. > > Brad > >> Hi, >> I have uploaded the first draft of my project proposal. I will add >> more sections to the project plan in a day or two. Just wanted to have >> the initial draft up. I hope to write a better proposal with your >> feedback. >> >> Regards, >> Chaitanya >> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: >> > >> > Chaitanya; >> > Thanks for the interest and specific questions. >> > >> >> 1. For the implementation of variants what would be better, to create >> >> a new SeqVariant class from scratch or to extend the SeqFeature class >> >> to accomodate variants? I guess a separate class would be better. >> > >> > My preference would be to see how far the SeqFeature class can take you >> > before implementing a new class. It should be general enough to handle >> > variant data, but the bigger challenge might be designing a lightweight >> > representation that is compatible with existing SeqFeatures. >> > >> >> 2. While looking at the Biopython wiki I came across an implementation >> >> of GFF at >> >> https://github.com/chapmanb/bcbb/tree/master/gff >> >> As GVF is an extension of GFF3, this module could be used for reading >> >> GVF's too. Is this module a good start to modify it to support GVFs? >> > >> > That would be perfect. We're hoping to merge this into the Biopython >> > code base before the next release. There is also an existing VCF parser >> > we'd love to use here: >> > >> > https://github.com/jamescasbon/PyVCF >> > >> >> 3. I've been going through the VCF documentation and SNPs, insertions >> >> and deletions can be represented just like it is done in VCF, the >> >> object would have a start position, length of reference sequence(no >> >> need to store this sequence) and a list of alternate sequence objects. >> >> I have to still look into the SV(Structural variants), rearrangements >> >> and imprecise variant information, so this representation is only for >> >> SNPs and small indels. The GVF has a very similar format for small >> >> indels and SNPs, just that it provides an extra end position column >> >> which is not required if we have the reference sequence. >> > >> > This sounds good. My general suggestion is to start writing your >> > proposal as soon as possible. A concrete first draft will help with more >> > detailed comments. The wiki has good information on the project plan: >> > >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply >> > >> > and the NESCent wiki has some examples of well-written proposals from >> > previous years: >> > >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application >> > >> > One of the key aspects is having a detailed week-by-week outline of your >> > plans for the summer. >> > >> > Thanks again for the interest, >> > Brad From ferreirafm at usp.br Wed Mar 28 12:54:44 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 09:54:44 -0300 Subject: [Biopython] help with NCBIXML.parse Message-ID: <20120328095444.663777w48sm3kpp0@webmail.usp.br> Hi there, What I'm doing wrong in the following piece of code? Thanks in advance, Fred #### CODE #### def blast_cmd(query_seq): outf = open('blast_out.xml', 'w') for subj_seq in glob.iglob('emm*.fasta'): blast_cline = NcbiblastpCommandline(cmd = "blastp", task = "blastp-short", query = query_seq, subject = subj_seq, ungapped = True, comp_based_stats = "0", max_target_seqs = "1", matrix = "PAM30", outfmt = "5") stdout, stderr = blast_cline() outf.write(stdout) outf.close() handle = open("blast_out.xml") blast_records = NCBIXML.parse(handle) for record in blast_records: print record #### RESULTS #### $ run_blast.py --blast query.fasta Traceback (most recent call last): File "/home/ferreirafm/bin/redundancy.py", line 121, in main() File "/home/ferreirafm/bin/redundancy.py", line 106, in main blast_cmd(query_seq) File "/home/ferreirafm/bin/redundancy.py", line 63, in blast_cmd for record in blast_records: File "/usr/lib64/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 652, in parse expat_parser.Parse(text, False) xml.parsers.expat.ExpatError: junk after document element: line 88, column 14 From p.j.a.cock at googlemail.com Wed Mar 28 13:08:28 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 14:08:28 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328095444.663777w48sm3kpp0@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 1:54 PM, wrote: > Hi there, > What I'm doing wrong in the following piece of code? > Thanks in advance, > Fred You seem to be calling BLAST multiple times in a loop and trying to give it SeqRecord objects. It wants FASTA files, and you can call BLAST once with a single FASTA query file (containing multiple records) and a single database or FASTA subject file (also containing multiple records). As to the specific error, did you look at your blast_out.xml file and what it said on line 88? Peter From ferreirafm at usp.br Wed Mar 28 14:03:51 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 11:03:51 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> Message-ID: <20120328110351.6152656322vvivd3@webmail.usp.br> Hi Peter, Thanks for answer. Citando Peter Cock : > You seem to be calling BLAST multiple times in a loop and > trying to give it SeqRecord objects. Yes, because I want just only one hit per sequence. If someone has a overcome to this, it would be great. If a run it with a multiple fasta file, I'll take several hits per sequence. Like this: P02977 emm1.22.pep 100.00 2 0 0 15 16 90 91 9.4 9.2 P02977 emm1.22.pep 100.00 2 0 0 14 15 104 105 9.4 9.2 P02977 emm1-2.3.pep 62.50 8 3 0 8 15 196 203 0.033 17.5 P02977 emm1.23.pep 62.50 8 3 0 8 15 196 203 0.033 17.5 P02977 emm1-2.4.pep 100.00 2 0 0 15 16 99 100 5.0 9.2 P02977 emm1.24.pep 100.00 2 0 0 15 16 88 89 7.5 9.2 P02977 emm1.24.pep 100.00 2 0 0 14 15 102 103 7.5 9.2 P02977 emm1.25.pep 100.00 2 0 0 15 16 81 82 4.3 9.2 > It wants FASTA files, > and you can call BLAST once with a single FASTA query > file (containing multiple records) and a single database or > FASTA subject file (also containing multiple records). > > As to the specific error, did you look at your blast_out.xml > file and what it said on line 88? > line 88 is a second "header" of the xml file. It seems xmlparse can't handle it. > Peter > From p.j.a.cock at googlemail.com Wed Mar 28 14:19:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 15:19:05 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328110351.6152656322vvivd3@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 3:03 PM, wrote: > Citando Peter Cock : >> You seem to be calling BLAST multiple times in a loop and >> trying to give it SeqRecord objects. > > > Yes, because I want just only one hit per sequence. If someone has a > overcome to this, it would be great. If a run it with a multiple fasta file, > I'll take several hits per sequence. Like this: > > ... Try using the -max_target_seqs argument. >> As to the specific error, did you look at your blast_out.xml >> file and what it said on line 88? > > line 88 is a second "header" of the xml file. It seems xmlparse can't handle > it. > > > "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> > That is not allowed in XML. On re-reading your code, I see this happens because you are effectively concatenating the output for several BLAST runs (via stdout) into the one file. Historically the NCBI BLAST tools used to do something like this but with on a new line, so we do have some special case code to cope with that. You could try making this small change: outf.write(stdout) to: outf.write(stdout) outf.write("\n") That might work. However that isn't an elegant solution because if it works it relies on some special case code in Biopython for an NCBI bug. Instead you could parse each output inside the for loop? Peter From ferreirafm at usp.br Wed Mar 28 14:59:52 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 11:59:52 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> Message-ID: <20120328115952.76164qh9rj1aopyg@webmail.usp.br> Citando Peter Cock : > Try using the -max_target_seqs argument. I have already tried it. I issued blast with -max_target_seq 1 on a muit-fasta file. See resulats my last post. > > That is not allowed in XML. On re-reading your code, I see > this happens because you are effectively concatenating the > output for several BLAST runs (via stdout) into the one file. > > Historically the NCBI BLAST tools used to do something like > this but with on a new line, so we do > have some special case code to cope with that. You could > try making this small change: > > outf.write(stdout) > > to: > > outf.write(stdout) > outf.write("\n") Yep, it works. > > That might work. However that isn't an elegant solution > because if it works it relies on some special case code > in Biopython for an NCBI bug. > > Instead you could parse each output inside the for loop? That's a solution, but this way I would have to do it several times which would be even less pythonic > > Peter > From p.j.a.cock at googlemail.com Wed Mar 28 15:26:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 16:26:46 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328115952.76164qh9rj1aopyg@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 3:59 PM, wrote: > Citando Peter Cock : > >> Try using the -max_target_seqs argument. > > I have already tried it. > I issued blast with -max_target_seq 1 on a muit-fasta file. ?See resulats my > last post. > What does 'print blast_cline' give? i.e. What is the actual command being called. Which version of NCBI BLAST+ are you using? Peter From ferreirafm at usp.br Wed Mar 28 15:32:38 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 12:32:38 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> Message-ID: <20120328123238.63565uv5g501qrkm@webmail.usp.br> > Citando Peter Cock : > > What does 'print blast_cline' give? i.e. What is the actual command > being called. A long list of: ... > > Which version of NCBI BLAST+ are you using? 2.2.26+ > > Peter > From p.j.a.cock at googlemail.com Wed Mar 28 15:37:40 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Mar 2012 16:37:40 +0100 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: <20120328123238.63565uv5g501qrkm@webmail.usp.br> References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> <20120328123238.63565uv5g501qrkm@webmail.usp.br> Message-ID: On Wed, Mar 28, 2012 at 4:32 PM, wrote: > > > >> Citando Peter Cock : >> >> >> What does 'print blast_cline' give? i.e. What is the actual command >> being called. > > > A long list of: > > > > > > > ... That's the BLAST parser's output - I mean the NcbiblastpCommandline object you assigned to variable blast_cline. >> Which version of NCBI BLAST+ are you using? > > 2.2.26+ Huh, that is the latest. I'm still using 2.2.25+ here. Peter From ferreirafm at usp.br Wed Mar 28 17:31:38 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 28 Mar 2012 14:31:38 -0300 Subject: [Biopython] help with NCBIXML.parse In-Reply-To: References: <20120328095444.663777w48sm3kpp0@webmail.usp.br> <20120328110351.6152656322vvivd3@webmail.usp.br> <20120328115952.76164qh9rj1aopyg@webmail.usp.br> <20120328123238.63565uv5g501qrkm@webmail.usp.br> Message-ID: <20120328143138.46662ec6ytufdgca@webmail.usp.br> Citando Peter Cock : > That's the BLAST parser's output - I mean the NcbiblastpCommandline > object you assigned to variable blast_cline. In my last post I meant "It passed" the print loop. However, I don't know what to do with this. I was waiting for the blast results from the alignment when printing a blast record. It isn't it? > Huh, that is the latest. I'm still using 2.2.25+ here. > > Peter > and...??? From alfonso.esposito1983 at hotmail.it Thu Mar 29 14:12:53 2012 From: alfonso.esposito1983 at hotmail.it (fonz esposito) Date: Thu, 29 Mar 2012 16:12:53 +0200 Subject: [Biopython] Blast sequences and SNPs detection Message-ID: Dear All, I am Alfonso Esposito, I am a PhD student in environmental microbiology and I am quite new to the python community. I am trying to figure out how to make a script but I am going mad. I would need a script that takes as input a fasta file with N sequences, blast it on the nucleotide collection in NCBI and delivers a output file containing each SNP or gap with the correspondent nucleotide position (for example position 123 A->G or Gap between 145 and 146)... thanks everybody and I hope to reicive your answer Regards Alfonso From p.j.a.cock at googlemail.com Thu Mar 29 14:27:24 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 15:27:24 +0100 Subject: [Biopython] Blast sequences and SNPs detection In-Reply-To: References: Message-ID: On Thu, Mar 29, 2012 at 3:12 PM, fonz esposito wrote: > > Dear All, > > I am Alfonso Esposito, I am a PhD student in environmental microbiology and I am > quite new to the python community. I am trying to figure out how to make a script > but I am going mad. I would need a script that takes as input a fasta file with N > sequences, blast it on the nucleotide collection in NCBI and delivers a output file > containing each SNP or gap with the correspondent nucleotide position (for > example position 123 A->G or Gap between 145 and 146)... thanks everybody > and I hope to reicive your answer Hello Alfonso, I am confused about your aim here. Surely a dedicated SNP detection tool would be more appropriate than BLAST? BLAST finds similar sequences, it doesn't find SNPs. Are you hoping to take the matched sequences and lookup their annotation for SNPs? Or are you wanting to treat BLAST pairwise sequence alignments as if there were alternative strains/alleles and interpret the differences as SNPs? Perhaps you plan to restrict your BLAST search to a known accession/reference genome? Also if your FASTA file with N sequence in it is actually high throughput sequencing reads (e.g. Illumina reads), you probably want to start with a mapping tool like BWA to do the alignment, not BLAST. Peter From zhigang.wu at email.ucr.edu Thu Mar 29 16:51:41 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 29 Mar 2012 09:51:41 -0700 Subject: [Biopython] Biopython GSoC Proposal Message-ID: Hi Biopython community, Here I am posting my draft of proposal, in which I have proposed to implement the SearchIO module. Please follow the link to access it https://docs.google.com/document/d/15fkPAZfN2Ln8nMJr4Ad7lMscaGbKOiTaXcGpxxvIe3A/edit Any comments and remarks are welcome. Zhigang From p.j.a.cock at googlemail.com Thu Mar 29 18:31:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 29 Mar 2012 19:31:21 +0100 Subject: [Biopython] Blast sequences and SNPs detection In-Reply-To: References: Message-ID: I'm assuming Fonz meant to send this to the list, my reply is below. On Thu, Mar 29, 2012 at 7:21 PM, fonz esposito wrote: > Dear Peter and dear all, > > first of all thanks for answering me so quickly, then I will try to explain > better my problem: I have sequences from DGGE bands, they have some > mistakes, mainly invalid basecall so I need to blast every single sequence > (after trimming the first and last bases from the AB1) on NCBI, and then > compare it to the best hit, checking out every mismatch. This could be > automated, I did with biopython the blast and I can process the output but I > did not manage to indicate the exact nucleotide number and what the mismatch > is, and when there is a gap I don't exactly know how to tell the program to > output the gap location in the original sequence I blasted. > > I hope that I was clearer now, let me know if you can help me > > Alfonso So these are 'Sanger' capillary reads, and while you may have lots I'm guessing this is under 100 in all? In that case using BLAST is probably going to be OK - although depending on how many sequences you have you might want to run that locally rather than at the NCBI. Which database are you intending to search against? i.e. Do you know what organism your bands should be from (or even what kind of organism)? What are you trying to do with any suspect bases where your sequences differ from those in the database? I personally (if the number of sequences was quite small) might think about working directly from BLAST pairwise alignment to go back to the chromatogram in Chromas (or an equivalent tool) to see if the base call can be manually corrected, or is the difference appears to be real. Peter P.S. You can read the (trimmed) sequences from ABI/AB1 files directly within Biopython 1.58 or later. From chapmanb at 50mail.com Fri Mar 30 01:13:46 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 29 Mar 2012 21:13:46 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> Message-ID: <87wr626gc5.fsf@fastmail.fm> Chaitanya; Thanks for making this available. It's a great start and you need to work from here on being much more detailed in your project plan. I left specific comments in-line in the proposal. Let us know when you have a revised version and we can work more. Thanks again, Brad > Here's the google doc link, I have made it editable too. > > https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit > > On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: > > > > Chaitanya; > > The easiest way to work on your proposal is to write it in a > > public Google Doc and then share with the list. I don't yet have access > > to all of the Melange GSoC project and I'd imagine others who might > > have thoughts are in the same boat. As a side benefit it's also much > > easier to collaborate on editing and notes. > > > > Brad > > > >> Hi, > >> I have uploaded the first draft of my project proposal. I will add > >> more sections to the project plan in a day or two. Just wanted to have > >> the initial draft up. I hope to write a better proposal with your > >> feedback. > >> > >> Regards, > >> Chaitanya > >> > >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > >> > > >> > Chaitanya; > >> > Thanks for the interest and specific questions. > >> > > >> >> 1. For the implementation of variants what would be better, to create > >> >> a new SeqVariant class from scratch or to extend the SeqFeature class > >> >> to accomodate variants? I guess a separate class would be better. > >> > > >> > My preference would be to see how far the SeqFeature class can take you > >> > before implementing a new class. It should be general enough to handle > >> > variant data, but the bigger challenge might be designing a lightweight > >> > representation that is compatible with existing SeqFeatures. > >> > > >> >> 2. While looking at the Biopython wiki I came across an implementation > >> >> of GFF at > >> >> https://github.com/chapmanb/bcbb/tree/master/gff > >> >> As GVF is an extension of GFF3, this module could be used for reading > >> >> GVF's too. Is this module a good start to modify it to support GVFs? > >> > > >> > That would be perfect. We're hoping to merge this into the Biopython > >> > code base before the next release. There is also an existing VCF parser > >> > we'd love to use here: > >> > > >> > https://github.com/jamescasbon/PyVCF > >> > > >> >> 3. I've been going through the VCF documentation and SNPs, insertions > >> >> and deletions can be represented just like it is done in VCF, the > >> >> object would have a start position, length of reference sequence(no > >> >> need to store this sequence) and a list of alternate sequence objects. > >> >> I have to still look into the SV(Structural variants), rearrangements > >> >> and imprecise variant information, so this representation is only for > >> >> SNPs and small indels. The GVF has a very similar format for small > >> >> indels and SNPs, just that it provides an extra end position column > >> >> which is not required if we have the reference sequence. > >> > > >> > This sounds good. My general suggestion is to start writing your > >> > proposal as soon as possible. A concrete first draft will help with more > >> > detailed comments. The wiki has good information on the project plan: > >> > > >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > >> > > >> > and the NESCent wiki has some examples of well-written proposals from > >> > previous years: > >> > > >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > >> > > >> > One of the key aspects is having a detailed week-by-week outline of your > >> > plans for the summer. > >> > > >> > Thanks again for the interest, > >> > Brad From chapmanb at 50mail.com Fri Mar 30 01:15:27 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 29 Mar 2012 21:15:27 -0400 Subject: [Biopython] Biopython GSoC Proposal In-Reply-To: References: Message-ID: <87ty166g9c.fsf@fastmail.fm> Zhigang; > Here I am posting my draft of proposal, in which I have proposed to > implement the SearchIO module. Please follow the link to access it > https://docs.google.com/document/d/15fkPAZfN2Ln8nMJr4Ad7lMscaGbKOiTaXcGpxxvIe3A/edit Thanks for putting this together. You've got an excellent start. I added comments in the document on specific areas. Let us know if you have any questions or need followup on any points. Thanks again, Brad From anna.kostikova at gmail.com Fri Mar 30 14:49:22 2012 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Fri, 30 Mar 2012 16:49:22 +0200 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch Message-ID: Dear list members, Is there a parameter in entrez.efetch/entrez.esearch which would allow to only look for and download records with the maximum sequence length of ? e.g. an analogue to SLEN parameter of the web interface of the NCBI website. Thanks a lot in advance, Anna From p.j.a.cock at googlemail.com Fri Mar 30 15:10:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 30 Mar 2012 16:10:45 +0100 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: On Fri, Mar 30, 2012 at 3:49 PM, Anna Kostikova wrote: > Dear list members, > > Is there a parameter in entrez.efetch/entrez.esearch which would allow > to only look for and download records with the maximum sequence length > of ? e.g. an analogue to SLEN parameter of the web > interface of the NCBI website. > > Thanks a lot in advance, > Anna For esearch, have you checked the available search fields using einfo - shown in the Biopython Tutorial and also here: http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ Both the nucleotide and protein databases do include SLEN as a search field for sequence length. Have you tried including something like 123[SLEN] in your Entrez search term? For efetch with a sequence database you can use seq_start and seq_stop to retrieve just part of the sequence. But that would just crop it: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html Peter From anna.kostikova at gmail.com Fri Mar 30 15:28:17 2012 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Fri, 30 Mar 2012 17:28:17 +0200 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: Thanks a lot Peter for the advice and the link - super useful trick! a very straightaway solution, indeed:) handle = Entrez.esearch(db="nucleotide",term=organism +'[ORGN] AND 100:1000[SLEN]') Thanks a lot again, Anna 2012/3/30 Peter Cock : > On Fri, Mar 30, 2012 at 3:49 PM, Anna Kostikova > wrote: >> Dear list members, >> >> Is there a parameter in entrez.efetch/entrez.esearch which would allow >> to only look for and download records with the maximum sequence length >> of ? e.g. an analogue to SLEN parameter of the web >> interface of the NCBI website. >> >> Thanks a lot in advance, >> Anna > > For esearch, have you checked the available search fields using > einfo - shown in the Biopython Tutorial and also here: > http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ > > Both the nucleotide and protein databases do include SLEN as a > search field for sequence length. Have you tried including something > like 123[SLEN] in your Entrez search term? > > For efetch with a sequence database you can use seq_start and seq_stop > to retrieve just part of the sequence. But that would just crop it: > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html > > Peter From p.j.a.cock at googlemail.com Fri Mar 30 15:41:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 30 Mar 2012 16:41:55 +0100 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: On Fri, Mar 30, 2012 at 4:28 PM, Anna Kostikova wrote: > Thanks a lot Peter for the advice and the link - super useful trick! > a very straightaway solution, indeed:) > > handle = Entrez.esearch(db="nucleotide",term=organism +'[ORGN] AND > 100:1000[SLEN]') > > Thanks a lot again, > Anna Thanks for letting us know it worked :) How did you find the range trick you're using for the length search? Peter From anna.kostikova at gmail.com Fri Mar 30 16:55:39 2012 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Fri, 30 Mar 2012 18:55:39 +0200 Subject: [Biopython] SLEN analogue in entrez.efetch/entrez.esearch In-Reply-To: References: Message-ID: > How did you find the range trick you're using for the length search? the idea came thanks to you. Essentially, the presence of SLEN term in your blog post on 'NCBI Entrez EInfo' pushed me to try the same syntax I'd use in perl or NCBI web interface. And it worked :) Anna 2012/3/30 Peter Cock : > On Fri, Mar 30, 2012 at 4:28 PM, Anna Kostikova > wrote: >> Thanks a lot Peter for the advice and the link - super useful trick! >> a very straightaway solution, indeed:) >> >> handle = Entrez.esearch(db="nucleotide",term=organism +'[ORGN] AND >> 100:1000[SLEN]') >> >> Thanks a lot again, >> Anna > > Thanks for letting us know it worked :) > > How did you find the range trick you're using for the length search? > > Peter From chris.mit7 at gmail.com Sat Mar 31 04:41:32 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sat, 31 Mar 2012 00:41:32 -0400 Subject: [Biopython] GSOC Genome Variants proposal Message-ID: Hey everyone, Here's a draft of my proposal: https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit I've allowed comments to be put in. Please tear it to shreds :). Thanks, Chris From reece at harts.net Sat Mar 31 20:26:05 2012 From: reece at harts.net (Reece Hart) Date: Sat, 31 Mar 2012 13:26:05 -0700 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: References: Message-ID: On Fri, Mar 30, 2012 at 9:41 PM, Chris Mitchell wrote: > Here's a draft of my proposal: > > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > Thanks Chris. I'm reading this proposal and others this weekend. Thanks for submitting! -Reece