From francesco.chiani at gmail.com Wed Oct 3 10:18:22 2012 From: francesco.chiani at gmail.com (francesco chiani) Date: Wed, 3 Oct 2012 14:18:22 +0000 (UTC) Subject: [Biopython] error in parseing Gene bank Message-ID: Hi Everyone, Someone have an idea of why in biopython for python 2.7 give me this error while parsing a gene bank file? Traceback (most recent call last): in for seq_record in SeqIO.parse(handle, "genbank"): File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 537, in parse for r in i: File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 445, in parse_records record = self.parse(handle, do_features) File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 428, in parse if self.feed(handle, consumer, do_features): File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 410, in feed consumer.record_end("//") File "C:\Python27\lib\site-packages\Bio\GenBank\__init__.py", line 1184, in record_end % self._seq_type) ValueError: Could not determine alphabet for seq_type dna p.s.-:This doesn't happen in biopython for python 2.6 version I suppose, 'cause with the same gene bank file in this version my script works... thanks x your help, F. From semenko at alum.mit.edu Wed Oct 3 10:28:20 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Wed, 3 Oct 2012 09:28:20 -0500 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: I've had this problem too, and have considered patching it: I * think * (without seeing your file), that your seqtype must be in capital letters (e.g. DNA). Does that work? - Nick On Wed, Oct 3, 2012 at 9:18 AM, francesco chiani wrote: > Hi Everyone, > Someone have an idea of why in biopython for python 2.7 give me this error > while parsing a gene bank file? > > > Traceback (most recent call last): > in > for seq_record in SeqIO.parse(handle, "genbank"): > File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 537, in > parse > for r in i: > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 445, in > parse_records > record = self.parse(handle, do_features) > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 428, > in parse > if self.feed(handle, consumer, do_features): > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 410, > in feed > consumer.record_end("//") > File "C:\Python27\lib\site-packages\Bio\GenBank\__init__.py", line 1184, > in > record_end > % self._seq_type) > ValueError: Could not determine alphabet for seq_type dna > > p.s.-:This doesn't happen in biopython for python 2.6 version I suppose, > 'cause > with the same gene bank file in this version my script works... > > > thanks x your help, > F. > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) 314.374.4434 (Cell) http://web.mit.edu/semenko/ From p.j.a.cock at googlemail.com Wed Oct 3 10:35:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Oct 2012 15:35:46 +0100 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: On Wed, Oct 3, 2012 at 3:18 PM, francesco chiani wrote: > Hi Everyone, > Someone have an idea of why in biopython for python 2.7 give me this error > while parsing a gene bank file? > > > Traceback (most recent call last): > in > for seq_record in SeqIO.parse(handle, "genbank"): > File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 537, in parse > for r in i: > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 445, in > parse_records > record = self.parse(handle, do_features) > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 428, in parse > if self.feed(handle, consumer, do_features): > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 410, in feed > consumer.record_end("//") > File "C:\Python27\lib\site-packages\Bio\GenBank\__init__.py", line 1184, in > record_end > % self._seq_type) > ValueError: Could not determine alphabet for seq_type dna > > p.s.-:This doesn't happen in biopython for python 2.6 version I suppose, 'cause > with the same gene bank file in this version my script works... > > > thanks x your help, > F. If you've switched from Python 2.6 to Python 2.7, it is likely you've also got a more recent Biopython on the Python 2.7 installation. Could you check the two Biopython versions? The problem is most likely an invalid LOCUS line (which should indicate if the sequence is DNA/Protein etc). Could you show us the first few lines of the GenBank file? Thanks, Peter From p.j.a.cock at googlemail.com Wed Oct 3 10:42:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Oct 2012 15:42:58 +0100 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: On Wed, Oct 3, 2012 at 3:39 PM, francesco chiani wrote: > Thanks for the replys: > > here the gbk file: > > LOCUS allele_48659_OTTMUSE00000300743_L1L2_Bact_P 37935 bp > dna linear UNK > ACCESSION unknown > DBSOURCE accession design_id=48659 > COMMENT cassette : L1L2_Bact_P > COMMENT design_id : 48659 > FEATURES Location/Qualifiers > ... As Nick guessed, I think the problem is you have 'dna' (lower case) in the LOCUS line, rather than 'DNA' (upper case). Where did this file come from? e.g. What software tool or database made it? http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#MoleculeTypeB In any case, perhaps Biopython could check for 'dna' as well (as some tools don't seem for obey this bit of the standard)? Thanks Peter From p.j.a.cock at googlemail.com Wed Oct 3 11:00:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Oct 2012 16:00:10 +0100 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: On Wed, Oct 3, 2012 at 3:56 PM, francesco chiani wrote: > Xfect after replace "dna" with "DNA" in the gbk file , the script works! > Fantastic. > > The gene bank file is from IKMC portal > http://www.i-dcc.org/martsearch/ > I have no idea about the software used to made it sorry.. I just use them.. > Could you email them, and include the link to the GenBank standard: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#MoleculeTypeB > does my script have to replace "dna" every gene bank or there is a > quicker solution I cant see? Try using the optional argument to SeqIO.parse, e.g. from Bio import SeqIO from Bio.Alphabet import generic_dna for seq_record in SeqIO.parse(handle, "genbank", alphabet=generic_dna): print seq_record.id Regards, Peter P.S. Please CC the mailing list in your replies. From devaniranjan at gmail.com Mon Oct 8 11:29:37 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 8 Oct 2012 11:29:37 -0400 Subject: [Biopython] not technically a biopython question Message-ID: Hi guys, I am working on some FASTA like sequences (not FASTA but something I have defined thats similar for some culled PDB from the PISCES server) I have a question: I have a small no of sequences called nCatSeq, for which there are MULTIPLE nBasinSeq, I go through a a large PDB file and I want to extract for for each nCatSeq the corresponding nBasinSeq without redundancies in a dictionary. The code snippet that does this is given below. nCatSeq=item[1][n]+item[1][n+1]+item[1][n+2]+item[1][n+3] nBasinSeq=item[2][n]+item[2][n+1]+item[2][n+2]+item[2][n+3] if nCatSeq not in potBasin: potBasin[nCatSeq]=nBasinSeq else: if nBasinSeq not in potBasin[nCatSeq]: potBasin[nCatSeq]=potBasin[nCatSeq],nBasinSeq else: pass I get the following as the answer for one nCatSeq, '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') what I want however is : '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') I don't want all the extra brackets due to the following command potBasin[nCatSeq]=potBasin[nCatSeq],nBasinSeq (see above code snippet) Is there a way to do this ? Thank you, George From devaniranjan at gmail.com Mon Oct 8 12:06:52 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 8 Oct 2012 12:06:52 -0400 Subject: [Biopython] not technically a biopython question In-Reply-To: <5072F914.3010009@inaf.cnrs-gif.fr> References: <5072F914.3010009@inaf.cnrs-gif.fr> Message-ID: Thank you Frederic but I have tried that and what it gives is : VUVVDDRVDDVGVUVV It basically joins all the tetramers together and I want them separately. On Mon, Oct 8, 2012 at 12:02 PM, Fr?d?ric Sohm wrote: > Hi, > > > if nBasinSeq not in potBasin[nCatSeq] : > potBasin[nCatSeq] = potBasin[nCatSeq] + (nBasinSeq,) > > or shorter > > if nBasinSeq not in potBasin[nCatSeq] : > potBasin[nCatSeq] += (nBasinSeq,) > > Regards, > > Fred > > > On 08/10/12 17:29, George Devaniranjan wrote: > >> Hi guys, >> >> I am working on some FASTA like sequences (not FASTA but something I have >> defined thats similar for some culled PDB from the PISCES server) >> >> I have a question: >> >> I have a small no of sequences called nCatSeq, for which there are >> MULTIPLE >> nBasinSeq, I go through a a large PDB file and I want to extract for for >> each nCatSeq the corresponding nBasinSeq without redundancies in a >> dictionary. The code snippet that does this is given below. >> >> nCatSeq=item[1][n]+item[1][n+**1]+item[1][n+2]+item[1][n+3] >> nBasinSeq=item[2][n]+item[2][**n+1]+item[2][n+2]+item[2][n+3] >> >> >> if nCatSeq not in potBasin: >> potBasin[nCatSeq]=nBasinSeq >> else: >> if nBasinSeq not in potBasin[nCatSeq]: >> potBasin[nCatSeq]=potBasin[**nCatSeq],nBasinSeq >> else: >> >> pass >> >> >> >> >> I get the following as the answer for one nCatSeq, >> '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') >> >> >> what I want however is : >> >> '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') >> >> I don't want all the extra brackets due to the following command >> potBasin[nCatSeq]=potBasin[**nCatSeq],nBasinSeq >> (see above code snippet) >> >> Is there a way to do this ? >> >> Thank you, >> George >> ______________________________**_________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> >> > -- > Fr?d?ric Sohm > GIS AMAGEN CNRS INRA > Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s" > UPR 3294 NED, CNRS > Institut de Neurobiologie A. Fessard > 1 Avenue de la Terrasse > 91 198 GIF-SUR -YVETTE > FRANCE > Phone: 33 1 69 82 34 12 > Fax: 33 1 69 82 41 67 > email: sohm at inaf.cnrs-gif.fr > From blind.watchmaker at yahoo.com Mon Oct 8 15:59:04 2012 From: blind.watchmaker at yahoo.com (John Ladasky) Date: Mon, 08 Oct 2012 12:59:04 -0700 Subject: [Biopython] not technically a biopython question In-Reply-To: References: Message-ID: <50733088.4020505@yahoo.com> Date: Mon, 8 Oct 2012 11:29:37 -0400 From: George Devaniranjan Subject: [Biopython] not technically a biopython question To: Biopython Mailing List Message-ID: > I get the following as the answer for one nCatSeq, > '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') > > > what I want however is : > > '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') > What you want to do is to "flatten" a nested sequence object. There are many Python recipes to do this. Most of the solutions involve recursive function calls. Here's a page that discusses several ways to get it done: http://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists-in-python If your list or tuple is VERY deeply nested (say, 256 parentheses), you may hit Python's recursion limit. There are solutions on that page which don't require recursion, but they are frequently more difficult to understand. I tend to prefer code that I can read at a glance, myself. I don't know why flattening lists and tuples isn't a standard library function in Python yet, it seems like everyone needs to do this at some time or another. From devaniranjan at gmail.com Mon Oct 8 16:11:06 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 8 Oct 2012 16:11:06 -0400 Subject: [Biopython] not technically a biopython question In-Reply-To: <50733088.4020505@yahoo.com> References: <50733088.4020505@yahoo.com> Message-ID: Thank you very much, this is what I did and it seems to work. I found the answer on stack overflow. if nCatSeq not in potBasin: potBasin[nCatSeq] = (nBasinSeq,) else: if nBasinSeq not in potBasin[nCatSeq]: potBasin[nCatSeq] = potBasin[nCatSeq] + (nBasinSeq,) On Mon, Oct 8, 2012 at 3:59 PM, John Ladasky wrote: > Date: Mon, 8 Oct 2012 11:29:37 -0400 From: George Devaniranjan < > devaniranjan at gmail.com> Subject: [Biopython] not technically a biopython > question To: Biopython Mailing List > Message-ID: gmail.com > > > > I get the following as the answer for one nCatSeq, >> '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') >> >> >> what I want however is : >> >> '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') >> >> > What you want to do is to "flatten" a nested sequence object. There are > many Python recipes to do this. Most of the solutions involve recursive > function calls. Here's a page that discusses several ways to get it done: > > http://stackoverflow.com/**questions/2158395/flatten-an-** > irregular-list-of-lists-in-**python > > If your list or tuple is VERY deeply nested (say, 256 parentheses), you > may hit Python's recursion limit. There are solutions on that page which > don't require recursion, but they are frequently more difficult to > understand. I tend to prefer code that I can read at a glance, myself. > > I don't know why flattening lists and tuples isn't a standard library > function in Python yet, it seems like everyone needs to do this at some > time or another. > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From ivaylo.stoimenov at gmail.com Wed Oct 10 07:00:55 2012 From: ivaylo.stoimenov at gmail.com (Ivaylo Stoimenov) Date: Wed, 10 Oct 2012 13:00:55 +0200 Subject: [Biopython] Read and Parse EMBOSS primer3-eprimer32 Message-ID: Hi, I have a problem of using Read and Parse functions when it comes to EMBOSS Primer3 (or eprimer32 wrapper). I would like to skip writing files but hijacking the output of Primer3 to a variable (object). Here is some part of the code, which does not work: from Bio.Emboss import Primer3 from Bio.Emboss.Applications import Primer3Commandline import sys ... cline = Primer3Commandline(sequence=combined_frame, auto=True, task =1) cline.explainflag = True cline.prange="100-150" cline.outfile = "stdout" ggg = cline() primer_record = Primer3.read(sys.stdout or ggg[0] or ...) print primer_record ... What I am doing wrong. The output of cline() is a tuple, but I would like to read or parse the first element. The problem is that Primer3.read and Primer3.parse expect file handles. Any advice would be highly appreciated. Kind regards, Ivaylo From hlapp at drycafe.net Wed Oct 10 16:31:04 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 10 Oct 2012 16:31:04 -0400 Subject: [Biopython] Fwd: [Announce] Call for Proposals for Doc Sprint Summit v2.0 References: Message-ID: If you've been a Google Summer of Code mentor this year or last year, you will have already seen this. I wanted to make sure everybody is aware, and this may provide the opportunity for the kind of concerted effort that could finally get a BioPerl, Biopython, Bioruby, or Biojava (or a combined??) off the ground. -hilmar Begin forwarded message: From: Carol Smith Subject: [GSoC Mentors] [Announce] Call for Proposals for Doc Sprint Summit v2.0 Date: October 10, 2012 2:44:50 PM EDT To: Google Summer of Code Mentors List Cc: adam at flossmanuals.net Dear GSoC mentors and org admins, Google Summer of Code in collaboration with Aspiration and FLOSS Manuals is hosting a "Doc Sprint Camp" at Google's Mountain View headquarters (California) Dec 3 - 7, 2012. The 2012 Doc Camp will feature: 1) An unconference on free software documentation topics - facilitated by Aspiration 2) 2-5 Book Sprints to produce books on free softwares - facilitated by FLOSS Manuals Building on the success of the 2011 GSoC Doc Camp we are proud to bring you the 2012 GSoC Doc Camp. Like the previous event the 2012 GSoC Doc Camp is a place for free software communities to meet, create a book for their project, attract new people to their efforts, and share their documentation experiences. The camp aims to improve free documentation materials and skills in free software projects and individuals and help form the identity of the emergent free documentation sector. Individuals and projects can apply. Food and accommodation for all individuals will be provided and travel support (full or partial) can also be applied for. Be a part of this exciting event ? propose a Book Sprint on your favorite free software or come and help others write a book on their favorite project. Guaranteed to be a lot of fun, productive, and a fantastic place to advance your documentation efforts and experiences. For more information or to register to take part, please see https://sites.google.com/site/docsprintsummitv2/. Please note proposals are due by October 26, so get yours in ASAP! Cheers, Carol Smith, Allen Gunn, Adam Hyde -- You received this message because you are subscribed to the Google Groups "Google Summer of Code Mentors List" group. To post to this group, send email to google-summer-of-code-mentors-list at googlegroups.com. To unsubscribe from this group, send email to google-summer-of-code-mentors-list+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-summer-of-code-mentors-list?hl=en. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From nuin at genedrift.org Wed Oct 10 21:55:53 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 10 Oct 2012 21:55:53 -0400 Subject: [Biopython] Affy CEL files Message-ID: Hi I found some old discussions on parsing CEL files generated from Affymetrix microarrays (and such). What is the current status of this in BioPython? I was able to find some classes but there is not a lot of documentation about them. Thanks in advance Paulo From p.j.a.cock at googlemail.com Thu Oct 11 07:05:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Oct 2012 12:05:51 +0100 Subject: [Biopython] Read and Parse EMBOSS primer3-eprimer32 In-Reply-To: References: Message-ID: Hello Ivaylo, On Wed, Oct 10, 2012 at 12:00 PM, Ivaylo Stoimenov wrote: > Hi, > > I have a problem of using Read and Parse functions when it comes to > EMBOSS Primer3 (or eprimer32 wrapper). I would like to skip writing files but > hijacking the output of Primer3 to a variable (object). Here is some part > of the code, which does not work: > > from Bio.Emboss import Primer3 > from Bio.Emboss.Applications import Primer3Commandline > import sys > ... > cline = Primer3Commandline(sequence=combined_frame, auto=True, task =1) > cline.explainflag = True > cline.prange="100-150" > cline.outfile = "stdout" I think that will just write to a file called stdout (unless Primer3 does something special). I think should use cline.stdout = True instead, which will add the switch -stdout to the EMBOSS tool's command line. Once you have told EMBOSS to write the output to stdout instead of to a file, you must get Python to parse this. Either use subprocess and the child process's stdout handle (see examples of this in the Biopython Tutorial), or if you capture stdout as a string turn the string into a pretend handle using StringIO. e.g. cline = Primer3Commandline(...) stdout, stderr = cline() #Runs it, captures output as strings from StringIO import StringIO handle = StringIO(stdout) #Turns string into pretend handle Peter From mjldehoon at yahoo.com Sat Oct 13 07:41:44 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 13 Oct 2012 04:41:44 -0700 (PDT) Subject: [Biopython] Affy CEL files In-Reply-To: Message-ID: <1350128504.35696.YahooMailClassic@web164002.mail.gq1.yahoo.com> As far as I know nobody is actively working on this. I would really like to see better support for Affy and other microarrays. I have been thinking to implement something myself but I haven't found the time to do so. Would you (or anybody else) be willing to contribute some code or documentation for Affy microarrays in Biopython? Thanks, -Michiel. --- On Wed, 10/10/12, Paulo Nuin wrote: > From: Paulo Nuin > Subject: [Biopython] Affy CEL files > To: "BioPython Mailing List" > Date: Wednesday, October 10, 2012, 9:55 PM > Hi > > I found some old discussions on parsing CEL files generated > from Affymetrix microarrays (and such). What is the current > status of this in BioPython? I was able to find some classes > but there is not a lot of documentation about them. > > Thanks in advance > > Paulo > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From nuin at genedrift.org Tue Oct 16 20:22:00 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Tue, 16 Oct 2012 20:22:00 -0400 Subject: [Biopython] Affy CEL files In-Reply-To: <1350128504.35696.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1350128504.35696.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: <950ED12B-784E-4DEC-A121-D27CA3AAE46F@genedrift.org> Hi That's an idea. I will see what I can do on my end. We are analalysing a lot of microarrays and it would be helpful to have something in Python instead of R only. Cheers Paulo On 2012-10-13, at 7:41 AM, Michiel de Hoon wrote: > As far as I know nobody is actively working on this. I would really like to see better support for Affy and other microarrays. I have been thinking to implement something myself but I haven't found the time to do so. Would you (or anybody else) be willing to contribute some code or documentation for Affy microarrays in Biopython? > > Thanks, > -Michiel. > > --- On Wed, 10/10/12, Paulo Nuin wrote: > >> From: Paulo Nuin >> Subject: [Biopython] Affy CEL files >> To: "BioPython Mailing List" >> Date: Wednesday, October 10, 2012, 9:55 PM >> Hi >> >> I found some old discussions on parsing CEL files generated >> from Affymetrix microarrays (and such). What is the current >> status of this in BioPython? I was able to find some classes >> but there is not a lot of documentation about them. >> >> Thanks in advance >> >> Paulo >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From p.j.a.cock at googlemail.com Thu Oct 18 14:33:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 18 Oct 2012 19:33:04 +0100 Subject: [Biopython] PyPy 1.8 support? Message-ID: Hello all, We currently run the test suite against both PyPy 1.8 and 1.9 on Linux via the TravisCI.org continuous integration testing service. Is anyone actually using Biopython under PyPy 1.8? If not, I intend to drop automated testing under PyPy 1.8 and focus just on PyPy 1.9 instead. (Automated testing under C Python 2.5, 2.6, 2.7, 3.1 and 3.2 etc will continue - I'm hoping to add Python 3.3 as well) Thanks, Peter From p.j.a.cock at googlemail.com Fri Oct 19 03:52:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 19 Oct 2012 08:52:19 +0100 Subject: [Biopython] PyPy 1.8 support? In-Reply-To: References: Message-ID: On Thu, Oct 18, 2012 at 7:33 PM, Peter Cock wrote: > Hello all, > > We currently run the test suite against both PyPy 1.8 and > 1.9 on Linux via the TravisCI.org continuous integration > testing service. > > Is anyone actually using Biopython under PyPy 1.8? > > If not, I intend to drop automated testing under PyPy 1.8 > and focus just on PyPy 1.9 instead. Done on TravisCI, but easy to revert: https://github.com/biopython/biopython/commit/126c944812730df4677c8fa2f63abc29ddd084bb One reason was the previous build failed due to a timeout fetching PyPy for a custom install. Now we use the TravisCI provided PyPy which should avoid that issue. (It still happens for Jython sometimes). Peter From p.j.a.cock at googlemail.com Mon Oct 22 13:17:34 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 22 Oct 2012 18:17:34 +0100 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? Message-ID: Dear Biopythoneers, Would anyone object to us preparing to drop support for Python 2.5 and Jython 2.5, perhaps after the next Biopython release? To reassure those of you using Jython, we'd wait until Jython 2.7 is out first. Jython 2.7 is already in alpha, and brings support for C Python 2.7 language features. Thanks, Peter From csaba.kiss at lanl.gov Tue Oct 23 12:01:17 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:01:17 +0000 Subject: [Biopython] sff into fasta and qual -> trim Message-ID: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> I am new to bio-python. I am trying to replace mothur with BioPython. I hope that biopython is faster than mothur. All I want to do is this: sffinfo(sff=sd11.fasta) trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) Can someone help me to translate the two mothur statements above to biopython, please? It would be greatly appreciated. thanks -- Best Regards: Csaba Kiss PhD, MSc, BSc TA-43, HRL-1, MS888 Los Alamos National Laboratory Work: 1-505-667-9898 Cell: 1-505-920-5774 From csaba.kiss at lanl.gov Tue Oct 23 12:04:21 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:04:21 +0000 Subject: [Biopython] sff inot fasta and qual then trim Message-ID: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> I am new to bio-python. I am trying to replace mothur with BioPython. I hope that biopython is faster than mothur. All I want to do is this: sffinfo(sff=sd11.fasta) trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) Can someone help me to translate the two mothur statements above to biopython, please? It would be greatly appreciated. thanks -- Best Regards: Csaba Kiss PhD, MSc, BSc TA-43, HRL-1, MS888 Los Alamos National Laboratory Work: 1-505-667-9898 Cell: 1-505-920-5774 From p.j.a.cock at googlemail.com Tue Oct 23 12:14:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 17:14:00 +0100 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 5:04 PM, Kiss, Csaba wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements > above to biopython, please? > It would be greatly appreciated. > thanks I don't know enough about mothur to give you an informed answer. I would guess the first line is just SFF to FASTA and QUAL, based partly on the title to your email. That at least is trivial in Biopython: from Bio import SeqIO SeqIO.convert("example.sff", "sff", "example.fasta", "fasta") SeqIO.convert("example.sff", "sff", "example.qual", "qual") Or, if you want the trimming in the SFF file applied, which is generally sensible: from Bio import SeqIO SeqIO.convert("example.sff", "sff-trim", "example.fasta", "fasta") SeqIO.convert("example.sff", "sff-trim", "example.qual", "qual") Personally I prefer to work with a single FASTQ file rather than a paired FASTA+QUAL (it is smaller on disc for one thing), so maybe: from Bio import SeqIO SeqIO.convert("example.sff", "sff-trim", "example.fastq", "fastq") Regards, Peter From p.j.a.cock at googlemail.com Tue Oct 23 12:16:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 17:16:30 +0100 Subject: [Biopython] sff into fasta and qual -> trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 5:01 PM, Kiss, Csaba wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements above to biopython, please? > It would be greatly appreciated. > thanks I saw your other email first, and replied to that: http://lists.open-bio.org/pipermail/biopython/2012-October/008217.html (On the bright side, it looks like your subscription to the mailing list has worked this time :) - welcome!) Peter From p.j.a.cock at googlemail.com Tue Oct 23 12:37:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 17:37:27 +0100 Subject: [Biopython] sff into fasta and qual -> trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93347D@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E93347D@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 5:22 PM, Kiss, Csaba wrote: > Thanks, Peter. > I understand the fastq sequence extraction. That's very neat. However, I am not sure how to do quality trimming of the sequences. > In mothur, we tested that qwindowsize=50, qwindowaverage=22 is a very nice way to get high quality sequences out. > I assume it works in a way that a 50 bp sliding window checks the average quality and if it's below a certain number (i.e. 22) then it rejects the sequence if it's above it keeps it. > Is there something similar in biopython. > > C Hi Csaba, No, there isn't a 'ready to use' sliding window read cleaning tool/function in Biopython, although you could write one using Biopython is you wished, with the advantage that you can implement exactly what you need. There are many (dozens?) of dedicated tools for this kind of thing which might be simpler or more appropriate. Have a browse here: http://seqanswers.com/wiki/Software/list Regards, Peter P.S. Please CC the mailing list in your replies. From cfriedline at vcu.edu Tue Oct 23 12:39:07 2012 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 23 Oct 2012 12:39:07 -0400 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> Are you trying to replace an entire analysis pipeline, which mothur provides, or simply take control of the read trimming routines? Mothur has been excellent for us (though I do supplement with my own code frequently), and I have a hard time believing that BioPython (or Python, in general) would be faster for these types of things. If you are married to Python, you may want to join in with the QIIME people, though they back their stuff with PyCogent rather than BioPython. Both are excellent packages for automating some parts of the analysis in microbial community studies. We can leave the philosophy of pipelining scientific research for another thread. ;-) I wonder if the reimplementation effort of common trimming/filtering tasks are worth your time, given the current maturity of both mothur and QIIME. On Oct 23, 2012, at 12:04 PM, "Kiss, Csaba" wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements above to biopython, please? > It would be greatly appreciated. > thanks > > > -- > Best Regards: > Csaba Kiss PhD, MSc, BSc > TA-43, HRL-1, MS888 > Los Alamos National Laboratory > Work: 1-505-667-9898 > Cell: 1-505-920-5774 > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From csaba.kiss at lanl.gov Tue Oct 23 12:39:11 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:39:11 +0000 Subject: [Biopython] sff into fasta and qual -> trim In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E93347D@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E9334A1@ECS-EXG-P-MB03.win.lanl.gov> Thanks Peter for the info. I will look at the Software list. Csaba -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Tuesday, October 23, 2012 10:37 AM To: Kiss, Csaba Cc: Biopython Mailing List Subject: Re: [Biopython] sff into fasta and qual -> trim On Tue, Oct 23, 2012 at 5:22 PM, Kiss, Csaba wrote: > Thanks, Peter. > I understand the fastq sequence extraction. That's very neat. However, I am not sure how to do quality trimming of the sequences. > In mothur, we tested that qwindowsize=50, qwindowaverage=22 is a very nice way to get high quality sequences out. > I assume it works in a way that a 50 bp sliding window checks the average quality and if it's below a certain number (i.e. 22) then it rejects the sequence if it's above it keeps it. > Is there something similar in biopython. > > C Hi Csaba, No, there isn't a 'ready to use' sliding window read cleaning tool/function in Biopython, although you could write one using Biopython is you wished, with the advantage that you can implement exactly what you need. There are many (dozens?) of dedicated tools for this kind of thing which might be simpler or more appropriate. Have a browse here: http://seqanswers.com/wiki/Software/list Regards, Peter P.S. Please CC the mailing list in your replies. From csaba.kiss at lanl.gov Tue Oct 23 12:47:23 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:47:23 +0000 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> Hi Christopher! I am writing a python script to analyze antibody sequences. I have been using mothur to convert the sff files to fasta and then trim the sequences for quality. For the end-users' sake, it would be easier if all they needed to install was python and can go around mothur. I have been happy with mothur until now when I tried to use it in my desktop computer and it took 3 hours to convert 3 million read from sff to fasta. I hoped that pure python would be faster. I will look at Pycogent and QIIME. Thanks Csaba -----Original Message----- From: Christopher Friedline [mailto:cfriedline at mymail.vcu.edu] On Behalf Of Chris Friedline Sent: Tuesday, October 23, 2012 10:39 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] sff inot fasta and qual then trim Are you trying to replace an entire analysis pipeline, which mothur provides, or simply take control of the read trimming routines? Mothur has been excellent for us (though I do supplement with my own code frequently), and I have a hard time believing that BioPython (or Python, in general) would be faster for these types of things. If you are married to Python, you may want to join in with the QIIME people, though they back their stuff with PyCogent rather than BioPython. Both are excellent packages for automating some parts of the analysis in microbial community studies. We can leave the philosophy of pipelining scientific research for another thread. ;-) I wonder if the reimplementation effort of common trimming/filtering tasks are worth your time, given the current maturity of both mothur and QIIME. On Oct 23, 2012, at 12:04 PM, "Kiss, Csaba" wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, > maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements above to biopython, please? > It would be greatly appreciated. > thanks > > > -- > Best Regards: > Csaba Kiss PhD, MSc, BSc > TA-43, HRL-1, MS888 > Los Alamos National Laboratory > Work: 1-505-667-9898 > Cell: 1-505-920-5774 > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From cfriedline at vcu.edu Tue Oct 23 12:58:31 2012 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 23 Oct 2012 12:58:31 -0400 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> Csaba, As Peter said, there are many packages which will convert sff to fastq/fasta. I wonder if you're running into disk performance issues, rather than algorithm ones, though. Using the BioPython SeqIO convert should tell you that much, though that does seem slow (at least for the systems that I work on). Chris On Oct 23, 2012, at 12:47 PM, "Kiss, Csaba" wrote: > Hi Christopher! > I am writing a python script to analyze antibody sequences. I have been using mothur to convert the sff files to fasta and then trim the sequences for quality. > For the end-users' sake, it would be easier if all they needed to install was python and can go around mothur. I have been happy with mothur until now when I tried to use it in my desktop computer and it took 3 hours to convert 3 million read from sff to fasta. I hoped that pure python would be faster. > I will look at Pycogent and QIIME. > Thanks > Csaba > > -----Original Message----- > From: Christopher Friedline [mailto:cfriedline at mymail.vcu.edu] On Behalf Of Chris Friedline > Sent: Tuesday, October 23, 2012 10:39 AM > To: Kiss, Csaba > Cc: biopython at lists.open-bio.org > Subject: Re: [Biopython] sff inot fasta and qual then trim > > Are you trying to replace an entire analysis pipeline, which mothur provides, or simply take control of the read trimming routines? Mothur has been excellent for us (though I do supplement with my own code frequently), and I have a hard time believing that BioPython (or Python, in general) would be faster for these types of things. If you are married to Python, you may want to join in with the QIIME people, though they back their stuff with PyCogent rather than BioPython. Both are excellent packages for automating some parts of the analysis in microbial community studies. We can leave the philosophy of pipelining scientific research for another thread. ;-) > > I wonder if the reimplementation effort of common trimming/filtering tasks are worth your time, given the current maturity of both mothur and QIIME. > > On Oct 23, 2012, at 12:04 PM, "Kiss, Csaba" wrote: > >> I am new to bio-python. I am trying to replace mothur with BioPython. >> I hope that biopython is faster than mothur. All I want to do is this: >> >> sffinfo(sff=sd11.fasta) >> trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, >> maxhomop=8, qwindowsize=50, qwindowaverage =22) >> >> Can someone help me to translate the two mothur statements above to biopython, please? >> It would be greatly appreciated. >> thanks >> >> >> -- >> Best Regards: >> Csaba Kiss PhD, MSc, BSc >> TA-43, HRL-1, MS888 >> Los Alamos National Laboratory >> Work: 1-505-667-9898 >> Cell: 1-505-920-5774 >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA > From p.j.a.cock at googlemail.com Tue Oct 23 13:06:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 18:06:26 +0100 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> Message-ID: On Tue, Oct 23, 2012 at 5:58 PM, Chris Friedline wrote: > Csaba, > > As Peter said, there are many packages which will convert sff to fastq/fasta. Actually I meant trimming packages, although there are several SFF converters as well e.g. Biopython, sff_extract, BioHaskell/Flower, and Roche's tools. > I wonder if you're running into disk performance issues, rather than algorithm > ones, though. Using the BioPython SeqIO convert should tell you that much, > though that does seem slow (at least for the systems that I work on). Could be - disk IO will be a factor, but I suspect the quality trimming to be the slow part rather than the format conversion. Peter From csaba.kiss at lanl.gov Tue Oct 23 13:13:32 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 17:13:32 +0000 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> >Could be - disk IO will be a factor, but I suspect the quality trimming to be the slow part rather than the format conversion. I don't think it's IO or the trimming. Mothur seems to take forever to do the sffinfo process on windows. Getting the 3 million sequences out was 3 hours. The trimming took 10 minutes. The rest of the python code to fish out my sequences 1 minute. You see now , why I would like to make it more efficient. Csaba From p.j.a.cock at googlemail.com Tue Oct 23 15:45:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 20:45:19 +0100 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 6:13 PM, Kiss, Csaba wrote: >>Could be - disk IO will be a factor, but I suspect the quality trimming to be the slow part rather than the format conversion. > > I don't think it's IO or the trimming. Mothur seems to take forever to do the sffinfo process on windows. > Getting the 3 million sequences out was 3 hours. That sounds a bit slow, can you compare this to the Biopython SFF conversion time (or any of the other tools)? > The trimming took 10 minutes. > The rest of the python code to fish out my sequences 1 minute. > > You see now , why I would like to make it more efficient. > > Csaba Is it possible to fish out your sequences and then do the trimming? If possible that sounds like it would be more efficient. Peter From csaba.kiss at lanl.gov Tue Oct 23 16:32:11 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 20:32:11 +0000 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E934574@ECS-EXG-P-MB03.win.lanl.gov> >That sounds a bit slow, can you compare this to the Biopython SFF conversion time (or any of the other tools)? I used SeqIO.convert("sd6.sff", "sff-trim", "sd6_p.fasta", "fasta") SeqIO.convert("sd6.sff", "sff-trim", "sd6_p.qual", "qual") and it finished in 8 minutes. That's much better than 3 hours. The problem is that if I use the mothur fasta/qual files and the python fasta/qual files and trim the sequences exactly the same way in mothur, I get slightly different trimmed sequence dataset. I am investigating further, if it would matter. Thanks for your helps, it is much appreciated Csaba From csaba.kiss at lanl.gov Wed Oct 24 11:49:59 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Wed, 24 Oct 2012 15:49:59 +0000 Subject: [Biopython] still more questions about NGS sequenbce trimming Message-ID: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Hi All! Thanks for all your help to extract DNA sequences from sff files. Using biopython I managed to improve the sequence extraction from 3 hours to 10 minutes. Now that I am hooked, I would like to replace mothur with some simple python functions. Is there any function in biopython that would look for homopolymers on DNA sequences. Particularly I am looking to reject a sequence if it has more than 8 bp of stretches of any single nucleotide. Another function I am looking for is a sliding window function along the quality file. I could either use the fastq file or the fasta/qual file pair. I could write these functions myself but if they are available, then it would make my life easier. Thanks Csaba From nje5 at georgetown.edu Wed Oct 24 12:07:16 2012 From: nje5 at georgetown.edu (Nathan Edwards) Date: Wed, 24 Oct 2012 12:07:16 -0400 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: References: Message-ID: <50881234.8050007@georgetown.edu> On 10/22/2012 1:17 PM, Peter Cock wrote: > Dear Biopythoneers, > > Would anyone object to us preparing to drop support for Python 2.5 and > Jython 2.5, perhaps after the next Biopython release? I'm still in the dark ages, but I need the push to upgrade my infrastructure. I'm just reluctant to rebuilt all of my third-party libraries. Are there specific parts of the code known to be (or soon to be) problematic for Python 2.5? Thanks, - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From p.j.a.cock at googlemail.com Wed Oct 24 12:33:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Oct 2012 17:33:42 +0100 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: <50881234.8050007@georgetown.edu> References: <50881234.8050007@georgetown.edu> Message-ID: On Wed, Oct 24, 2012 at 5:07 PM, Nathan Edwards wrote: > On 10/22/2012 1:17 PM, Peter Cock wrote: >> Dear Biopythoneers, >> >> Would anyone object to us preparing to drop support for Python 2.5 and >> Jython 2.5, perhaps after the next Biopython release? > > I'm still in the dark ages, but I need the push to upgrade my > infrastructure. I'm just reluctant to rebuilt all of my third-party > libraries. > > Are there specific parts of the code known to be (or soon to be) > problematic for Python 2.5? > > Thanks, I can't point at any one killer feature here: http://docs.python.org/whatsnew/2.6.html There are assorted little things now where we have had to add Python 2.5 specific code or fallbacks (e.g. OrderedDict). The main other benefit is 2.6 adds a number of new features from Python 3, which should make supporting Python 2 and 3 a little easier (e.g. byte literals). Peter From nje5 at georgetown.edu Wed Oct 24 12:56:23 2012 From: nje5 at georgetown.edu (Nathan Edwards) Date: Wed, 24 Oct 2012 12:56:23 -0400 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: References: <50881234.8050007@georgetown.edu> Message-ID: <50881DB7.7070602@georgetown.edu> On 10/24/2012 12:33 PM, Peter Cock wrote: > On Wed, Oct 24, 2012 at 5:07 PM, Nathan Edwards wrote: >> On 10/22/2012 1:17 PM, Peter Cock wrote: >>> Dear Biopythoneers, >>> >>> Would anyone object to us preparing to drop support for Python 2.5 and >>> Jython 2.5, perhaps after the next Biopython release? > > I can't point at any one killer feature here: > http://docs.python.org/whatsnew/2.6.html > > There are assorted little things now where we have had to > add Python 2.5 specific code or fallbacks (e.g. OrderedDict). > > The main other benefit is 2.6 adds a number of new features > from Python 3, which should make supporting Python 2 and 3 > a little easier (e.g. byte literals). I'm not opposed. I can always not upgrade BioPython until I'm ready. - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From s.schmeier at gmail.com Wed Oct 24 13:12:46 2012 From: s.schmeier at gmail.com (Sebastian Schmeier) Date: Wed, 24 Oct 2012 19:12:46 +0200 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: A very quick and dirty approach for your reject function (I hope I understood correctly) in script form: #!/usr/bin/env python import sys, re from Bio import SeqIO def main(): for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not discard(str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') def discard(seq): oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) if oRes: return 1 else: return 0 if __name__ == '__main__': sys.exit(main()) Best, Seb On Wed, Oct 24, 2012 at 5:49 PM, Kiss, Csaba wrote: > Hi All! > Thanks for all your help to extract DNA sequences from sff files. Using > biopython I managed to improve the sequence extraction from 3 hours to 10 > minutes. > Now that I am hooked, I would like to replace mothur with some simple > python functions. > Is there any function in biopython that would look for homopolymers on DNA > sequences. Particularly I am looking to reject a sequence if it has more > than 8 bp of stretches of any single nucleotide. > > Another function I am looking for is a sliding window function along the > quality file. I could either use the fastq file or the fasta/qual file pair. > > I could write these functions myself but if they are available, then it > would make my life easier. > Thanks > > Csaba > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From csaba.kiss at lanl.gov Wed Oct 24 13:20:23 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Wed, 24 Oct 2012 17:20:23 +0000 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E9346C1@ECS-EXG-P-MB03.win.lanl.gov> Thanks, Seb. That?s a clever usage of regex. csaba From: Sebastian Schmeier [mailto:s.schmeier at gmail.com] Sent: Wednesday, October 24, 2012 11:13 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequenbce trimming A very quick and dirty approach for your reject function (I hope I understood correctly) in script form: #!/usr/bin/env python import sys, re from Bio import SeqIO def main(): for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not discard(str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') def discard(seq): oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) if oRes: return 1 else: return 0 if __name__ == '__main__': sys.exit(main()) Best, Seb On Wed, Oct 24, 2012 at 5:49 PM, Kiss, Csaba > wrote: Hi All! Thanks for all your help to extract DNA sequences from sff files. Using biopython I managed to improve the sequence extraction from 3 hours to 10 minutes. Now that I am hooked, I would like to replace mothur with some simple python functions. Is there any function in biopython that would look for homopolymers on DNA sequences. Particularly I am looking to reject a sequence if it has more than 8 bp of stretches of any single nucleotide. Another function I am looking for is a sliding window function along the quality file. I could either use the fastq file or the fasta/qual file pair. I could write these functions myself but if they are available, then it would make my life easier. Thanks Csaba _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Oct 24 13:22:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Oct 2012 18:22:57 +0100 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier wrote: > A very quick and dirty approach for your reject function (I hope I > understood correctly) in script form: > > #!/usr/bin/env python > import sys, re > from Bio import SeqIO > > def main(): > for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : > if not discard(str(record.seq)): > SeqIO.write(record, sys.stdout, 'fasta') > > def discard(seq): > oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) > if oRes: return 1 > else: return 0 > > if __name__ == '__main__': > sys.exit(main()) Minor suggestions - if you are going to use a regular expression many times (here once per read), compile it once first. Also Python defines "True" and "False" which are more natural than 1 and 0, but in fact you could do: def discard(seq): return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)) At that point defining and using a function seems a bit of an unnecessary overhead so: def main(): for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') Next a much more important point - try and make a single call to SeqIO.write, with all the records (using an iterator approach) rather than many calls to SeqIO.write (which isn't supported for output in formats like SFF). This should be faster: for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') From p.j.a.cock at googlemail.com Wed Oct 24 13:27:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Oct 2012 18:27:14 +0100 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock wrote: > On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier > wrote: >> A very quick and dirty approach for your reject function (I hope I >> understood correctly) in script form: >> >> #!/usr/bin/env python >> import sys, re >> from Bio import SeqIO >> >> def main(): >> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : >> if not discard(str(record.seq)): >> SeqIO.write(record, sys.stdout, 'fasta') >> >> def discard(seq): >> oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) >> if oRes: return 1 >> else: return 0 >> >> if __name__ == '__main__': >> sys.exit(main()) > > Minor suggestions - if you are going to use a regular expression > many times (here once per read), compile it once first. Also > Python defines "True" and "False" which are more natural > than 1 and 0, but in fact you could do: > > def discard(seq): > return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)) > > At that point defining and using a function seems a bit of > an unnecessary overhead so: > > def main(): > for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : > if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): > SeqIO.write(record, sys.stdout, 'fasta') > > Next a much more important point - try and make a single > call to SeqIO.write, with all the records (using an iterator > approach) rather than many calls to SeqIO.write (which > isn't supported for output in formats like SFF). This should > be faster: Sorry Sebastian - I had a hiccup with my mouse focus and accidentally sent that email half finished. I meant something like this: def main(): wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \ if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(rec.seq))) count = SeqIO.write(wanted, sys.stdout, 'fasta') There are other examples of filtering sequence files in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html I hope that is useful, Peter From csaba.kiss at lanl.gov Wed Oct 24 14:01:23 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Wed, 24 Oct 2012 18:01:23 +0000 Subject: [Biopython] still more questions about NGS sequence trimming Message-ID: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> Thanks, Peter. I am looking at this example now: from Bio import SeqIO good_reads = (rec for rec in \ SeqIO.parse("SRR020192.fastq", "fastq") \ if min(rec.letter_annotations["phred_quality"]) >= 20) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count That's a rather crude quality filtering. Is there any more sophisticated options already in biopython? Ie. quality_average Or other options? -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Wednesday, October 24, 2012 11:27 AM To: Sebastian Schmeier Cc: Kiss, Csaba; biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequenbce trimming On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock wrote: > On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier > wrote: >> A very quick and dirty approach for your reject function (I hope I >> understood correctly) in script form: >> >> #!/usr/bin/env python >> import sys, re >> from Bio import SeqIO >> >> def main(): >> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : >> if not discard(str(record.seq)): >> SeqIO.write(record, sys.stdout, 'fasta') >> >> def discard(seq): >> oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) >> if oRes: return 1 >> else: return 0 >> >> if __name__ == '__main__': >> sys.exit(main()) > > Minor suggestions - if you are going to use a regular expression many > times (here once per read), compile it once first. Also Python defines > "True" and "False" which are more natural than 1 and 0, but in fact > you could do: > > def discard(seq): > return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)) > > At that point defining and using a function seems a bit of an > unnecessary overhead so: > > def main(): > for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : > if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): > SeqIO.write(record, sys.stdout, 'fasta') > > Next a much more important point - try and make a single call to > SeqIO.write, with all the records (using an iterator > approach) rather than many calls to SeqIO.write (which isn't supported > for output in formats like SFF). This should be faster: Sorry Sebastian - I had a hiccup with my mouse focus and accidentally sent that email half finished. I meant something like this: def main(): wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \ if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(rec.seq))) count = SeqIO.write(wanted, sys.stdout, 'fasta') There are other examples of filtering sequence files in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html I hope that is useful, Peter From lancas at uw.edu Wed Oct 24 21:11:14 2012 From: lancas at uw.edu (Samuel M. Lancaster) Date: Wed, 24 Oct 2012 18:11:14 -0700 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 Message-ID: Hi, I am running into a problem installing Biopython on my computer. To use Biopython I need Numpy; however I can only find Biopython for Python 2.7 and Numpy for Python 2.6. Can you direct me to a place where I can find either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with Mac OS Lion 10.7.5? Thanks, Sam From nuin at genedrift.org Wed Oct 24 21:16:31 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 24 Oct 2012 21:16:31 -0400 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 In-Reply-To: References: Message-ID: Hi You can install from source a beta version http://sourceforge.net/projects/numpy/files/NumPy/1.7.0b2/ it should work fine if you follow the instructions. Cheers Paulo On 2012-10-24, at 9:11 PM, "Samuel M. Lancaster" wrote: > Hi, > I am running into a problem installing Biopython on my computer. To use > Biopython I need Numpy; however I can only find Biopython for Python 2.7 > and Numpy for Python 2.6. Can you direct me to a place where I can find > either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with > Mac OS Lion 10.7.5? > > Thanks, > Sam > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From nuin at genedrift.org Wed Oct 24 21:26:24 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 24 Oct 2012 21:26:24 -0400 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 In-Reply-To: References: Message-ID: Alternatively, you can also try installing everything via pip http://pypi.python.org/pypi/pip Cheers Paulo On 2012-10-24, at 9:11 PM, "Samuel M. Lancaster" wrote: > Hi, > I am running into a problem installing Biopython on my computer. To use > Biopython I need Numpy; however I can only find Biopython for Python 2.7 > and Numpy for Python 2.6. Can you direct me to a place where I can find > either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with > Mac OS Lion 10.7.5? > > Thanks, > Sam > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Oct 25 04:03:34 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 09:03:34 +0100 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 In-Reply-To: References: Message-ID: On Thu, Oct 25, 2012 at 2:11 AM, Samuel M. Lancaster wrote: > Hi, > I am running into a problem installing Biopython on my computer. To use > Biopython I need Numpy; however I can only find Biopython for Python 2.7 > and Numpy for Python 2.6. Can you direct me to a place where I can find > either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with > Mac OS Lion 10.7.5? > > Thanks, > Sam Note NumPy provides precompiled packages for Mac OS X for the official Python installs from Python.org - not the Apple provided Python installation. Which Python(s) are you trying to use? You can if you wish install NumPy from source on Mac OS X for any installed Python. You will need Apple's XCode from the App Store to do this (you'll need it anyway to compile Biopython's C modules). Peter From p.j.a.cock at googlemail.com Thu Oct 25 04:14:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 09:14:50 +0100 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Wed, Oct 24, 2012 at 7:01 PM, Kiss, Csaba wrote: > Thanks, Peter. I am looking at this example now: > > from Bio import SeqIO > good_reads = (rec for rec in \ > SeqIO.parse("SRR020192.fastq", "fastq") \ > if min(rec.letter_annotations["phred_quality"]) >= 20) > count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") > print "Saved %i reads" % count > > That's a rather crude quality filtering. Is there any more sophisticated options > already in biopython? Ie. quality_average > > Or other options? Average (mean) quality is easy, take the sum and divide by the length (or in this case, I've moved the divide to a multiply on the other side of the inequality since generally multiplication is faster than division): from Bio import SeqIO good_reads = (rec for rec in \ SeqIO.parse("SRR020192.fastq", "fastq") \ if sum(rec.letter_annotations["phred_quality"]) >= 20*len(rec)) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count However, for most sequencing reads you'd want to use a trimming step first as read quality tends to decline with length - the first half might be good and the second half bad, meaning the average is poor. You could write a little function to do that, and slice the SeqRecord to select the good chunk. There are examples of that in the Tutorial for removing an adapter/adaptor or PCR primer from FASTQ files. Peter From csaba.kiss at lanl.gov Thu Oct 25 10:49:59 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Thu, 25 Oct 2012 14:49:59 +0000 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> Thanks, Peter. I am writing my quality functions. Another question about trimming. As you mentioned, the quality of the ends tend to be lower than in the middle. Could that be fixed just by using "sff-trim" when I create my FASTQ file? If I don't do that I get sequences with small and capital letters. Are you suggesting further trimming than just "sff-trim". Csaba -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Thursday, October 25, 2012 2:15 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequence trimming On Wed, Oct 24, 2012 at 7:01 PM, Kiss, Csaba wrote: > Thanks, Peter. I am looking at this example now: > > from Bio import SeqIO > good_reads = (rec for rec in \ > SeqIO.parse("SRR020192.fastq", "fastq") \ > if min(rec.letter_annotations["phred_quality"]) >= 20) > count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print > "Saved %i reads" % count > > That's a rather crude quality filtering. Is there any more > sophisticated options already in biopython? Ie. quality_average > > Or other options? Average (mean) quality is easy, take the sum and divide by the length (or in this case, I've moved the divide to a multiply on the other side of the inequality since generally multiplication is faster than division): from Bio import SeqIO good_reads = (rec for rec in \ SeqIO.parse("SRR020192.fastq", "fastq") \ if sum(rec.letter_annotations["phred_quality"]) >= 20*len(rec)) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count However, for most sequencing reads you'd want to use a trimming step first as read quality tends to decline with length - the first half might be good and the second half bad, meaning the average is poor. You could write a little function to do that, and slice the SeqRecord to select the good chunk. There are examples of that in the Tutorial for removing an adapter/adaptor or PCR primer from FASTQ files. Peter From p.j.a.cock at googlemail.com Thu Oct 25 11:29:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 16:29:57 +0100 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: > Thanks, Peter. I am writing my quality functions. Another question about > trimming. As you mentioned, the quality of the ends tend to be lower than > in the middle. Could that be fixed just by using "sff-trim" when I create my > FASTQ file? If I don't do that I get sequences with small and capital letters. > Are you suggesting further trimming than just "sff-trim". In Bio.SeqIO, we use the file format names "sff" and "sff-trim" to mean the raw sequence data from the SFF file in full, or with the trimming values inside the SFF file applied. If you have used the Roche tools you'll see a similar option in their SFF extraction tool. This default trimming is decided by the Roche 454 instrument and does quite a good job at removing the adapters, barcodes and poor quality bits. I assume you were using Mothur to do further trimming based on a more stringent sliding window of quality scores? Peter From csaba.kiss at lanl.gov Thu Oct 25 11:34:46 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Thu, 25 Oct 2012 15:34:46 +0000 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E935829@ECS-EXG-P-MB03.win.lanl.gov> I believe mothur does check the moving average quality of a sequence with a sliding window of 50 bp. If the quality falls below the given value then it tosses the sequence out. I don't think it does end trimming beside removing the small letters from the ends. Of course, it can remove adapter and primer sequences but that's not based on quality values. C -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Thursday, October 25, 2012 9:30 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequence trimming On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: > Thanks, Peter. I am writing my quality functions. Another question > about trimming. As you mentioned, the quality of the ends tend to be > lower than in the middle. Could that be fixed just by using "sff-trim" > when I create my FASTQ file? If I don't do that I get sequences with small and capital letters. > Are you suggesting further trimming than just "sff-trim". In Bio.SeqIO, we use the file format names "sff" and "sff-trim" to mean the raw sequence data from the SFF file in full, or with the trimming values inside the SFF file applied. If you have used the Roche tools you'll see a similar option in their SFF extraction tool. This default trimming is decided by the Roche 454 instrument and does quite a good job at removing the adapters, barcodes and poor quality bits. I assume you were using Mothur to do further trimming based on a more stringent sliding window of quality scores? Peter From p.j.a.cock at googlemail.com Thu Oct 25 11:58:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 16:58:04 +0100 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E935829@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935829@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Thu, Oct 25, 2012 at 4:34 PM, Kiss, Csaba wrote: > I believe mothur does check the moving average quality of a sequence with > a sliding window of 50 bp. If the quality falls below the given value then it > tosses the sequence out. I don't think it does end trimming beside removing > the small letters from the ends. Of course, it can remove adapter and primer > sequences but that's not based on quality values. Fine - the point is doing SeqIO.parse("example.sff", "sff-trim") does NOT do any of that. All it does is apply the trimming information already recorded in the SFF file by the provider (e.g. the Roche 454 instrument). So back to your earlier question: > On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: >> Thanks, Peter. I am writing my quality functions. Another question >> about trimming. As you mentioned, the quality of the ends tend to be >> lower than in the middle. Could that be fixed just by using "sff-trim" >> when I create my FASTQ file? Using "sff-trim" would be sensible as a starting point, but you'll still probably notice a drop off in quality along the read length. This is normal. >> If I don't do that I get sequences with small and capital letters. The lower case bits are what Roche labelled as low quality or adapter. The upper case bit is what Roche labelled as worth keeping after its trimming, and it is this you'd get via SeqIO.parse("example.sff", "sff-trim"). You'll probably notice all the untrimmed sequences start with the same four letters (in lower case). >> Are you suggesting further trimming than just "sff-trim". Yes, if you want to mimic what Mothur was doing for you. Peter From afernandez at ceab.csic.es Thu Oct 25 12:32:42 2012 From: afernandez at ceab.csic.es (Antonio Fernandez-Guerra) Date: Thu, 25 Oct 2012 18:32:42 +0200 Subject: [Biopython] still more questions about NGS sequence trimming Message-ID: -- Antonio Fern?ndez-Guerra Center for Advanced Studies of Blanes (CEAB-CSIC) Acces Cala St Francesc, 14 17300 Blanes, SPAIN Tel +34 972 33 6101 Fax +34 972 33 7806 http://nodens.ceab.csic.es/ecogenomics/members/antoni-fernandez-guerra.html e-mail: afernandez at ceab.csic.es Peter Cock wrote: >On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: >> Thanks, Peter. I am writing my quality functions. Another question about >> trimming. As you mentioned, the quality of the ends tend to be lower than >> in the middle. Could that be fixed just by using "sff-trim" when I create my >> FASTQ file? If I don't do that I get sequences with small and capital letters. >> Are you suggesting further trimming than just "sff-trim". > >In Bio.SeqIO, we use the file format names "sff" and "sff-trim" to mean >the raw sequence data from the SFF file in full, or with the trimming >values inside the SFF file applied. If you have used the Roche tools >you'll see a similar option in their SFF extraction tool. This default >trimming is decided by the Roche 454 instrument and does quite a >good job at removing the adapters, barcodes and poor quality bits. > >I assume you were using Mothur to do further trimming based on a >more stringent sliding window of quality scores? > >Peter >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Oct 31 17:13:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 31 Oct 2012 21:13:19 +0000 Subject: [Biopython] Fwd: [mira_talk] new sff_extract In-Reply-To: References: <5090F1B5.9070000@upv.es> Message-ID: Apologies if this arrives twice, the OBF mail server was down earlier today - more news about that tomorrow I hope, along with information about the website. (This email is partly to confirm the list are alive again.) Peter ---------- Forwarded message ---------- From: *Peter Cock* Date: Wednesday, October 31, 2012 Subject: [mira_talk] new sff_extract To: Biopython Mailing List Hi all, For those working with SFF files (from Roche or IonTorrent), Jose's sff_extract tool has often been a popular alternative to the Roche (Linux only) off instrument applications - and Biopython's SFF support was based on sff_extract (thanks again Jose!). Jose has just announced (on the MIRA assembler mailing list) a new version of sff_extract which now calls the Biopython SFF code for the low level binary file access, and comes with some additional related tools. See below for details, original thread archive here: http://www.freelists.org/post/mira_talk/new-sff-extract Peter ---------- Forwarded message ---------- From: Jose Blanca > Date: Wed, Oct 31, 2012 at 9:39 AM Subject: [mira_talk] new sff_extract To: "mira_talk at freelists.org " > Hi: Sometime ago we discussed in this list the future of sff_extract. We started working on it and we have a version that we think is working. The sff_extract functionality has been split in two sff_extract and split_matepairs that can be linked together with a pipe. We haven't done extensive testing so if you use them, please let us know. These utilities are bundled with some other little tools that we have developed for our day to day work. They are all written in python and they use biopython. You can take a look at the development site: https://github.com/JoseBlanca/seq_crumbs Or our site: http://bioinf.comav.upv.es/seq_crumbs/ Of course we'd love to have some feedback. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html From francesco.chiani at gmail.com Wed Oct 3 14:18:22 2012 From: francesco.chiani at gmail.com (francesco chiani) Date: Wed, 3 Oct 2012 14:18:22 +0000 (UTC) Subject: [Biopython] error in parseing Gene bank Message-ID: Hi Everyone, Someone have an idea of why in biopython for python 2.7 give me this error while parsing a gene bank file? Traceback (most recent call last): in for seq_record in SeqIO.parse(handle, "genbank"): File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 537, in parse for r in i: File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 445, in parse_records record = self.parse(handle, do_features) File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 428, in parse if self.feed(handle, consumer, do_features): File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 410, in feed consumer.record_end("//") File "C:\Python27\lib\site-packages\Bio\GenBank\__init__.py", line 1184, in record_end % self._seq_type) ValueError: Could not determine alphabet for seq_type dna p.s.-:This doesn't happen in biopython for python 2.6 version I suppose, 'cause with the same gene bank file in this version my script works... thanks x your help, F. From semenko at alum.mit.edu Wed Oct 3 14:28:20 2012 From: semenko at alum.mit.edu (Nick Semenkovich) Date: Wed, 3 Oct 2012 09:28:20 -0500 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: I've had this problem too, and have considered patching it: I * think * (without seeing your file), that your seqtype must be in capital letters (e.g. DNA). Does that work? - Nick On Wed, Oct 3, 2012 at 9:18 AM, francesco chiani wrote: > Hi Everyone, > Someone have an idea of why in biopython for python 2.7 give me this error > while parsing a gene bank file? > > > Traceback (most recent call last): > in > for seq_record in SeqIO.parse(handle, "genbank"): > File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 537, in > parse > for r in i: > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 445, in > parse_records > record = self.parse(handle, do_features) > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 428, > in parse > if self.feed(handle, consumer, do_features): > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 410, > in feed > consumer.record_end("//") > File "C:\Python27\lib\site-packages\Bio\GenBank\__init__.py", line 1184, > in > record_end > % self._seq_type) > ValueError: Could not determine alphabet for seq_type dna > > p.s.-:This doesn't happen in biopython for python 2.6 version I suppose, > 'cause > with the same gene bank file in this version my script works... > > > thanks x your help, > F. > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Nick Semenkovich Laboratory of Dr. Jeffrey I. Gordon Medical Scientist Training Program School of Medicine Washington University in St. Louis 314.362.3963 (Lab) 314.374.4434 (Cell) http://web.mit.edu/semenko/ From p.j.a.cock at googlemail.com Wed Oct 3 14:35:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Oct 2012 15:35:46 +0100 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: On Wed, Oct 3, 2012 at 3:18 PM, francesco chiani wrote: > Hi Everyone, > Someone have an idea of why in biopython for python 2.7 give me this error > while parsing a gene bank file? > > > Traceback (most recent call last): > in > for seq_record in SeqIO.parse(handle, "genbank"): > File "C:\Python27\lib\site-packages\Bio\SeqIO\__init__.py", line 537, in parse > for r in i: > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 445, in > parse_records > record = self.parse(handle, do_features) > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 428, in parse > if self.feed(handle, consumer, do_features): > File "C:\Python27\lib\site-packages\Bio\GenBank\Scanner.py", line 410, in feed > consumer.record_end("//") > File "C:\Python27\lib\site-packages\Bio\GenBank\__init__.py", line 1184, in > record_end > % self._seq_type) > ValueError: Could not determine alphabet for seq_type dna > > p.s.-:This doesn't happen in biopython for python 2.6 version I suppose, 'cause > with the same gene bank file in this version my script works... > > > thanks x your help, > F. If you've switched from Python 2.6 to Python 2.7, it is likely you've also got a more recent Biopython on the Python 2.7 installation. Could you check the two Biopython versions? The problem is most likely an invalid LOCUS line (which should indicate if the sequence is DNA/Protein etc). Could you show us the first few lines of the GenBank file? Thanks, Peter From p.j.a.cock at googlemail.com Wed Oct 3 14:42:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Oct 2012 15:42:58 +0100 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: On Wed, Oct 3, 2012 at 3:39 PM, francesco chiani wrote: > Thanks for the replys: > > here the gbk file: > > LOCUS allele_48659_OTTMUSE00000300743_L1L2_Bact_P 37935 bp > dna linear UNK > ACCESSION unknown > DBSOURCE accession design_id=48659 > COMMENT cassette : L1L2_Bact_P > COMMENT design_id : 48659 > FEATURES Location/Qualifiers > ... As Nick guessed, I think the problem is you have 'dna' (lower case) in the LOCUS line, rather than 'DNA' (upper case). Where did this file come from? e.g. What software tool or database made it? http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#MoleculeTypeB In any case, perhaps Biopython could check for 'dna' as well (as some tools don't seem for obey this bit of the standard)? Thanks Peter From p.j.a.cock at googlemail.com Wed Oct 3 15:00:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Oct 2012 16:00:10 +0100 Subject: [Biopython] error in parseing Gene bank In-Reply-To: References: Message-ID: On Wed, Oct 3, 2012 at 3:56 PM, francesco chiani wrote: > Xfect after replace "dna" with "DNA" in the gbk file , the script works! > Fantastic. > > The gene bank file is from IKMC portal > http://www.i-dcc.org/martsearch/ > I have no idea about the software used to made it sorry.. I just use them.. > Could you email them, and include the link to the GenBank standard: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#MoleculeTypeB > does my script have to replace "dna" every gene bank or there is a > quicker solution I cant see? Try using the optional argument to SeqIO.parse, e.g. from Bio import SeqIO from Bio.Alphabet import generic_dna for seq_record in SeqIO.parse(handle, "genbank", alphabet=generic_dna): print seq_record.id Regards, Peter P.S. Please CC the mailing list in your replies. From devaniranjan at gmail.com Mon Oct 8 15:29:37 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 8 Oct 2012 11:29:37 -0400 Subject: [Biopython] not technically a biopython question Message-ID: Hi guys, I am working on some FASTA like sequences (not FASTA but something I have defined thats similar for some culled PDB from the PISCES server) I have a question: I have a small no of sequences called nCatSeq, for which there are MULTIPLE nBasinSeq, I go through a a large PDB file and I want to extract for for each nCatSeq the corresponding nBasinSeq without redundancies in a dictionary. The code snippet that does this is given below. nCatSeq=item[1][n]+item[1][n+1]+item[1][n+2]+item[1][n+3] nBasinSeq=item[2][n]+item[2][n+1]+item[2][n+2]+item[2][n+3] if nCatSeq not in potBasin: potBasin[nCatSeq]=nBasinSeq else: if nBasinSeq not in potBasin[nCatSeq]: potBasin[nCatSeq]=potBasin[nCatSeq],nBasinSeq else: pass I get the following as the answer for one nCatSeq, '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') what I want however is : '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') I don't want all the extra brackets due to the following command potBasin[nCatSeq]=potBasin[nCatSeq],nBasinSeq (see above code snippet) Is there a way to do this ? Thank you, George From devaniranjan at gmail.com Mon Oct 8 16:06:52 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 8 Oct 2012 12:06:52 -0400 Subject: [Biopython] not technically a biopython question In-Reply-To: <5072F914.3010009@inaf.cnrs-gif.fr> References: <5072F914.3010009@inaf.cnrs-gif.fr> Message-ID: Thank you Frederic but I have tried that and what it gives is : VUVVDDRVDDVGVUVV It basically joins all the tetramers together and I want them separately. On Mon, Oct 8, 2012 at 12:02 PM, Fr?d?ric Sohm wrote: > Hi, > > > if nBasinSeq not in potBasin[nCatSeq] : > potBasin[nCatSeq] = potBasin[nCatSeq] + (nBasinSeq,) > > or shorter > > if nBasinSeq not in potBasin[nCatSeq] : > potBasin[nCatSeq] += (nBasinSeq,) > > Regards, > > Fred > > > On 08/10/12 17:29, George Devaniranjan wrote: > >> Hi guys, >> >> I am working on some FASTA like sequences (not FASTA but something I have >> defined thats similar for some culled PDB from the PISCES server) >> >> I have a question: >> >> I have a small no of sequences called nCatSeq, for which there are >> MULTIPLE >> nBasinSeq, I go through a a large PDB file and I want to extract for for >> each nCatSeq the corresponding nBasinSeq without redundancies in a >> dictionary. The code snippet that does this is given below. >> >> nCatSeq=item[1][n]+item[1][n+**1]+item[1][n+2]+item[1][n+3] >> nBasinSeq=item[2][n]+item[2][**n+1]+item[2][n+2]+item[2][n+3] >> >> >> if nCatSeq not in potBasin: >> potBasin[nCatSeq]=nBasinSeq >> else: >> if nBasinSeq not in potBasin[nCatSeq]: >> potBasin[nCatSeq]=potBasin[**nCatSeq],nBasinSeq >> else: >> >> pass >> >> >> >> >> I get the following as the answer for one nCatSeq, >> '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') >> >> >> what I want however is : >> >> '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') >> >> I don't want all the extra brackets due to the following command >> potBasin[nCatSeq]=potBasin[**nCatSeq],nBasinSeq >> (see above code snippet) >> >> Is there a way to do this ? >> >> Thank you, >> George >> ______________________________**_________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/biopython >> >> > -- > Fr?d?ric Sohm > GIS AMAGEN CNRS INRA > Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s" > UPR 3294 NED, CNRS > Institut de Neurobiologie A. Fessard > 1 Avenue de la Terrasse > 91 198 GIF-SUR -YVETTE > FRANCE > Phone: 33 1 69 82 34 12 > Fax: 33 1 69 82 41 67 > email: sohm at inaf.cnrs-gif.fr > From blind.watchmaker at yahoo.com Mon Oct 8 19:59:04 2012 From: blind.watchmaker at yahoo.com (John Ladasky) Date: Mon, 08 Oct 2012 12:59:04 -0700 Subject: [Biopython] not technically a biopython question In-Reply-To: References: Message-ID: <50733088.4020505@yahoo.com> Date: Mon, 8 Oct 2012 11:29:37 -0400 From: George Devaniranjan Subject: [Biopython] not technically a biopython question To: Biopython Mailing List Message-ID: > I get the following as the answer for one nCatSeq, > '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') > > > what I want however is : > > '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') > What you want to do is to "flatten" a nested sequence object. There are many Python recipes to do this. Most of the solutions involve recursive function calls. Here's a page that discusses several ways to get it done: http://stackoverflow.com/questions/2158395/flatten-an-irregular-list-of-lists-in-python If your list or tuple is VERY deeply nested (say, 256 parentheses), you may hit Python's recursion limit. There are solutions on that page which don't require recursion, but they are frequently more difficult to understand. I tend to prefer code that I can read at a glance, myself. I don't know why flattening lists and tuples isn't a standard library function in Python yet, it seems like everyone needs to do this at some time or another. From devaniranjan at gmail.com Mon Oct 8 20:11:06 2012 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 8 Oct 2012 16:11:06 -0400 Subject: [Biopython] not technically a biopython question In-Reply-To: <50733088.4020505@yahoo.com> References: <50733088.4020505@yahoo.com> Message-ID: Thank you very much, this is what I did and it seems to work. I found the answer on stack overflow. if nCatSeq not in potBasin: potBasin[nCatSeq] = (nBasinSeq,) else: if nBasinSeq not in potBasin[nCatSeq]: potBasin[nCatSeq] = potBasin[nCatSeq] + (nBasinSeq,) On Mon, Oct 8, 2012 at 3:59 PM, John Ladasky wrote: > Date: Mon, 8 Oct 2012 11:29:37 -0400 From: George Devaniranjan < > devaniranjan at gmail.com> Subject: [Biopython] not technically a biopython > question To: Biopython Mailing List > Message-ID: gmail.com > > > > I get the following as the answer for one nCatSeq, >> '4241': ((('VUVV', 'DDRV'), 'DDVG'), 'VUVV') >> >> >> what I want however is : >> >> '4241': ('VUVV', 'DDRV', 'DDVG', 'VUVV') >> >> > What you want to do is to "flatten" a nested sequence object. There are > many Python recipes to do this. Most of the solutions involve recursive > function calls. Here's a page that discusses several ways to get it done: > > http://stackoverflow.com/**questions/2158395/flatten-an-** > irregular-list-of-lists-in-**python > > If your list or tuple is VERY deeply nested (say, 256 parentheses), you > may hit Python's recursion limit. There are solutions on that page which > don't require recursion, but they are frequently more difficult to > understand. I tend to prefer code that I can read at a glance, myself. > > I don't know why flattening lists and tuples isn't a standard library > function in Python yet, it seems like everyone needs to do this at some > time or another. > > > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From ivaylo.stoimenov at gmail.com Wed Oct 10 11:00:55 2012 From: ivaylo.stoimenov at gmail.com (Ivaylo Stoimenov) Date: Wed, 10 Oct 2012 13:00:55 +0200 Subject: [Biopython] Read and Parse EMBOSS primer3-eprimer32 Message-ID: Hi, I have a problem of using Read and Parse functions when it comes to EMBOSS Primer3 (or eprimer32 wrapper). I would like to skip writing files but hijacking the output of Primer3 to a variable (object). Here is some part of the code, which does not work: from Bio.Emboss import Primer3 from Bio.Emboss.Applications import Primer3Commandline import sys ... cline = Primer3Commandline(sequence=combined_frame, auto=True, task =1) cline.explainflag = True cline.prange="100-150" cline.outfile = "stdout" ggg = cline() primer_record = Primer3.read(sys.stdout or ggg[0] or ...) print primer_record ... What I am doing wrong. The output of cline() is a tuple, but I would like to read or parse the first element. The problem is that Primer3.read and Primer3.parse expect file handles. Any advice would be highly appreciated. Kind regards, Ivaylo From hlapp at drycafe.net Wed Oct 10 20:31:04 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 10 Oct 2012 16:31:04 -0400 Subject: [Biopython] Fwd: [Announce] Call for Proposals for Doc Sprint Summit v2.0 References: Message-ID: If you've been a Google Summer of Code mentor this year or last year, you will have already seen this. I wanted to make sure everybody is aware, and this may provide the opportunity for the kind of concerted effort that could finally get a BioPerl, Biopython, Bioruby, or Biojava (or a combined??) off the ground. -hilmar Begin forwarded message: From: Carol Smith Subject: [GSoC Mentors] [Announce] Call for Proposals for Doc Sprint Summit v2.0 Date: October 10, 2012 2:44:50 PM EDT To: Google Summer of Code Mentors List Cc: adam at flossmanuals.net Dear GSoC mentors and org admins, Google Summer of Code in collaboration with Aspiration and FLOSS Manuals is hosting a "Doc Sprint Camp" at Google's Mountain View headquarters (California) Dec 3 - 7, 2012. The 2012 Doc Camp will feature: 1) An unconference on free software documentation topics - facilitated by Aspiration 2) 2-5 Book Sprints to produce books on free softwares - facilitated by FLOSS Manuals Building on the success of the 2011 GSoC Doc Camp we are proud to bring you the 2012 GSoC Doc Camp. Like the previous event the 2012 GSoC Doc Camp is a place for free software communities to meet, create a book for their project, attract new people to their efforts, and share their documentation experiences. The camp aims to improve free documentation materials and skills in free software projects and individuals and help form the identity of the emergent free documentation sector. Individuals and projects can apply. Food and accommodation for all individuals will be provided and travel support (full or partial) can also be applied for. Be a part of this exciting event ? propose a Book Sprint on your favorite free software or come and help others write a book on their favorite project. Guaranteed to be a lot of fun, productive, and a fantastic place to advance your documentation efforts and experiences. For more information or to register to take part, please see https://sites.google.com/site/docsprintsummitv2/. Please note proposals are due by October 26, so get yours in ASAP! Cheers, Carol Smith, Allen Gunn, Adam Hyde -- You received this message because you are subscribed to the Google Groups "Google Summer of Code Mentors List" group. To post to this group, send email to google-summer-of-code-mentors-list at googlegroups.com. To unsubscribe from this group, send email to google-summer-of-code-mentors-list+unsubscribe at googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-summer-of-code-mentors-list?hl=en. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From nuin at genedrift.org Thu Oct 11 01:55:53 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 10 Oct 2012 21:55:53 -0400 Subject: [Biopython] Affy CEL files Message-ID: Hi I found some old discussions on parsing CEL files generated from Affymetrix microarrays (and such). What is the current status of this in BioPython? I was able to find some classes but there is not a lot of documentation about them. Thanks in advance Paulo From p.j.a.cock at googlemail.com Thu Oct 11 11:05:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Oct 2012 12:05:51 +0100 Subject: [Biopython] Read and Parse EMBOSS primer3-eprimer32 In-Reply-To: References: Message-ID: Hello Ivaylo, On Wed, Oct 10, 2012 at 12:00 PM, Ivaylo Stoimenov wrote: > Hi, > > I have a problem of using Read and Parse functions when it comes to > EMBOSS Primer3 (or eprimer32 wrapper). I would like to skip writing files but > hijacking the output of Primer3 to a variable (object). Here is some part > of the code, which does not work: > > from Bio.Emboss import Primer3 > from Bio.Emboss.Applications import Primer3Commandline > import sys > ... > cline = Primer3Commandline(sequence=combined_frame, auto=True, task =1) > cline.explainflag = True > cline.prange="100-150" > cline.outfile = "stdout" I think that will just write to a file called stdout (unless Primer3 does something special). I think should use cline.stdout = True instead, which will add the switch -stdout to the EMBOSS tool's command line. Once you have told EMBOSS to write the output to stdout instead of to a file, you must get Python to parse this. Either use subprocess and the child process's stdout handle (see examples of this in the Biopython Tutorial), or if you capture stdout as a string turn the string into a pretend handle using StringIO. e.g. cline = Primer3Commandline(...) stdout, stderr = cline() #Runs it, captures output as strings from StringIO import StringIO handle = StringIO(stdout) #Turns string into pretend handle Peter From mjldehoon at yahoo.com Sat Oct 13 11:41:44 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 13 Oct 2012 04:41:44 -0700 (PDT) Subject: [Biopython] Affy CEL files In-Reply-To: Message-ID: <1350128504.35696.YahooMailClassic@web164002.mail.gq1.yahoo.com> As far as I know nobody is actively working on this. I would really like to see better support for Affy and other microarrays. I have been thinking to implement something myself but I haven't found the time to do so. Would you (or anybody else) be willing to contribute some code or documentation for Affy microarrays in Biopython? Thanks, -Michiel. --- On Wed, 10/10/12, Paulo Nuin wrote: > From: Paulo Nuin > Subject: [Biopython] Affy CEL files > To: "BioPython Mailing List" > Date: Wednesday, October 10, 2012, 9:55 PM > Hi > > I found some old discussions on parsing CEL files generated > from Affymetrix microarrays (and such). What is the current > status of this in BioPython? I was able to find some classes > but there is not a lot of documentation about them. > > Thanks in advance > > Paulo > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From nuin at genedrift.org Wed Oct 17 00:22:00 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Tue, 16 Oct 2012 20:22:00 -0400 Subject: [Biopython] Affy CEL files In-Reply-To: <1350128504.35696.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1350128504.35696.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: <950ED12B-784E-4DEC-A121-D27CA3AAE46F@genedrift.org> Hi That's an idea. I will see what I can do on my end. We are analalysing a lot of microarrays and it would be helpful to have something in Python instead of R only. Cheers Paulo On 2012-10-13, at 7:41 AM, Michiel de Hoon wrote: > As far as I know nobody is actively working on this. I would really like to see better support for Affy and other microarrays. I have been thinking to implement something myself but I haven't found the time to do so. Would you (or anybody else) be willing to contribute some code or documentation for Affy microarrays in Biopython? > > Thanks, > -Michiel. > > --- On Wed, 10/10/12, Paulo Nuin wrote: > >> From: Paulo Nuin >> Subject: [Biopython] Affy CEL files >> To: "BioPython Mailing List" >> Date: Wednesday, October 10, 2012, 9:55 PM >> Hi >> >> I found some old discussions on parsing CEL files generated >> from Affymetrix microarrays (and such). What is the current >> status of this in BioPython? I was able to find some classes >> but there is not a lot of documentation about them. >> >> Thanks in advance >> >> Paulo >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> From p.j.a.cock at googlemail.com Thu Oct 18 18:33:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 18 Oct 2012 19:33:04 +0100 Subject: [Biopython] PyPy 1.8 support? Message-ID: Hello all, We currently run the test suite against both PyPy 1.8 and 1.9 on Linux via the TravisCI.org continuous integration testing service. Is anyone actually using Biopython under PyPy 1.8? If not, I intend to drop automated testing under PyPy 1.8 and focus just on PyPy 1.9 instead. (Automated testing under C Python 2.5, 2.6, 2.7, 3.1 and 3.2 etc will continue - I'm hoping to add Python 3.3 as well) Thanks, Peter From p.j.a.cock at googlemail.com Fri Oct 19 07:52:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 19 Oct 2012 08:52:19 +0100 Subject: [Biopython] PyPy 1.8 support? In-Reply-To: References: Message-ID: On Thu, Oct 18, 2012 at 7:33 PM, Peter Cock wrote: > Hello all, > > We currently run the test suite against both PyPy 1.8 and > 1.9 on Linux via the TravisCI.org continuous integration > testing service. > > Is anyone actually using Biopython under PyPy 1.8? > > If not, I intend to drop automated testing under PyPy 1.8 > and focus just on PyPy 1.9 instead. Done on TravisCI, but easy to revert: https://github.com/biopython/biopython/commit/126c944812730df4677c8fa2f63abc29ddd084bb One reason was the previous build failed due to a timeout fetching PyPy for a custom install. Now we use the TravisCI provided PyPy which should avoid that issue. (It still happens for Jython sometimes). Peter From p.j.a.cock at googlemail.com Mon Oct 22 17:17:34 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 22 Oct 2012 18:17:34 +0100 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? Message-ID: Dear Biopythoneers, Would anyone object to us preparing to drop support for Python 2.5 and Jython 2.5, perhaps after the next Biopython release? To reassure those of you using Jython, we'd wait until Jython 2.7 is out first. Jython 2.7 is already in alpha, and brings support for C Python 2.7 language features. Thanks, Peter From csaba.kiss at lanl.gov Tue Oct 23 16:01:17 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:01:17 +0000 Subject: [Biopython] sff into fasta and qual -> trim Message-ID: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> I am new to bio-python. I am trying to replace mothur with BioPython. I hope that biopython is faster than mothur. All I want to do is this: sffinfo(sff=sd11.fasta) trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) Can someone help me to translate the two mothur statements above to biopython, please? It would be greatly appreciated. thanks -- Best Regards: Csaba Kiss PhD, MSc, BSc TA-43, HRL-1, MS888 Los Alamos National Laboratory Work: 1-505-667-9898 Cell: 1-505-920-5774 From csaba.kiss at lanl.gov Tue Oct 23 16:04:21 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:04:21 +0000 Subject: [Biopython] sff inot fasta and qual then trim Message-ID: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> I am new to bio-python. I am trying to replace mothur with BioPython. I hope that biopython is faster than mothur. All I want to do is this: sffinfo(sff=sd11.fasta) trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) Can someone help me to translate the two mothur statements above to biopython, please? It would be greatly appreciated. thanks -- Best Regards: Csaba Kiss PhD, MSc, BSc TA-43, HRL-1, MS888 Los Alamos National Laboratory Work: 1-505-667-9898 Cell: 1-505-920-5774 From p.j.a.cock at googlemail.com Tue Oct 23 16:14:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 17:14:00 +0100 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 5:04 PM, Kiss, Csaba wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements > above to biopython, please? > It would be greatly appreciated. > thanks I don't know enough about mothur to give you an informed answer. I would guess the first line is just SFF to FASTA and QUAL, based partly on the title to your email. That at least is trivial in Biopython: from Bio import SeqIO SeqIO.convert("example.sff", "sff", "example.fasta", "fasta") SeqIO.convert("example.sff", "sff", "example.qual", "qual") Or, if you want the trimming in the SFF file applied, which is generally sensible: from Bio import SeqIO SeqIO.convert("example.sff", "sff-trim", "example.fasta", "fasta") SeqIO.convert("example.sff", "sff-trim", "example.qual", "qual") Personally I prefer to work with a single FASTQ file rather than a paired FASTA+QUAL (it is smaller on disc for one thing), so maybe: from Bio import SeqIO SeqIO.convert("example.sff", "sff-trim", "example.fastq", "fastq") Regards, Peter From p.j.a.cock at googlemail.com Tue Oct 23 16:16:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 17:16:30 +0100 Subject: [Biopython] sff into fasta and qual -> trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 5:01 PM, Kiss, Csaba wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements above to biopython, please? > It would be greatly appreciated. > thanks I saw your other email first, and replied to that: http://lists.open-bio.org/pipermail/biopython/2012-October/008217.html (On the bright side, it looks like your subscription to the mailing list has worked this time :) - welcome!) Peter From p.j.a.cock at googlemail.com Tue Oct 23 16:37:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 17:37:27 +0100 Subject: [Biopython] sff into fasta and qual -> trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93347D@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E93347D@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 5:22 PM, Kiss, Csaba wrote: > Thanks, Peter. > I understand the fastq sequence extraction. That's very neat. However, I am not sure how to do quality trimming of the sequences. > In mothur, we tested that qwindowsize=50, qwindowaverage=22 is a very nice way to get high quality sequences out. > I assume it works in a way that a 50 bp sliding window checks the average quality and if it's below a certain number (i.e. 22) then it rejects the sequence if it's above it keeps it. > Is there something similar in biopython. > > C Hi Csaba, No, there isn't a 'ready to use' sliding window read cleaning tool/function in Biopython, although you could write one using Biopython is you wished, with the advantage that you can implement exactly what you need. There are many (dozens?) of dedicated tools for this kind of thing which might be simpler or more appropriate. Have a browse here: http://seqanswers.com/wiki/Software/list Regards, Peter P.S. Please CC the mailing list in your replies. From cfriedline at vcu.edu Tue Oct 23 16:39:07 2012 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 23 Oct 2012 12:39:07 -0400 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> Are you trying to replace an entire analysis pipeline, which mothur provides, or simply take control of the read trimming routines? Mothur has been excellent for us (though I do supplement with my own code frequently), and I have a hard time believing that BioPython (or Python, in general) would be faster for these types of things. If you are married to Python, you may want to join in with the QIIME people, though they back their stuff with PyCogent rather than BioPython. Both are excellent packages for automating some parts of the analysis in microbial community studies. We can leave the philosophy of pipelining scientific research for another thread. ;-) I wonder if the reimplementation effort of common trimming/filtering tasks are worth your time, given the current maturity of both mothur and QIIME. On Oct 23, 2012, at 12:04 PM, "Kiss, Csaba" wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements above to biopython, please? > It would be greatly appreciated. > thanks > > > -- > Best Regards: > Csaba Kiss PhD, MSc, BSc > TA-43, HRL-1, MS888 > Los Alamos National Laboratory > Work: 1-505-667-9898 > Cell: 1-505-920-5774 > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From csaba.kiss at lanl.gov Tue Oct 23 16:39:11 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:39:11 +0000 Subject: [Biopython] sff into fasta and qual -> trim In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93341A@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E93347D@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E9334A1@ECS-EXG-P-MB03.win.lanl.gov> Thanks Peter for the info. I will look at the Software list. Csaba -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Tuesday, October 23, 2012 10:37 AM To: Kiss, Csaba Cc: Biopython Mailing List Subject: Re: [Biopython] sff into fasta and qual -> trim On Tue, Oct 23, 2012 at 5:22 PM, Kiss, Csaba wrote: > Thanks, Peter. > I understand the fastq sequence extraction. That's very neat. However, I am not sure how to do quality trimming of the sequences. > In mothur, we tested that qwindowsize=50, qwindowaverage=22 is a very nice way to get high quality sequences out. > I assume it works in a way that a 50 bp sliding window checks the average quality and if it's below a certain number (i.e. 22) then it rejects the sequence if it's above it keeps it. > Is there something similar in biopython. > > C Hi Csaba, No, there isn't a 'ready to use' sliding window read cleaning tool/function in Biopython, although you could write one using Biopython is you wished, with the advantage that you can implement exactly what you need. There are many (dozens?) of dedicated tools for this kind of thing which might be simpler or more appropriate. Have a browse here: http://seqanswers.com/wiki/Software/list Regards, Peter P.S. Please CC the mailing list in your replies. From csaba.kiss at lanl.gov Tue Oct 23 16:47:23 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 16:47:23 +0000 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> Hi Christopher! I am writing a python script to analyze antibody sequences. I have been using mothur to convert the sff files to fasta and then trim the sequences for quality. For the end-users' sake, it would be easier if all they needed to install was python and can go around mothur. I have been happy with mothur until now when I tried to use it in my desktop computer and it took 3 hours to convert 3 million read from sff to fasta. I hoped that pure python would be faster. I will look at Pycogent and QIIME. Thanks Csaba -----Original Message----- From: Christopher Friedline [mailto:cfriedline at mymail.vcu.edu] On Behalf Of Chris Friedline Sent: Tuesday, October 23, 2012 10:39 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] sff inot fasta and qual then trim Are you trying to replace an entire analysis pipeline, which mothur provides, or simply take control of the read trimming routines? Mothur has been excellent for us (though I do supplement with my own code frequently), and I have a hard time believing that BioPython (or Python, in general) would be faster for these types of things. If you are married to Python, you may want to join in with the QIIME people, though they back their stuff with PyCogent rather than BioPython. Both are excellent packages for automating some parts of the analysis in microbial community studies. We can leave the philosophy of pipelining scientific research for another thread. ;-) I wonder if the reimplementation effort of common trimming/filtering tasks are worth your time, given the current maturity of both mothur and QIIME. On Oct 23, 2012, at 12:04 PM, "Kiss, Csaba" wrote: > I am new to bio-python. I am trying to replace mothur with BioPython. > I hope that biopython is faster than mothur. All I want to do is this: > > sffinfo(sff=sd11.fasta) > trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, > maxhomop=8, qwindowsize=50, qwindowaverage =22) > > Can someone help me to translate the two mothur statements above to biopython, please? > It would be greatly appreciated. > thanks > > > -- > Best Regards: > Csaba Kiss PhD, MSc, BSc > TA-43, HRL-1, MS888 > Los Alamos National Laboratory > Work: 1-505-667-9898 > Cell: 1-505-920-5774 > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From cfriedline at vcu.edu Tue Oct 23 16:58:31 2012 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 23 Oct 2012 12:58:31 -0400 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> Csaba, As Peter said, there are many packages which will convert sff to fastq/fasta. I wonder if you're running into disk performance issues, rather than algorithm ones, though. Using the BioPython SeqIO convert should tell you that much, though that does seem slow (at least for the systems that I work on). Chris On Oct 23, 2012, at 12:47 PM, "Kiss, Csaba" wrote: > Hi Christopher! > I am writing a python script to analyze antibody sequences. I have been using mothur to convert the sff files to fasta and then trim the sequences for quality. > For the end-users' sake, it would be easier if all they needed to install was python and can go around mothur. I have been happy with mothur until now when I tried to use it in my desktop computer and it took 3 hours to convert 3 million read from sff to fasta. I hoped that pure python would be faster. > I will look at Pycogent and QIIME. > Thanks > Csaba > > -----Original Message----- > From: Christopher Friedline [mailto:cfriedline at mymail.vcu.edu] On Behalf Of Chris Friedline > Sent: Tuesday, October 23, 2012 10:39 AM > To: Kiss, Csaba > Cc: biopython at lists.open-bio.org > Subject: Re: [Biopython] sff inot fasta and qual then trim > > Are you trying to replace an entire analysis pipeline, which mothur provides, or simply take control of the read trimming routines? Mothur has been excellent for us (though I do supplement with my own code frequently), and I have a hard time believing that BioPython (or Python, in general) would be faster for these types of things. If you are married to Python, you may want to join in with the QIIME people, though they back their stuff with PyCogent rather than BioPython. Both are excellent packages for automating some parts of the analysis in microbial community studies. We can leave the philosophy of pipelining scientific research for another thread. ;-) > > I wonder if the reimplementation effort of common trimming/filtering tasks are worth your time, given the current maturity of both mothur and QIIME. > > On Oct 23, 2012, at 12:04 PM, "Kiss, Csaba" wrote: > >> I am new to bio-python. I am trying to replace mothur with BioPython. >> I hope that biopython is faster than mothur. All I want to do is this: >> >> sffinfo(sff=sd11.fasta) >> trim.seqs(fasta=sd11.fasta, qfile=sd11.qual, minlength = 50, >> maxhomop=8, qwindowsize=50, qwindowaverage =22) >> >> Can someone help me to translate the two mothur statements above to biopython, please? >> It would be greatly appreciated. >> thanks >> >> >> -- >> Best Regards: >> Csaba Kiss PhD, MSc, BSc >> TA-43, HRL-1, MS888 >> Los Alamos National Laboratory >> Work: 1-505-667-9898 >> Cell: 1-505-920-5774 >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA > From p.j.a.cock at googlemail.com Tue Oct 23 17:06:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 18:06:26 +0100 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> Message-ID: On Tue, Oct 23, 2012 at 5:58 PM, Chris Friedline wrote: > Csaba, > > As Peter said, there are many packages which will convert sff to fastq/fasta. Actually I meant trimming packages, although there are several SFF converters as well e.g. Biopython, sff_extract, BioHaskell/Flower, and Roche's tools. > I wonder if you're running into disk performance issues, rather than algorithm > ones, though. Using the BioPython SeqIO convert should tell you that much, > though that does seem slow (at least for the systems that I work on). Could be - disk IO will be a factor, but I suspect the quality trimming to be the slow part rather than the format conversion. Peter From csaba.kiss at lanl.gov Tue Oct 23 17:13:32 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 17:13:32 +0000 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> >Could be - disk IO will be a factor, but I suspect the quality trimming to be the slow part rather than the format conversion. I don't think it's IO or the trimming. Mothur seems to take forever to do the sffinfo process on windows. Getting the 3 million sequences out was 3 hours. The trimming took 10 minutes. The rest of the python code to fish out my sequences 1 minute. You see now , why I would like to make it more efficient. Csaba From p.j.a.cock at googlemail.com Tue Oct 23 19:45:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Oct 2012 20:45:19 +0100 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Tue, Oct 23, 2012 at 6:13 PM, Kiss, Csaba wrote: >>Could be - disk IO will be a factor, but I suspect the quality trimming to be the slow part rather than the format conversion. > > I don't think it's IO or the trimming. Mothur seems to take forever to do the sffinfo process on windows. > Getting the 3 million sequences out was 3 hours. That sounds a bit slow, can you compare this to the Biopython SFF conversion time (or any of the other tools)? > The trimming took 10 minutes. > The rest of the python code to fish out my sequences 1 minute. > > You see now , why I would like to make it more efficient. > > Csaba Is it possible to fish out your sequences and then do the trimming? If possible that sounds like it would be more efficient. Peter From csaba.kiss at lanl.gov Tue Oct 23 20:32:11 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Tue, 23 Oct 2012 20:32:11 +0000 Subject: [Biopython] sff inot fasta and qual then trim In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93343F@ECS-EXG-P-MB03.win.lanl.gov> <8D42AA3D-54D8-4639-8A98-2FE28FD58BD3@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E9334BC@ECS-EXG-P-MB03.win.lanl.gov> <8EB2122B-C468-4DA8-BBCA-15F304E9F8E0@vcu.edu> <8C93404AC678DC44905F571FD327A6CC1E934500@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E934574@ECS-EXG-P-MB03.win.lanl.gov> >That sounds a bit slow, can you compare this to the Biopython SFF conversion time (or any of the other tools)? I used SeqIO.convert("sd6.sff", "sff-trim", "sd6_p.fasta", "fasta") SeqIO.convert("sd6.sff", "sff-trim", "sd6_p.qual", "qual") and it finished in 8 minutes. That's much better than 3 hours. The problem is that if I use the mothur fasta/qual files and the python fasta/qual files and trim the sequences exactly the same way in mothur, I get slightly different trimmed sequence dataset. I am investigating further, if it would matter. Thanks for your helps, it is much appreciated Csaba From csaba.kiss at lanl.gov Wed Oct 24 15:49:59 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Wed, 24 Oct 2012 15:49:59 +0000 Subject: [Biopython] still more questions about NGS sequenbce trimming Message-ID: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Hi All! Thanks for all your help to extract DNA sequences from sff files. Using biopython I managed to improve the sequence extraction from 3 hours to 10 minutes. Now that I am hooked, I would like to replace mothur with some simple python functions. Is there any function in biopython that would look for homopolymers on DNA sequences. Particularly I am looking to reject a sequence if it has more than 8 bp of stretches of any single nucleotide. Another function I am looking for is a sliding window function along the quality file. I could either use the fastq file or the fasta/qual file pair. I could write these functions myself but if they are available, then it would make my life easier. Thanks Csaba From nje5 at georgetown.edu Wed Oct 24 16:07:16 2012 From: nje5 at georgetown.edu (Nathan Edwards) Date: Wed, 24 Oct 2012 12:07:16 -0400 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: References: Message-ID: <50881234.8050007@georgetown.edu> On 10/22/2012 1:17 PM, Peter Cock wrote: > Dear Biopythoneers, > > Would anyone object to us preparing to drop support for Python 2.5 and > Jython 2.5, perhaps after the next Biopython release? I'm still in the dark ages, but I need the push to upgrade my infrastructure. I'm just reluctant to rebuilt all of my third-party libraries. Are there specific parts of the code known to be (or soon to be) problematic for Python 2.5? Thanks, - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From p.j.a.cock at googlemail.com Wed Oct 24 16:33:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Oct 2012 17:33:42 +0100 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: <50881234.8050007@georgetown.edu> References: <50881234.8050007@georgetown.edu> Message-ID: On Wed, Oct 24, 2012 at 5:07 PM, Nathan Edwards wrote: > On 10/22/2012 1:17 PM, Peter Cock wrote: >> Dear Biopythoneers, >> >> Would anyone object to us preparing to drop support for Python 2.5 and >> Jython 2.5, perhaps after the next Biopython release? > > I'm still in the dark ages, but I need the push to upgrade my > infrastructure. I'm just reluctant to rebuilt all of my third-party > libraries. > > Are there specific parts of the code known to be (or soon to be) > problematic for Python 2.5? > > Thanks, I can't point at any one killer feature here: http://docs.python.org/whatsnew/2.6.html There are assorted little things now where we have had to add Python 2.5 specific code or fallbacks (e.g. OrderedDict). The main other benefit is 2.6 adds a number of new features from Python 3, which should make supporting Python 2 and 3 a little easier (e.g. byte literals). Peter From nje5 at georgetown.edu Wed Oct 24 16:56:23 2012 From: nje5 at georgetown.edu (Nathan Edwards) Date: Wed, 24 Oct 2012 12:56:23 -0400 Subject: [Biopython] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: References: <50881234.8050007@georgetown.edu> Message-ID: <50881DB7.7070602@georgetown.edu> On 10/24/2012 12:33 PM, Peter Cock wrote: > On Wed, Oct 24, 2012 at 5:07 PM, Nathan Edwards wrote: >> On 10/22/2012 1:17 PM, Peter Cock wrote: >>> Dear Biopythoneers, >>> >>> Would anyone object to us preparing to drop support for Python 2.5 and >>> Jython 2.5, perhaps after the next Biopython release? > > I can't point at any one killer feature here: > http://docs.python.org/whatsnew/2.6.html > > There are assorted little things now where we have had to > add Python 2.5 specific code or fallbacks (e.g. OrderedDict). > > The main other benefit is 2.6 adds a number of new features > from Python 3, which should make supporting Python 2 and 3 > a little easier (e.g. byte literals). I'm not opposed. I can always not upgrade BioPython until I'm ready. - n -- Dr. Nathan Edwards nje5 at georgetown.edu Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Room 1215, Harris Building Room 347, Basic Science 3300 Whitehaven St, NW 3900 Reservoir Road, NW Washington DC 20007 Washington DC 20007 Phone: 202-687-7042 Phone: 202-687-1618 Fax: 202-687-0057 Fax: 202-687-7186 From s.schmeier at gmail.com Wed Oct 24 17:12:46 2012 From: s.schmeier at gmail.com (Sebastian Schmeier) Date: Wed, 24 Oct 2012 19:12:46 +0200 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: A very quick and dirty approach for your reject function (I hope I understood correctly) in script form: #!/usr/bin/env python import sys, re from Bio import SeqIO def main(): for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not discard(str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') def discard(seq): oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) if oRes: return 1 else: return 0 if __name__ == '__main__': sys.exit(main()) Best, Seb On Wed, Oct 24, 2012 at 5:49 PM, Kiss, Csaba wrote: > Hi All! > Thanks for all your help to extract DNA sequences from sff files. Using > biopython I managed to improve the sequence extraction from 3 hours to 10 > minutes. > Now that I am hooked, I would like to replace mothur with some simple > python functions. > Is there any function in biopython that would look for homopolymers on DNA > sequences. Particularly I am looking to reject a sequence if it has more > than 8 bp of stretches of any single nucleotide. > > Another function I am looking for is a sliding window function along the > quality file. I could either use the fastq file or the fasta/qual file pair. > > I could write these functions myself but if they are available, then it > would make my life easier. > Thanks > > Csaba > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From csaba.kiss at lanl.gov Wed Oct 24 17:20:23 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Wed, 24 Oct 2012 17:20:23 +0000 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E9346C1@ECS-EXG-P-MB03.win.lanl.gov> Thanks, Seb. That?s a clever usage of regex. csaba From: Sebastian Schmeier [mailto:s.schmeier at gmail.com] Sent: Wednesday, October 24, 2012 11:13 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequenbce trimming A very quick and dirty approach for your reject function (I hope I understood correctly) in script form: #!/usr/bin/env python import sys, re from Bio import SeqIO def main(): for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not discard(str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') def discard(seq): oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) if oRes: return 1 else: return 0 if __name__ == '__main__': sys.exit(main()) Best, Seb On Wed, Oct 24, 2012 at 5:49 PM, Kiss, Csaba > wrote: Hi All! Thanks for all your help to extract DNA sequences from sff files. Using biopython I managed to improve the sequence extraction from 3 hours to 10 minutes. Now that I am hooked, I would like to replace mothur with some simple python functions. Is there any function in biopython that would look for homopolymers on DNA sequences. Particularly I am looking to reject a sequence if it has more than 8 bp of stretches of any single nucleotide. Another function I am looking for is a sliding window function along the quality file. I could either use the fastq file or the fasta/qual file pair. I could write these functions myself but if they are available, then it would make my life easier. Thanks Csaba _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Oct 24 17:22:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Oct 2012 18:22:57 +0100 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier wrote: > A very quick and dirty approach for your reject function (I hope I > understood correctly) in script form: > > #!/usr/bin/env python > import sys, re > from Bio import SeqIO > > def main(): > for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : > if not discard(str(record.seq)): > SeqIO.write(record, sys.stdout, 'fasta') > > def discard(seq): > oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) > if oRes: return 1 > else: return 0 > > if __name__ == '__main__': > sys.exit(main()) Minor suggestions - if you are going to use a regular expression many times (here once per read), compile it once first. Also Python defines "True" and "False" which are more natural than 1 and 0, but in fact you could do: def discard(seq): return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)) At that point defining and using a function seems a bit of an unnecessary overhead so: def main(): for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') Next a much more important point - try and make a single call to SeqIO.write, with all the records (using an iterator approach) rather than many calls to SeqIO.write (which isn't supported for output in formats like SFF). This should be faster: for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): SeqIO.write(record, sys.stdout, 'fasta') From p.j.a.cock at googlemail.com Wed Oct 24 17:27:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Oct 2012 18:27:14 +0100 Subject: [Biopython] still more questions about NGS sequenbce trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E93464E@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock wrote: > On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier > wrote: >> A very quick and dirty approach for your reject function (I hope I >> understood correctly) in script form: >> >> #!/usr/bin/env python >> import sys, re >> from Bio import SeqIO >> >> def main(): >> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : >> if not discard(str(record.seq)): >> SeqIO.write(record, sys.stdout, 'fasta') >> >> def discard(seq): >> oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) >> if oRes: return 1 >> else: return 0 >> >> if __name__ == '__main__': >> sys.exit(main()) > > Minor suggestions - if you are going to use a regular expression > many times (here once per read), compile it once first. Also > Python defines "True" and "False" which are more natural > than 1 and 0, but in fact you could do: > > def discard(seq): > return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)) > > At that point defining and using a function seems a bit of > an unnecessary overhead so: > > def main(): > for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : > if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): > SeqIO.write(record, sys.stdout, 'fasta') > > Next a much more important point - try and make a single > call to SeqIO.write, with all the records (using an iterator > approach) rather than many calls to SeqIO.write (which > isn't supported for output in formats like SFF). This should > be faster: Sorry Sebastian - I had a hiccup with my mouse focus and accidentally sent that email half finished. I meant something like this: def main(): wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \ if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(rec.seq))) count = SeqIO.write(wanted, sys.stdout, 'fasta') There are other examples of filtering sequence files in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html I hope that is useful, Peter From csaba.kiss at lanl.gov Wed Oct 24 18:01:23 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Wed, 24 Oct 2012 18:01:23 +0000 Subject: [Biopython] still more questions about NGS sequence trimming Message-ID: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> Thanks, Peter. I am looking at this example now: from Bio import SeqIO good_reads = (rec for rec in \ SeqIO.parse("SRR020192.fastq", "fastq") \ if min(rec.letter_annotations["phred_quality"]) >= 20) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count That's a rather crude quality filtering. Is there any more sophisticated options already in biopython? Ie. quality_average Or other options? -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Wednesday, October 24, 2012 11:27 AM To: Sebastian Schmeier Cc: Kiss, Csaba; biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequenbce trimming On Wed, Oct 24, 2012 at 6:22 PM, Peter Cock wrote: > On Wed, Oct 24, 2012 at 6:12 PM, Sebastian Schmeier > wrote: >> A very quick and dirty approach for your reject function (I hope I >> understood correctly) in script form: >> >> #!/usr/bin/env python >> import sys, re >> from Bio import SeqIO >> >> def main(): >> for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : >> if not discard(str(record.seq)): >> SeqIO.write(record, sys.stdout, 'fasta') >> >> def discard(seq): >> oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq) >> if oRes: return 1 >> else: return 0 >> >> if __name__ == '__main__': >> sys.exit(main()) > > Minor suggestions - if you are going to use a regular expression many > times (here once per read), compile it once first. Also Python defines > "True" and "False" which are more natural than 1 and 0, but in fact > you could do: > > def discard(seq): > return bool(re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)) > > At that point defining and using a function seems a bit of an > unnecessary overhead so: > > def main(): > for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") : > if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(record.seq)): > SeqIO.write(record, sys.stdout, 'fasta') > > Next a much more important point - try and make a single call to > SeqIO.write, with all the records (using an iterator > approach) rather than many calls to SeqIO.write (which isn't supported > for output in formats like SFF). This should be faster: Sorry Sebastian - I had a hiccup with my mouse focus and accidentally sent that email half finished. I meant something like this: def main(): wanted = (rec for rec in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") \ if not re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', str(rec.seq))) count = SeqIO.write(wanted, sys.stdout, 'fasta') There are other examples of filtering sequence files in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html I hope that is useful, Peter From lancas at uw.edu Thu Oct 25 01:11:14 2012 From: lancas at uw.edu (Samuel M. Lancaster) Date: Wed, 24 Oct 2012 18:11:14 -0700 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 Message-ID: Hi, I am running into a problem installing Biopython on my computer. To use Biopython I need Numpy; however I can only find Biopython for Python 2.7 and Numpy for Python 2.6. Can you direct me to a place where I can find either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with Mac OS Lion 10.7.5? Thanks, Sam From nuin at genedrift.org Thu Oct 25 01:16:31 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 24 Oct 2012 21:16:31 -0400 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 In-Reply-To: References: Message-ID: Hi You can install from source a beta version http://sourceforge.net/projects/numpy/files/NumPy/1.7.0b2/ it should work fine if you follow the instructions. Cheers Paulo On 2012-10-24, at 9:11 PM, "Samuel M. Lancaster" wrote: > Hi, > I am running into a problem installing Biopython on my computer. To use > Biopython I need Numpy; however I can only find Biopython for Python 2.7 > and Numpy for Python 2.6. Can you direct me to a place where I can find > either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with > Mac OS Lion 10.7.5? > > Thanks, > Sam > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From nuin at genedrift.org Thu Oct 25 01:26:24 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 24 Oct 2012 21:26:24 -0400 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 In-Reply-To: References: Message-ID: Alternatively, you can also try installing everything via pip http://pypi.python.org/pypi/pip Cheers Paulo On 2012-10-24, at 9:11 PM, "Samuel M. Lancaster" wrote: > Hi, > I am running into a problem installing Biopython on my computer. To use > Biopython I need Numpy; however I can only find Biopython for Python 2.7 > and Numpy for Python 2.6. Can you direct me to a place where I can find > either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with > Mac OS Lion 10.7.5? > > Thanks, > Sam > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Thu Oct 25 08:03:34 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 09:03:34 +0100 Subject: [Biopython] Biopython for Max OS Lion 10.7.5 In-Reply-To: References: Message-ID: On Thu, Oct 25, 2012 at 2:11 AM, Samuel M. Lancaster wrote: > Hi, > I am running into a problem installing Biopython on my computer. To use > Biopython I need Numpy; however I can only find Biopython for Python 2.7 > and Numpy for Python 2.6. Can you direct me to a place where I can find > either Biopython for Python 2.6 or Numpy for Python 2.7 that I can use with > Mac OS Lion 10.7.5? > > Thanks, > Sam Note NumPy provides precompiled packages for Mac OS X for the official Python installs from Python.org - not the Apple provided Python installation. Which Python(s) are you trying to use? You can if you wish install NumPy from source on Mac OS X for any installed Python. You will need Apple's XCode from the App Store to do this (you'll need it anyway to compile Biopython's C modules). Peter From p.j.a.cock at googlemail.com Thu Oct 25 08:14:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 09:14:50 +0100 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Wed, Oct 24, 2012 at 7:01 PM, Kiss, Csaba wrote: > Thanks, Peter. I am looking at this example now: > > from Bio import SeqIO > good_reads = (rec for rec in \ > SeqIO.parse("SRR020192.fastq", "fastq") \ > if min(rec.letter_annotations["phred_quality"]) >= 20) > count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") > print "Saved %i reads" % count > > That's a rather crude quality filtering. Is there any more sophisticated options > already in biopython? Ie. quality_average > > Or other options? Average (mean) quality is easy, take the sum and divide by the length (or in this case, I've moved the divide to a multiply on the other side of the inequality since generally multiplication is faster than division): from Bio import SeqIO good_reads = (rec for rec in \ SeqIO.parse("SRR020192.fastq", "fastq") \ if sum(rec.letter_annotations["phred_quality"]) >= 20*len(rec)) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count However, for most sequencing reads you'd want to use a trimming step first as read quality tends to decline with length - the first half might be good and the second half bad, meaning the average is poor. You could write a little function to do that, and slice the SeqRecord to select the good chunk. There are examples of that in the Tutorial for removing an adapter/adaptor or PCR primer from FASTQ files. Peter From csaba.kiss at lanl.gov Thu Oct 25 14:49:59 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Thu, 25 Oct 2012 14:49:59 +0000 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> Thanks, Peter. I am writing my quality functions. Another question about trimming. As you mentioned, the quality of the ends tend to be lower than in the middle. Could that be fixed just by using "sff-trim" when I create my FASTQ file? If I don't do that I get sequences with small and capital letters. Are you suggesting further trimming than just "sff-trim". Csaba -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Thursday, October 25, 2012 2:15 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequence trimming On Wed, Oct 24, 2012 at 7:01 PM, Kiss, Csaba wrote: > Thanks, Peter. I am looking at this example now: > > from Bio import SeqIO > good_reads = (rec for rec in \ > SeqIO.parse("SRR020192.fastq", "fastq") \ > if min(rec.letter_annotations["phred_quality"]) >= 20) > count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print > "Saved %i reads" % count > > That's a rather crude quality filtering. Is there any more > sophisticated options already in biopython? Ie. quality_average > > Or other options? Average (mean) quality is easy, take the sum and divide by the length (or in this case, I've moved the divide to a multiply on the other side of the inequality since generally multiplication is faster than division): from Bio import SeqIO good_reads = (rec for rec in \ SeqIO.parse("SRR020192.fastq", "fastq") \ if sum(rec.letter_annotations["phred_quality"]) >= 20*len(rec)) count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") print "Saved %i reads" % count However, for most sequencing reads you'd want to use a trimming step first as read quality tends to decline with length - the first half might be good and the second half bad, meaning the average is poor. You could write a little function to do that, and slice the SeqRecord to select the good chunk. There are examples of that in the Tutorial for removing an adapter/adaptor or PCR primer from FASTQ files. Peter From p.j.a.cock at googlemail.com Thu Oct 25 15:29:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 16:29:57 +0100 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: > Thanks, Peter. I am writing my quality functions. Another question about > trimming. As you mentioned, the quality of the ends tend to be lower than > in the middle. Could that be fixed just by using "sff-trim" when I create my > FASTQ file? If I don't do that I get sequences with small and capital letters. > Are you suggesting further trimming than just "sff-trim". In Bio.SeqIO, we use the file format names "sff" and "sff-trim" to mean the raw sequence data from the SFF file in full, or with the trimming values inside the SFF file applied. If you have used the Roche tools you'll see a similar option in their SFF extraction tool. This default trimming is decided by the Roche 454 instrument and does quite a good job at removing the adapters, barcodes and poor quality bits. I assume you were using Mothur to do further trimming based on a more stringent sliding window of quality scores? Peter From csaba.kiss at lanl.gov Thu Oct 25 15:34:46 2012 From: csaba.kiss at lanl.gov (Kiss, Csaba) Date: Thu, 25 Oct 2012 15:34:46 +0000 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: <8C93404AC678DC44905F571FD327A6CC1E935829@ECS-EXG-P-MB03.win.lanl.gov> I believe mothur does check the moving average quality of a sequence with a sliding window of 50 bp. If the quality falls below the given value then it tosses the sequence out. I don't think it does end trimming beside removing the small letters from the ends. Of course, it can remove adapter and primer sequences but that's not based on quality values. C -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Thursday, October 25, 2012 9:30 AM To: Kiss, Csaba Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] still more questions about NGS sequence trimming On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: > Thanks, Peter. I am writing my quality functions. Another question > about trimming. As you mentioned, the quality of the ends tend to be > lower than in the middle. Could that be fixed just by using "sff-trim" > when I create my FASTQ file? If I don't do that I get sequences with small and capital letters. > Are you suggesting further trimming than just "sff-trim". In Bio.SeqIO, we use the file format names "sff" and "sff-trim" to mean the raw sequence data from the SFF file in full, or with the trimming values inside the SFF file applied. If you have used the Roche tools you'll see a similar option in their SFF extraction tool. This default trimming is decided by the Roche 454 instrument and does quite a good job at removing the adapters, barcodes and poor quality bits. I assume you were using Mothur to do further trimming based on a more stringent sliding window of quality scores? Peter From p.j.a.cock at googlemail.com Thu Oct 25 15:58:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Oct 2012 16:58:04 +0100 Subject: [Biopython] still more questions about NGS sequence trimming In-Reply-To: <8C93404AC678DC44905F571FD327A6CC1E935829@ECS-EXG-P-MB03.win.lanl.gov> References: <8C93404AC678DC44905F571FD327A6CC1E934708@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935803@ECS-EXG-P-MB03.win.lanl.gov> <8C93404AC678DC44905F571FD327A6CC1E935829@ECS-EXG-P-MB03.win.lanl.gov> Message-ID: On Thu, Oct 25, 2012 at 4:34 PM, Kiss, Csaba wrote: > I believe mothur does check the moving average quality of a sequence with > a sliding window of 50 bp. If the quality falls below the given value then it > tosses the sequence out. I don't think it does end trimming beside removing > the small letters from the ends. Of course, it can remove adapter and primer > sequences but that's not based on quality values. Fine - the point is doing SeqIO.parse("example.sff", "sff-trim") does NOT do any of that. All it does is apply the trimming information already recorded in the SFF file by the provider (e.g. the Roche 454 instrument). So back to your earlier question: > On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: >> Thanks, Peter. I am writing my quality functions. Another question >> about trimming. As you mentioned, the quality of the ends tend to be >> lower than in the middle. Could that be fixed just by using "sff-trim" >> when I create my FASTQ file? Using "sff-trim" would be sensible as a starting point, but you'll still probably notice a drop off in quality along the read length. This is normal. >> If I don't do that I get sequences with small and capital letters. The lower case bits are what Roche labelled as low quality or adapter. The upper case bit is what Roche labelled as worth keeping after its trimming, and it is this you'd get via SeqIO.parse("example.sff", "sff-trim"). You'll probably notice all the untrimmed sequences start with the same four letters (in lower case). >> Are you suggesting further trimming than just "sff-trim". Yes, if you want to mimic what Mothur was doing for you. Peter From afernandez at ceab.csic.es Thu Oct 25 16:32:42 2012 From: afernandez at ceab.csic.es (Antonio Fernandez-Guerra) Date: Thu, 25 Oct 2012 18:32:42 +0200 Subject: [Biopython] still more questions about NGS sequence trimming Message-ID: -- Antonio Fern?ndez-Guerra Center for Advanced Studies of Blanes (CEAB-CSIC) Acces Cala St Francesc, 14 17300 Blanes, SPAIN Tel +34 972 33 6101 Fax +34 972 33 7806 http://nodens.ceab.csic.es/ecogenomics/members/antoni-fernandez-guerra.html e-mail: afernandez at ceab.csic.es Peter Cock wrote: >On Thu, Oct 25, 2012 at 3:49 PM, Kiss, Csaba wrote: >> Thanks, Peter. I am writing my quality functions. Another question about >> trimming. As you mentioned, the quality of the ends tend to be lower than >> in the middle. Could that be fixed just by using "sff-trim" when I create my >> FASTQ file? If I don't do that I get sequences with small and capital letters. >> Are you suggesting further trimming than just "sff-trim". > >In Bio.SeqIO, we use the file format names "sff" and "sff-trim" to mean >the raw sequence data from the SFF file in full, or with the trimming >values inside the SFF file applied. If you have used the Roche tools >you'll see a similar option in their SFF extraction tool. This default >trimming is decided by the Roche 454 instrument and does quite a >good job at removing the adapters, barcodes and poor quality bits. > >I assume you were using Mothur to do further trimming based on a >more stringent sliding window of quality scores? > >Peter >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Oct 31 21:13:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 31 Oct 2012 21:13:19 +0000 Subject: [Biopython] Fwd: [mira_talk] new sff_extract In-Reply-To: References: <5090F1B5.9070000@upv.es> Message-ID: Apologies if this arrives twice, the OBF mail server was down earlier today - more news about that tomorrow I hope, along with information about the website. (This email is partly to confirm the list are alive again.) Peter ---------- Forwarded message ---------- From: *Peter Cock* Date: Wednesday, October 31, 2012 Subject: [mira_talk] new sff_extract To: Biopython Mailing List Hi all, For those working with SFF files (from Roche or IonTorrent), Jose's sff_extract tool has often been a popular alternative to the Roche (Linux only) off instrument applications - and Biopython's SFF support was based on sff_extract (thanks again Jose!). Jose has just announced (on the MIRA assembler mailing list) a new version of sff_extract which now calls the Biopython SFF code for the low level binary file access, and comes with some additional related tools. See below for details, original thread archive here: http://www.freelists.org/post/mira_talk/new-sff-extract Peter ---------- Forwarded message ---------- From: Jose Blanca > Date: Wed, Oct 31, 2012 at 9:39 AM Subject: [mira_talk] new sff_extract To: "mira_talk at freelists.org " > Hi: Sometime ago we discussed in this list the future of sff_extract. We started working on it and we have a version that we think is working. The sff_extract functionality has been split in two sff_extract and split_matepairs that can be linked together with a pipe. We haven't done extensive testing so if you use them, please let us know. These utilities are bundled with some other little tools that we have developed for our day to day work. They are all written in python and they use biopython. You can take a look at the development site: https://github.com/JoseBlanca/seq_crumbs Or our site: http://bioinf.comav.upv.es/seq_crumbs/ Of course we'd love to have some feedback. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html