From p.j.a.cock at googlemail.com Sun Apr 1 04:51:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 1 Apr 2012 09:51:12 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Sun, Apr 1, 2012 at 4:04 AM, Igor Rodrigues da Costa wrote: > > Hi, > I am interested in participating in GSoC this summer. I would > like to know if there is community support for a new project: > Extending Seq class to add support to back translation of > proteins (something like this: http://www.bork.embl.de/pal2nal/ ). > If this project isn't strong enough at its own, it could be added > to any existing project, or it could be complemented with others > suggestions from the community. > Thanks for your attention,Igor Hi Igor, I don't think back translation in itself is nearly enough to be a GSoC project. Is it also problematic - we had a good long discussion about back translation, and what it might be useful for, back in 2008. In particular, assuming back translation to a simple nucleotide sequence (as a string or Seq object), what would it actually be useful for? See http://bugzilla.open-bio.org/show_bug.cgi?id=2618 which is now using https://redmine.open-bio.org/issues/2618 and the quite long and at times confusing thread: http://lists.open-bio.org/pipermail/biopython/2008-October/004588.html Did you have any other ideas or topics that interested you? Regards, Peter From chaitanya.talnikar at iitb.ac.in Sun Apr 1 05:42:43 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Sun, 1 Apr 2012 15:12:43 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <87wr626gc5.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: I have uploaded a second draft incorporating the changes. Please provide comments on my proposal. Thanks, Chaitanya On Fri, Mar 30, 2012 at 6:43 AM, Brad Chapman wrote: > > Chaitanya; > Thanks for making this available. It's a great start and you need to > work from here on being much more detailed in your project plan. I left > specific comments in-line in the proposal. Let us know when you have a > revised version and we can work more. Thanks again, > Brad > >> Here's the google doc link, I have made it editable too. >> >> https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit >> >> On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: >> > >> > Chaitanya; >> > The easiest way to work on your proposal is to write it in a >> > public Google Doc and then share with the list. I don't yet have access >> > to all of the Melange GSoC project and I'd imagine others who might >> > have thoughts are in the same boat. As a side benefit it's also much >> > easier to collaborate on editing and notes. >> > >> > Brad >> > >> >> Hi, >> >> I have uploaded the first draft of my project proposal. I will add >> >> more sections to the project plan in a day or two. Just wanted to have >> >> the initial draft up. I hope to write a better proposal with your >> >> feedback. >> >> >> >> Regards, >> >> Chaitanya >> >> >> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: >> >> > >> >> > Chaitanya; >> >> > Thanks for the interest and specific questions. >> >> > >> >> >> 1. For the implementation of variants what would be better, to create >> >> >> a new SeqVariant class from scratch or to extend the SeqFeature class >> >> >> to accomodate variants? I guess a separate class would be better. >> >> > >> >> > My preference would be to see how far the SeqFeature class can take you >> >> > before implementing a new class. It should be general enough to handle >> >> > variant data, but the bigger challenge might be designing a lightweight >> >> > representation that is compatible with existing SeqFeatures. >> >> > >> >> >> 2. While looking at the Biopython wiki I came across an implementation >> >> >> of GFF at >> >> >> https://github.com/chapmanb/bcbb/tree/master/gff >> >> >> As GVF is an extension of GFF3, this module could be used for reading >> >> >> GVF's too. Is this module a good start to modify it to support GVFs? >> >> > >> >> > That would be perfect. We're hoping to merge this into the Biopython >> >> > code base before the next release. There is also an existing VCF parser >> >> > we'd love to use here: >> >> > >> >> > https://github.com/jamescasbon/PyVCF >> >> > >> >> >> 3. I've been going through the VCF documentation and SNPs, insertions >> >> >> and deletions can be represented just like it is done in VCF, the >> >> >> object would have a start position, length of reference sequence(no >> >> >> need to store this sequence) and a list of alternate sequence objects. >> >> >> I have to still look into the SV(Structural variants), rearrangements >> >> >> and imprecise variant information, so this representation is only for >> >> >> SNPs and small indels. The GVF has a very similar format for small >> >> >> indels and SNPs, just that it provides an extra end position column >> >> >> which is not required if we have the reference sequence. >> >> > >> >> > This sounds good. My general suggestion is to start writing your >> >> > proposal as soon as possible. A concrete first draft will help with more >> >> > detailed comments. The wiki has good information on the project plan: >> >> > >> >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply >> >> > >> >> > and the NESCent wiki has some examples of well-written proposals from >> >> > previous years: >> >> > >> >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application >> >> > >> >> > One of the key aspects is having a detailed week-by-week outline of your >> >> > plans for the summer. >> >> > >> >> > Thanks again for the interest, >> >> > Brad From chapmanb at 50mail.com Sun Apr 1 16:03:21 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 16:03:21 -0400 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: References: Message-ID: <87ty13te2e.fsf@fastmail.fm> Chris; Thanks for putting this together: that's a great start. I left specific suggestions as comments in the document but in general the next step is to expand your timeline to be increasingly specific about your plans. It sounds like you have a good handle on the overview, now the full nitty gritty details start. Brad > Hey everyone, > > Here's a draft of my proposal: > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > > I've allowed comments to be put in. Please tear it to shreds :). > > Thanks, > Chris > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From chapmanb at 50mail.com Sun Apr 1 16:07:31 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 16:07:31 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: <87r4w7tdvg.fsf@fastmail.fm> Chaitanya; Thanks for the additional work on this, that's great work. I left specific comments in-line but my general suggestion is to keep expanding and clarifying the timeline. Up front work building a detailed timeline makes the summer work so much easier, as well as building a stronger proposal. Thanks again, Brad > I have uploaded a second draft incorporating the changes. Please > provide comments on my proposal. > Thanks, > Chaitanya > > On Fri, Mar 30, 2012 at 6:43 AM, Brad Chapman wrote: > > > > Chaitanya; > > Thanks for making this available. It's a great start and you need to > > work from here on being much more detailed in your project plan. I left > > specific comments in-line in the proposal. Let us know when you have a > > revised version and we can work more. Thanks again, > > Brad > > > >> Here's the google doc link, I have made it editable too. > >> > >> https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit > >> > >> On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: > >> > > >> > Chaitanya; > >> > The easiest way to work on your proposal is to write it in a > >> > public Google Doc and then share with the list. I don't yet have access > >> > to all of the Melange GSoC project and I'd imagine others who might > >> > have thoughts are in the same boat. As a side benefit it's also much > >> > easier to collaborate on editing and notes. > >> > > >> > Brad > >> > > >> >> Hi, > >> >> I have uploaded the first draft of my project proposal. I will add > >> >> more sections to the project plan in a day or two. Just wanted to have > >> >> the initial draft up. I hope to write a better proposal with your > >> >> feedback. > >> >> > >> >> Regards, > >> >> Chaitanya > >> >> > >> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > >> >> > > >> >> > Chaitanya; > >> >> > Thanks for the interest and specific questions. > >> >> > > >> >> >> 1. For the implementation of variants what would be better, to create > >> >> >> a new SeqVariant class from scratch or to extend the SeqFeature class > >> >> >> to accomodate variants? I guess a separate class would be better. > >> >> > > >> >> > My preference would be to see how far the SeqFeature class can take you > >> >> > before implementing a new class. It should be general enough to handle > >> >> > variant data, but the bigger challenge might be designing a lightweight > >> >> > representation that is compatible with existing SeqFeatures. > >> >> > > >> >> >> 2. While looking at the Biopython wiki I came across an implementation > >> >> >> of GFF at > >> >> >> https://github.com/chapmanb/bcbb/tree/master/gff > >> >> >> As GVF is an extension of GFF3, this module could be used for reading > >> >> >> GVF's too. Is this module a good start to modify it to support GVFs? > >> >> > > >> >> > That would be perfect. We're hoping to merge this into the Biopython > >> >> > code base before the next release. There is also an existing VCF parser > >> >> > we'd love to use here: > >> >> > > >> >> > https://github.com/jamescasbon/PyVCF > >> >> > > >> >> >> 3. I've been going through the VCF documentation and SNPs, insertions > >> >> >> and deletions can be represented just like it is done in VCF, the > >> >> >> object would have a start position, length of reference sequence(no > >> >> >> need to store this sequence) and a list of alternate sequence objects. > >> >> >> I have to still look into the SV(Structural variants), rearrangements > >> >> >> and imprecise variant information, so this representation is only for > >> >> >> SNPs and small indels. The GVF has a very similar format for small > >> >> >> indels and SNPs, just that it provides an extra end position column > >> >> >> which is not required if we have the reference sequence. > >> >> > > >> >> > This sounds good. My general suggestion is to start writing your > >> >> > proposal as soon as possible. A concrete first draft will help with more > >> >> > detailed comments. The wiki has good information on the project plan: > >> >> > > >> >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > >> >> > > >> >> > and the NESCent wiki has some examples of well-written proposals from > >> >> > previous years: > >> >> > > >> >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > >> >> > > >> >> > One of the key aspects is having a detailed week-by-week outline of your > >> >> > plans for the summer. > >> >> > > >> >> > Thanks again for the interest, > >> >> > Brad From sudeep495 at gmail.com Mon Apr 2 16:37:28 2012 From: sudeep495 at gmail.com (Sudeep Singh) Date: Tue, 3 Apr 2012 02:07:28 +0530 Subject: [Biopython] Gsoc 2012, SearchIO In-Reply-To: References: Message-ID: Dear Peter Cock, I am fifth year dual degree student in computer science at Indian Insitute of Technology, Kharagpur . I have interest in Bio-informatics and have done a course and a couple of projects in this area. I am interested in the project SearchIO listed on the Ideas Page. Kindly let me know how shall I proceed ? Thanks Sudeep From p.j.a.cock at googlemail.com Tue Apr 3 04:51:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 09:51:59 +0100 Subject: [Biopython] Gsoc 2012, SearchIO In-Reply-To: References: Message-ID: On Mon, Apr 2, 2012 at 9:37 PM, Sudeep Singh wrote: > Dear Peter Cock, > > I am fifth year dual degree student in computer science at Indian Insitute > of Technology, Kharagpur . I have interest in Bio-informatics and have done > a course and a couple of projects in this area. I am interested in the > project SearchIO listed on the Ideas Page. > Kindly let me know how shall I proceed ? > > Thanks > Sudeep Hello Sudeep, Welcome to the Biopython mailing list :) Since you are interested in applying for Google Summer of Code (GSoC), you should also subscribe to the biopython-dev mailing list, which is were discussion about code for Biopython mostly happens. I wrote a more detailed email about my thoughts for SearchIO on the biopython-dev list last month: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html You are welcome to write a GSoC proposal - but you will have to hurry as the deadline is this Friday 6 April. Please see: http://www.open-bio.org/wiki/Google_Summer_of_Code http://code.google.com/soc/ You should have a look at some of the previous projects online, including their project schedule which is an important part of the proposal. http://biopython.org/wiki/Google_Summer_of_Code It is also important for us to gauge your programming ability and experience. If you can link to previous open source project contributions, that would be a good sign. I have suggested to other applicants that finding and reporting a bug in Biopython (even a mistake in the documentation) is a good start. Contributing a bug fix is even better ;) In the case of the SearchIO project idea, we'd also be looking for some evidence of familiarity with the tools whose output you would be working with (BLAST, FASTA, HMMER, etc). Perhaps you've used some in your studies? If so, you can write that in the proposal. You can send a draft proposal to me for comment and feedback, but I would encourage you to share it on the biopython-dev list for wider review - for example as a Google Doc with commenting enabled. Several of the other students have already done this. Don't leave this too late - I will be traveling Thursday 5 and Friday 6 April, so won't be giving anyone any last minute comments ;) Remember being selected is a competition - all the OBF GSoC project proposals will be reviewed and ranked, and projects then allocated based on how many students Google allocates to us. The SearchIO topic seems very popular, but only one student would be picked to work on this. Good luck, Peter From ivaylo.stoimenov at gmail.com Tue Apr 3 05:37:01 2012 From: ivaylo.stoimenov at gmail.com (Ivaylo Stoimenov) Date: Tue, 3 Apr 2012 11:37:01 +0200 Subject: [Biopython] Installation of BCBio pack Message-ID: Hi, I would like to use the nice GFF parser tool from BCBio pack, but I am not sure how to install the pack on my machine. If someone could help me to install BCBio on Ubuntu 11.10 I would be so grateful. Thank you in advance. Best regards, Ivaylo Stoimenov From chapmanb at 50mail.com Tue Apr 3 09:11:13 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 09:11:13 -0400 Subject: [Biopython] Installation of BCBio pack In-Reply-To: References: Message-ID: <87ehs5hsem.fsf@fastmail.fm> Ivaylo; > I would like to use the nice GFF parser tool from BCBio pack, but I am not > sure how to install the pack on my machine. If someone could help me to > install BCBio on Ubuntu 11.10 I would be so grateful. Thank you in > advance. We're hoping to include this in the next release of Biopython. In the meantime it's a manual install: git clone git://github.com/chapmanb/bcbb.git cd bcbb/gff python setup.py build sudo python setup.py install Hope this helps, Brad From chapmanb at 50mail.com Tue Apr 3 10:24:43 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 10:24:43 -0400 Subject: [Biopython] Installation of BCBio pack In-Reply-To: References: <87ehs5hsem.fsf@fastmail.fm> Message-ID: <87398kj3kk.fsf@fastmail.fm> Ivaylo; > Thank you so much for the help (and for writing the tools on the first > place). However, I got a problem after trying to execute the commands. > After "python setup.py build", I am getting an error message saying > "ImportError: No module named setuptools". I wonder where is the problem. > Do I need to install setuptools from somewhere first? setuptools is a standard install framework for Python and provides the 'easy_install' command. Instructions for installing it on different platforms are here: http://pypi.python.org/pypi/setuptools Or on Ubuntu you can do: sudo apt-get install python-setuptools Hope this helps, Brad From igorrcosta at hotmail.com Tue Apr 3 17:24:02 2012 From: igorrcosta at hotmail.com (Igor Rodrigues da Costa) Date: Tue, 3 Apr 2012 21:24:02 +0000 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: , Message-ID: Thanks for your response! I think back translation has an obvious solution that avoids all those problems mentioned in that discussion you cited, that is to pass the nucleotide sequence as a parameter. It has plenty of utilities, I have used it in my own research for comparing the evolutionary profile (ks/ka) of a list of aligned proteins in a multifasta (I made a script that fetched the CDS from ncbi using Entrez module to get the nucleotide sequence), it aligns the codons of nucleotide sequences (a hard problem if the protein sequence is not available) and can also check for data integrity. Another topic of interest, also used in my projects, is the calculation of the Dn/Ds rate (non-synonymous / synonymous mutations * non-synonymous / synonymous loci) using the most popular models (Nei-Gojobori, Li, etc). It is very usefull as can be seen for it's widespread use in papers (http://code.google.com/p/kaks-calculator/wiki/Citations) Similar projects: https://github.com/tanghaibao/bio-pipeline/tree/master/synonymous_calculation/ http://www.bork.embl.de/pal2nal/ http://cran.r-project.org/web/packages/seqinr/index.html http://services.cbu.uib.no/tools/kaks http://code.google.com/p/kaks-calculator/ Thanks for your input,Igor > Date: Sun, 1 Apr 2012 09:51:12 +0100 > Subject: Re: [Biopython] Back translation support in Biopython > From: p.j.a.cock at googlemail.com > To: igorrcosta at hotmail.com > CC: biopython at lists.open-bio.org > > On Sun, Apr 1, 2012 at 4:04 AM, Igor Rodrigues da Costa > wrote: > > > > Hi, > > I am interested in participating in GSoC this summer. I would > > like to know if there is community support for a new project: > > Extending Seq class to add support to back translation of > > proteins (something like this: http://www.bork.embl.de/pal2nal/ ). > > If this project isn't strong enough at its own, it could be added > > to any existing project, or it could be complemented with others > > suggestions from the community. > > Thanks for your attention,Igor > > Hi Igor, > > I don't think back translation in itself is nearly enough to be a > GSoC project. Is it also problematic - we had a good long > discussion about back translation, and what it might be useful > for, back in 2008. In particular, assuming back translation to > a simple nucleotide sequence (as a string or Seq object), > what would it actually be useful for? > > See http://bugzilla.open-bio.org/show_bug.cgi?id=2618 which > is now using https://redmine.open-bio.org/issues/2618 and > the quite long and at times confusing thread: > http://lists.open-bio.org/pipermail/biopython/2008-October/004588.html > > Did you have any other ideas or topics that interested you? > > Regards, > > Peter From reece at harts.net Tue Apr 3 20:33:28 2012 From: reece at harts.net (Reece Hart) Date: Tue, 3 Apr 2012 17:33:28 -0700 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: On Sun, Apr 1, 2012 at 2:42 AM, Chaitanya Talnikar < chaitanya.talnikar at iitb.ac.in> wrote: > I have uploaded a second draft incorporating the changes. Please > provide comments on my proposal. > Hi Chaitanya- I also read your proposal last night. My comments mostly echo Brad's, although there are a couple of new ones I think. I'll be happy to reread or answer questions as needed. -Reece From eric.talevich at gmail.com Tue Apr 3 21:49:17 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 3 Apr 2012 21:49:17 -0400 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: Hi Igor, It sounds like you're referring to aligning amino acid sequences to codon sequences, as PAL2NAL does. This is different from what most people mean by back translation, but as you point out, certainly useful. If you write a function that can match a protein sequence alignment to a set of raw CDS sequences, returning a nucleotide alignment based on the codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does exactly that, plus a bit more, and is a fairly well-known and easily obtained program. Personally, I would prefer to write a wrapper for PAL2NAL under Bio.Align.Applications, using the existing Bio.Applications framework. Once the user has a codon alignment, dn/ds and many other calculations based on evolutionary models can be performed with our PAML wrappers, under Bio.Phylo.PAML. I agree there is room in Biopython to make this workflow easier to perform. (Although I wouldn't be able to mentor such a project under GSoC this year.) Best, Eric On Tue, Apr 3, 2012 at 5:24 PM, Igor Rodrigues da Costa < igorrcosta at hotmail.com> wrote: > > Thanks for your response! > I think back translation has an obvious solution that avoids all those > problems mentioned in that discussion you cited, that is to pass the > nucleotide sequence as a parameter. It has plenty of utilities, I have used > it in my own research for comparing the evolutionary profile (ks/ka) of a > list of aligned proteins in a multifasta (I made a script that fetched the > CDS from ncbi using Entrez module to get the nucleotide sequence), it > aligns the codons of nucleotide sequences (a hard problem if the protein > sequence is not available) and can also check for data integrity. > Another topic of interest, also used in my projects, is the calculation of > the Dn/Ds rate (non-synonymous / synonymous mutations * non-synonymous / > synonymous loci) using the most popular models (Nei-Gojobori, Li, etc). It > is very usefull as can be seen for it's widespread use in papers ( > http://code.google.com/p/kaks-calculator/wiki/Citations) > Similar projects: > https://github.com/tanghaibao/bio-pipeline/tree/master/synonymous_calculation/ > http://www.bork.embl.de/pal2nal/ > http://cran.r-project.org/web/packages/seqinr/index.html > http://services.cbu.uib.no/tools/kaks > http://code.google.com/p/kaks-calculator/ > Thanks for your input,Igor > > Date: Sun, 1 Apr 2012 09:51:12 +0100 > > Subject: Re: [Biopython] Back translation support in Biopython > > From: p.j.a.cock at googlemail.com > > To: igorrcosta at hotmail.com > > CC: biopython at lists.open-bio.org > > > > On Sun, Apr 1, 2012 at 4:04 AM, Igor Rodrigues da Costa > > wrote: > > > > > > Hi, > > > I am interested in participating in GSoC this summer. I would > > > like to know if there is community support for a new project: > > > Extending Seq class to add support to back translation of > > > proteins (something like this: http://www.bork.embl.de/pal2nal/ ). > > > If this project isn't strong enough at its own, it could be added > > > to any existing project, or it could be complemented with others > > > suggestions from the community. > > > Thanks for your attention,Igor > > > > Hi Igor, > > > > I don't think back translation in itself is nearly enough to be a > > GSoC project. Is it also problematic - we had a good long > > discussion about back translation, and what it might be useful > > for, back in 2008. In particular, assuming back translation to > > a simple nucleotide sequence (as a string or Seq object), > > what would it actually be useful for? > > > > See http://bugzilla.open-bio.org/show_bug.cgi?id=2618 which > > is now using https://redmine.open-bio.org/issues/2618 and > > the quite long and at times confusing thread: > > http://lists.open-bio.org/pipermail/biopython/2008-October/004588.html > > > > Did you have any other ideas or topics that interested you? > > > > Regards, > > > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Wed Apr 4 11:02:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 Apr 2012 16:02:41 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich wrote: > Hi Igor, > > It sounds like you're referring to aligning amino acid sequences to codon > sequences, as PAL2NAL does. This is different from what most people mean by > back translation, but as you point out, certainly useful. > > If you write a function that can match a protein sequence alignment to a set > of raw CDS sequences, returning a nucleotide alignment based on the > codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does > exactly that, plus a bit more, and is a fairly well-known and easily > obtained program. Personally, I would prefer to write a wrapper for PAL2NAL > under Bio.Align.Applications, using the existing Bio.Applications framework. As per the old thread, a simple function in Python taking the gapped protein sequence, original nucleotide coding sequence, and the translation table does sound useful. Then using that, you could go from a protein alignment plus the original nucleotide coding sequences to a codon alignment, or other tasks. Given this is all relatively straightforward string manipulation and we already have the required genetic code tables in Biopython, I'm not convinced that wrapping PAL2NAL would be the best solution (for this sub task). > Once the user has a codon alignment, dn/ds and many other calculations based > on evolutionary models can be performed with our PAML wrappers, under > Bio.Phylo.PAML. I agree there is room in Biopython to make this workflow > easier to perform. (Although I wouldn't be able to mentor such a project > under GSoC this year.) Doing some of the calculations directly within Biopython could be interesting and useful - although calling PAML is a very pragmatic solution too. I'm note sure you have enough work here to justify a GSoC project, but the timing is also rather tight to find a suitable mentor. Maybe next year? However, you can still start contributing to Biopython now - and such involvement would be viewed positively on a future GSoC application (not just with us, but for other participating project being about to show past contributions to open source projects is good). Regards, Peter From alfonso.esposito1983 at hotmail.it Wed Apr 4 11:27:57 2012 From: alfonso.esposito1983 at hotmail.it (fonz esposito) Date: Wed, 4 Apr 2012 17:27:57 +0200 Subject: [Biopython] Blast defaults Message-ID: Hello everybody, I guess I am not the first one coming out with this question but: I have problems because the NCBIWWW.qblast function does not give the same exact result as the online blast, I use it as the tutorial says but I don't know how to change the parameters to the ones that the online web blast has as default... Does someone know what parameter should I change? Thanks in advance From p.j.a.cock at googlemail.com Wed Apr 4 11:39:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 Apr 2012 16:39:25 +0100 Subject: [Biopython] Blast defaults In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 4:27 PM, fonz esposito wrote: > > Hello everybody, > > I guess I am not the first one coming out with this question but: I > have problems because the NCBIWWW.qblast function does not > give the same exact result as the online blast, I use it as the tutorial > says but I don't know how to change the parameters to the ones > that the online web blast has as default... Does someone know > what parameter should I change? Check the gap parameters first, but you're going to have to compare them all to be sure - the NCBI website does some quite clever auto-selection these days. Peter From tturne18 at jhmi.edu Wed Apr 4 12:55:20 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Wed, 4 Apr 2012 16:55:20 +0000 Subject: [Biopython] biopython question Message-ID: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> Hi, I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: from Bio import SeqIO def trim_primers(records, primer): """Removes perfect primer sequences at start of reads. This is a generator function, the records argument should be a list or iterator returning SeqRecord objects. """ len_primer = len(primer) #cache this for later for record in records: if record.seq.startswith(primer): yield record[len_primer:] else: yield record original_reads = SeqIO.parse("SRR020192.fastq", "fastq") trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") print "Saved %i reads" % count My question is: Is there a way to loop through a primer file for instance instead of looking for only 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. Primer file structured as: GATGACGGTGT GATGACGGTGA GATGACGGCCT If you have any suggestions it would be greatly appreciated. Thanks. Tychele From w.arindrarto at gmail.com Wed Apr 4 14:05:54 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 4 Apr 2012 20:05:54 +0200 Subject: [Biopython] biopython question In-Reply-To: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Tychele, If I understood correctly, you have a list of primers stored in a file and you want to trim those primer sequences off your fastq sequences, correct? One way I could think of is to first store the primers in a list (since they will be used repeatedly to check every single fastq sequence). Here's the code: from Bio import SeqIO def trim_primers(records, 'primer_file_name'): # read the primers primer_list = [] with open('primer_file_name', 'r') as source: for line in source: primer_list.append(line.strip()) ? ?for record in records: # list to check if the sequence begins with any of the primers check = [record.seq.startswith(x) for x in primer_list] # if any of the primer is present in the beginning of the sequence, then we trim it off if any(check): # get index of primer that matches the beginning idx = check.index(True) len_primer = len(primer_list[idx]) yield record[len_primer:] # otherwise just return the whole record ? ? ? ?else: ? ? ? ? ? yield record and then, you can use the function like so: original_reads = SeqIO.parse("SRR020192.fastq", "fastq") trimmed_reads = trim_primers(original_reads, 'primer_file_name') count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") print "Saved %i reads" % count I haven't tested the function, but I suppose that should do the trick. Hope that helps :), Bow On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: > Hi, > > I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: > > from Bio import SeqIO > def trim_primers(records, primer): > ? ?"""Removes perfect primer sequences at start of reads. > > ? ?This is a generator function, the records argument should > ? ?be a list or iterator returning SeqRecord objects. > ? ?""" > ? ?len_primer = len(primer) #cache this for later > ? ?for record in records: > ? ? ? ?if record.seq.startswith(primer): > ? ? ? ? ? ?yield record[len_primer:] > ? ? ? ?else: > ? ? ? ? ? ?yield record > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > print "Saved %i reads" % count > > > > > My question is: Is there a way to loop through a primer file for instance instead of looking for only > > 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. > > Primer file structured as: > GATGACGGTGT > GATGACGGTGA > GATGACGGCCT > > If you have any suggestions it would be greatly appreciated. Thanks. > > Tychele > > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From ferreirafm at usp.br Wed Apr 4 14:56:08 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 04 Apr 2012 15:56:08 -0300 Subject: [Biopython] random peptide sequences Message-ID: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> Dear BioPython List, I want to write a python script to generate random peptide sequences. I have a scratch in my mind, however, I'm not sure how to deal with data itself (like: use sequence or mutableSeq?). The problem is as follow: I have a list of 20 string peptides which I join to produce a sequence. I want to generate 1000+ sequences keeping the peptide1 (pep1) in a fix position (p1) and randomly permute (without repetition) the remaining 19 peptides in the remaining 19 positions. Repeat the first step keeping pep2 in a fix position p2 to generate more 1000 peptides sequences. And repeat this step again and again for all of the peptides & positions. At the end, I'm going to run a function with each one of peptide sequences getting I binary result like "positive" or "negative". What I have in mind is to randomly generate 1000 peptide sequences of 19 peptides, insert pep1 at position p1 in all of them; generate more 1000 peptide sequences of 19 peptides again and insert pep2 at position p2 in all of them; and so on...At the end, I'm going to run the function for each of the sequences and store results in a dict where value is the binary result. Well, where Biopython goes? I'm completely new to Biopython and would like to use it to solve the problem described. So, I'm writing to ask you guys for some tips and advises to use Biopython resources as much as possible. Any help is appreciated. All the Best, Fred From p.j.a.cock at googlemail.com Wed Apr 4 15:34:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 Apr 2012 20:34:53 +0100 Subject: [Biopython] random peptide sequences In-Reply-To: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> References: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> Message-ID: On Wed, Apr 4, 2012 at 7:56 PM, wrote: > Dear BioPython List, > I want to write a python script to generate random peptide sequences. I have > a scratch in my mind, however, I'm not sure how to deal with data itself > (like: use sequence or mutableSeq?). I would use a Seq object - once generated your random sequence won't change, so there is no need for the MutableSeq object. > ... At the end, I'm going to run the > function for each of the sequences and store results in a dict where value > is the binary result. It sounds like a large dataset of 1000s of random sequences will be created - you probably don't want to do that all in memory. I would generate the random records one by one and write them to a FASTA file. Then loop over the FASTA file and apply your binary test. An advantage of this split is you have broken the task in two - you can get the random sequence generator working and checked separately from writing and testing the classifier. [I am assuming you want to get out of this a table of some kind linking random sequences to binary classifier results] Peter From chris.mit7 at gmail.com Wed Apr 4 16:41:53 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 4 Apr 2012 16:41:53 -0400 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: <87ty13te2e.fsf@fastmail.fm> References: <87ty13te2e.fsf@fastmail.fm> Message-ID: I put some more updates on it. I'll have it finished by the end of Thursday, but any comments on my changes are appreciated. I expanded on my timeline and just need to fill in Weeks 6-11. On Sun, Apr 1, 2012 at 4:03 PM, Brad Chapman wrote: > > Chris; > Thanks for putting this together: that's a great start. I left specific > suggestions as comments in the document but in general the next step is > to expand your timeline to be increasingly specific about your plans. It > sounds like you have a good handle on the overview, now the full nitty > gritty details start. > > Brad > > > Hey everyone, > > > > Here's a draft of my proposal: > > > > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > > > > I've allowed comments to be put in. Please tear it to shreds :). > > > > Thanks, > > Chris > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Wed Apr 4 18:26:29 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 5 Apr 2012 00:26:29 +0200 Subject: [Biopython] biopython question In-Reply-To: <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Tychele, Glad to hear that and thanks for attaching the code as well :). Just one more heads up on the code, the trimming function assumes that for any record sequence, there is only one matching primer sequence at most. If by any random chance a sequence begins with two or more primer sequences, then it will only trim the first primer sequence. So if you still see some primer sequences left in the trimmed sequences, this could be the case and you'll need to modify the code. However, that seems unlikely ~ the current code should suffice. cheers, Bow On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: > Hi Bow, > > Thank you! This works great. I have attached the final code to the email in case it may benefit others. > > Tychele > > > ________________________________________ > From: Wibowo Arindrarto [w.arindrarto at gmail.com] > Sent: Wednesday, April 04, 2012 2:05 PM > To: Tychele Turner > Cc: biopython at biopython.org > Subject: Re: [Biopython] biopython question > > Hi Tychele, > > If I understood correctly, you have a list of primers stored in a file > and you want to trim those primer sequences off your fastq sequences, > correct? One way I could think of is to first store the primers in a > list (since they will be used repeatedly to check every single fastq > sequence). > > Here's the code: > > from Bio import SeqIO > > def trim_primers(records, 'primer_file_name'): > > ? ?# read the primers > ? ?primer_list = [] > ? ?with open('primer_file_name', 'r') as source: > ? ? ?for line in source: > ? ? ? ?primer_list.append(line.strip()) > > ? ?for record in records: > ? ? ? ?# list to check if the sequence begins with any of the primers > ? ? ? ?check = [record.seq.startswith(x) for x in primer_list] > ? ? ? ?# if any of the primer is present in the beginning of the > sequence, then we trim it off > ? ? ? ?if any(check): > ? ? ? ? ? ?# get index of primer that matches the beginning > ? ? ? ? ? ?idx = check.index(True) > ? ? ? ? ? ?len_primer = len(primer_list[idx]) > ? ? ? ? ? ?yield record[len_primer:] > ? ? ? ?# otherwise just return the whole record > ? ? ? ?else: > ? ? ? ? ? ?yield record > > and then, you can use the function like so: > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > trimmed_reads = trim_primers(original_reads, 'primer_file_name') > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > print "Saved %i reads" % count > > I haven't tested the function, but I suppose that should do the trick. > > Hope that helps :), > Bow > > > On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: >> Hi, >> >> I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: >> >> from Bio import SeqIO >> def trim_primers(records, primer): >> ? ?"""Removes perfect primer sequences at start of reads. >> >> ? ?This is a generator function, the records argument should >> ? ?be a list or iterator returning SeqRecord objects. >> ? ?""" >> ? ?len_primer = len(primer) #cache this for later >> ? ?for record in records: >> ? ? ? ?if record.seq.startswith(primer): >> ? ? ? ? ? ?yield record[len_primer:] >> ? ? ? ?else: >> ? ? ? ? ? ?yield record >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> print "Saved %i reads" % count >> >> >> >> >> My question is: Is there a way to loop through a primer file for instance instead of looking for only >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. >> >> Primer file structured as: >> GATGACGGTGT >> GATGACGGTGA >> GATGACGGCCT >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> Tychele >> >> >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > From ferreirafm at usp.br Wed Apr 4 20:01:42 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 04 Apr 2012 21:01:42 -0300 Subject: [Biopython] random peptide sequences In-Reply-To: References: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> <20120404165644.15233099ve7c2pvg@webmail.usp.br> Message-ID: <20120404210142.95126nwmeotjeg9y@webmail.usp.br> Hi Peter, It seems I get there, but can't write records to file using SeqIO.write as usual. Fred code: def random_seq(fastafile): records = [ ] query = SeqIO.read(fastafile, "fasta") peplist = str(query.seq).split('GPGPG') peptup = tuple(str(query.seq).split('GPGPG')) for pep in peptup: outf = open("test.fasta", "w") peplist.remove(pep) for k in range(10): random.shuffle(peplist, random.random) peplist.insert(0, pep) rec = SeqRecord('GPGPG'.join(peplist), id="pep%s" % k) records.append(rec) print 'id: %s\nSeq: %s\n' % (rec.id, rec.seq) peplist.remove(pep) print records SeqIO.write(records, outf, "fasta") outf.close() sys.exit(1) output: $ random_pep.py --run br18.fasta id: pep0 Seq: EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGVLAIVALVVATIIAIGPGPGTMLLGMLMICSAAGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGEAIIRILQQLLFIHF id: pep1 Seq: EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGSELYLYKVVKIEPLGVAPGPGPGKRWIILGLNKIVRMYSPTSIGPGPGVLAIVALVVATIIAIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVIGPGPGSPEVIPMFSALSEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIG id: pep2 Seq: EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGEAIIRILQQLLFIHFGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVI id: pep3 Seq: EELRSLYNTVATLYCVHGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGKRWIILGLNKIVRMYSPTSIGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGNTSYRLISCNTSVI id: pep4 Seq: EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGKRWIILGLNKIVRMYSPTSIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGSPEVIPMFSALSEGPGPGNTSYRLISCNTSVIGPGPGSLQYLALVALVAPKKGPGPGTPVNIIGRNLLTQIGGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI id: pep5 Seq: EELRSLYNTVATLYCVHGPGPGSPEVIPMFSALSEGPGPGEAIIRILQQLLFIHFGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGDKELYPLASLRSLFGGPGPGSLQYLALVALVAPKKGPGPGNTSYRLISCNTSVIGPGPGTPVNIIGRNLLTQIGGPGPGVLAIVALVVATIIAIGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGRDLLLIVTRIVELLGR id: pep6 Seq: EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGALFYKLDVVPIDGPGPGKRWIILGLNKIVRMYSPTSIGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGNTSYRLISCNTSVIGPGPGVLAIVALVVATIIAIGPGPGEAIIRILQQLLFIHFGPGPGGKIILVAVHVASGYIGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKK id: pep7 Seq: EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGSPEVIPMFSALSEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI id: pep8 Seq: EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGRDLLLIVTRIVELLGRGPGPGNTSYRLISCNTSVIGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGEAIIRILQQLLFIHFGPGPGDKELYPLASLRSLFGGPGPGALFYKLDVVPIDGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKKGPGPGGKIILVAVHVASGYIGPGPGTMLLGMLMICSAAGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGKRWIILGLNKIVRMYSPTSI id: pep9 Seq: EELRSLYNTVATLYCVHGPGPGVLEWRFDSRLAFHHVGPGPGSPEVIPMFSALSEGPGPGVLAIVALVVATIIAIGPGPGTPVNIIGRNLLTQIGGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGNTSYRLISCNTSVIGPGPGTMLLGMLMICSAAGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGRDLLLIVTRIVELLGRGPGPGEAIIRILQQLLFIHFGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGQQLLFIHFRIGCRHSRIG [SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGVLAIVALVVATIIAIGPGPGTMLLGMLMICSAAGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGEAIIRILQQLLFIHF', id='pep0', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGSELYLYKVVKIEPLGVAPGPGPGKRWIILGLNKIVRMYSPTSIGPGPGVLAIVALVVATIIAIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVIGPGPGSPEVIPMFSALSEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIG', id='pep1', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGEAIIRILQQLLFIHFGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVI', id='pep2', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGKRWIILGLNKIVRMYSPTSIGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGNTSYRLISCNTSVI', id='pep3', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGKRWIILGLNKIVRMYSPTSIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGSPEVIPMFSALSEGPGPGNTSYRLISCNTSVIGPGPGSLQYLALVALVAPKKGPGPGTPVNIIGRNLLTQIGGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI', id='pep4', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGSPEVIPMFSALSEGPGPGEAIIRILQQLLFIHFGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGDKELYPLASLRSLFGGPGPGSLQYLALVALVAPKKGPGPGNTSYRLISCNTSVIGPGPGTPVNIIGRNLLTQIGGPGPGVLAIVALVVATIIAIGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGRDLLLIVTRIVELLGR', id='pep5', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGALFYKLDVVPIDGPGPGKRWIILGLNKIVRMYSPTSIGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGNTSYRLISCNTSVIGPGPGVLAIVALVVATIIAIGPGPGEAIIRILQQLLFIHFGPGPGGKIILVAVHVASGYIGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKK', id='pep6', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGSPEVIPMFSALSEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI', id='pep7', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGRDLLLIVTRIVELLGRGPGPGNTSYRLISCNTSVIGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGEAIIRILQQLLFIHFGPGPGDKELYPLASLRSLFGGPGPGALFYKLDVVPIDGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKKGPGPGGKIILVAVHVASGYIGPGPGTMLLGMLMICSAAGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGKRWIILGLNKIVRMYSPTSI', id='pep8', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGVLEWRFDSRLAFHHVGPGPGSPEVIPMFSALSEGPGPGVLAIVALVVATIIAIGPGPGTPVNIIGRNLLTQIGGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGNTSYRLISCNTSVIGPGPGTMLLGMLMICSAAGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGRDLLLIVTRIVELLGRGPGPGEAIIRILQQLLFIHFGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGQQLLFIHFRIGCRHSRIG', id='pep9', name='', description='', dbxrefs=[])] Traceback (most recent call last): File "/home/ferreirafm/bin/random_pep.py", line 173, in main() File "/home/ferreirafm/bin/random_pep.py", line 156, in main random_seq(fastafile) File "/home/ferreirafm/bin/random_pep.py", line 39, in random_seq SeqIO.write(records, outf, "fasta") File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 412, in write count = writer_class(handle).write_file(sequences) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 136, in write_record data = self._get_seq_string(record) #Catches sequence being None File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 164, in _get_seq_string % record.id) TypeError: SeqRecord (id=pep0) has an invalid sequence. Citando Peter Cock : > On Wed, Apr 4, 2012 at 8:56 PM, wrote: >> >> Hi Peter, >> Thanks for helping. I'll try something like that and let you know the >> results. >> Fred > > Good luck - and please reply on the list to let us know how you get on :) > > Peter > From tturne18 at jhmi.edu Wed Apr 4 18:12:14 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Wed, 4 Apr 2012 22:12:14 +0000 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com>, Message-ID: <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Hi Bow, Thank you! This works great. I have attached the final code to the email in case it may benefit others. Tychele ________________________________________ From: Wibowo Arindrarto [w.arindrarto at gmail.com] Sent: Wednesday, April 04, 2012 2:05 PM To: Tychele Turner Cc: biopython at biopython.org Subject: Re: [Biopython] biopython question Hi Tychele, If I understood correctly, you have a list of primers stored in a file and you want to trim those primer sequences off your fastq sequences, correct? One way I could think of is to first store the primers in a list (since they will be used repeatedly to check every single fastq sequence). Here's the code: from Bio import SeqIO def trim_primers(records, 'primer_file_name'): # read the primers primer_list = [] with open('primer_file_name', 'r') as source: for line in source: primer_list.append(line.strip()) for record in records: # list to check if the sequence begins with any of the primers check = [record.seq.startswith(x) for x in primer_list] # if any of the primer is present in the beginning of the sequence, then we trim it off if any(check): # get index of primer that matches the beginning idx = check.index(True) len_primer = len(primer_list[idx]) yield record[len_primer:] # otherwise just return the whole record else: yield record and then, you can use the function like so: original_reads = SeqIO.parse("SRR020192.fastq", "fastq") trimmed_reads = trim_primers(original_reads, 'primer_file_name') count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") print "Saved %i reads" % count I haven't tested the function, but I suppose that should do the trick. Hope that helps :), Bow On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: > Hi, > > I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: > > from Bio import SeqIO > def trim_primers(records, primer): > """Removes perfect primer sequences at start of reads. > > This is a generator function, the records argument should > be a list or iterator returning SeqRecord objects. > """ > len_primer = len(primer) #cache this for later > for record in records: > if record.seq.startswith(primer): > yield record[len_primer:] > else: > yield record > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > print "Saved %i reads" % count > > > > > My question is: Is there a way to loop through a primer file for instance instead of looking for only > > 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. > > Primer file structured as: > GATGACGGTGT > GATGACGGTGA > GATGACGGCCT > > If you have any suggestions it would be greatly appreciated. Thanks. > > Tychele > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -------------- next part -------------- A non-text attachment was scrubbed... Name: testTrimPrimers.py Type: text/x-python-script Size: 1181 bytes Desc: testTrimPrimers.py URL: From chapmanb at 50mail.com Thu Apr 5 07:22:00 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 05 Apr 2012 07:22:00 -0400 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: References: <87ty13te2e.fsf@fastmail.fm> Message-ID: <87y5qabezr.fsf@fastmail.fm> Chris; Thanks for the updates, you're putting together a solid proposal. I added a couple of additional comments and pointers which should hopefully be helpful. My other practical suggestion is to include a link to your Google Doc from the official proposal in GSoC Melange. This will allow you to update your proposal in response to any reviewer comments, since Melange doesn't allow edits after Friday. Best of luck with the review process and thanks again for all of the work on the proposal, Brad > I put some more updates on it. I'll have it finished by the end of > Thursday, but any comments on my changes are appreciated. I expanded on my > timeline and just need to fill in Weeks 6-11. > > On Sun, Apr 1, 2012 at 4:03 PM, Brad Chapman wrote: > > > > > Chris; > > Thanks for putting this together: that's a great start. I left specific > > suggestions as comments in the document but in general the next step is > > to expand your timeline to be increasingly specific about your plans. It > > sounds like you have a good handle on the overview, now the full nitty > > gritty details start. > > > > Brad > > > > > Hey everyone, > > > > > > Here's a draft of my proposal: > > > > > > > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > > > > > > I've allowed comments to be put in. Please tear it to shreds :). > > > > > > Thanks, > > > Chris > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > Non-text part: text/html From chaitanya.talnikar at iitb.ac.in Thu Apr 5 17:16:13 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Fri, 6 Apr 2012 02:46:13 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: Hi all, I have modifed my proposal based on the comments. I have also updated the proposal on the GSoC website, along with a link to the google doc. Regards, Chaitanya On Wed, Apr 4, 2012 at 6:03 AM, Reece Hart wrote: > On Sun, Apr 1, 2012 at 2:42 AM, Chaitanya Talnikar > wrote: >> >> I have uploaded a second draft incorporating the changes. Please >> provide comments on my proposal. > > > Hi Chaitanya- > > I also read your proposal last night. My comments mostly echo Brad's, > although there are a couple of new ones I think. > > I'll be happy to reread or answer questions as needed. > > -Reece > From zhigang.wu at email.ucr.edu Thu Apr 5 17:49:05 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 5 Apr 2012 14:49:05 -0700 Subject: [Biopython] Biopython GSoC Proposal In-Reply-To: <87ty166g9c.fsf@fastmail.fm> References: <87ty166g9c.fsf@fastmail.fm> Message-ID: Hi Brad, Thanks for your comments. I have substantial modification to my proposal, which I think is close to submission. You and all others in the community are welcome to make any further comments and suggestions. Regards, Zhigang On Thu, Mar 29, 2012 at 6:15 PM, Brad Chapman wrote: > > Zhigang; > > > Here I am posting my draft of proposal, in which I have proposed to > > implement the SearchIO module. Please follow the link to access it > > > https://docs.google.com/document/d/15fkPAZfN2Ln8nMJr4Ad7lMscaGbKOiTaXcGpxxvIe3A/edit > > Thanks for putting this together. You've got an excellent start. I added > comments in the document on specific areas. Let us know if you have any > questions or need followup on any points. Thanks again, > Brad > From mictadlo at gmail.com Thu Apr 5 22:06:54 2012 From: mictadlo at gmail.com (Mic) Date: Fri, 6 Apr 2012 12:06:54 +1000 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: What is the difference to remove primer from the fastq file rather to use markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on an alignment? Would both ways deliver the same results? Thank you in advance. On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto wrote: > Hi Tychele, > > Glad to hear that and thanks for attaching the code as well :). > > Just one more heads up on the code, the trimming function assumes that > for any record sequence, there is only one matching primer sequence at > most. If by any random chance a sequence begins with two or more > primer sequences, then it will only trim the first primer sequence. So > if you still see some primer sequences left in the trimmed sequences, > this could be the case and you'll need to modify the code. > > However, that seems unlikely ~ the current code should suffice. > > cheers, > Bow > > > On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: > > Hi Bow, > > > > Thank you! This works great. I have attached the final code to the email > in case it may benefit others. > > > > Tychele > > > > > > ________________________________________ > > From: Wibowo Arindrarto [w.arindrarto at gmail.com] > > Sent: Wednesday, April 04, 2012 2:05 PM > > To: Tychele Turner > > Cc: biopython at biopython.org > > Subject: Re: [Biopython] biopython question > > > > Hi Tychele, > > > > If I understood correctly, you have a list of primers stored in a file > > and you want to trim those primer sequences off your fastq sequences, > > correct? One way I could think of is to first store the primers in a > > list (since they will be used repeatedly to check every single fastq > > sequence). > > > > Here's the code: > > > > from Bio import SeqIO > > > > def trim_primers(records, 'primer_file_name'): > > > > # read the primers > > primer_list = [] > > with open('primer_file_name', 'r') as source: > > for line in source: > > primer_list.append(line.strip()) > > > > for record in records: > > # list to check if the sequence begins with any of the primers > > check = [record.seq.startswith(x) for x in primer_list] > > # if any of the primer is present in the beginning of the > > sequence, then we trim it off > > if any(check): > > # get index of primer that matches the beginning > > idx = check.index(True) > > len_primer = len(primer_list[idx]) > > yield record[len_primer:] > > # otherwise just return the whole record > > else: > > yield record > > > > and then, you can use the function like so: > > > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > > trimmed_reads = trim_primers(original_reads, 'primer_file_name') > > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > > print "Saved %i reads" % count > > > > I haven't tested the function, but I suppose that should do the trick. > > > > Hope that helps :), > > Bow > > > > > > On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: > >> Hi, > >> > >> I have a question regarding one of the biopython capabilities. I would > like to trim primers off the end of reads in a fastq file and I found > wonderful documentation of how to do this on your website as follows: > >> > >> from Bio import SeqIO > >> def trim_primers(records, primer): > >> """Removes perfect primer sequences at start of reads. > >> > >> This is a generator function, the records argument should > >> be a list or iterator returning SeqRecord objects. > >> """ > >> len_primer = len(primer) #cache this for later > >> for record in records: > >> if record.seq.startswith(primer): > >> yield record[len_primer:] > >> else: > >> yield record > >> > >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > >> print "Saved %i reads" % count > >> > >> > >> > >> > >> My question is: Is there a way to loop through a primer file for > instance instead of looking for only > >> > >> 'GATGACGGTGT' every primer would be checked and subsequently removed > from the start of its respective read. > >> > >> Primer file structured as: > >> GATGACGGTGT > >> GATGACGGTGA > >> GATGACGGCCT > >> > >> If you have any suggestions it would be greatly appreciated. Thanks. > >> > >> Tychele > >> > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Thu Apr 5 22:20:09 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 6 Apr 2012 04:20:09 +0200 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Mic, I'm not familiar with picard, but it seems that this program detects whole duplicate molecules instead of detecting whether a primer is present in sequences (which may or may not be duplicates). Plus, it doesn't do any removal ~ it only flags them. So I don't think the two are comparable. cheers, Bow On Fri, Apr 6, 2012 at 04:06, Mic wrote: > What is the difference to remove primer from the fastq file rather to use > markDuplicates?http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates?on > an?alignment? > > Would both ways deliver the same results? > > Thank you in advance. > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto > wrote: >> >> Hi Tychele, >> >> Glad to hear that and thanks for attaching the code as well :). >> >> Just one more heads up on the code, the trimming function assumes that >> for any record sequence, there is only one matching primer sequence at >> most. If by any random chance a sequence begins with two or more >> primer sequences, then it will only trim the first primer sequence. So >> if you still see some primer sequences left in the trimmed sequences, >> this could be the case and you'll need to modify the code. >> >> However, that seems unlikely ~ the current code should suffice. >> >> cheers, >> Bow >> >> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: >> > Hi Bow, >> > >> > Thank you! This works great. I have attached the final code to the email >> > in case it may benefit others. >> > >> > Tychele >> > >> > >> > ________________________________________ >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] >> > Sent: Wednesday, April 04, 2012 2:05 PM >> > To: Tychele Turner >> > Cc: biopython at biopython.org >> > Subject: Re: [Biopython] biopython question >> > >> > Hi Tychele, >> > >> > If I understood correctly, you have a list of primers stored in a file >> > and you want to trim those primer sequences off your fastq sequences, >> > correct? One way I could think of is to first store the primers in a >> > list (since they will be used repeatedly to check every single fastq >> > sequence). >> > >> > Here's the code: >> > >> > from Bio import SeqIO >> > >> > def trim_primers(records, 'primer_file_name'): >> > >> > ? ?# read the primers >> > ? ?primer_list = [] >> > ? ?with open('primer_file_name', 'r') as source: >> > ? ? ?for line in source: >> > ? ? ? ?primer_list.append(line.strip()) >> > >> > ? ?for record in records: >> > ? ? ? ?# list to check if the sequence begins with any of the primers >> > ? ? ? ?check = [record.seq.startswith(x) for x in primer_list] >> > ? ? ? ?# if any of the primer is present in the beginning of the >> > sequence, then we trim it off >> > ? ? ? ?if any(check): >> > ? ? ? ? ? ?# get index of primer that matches the beginning >> > ? ? ? ? ? ?idx = check.index(True) >> > ? ? ? ? ? ?len_primer = len(primer_list[idx]) >> > ? ? ? ? ? ?yield record[len_primer:] >> > ? ? ? ?# otherwise just return the whole record >> > ? ? ? ?else: >> > ? ? ? ? ? ?yield record >> > >> > and then, you can use the function like so: >> > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> > print "Saved %i reads" % count >> > >> > I haven't tested the function, but I suppose that should do the trick. >> > >> > Hope that helps :), >> > Bow >> > >> > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: >> >> Hi, >> >> >> >> I have a question regarding one of the biopython capabilities. I would >> >> like to trim primers off the end of reads in a fastq file and I found >> >> wonderful documentation of how to do this on your website as follows: >> >> >> >> from Bio import SeqIO >> >> def trim_primers(records, primer): >> >> ? ?"""Removes perfect primer sequences at start of reads. >> >> >> >> ? ?This is a generator function, the records argument should >> >> ? ?be a list or iterator returning SeqRecord objects. >> >> ? ?""" >> >> ? ?len_primer = len(primer) #cache this for later >> >> ? ?for record in records: >> >> ? ? ? ?if record.seq.startswith(primer): >> >> ? ? ? ? ? ?yield record[len_primer:] >> >> ? ? ? ?else: >> >> ? ? ? ? ? ?yield record >> >> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> >> print "Saved %i reads" % count >> >> >> >> >> >> >> >> >> >> My question is: Is there a way to loop through a primer file for >> >> instance instead of looking for only >> >> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed >> >> from the start of its respective read. >> >> >> >> Primer file structured as: >> >> GATGACGGTGT >> >> GATGACGGTGA >> >> GATGACGGCCT >> >> >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> >> >> Tychele >> >> >> >> >> >> _______________________________________________ >> >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From mictadlo at gmail.com Fri Apr 6 05:59:36 2012 From: mictadlo at gmail.com (Mic) Date: Fri, 6 Apr 2012 19:59:36 +1000 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Bow, You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates: *$ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam * *$ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics* Are adapters and fragments the same? I found the following software for adapter: ** TagDust - eliminate artifactual sequence from NGS data* *http://www.biomedcentral.com/1471-2164/12/382* *http://bioinformatics.oxfordjournals.org/content/25/21/2839.full* ** FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?* *title=Main_Page* ** Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic* ** http://code.google.com/p/cutadapt/ * ** https://github.com/vsbuffalo/scythe * ** http://code.google.com/p/biopieces/wiki/find_adaptor* Thank you in advance. Cheers, On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto wrote: > Hi Mic, > > I'm not familiar with picard, but it seems that this program detects > whole duplicate molecules instead of detecting whether a primer is > present in sequences (which may or may not be duplicates). Plus, it > doesn't do any removal ~ it only flags them. So I don't think the two > are comparable. > > cheers, > Bow > > On Fri, Apr 6, 2012 at 04:06, Mic wrote: > > What is the difference to remove primer from the fastq file rather to use > > markDuplicates > http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates > on > > an alignment? > > > > Would both ways deliver the same results? > > > > Thank you in advance. > > > > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto < > w.arindrarto at gmail.com> > > wrote: > >> > >> Hi Tychele, > >> > >> Glad to hear that and thanks for attaching the code as well :). > >> > >> Just one more heads up on the code, the trimming function assumes that > >> for any record sequence, there is only one matching primer sequence at > >> most. If by any random chance a sequence begins with two or more > >> primer sequences, then it will only trim the first primer sequence. So > >> if you still see some primer sequences left in the trimmed sequences, > >> this could be the case and you'll need to modify the code. > >> > >> However, that seems unlikely ~ the current code should suffice. > >> > >> cheers, > >> Bow > >> > >> > >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: > >> > Hi Bow, > >> > > >> > Thank you! This works great. I have attached the final code to the > email > >> > in case it may benefit others. > >> > > >> > Tychele > >> > > >> > > >> > ________________________________________ > >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] > >> > Sent: Wednesday, April 04, 2012 2:05 PM > >> > To: Tychele Turner > >> > Cc: biopython at biopython.org > >> > Subject: Re: [Biopython] biopython question > >> > > >> > Hi Tychele, > >> > > >> > If I understood correctly, you have a list of primers stored in a file > >> > and you want to trim those primer sequences off your fastq sequences, > >> > correct? One way I could think of is to first store the primers in a > >> > list (since they will be used repeatedly to check every single fastq > >> > sequence). > >> > > >> > Here's the code: > >> > > >> > from Bio import SeqIO > >> > > >> > def trim_primers(records, 'primer_file_name'): > >> > > >> > # read the primers > >> > primer_list = [] > >> > with open('primer_file_name', 'r') as source: > >> > for line in source: > >> > primer_list.append(line.strip()) > >> > > >> > for record in records: > >> > # list to check if the sequence begins with any of the primers > >> > check = [record.seq.startswith(x) for x in primer_list] > >> > # if any of the primer is present in the beginning of the > >> > sequence, then we trim it off > >> > if any(check): > >> > # get index of primer that matches the beginning > >> > idx = check.index(True) > >> > len_primer = len(primer_list[idx]) > >> > yield record[len_primer:] > >> > # otherwise just return the whole record > >> > else: > >> > yield record > >> > > >> > and then, you can use the function like so: > >> > > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') > >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > >> > print "Saved %i reads" % count > >> > > >> > I haven't tested the function, but I suppose that should do the trick. > >> > > >> > Hope that helps :), > >> > Bow > >> > > >> > > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner > wrote: > >> >> Hi, > >> >> > >> >> I have a question regarding one of the biopython capabilities. I > would > >> >> like to trim primers off the end of reads in a fastq file and I found > >> >> wonderful documentation of how to do this on your website as follows: > >> >> > >> >> from Bio import SeqIO > >> >> def trim_primers(records, primer): > >> >> """Removes perfect primer sequences at start of reads. > >> >> > >> >> This is a generator function, the records argument should > >> >> be a list or iterator returning SeqRecord objects. > >> >> """ > >> >> len_primer = len(primer) #cache this for later > >> >> for record in records: > >> >> if record.seq.startswith(primer): > >> >> yield record[len_primer:] > >> >> else: > >> >> yield record > >> >> > >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > >> >> print "Saved %i reads" % count > >> >> > >> >> > >> >> > >> >> > >> >> My question is: Is there a way to loop through a primer file for > >> >> instance instead of looking for only > >> >> > >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed > >> >> from the start of its respective read. > >> >> > >> >> Primer file structured as: > >> >> GATGACGGTGT > >> >> GATGACGGTGA > >> >> GATGACGGCCT > >> >> > >> >> If you have any suggestions it would be greatly appreciated. Thanks. > >> >> > >> >> Tychele > >> >> > >> >> > >> >> _______________________________________________ > >> >> Biopython mailing list - Biopython at lists.open-bio.org > >> >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From ferreirafm at usp.br Fri Apr 6 06:44:04 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Fri, 06 Apr 2012 07:44:04 -0300 Subject: [Biopython] random peptide sequences In-Reply-To: References: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> <20120404165644.15233099ve7c2pvg@webmail.usp.br> <20120404210142.95126nwmeotjeg9y@webmail.usp.br> Message-ID: <20120406074404.12906v4jd37l4s8k@webmail.usp.br> Hi Peter, quick replaying also: thanks. Citando Peter Cock : > Just a quick reply - try changing this: > > rec = SeqRecord('GPGPG'.join(peplist), id="pep%s" % k) > > to > > rec = SeqRecord(Seq('GPGPG'.join(peplist)), id="pep%s" % k) > > You'll need to add this import line at the start as well, > from Bio.Seq import Seq > From 88whacko at gmail.com Fri Apr 6 09:01:49 2012 From: 88whacko at gmail.com (Andrea Rizzi) Date: Fri, 6 Apr 2012 15:01:49 +0200 Subject: [Biopython] GSoC - variants proposal - Andrea Rizzi Message-ID: Hi everybody, I'm a master student at Royal School of Technology in Stockholm. My program is Computational and System Biology and I'm interested in the representation and manipulation of variants project. I hope I'll have the chance to work with you. Here is the link to my proposal on google docs: https://docs.google.com/document/d/1iAjuOT1MzfMYDPr7ghDCB8pkWRdJqT_Pr5cvUEKekgg/ Reece, Brad: thank you very much for finding the time to answer me. The proposal on google docs is now publicly available with comments enabled and I've added a short summary in my google application. Cheers! -- -- Andrea From tturne18 at jhmi.edu Sat Apr 7 11:05:30 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Sat, 7 Apr 2012 15:05:30 +0000 Subject: [Biopython] [Samtools-help] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> , Message-ID: <22450DD328862542A918A3BC491F263B561E92@SN2PRD0102MB141.prod.exchangelabs.com> Hi Mic, I just saw your message regarding Mark Duplicates and the script Bow and I were discussing which recognizes and cleaves primers. First off, I'm familiar with Mark Duplicates from Picard and I do use it for exome data. However, in this instance I was looking at sequences coming from short amplicon sequencing. In this instance, marking duplicates is not appropriate because most of the reads will be duplicates due to the nature of the bench experiment (in contrast to shotgun sequencing where your looking at random fragments in which PCR artifacts arise in the PCR steps post-shearing). In my short amplicon sequence data, the read will start with the primer sequence and then extend to be a total length of 100 nucleotides. For this reason, I wanted to use a script which could recognize the primer and ultimately cleave that primer from the read so it would not go into the rest of the pipeline which would ultimately go to a variant calling program. As for your last point of sending other software which cut adapters that's fine but I'm not cutting adapters I'm looking for primer sequences and cleaving those. Also, I thought that if Biopython already has such a nice setup to do this I would use that especially since python is quite efficient at this task. Hope this helps. Tychele From: Mic [mictadlo at gmail.com] Sent: Friday, April 06, 2012 5:59 AM To: Wibowo Arindrarto Cc: samtools-help; biopython at biopython.org Subject: Re: [Samtools-help] [Biopython] biopython question Hi Bow, You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates: $ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam $ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics Are adapters and fragments the same? I found the following software for adapter: * TagDust - eliminate artifactual sequence from NGS data http://www.biomedcentral.com/1471-2164/12/382 http://bioinformatics.oxfordjournals.org/content/25/21/2839.full * FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php? title=Main_Page * Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic * http://code.google.com/p/cutadapt/ * https://github.com/vsbuffalo/scythe * http://code.google.com/p/biopieces/wiki/find_adaptor Thank you in advance. Cheers, On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto > wrote: Hi Mic, I'm not familiar with picard, but it seems that this program detects whole duplicate molecules instead of detecting whether a primer is present in sequences (which may or may not be duplicates). Plus, it doesn't do any removal ~ it only flags them. So I don't think the two are comparable. cheers, Bow On Fri, Apr 6, 2012 at 04:06, Mic > wrote: > What is the difference to remove primer from the fastq file rather to use > markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on > an alignment? > > Would both ways deliver the same results? > > Thank you in advance. > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto > > wrote: >> >> Hi Tychele, >> >> Glad to hear that and thanks for attaching the code as well :). >> >> Just one more heads up on the code, the trimming function assumes that >> for any record sequence, there is only one matching primer sequence at >> most. If by any random chance a sequence begins with two or more >> primer sequences, then it will only trim the first primer sequence. So >> if you still see some primer sequences left in the trimmed sequences, >> this could be the case and you'll need to modify the code. >> >> However, that seems unlikely ~ the current code should suffice. >> >> cheers, >> Bow >> >> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner > wrote: >> > Hi Bow, >> > >> > Thank you! This works great. I have attached the final code to the email >> > in case it may benefit others. >> > >> > Tychele >> > >> > >> > ________________________________________ >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] >> > Sent: Wednesday, April 04, 2012 2:05 PM >> > To: Tychele Turner >> > Cc: biopython at biopython.org >> > Subject: Re: [Biopython] biopython question >> > >> > Hi Tychele, >> > >> > If I understood correctly, you have a list of primers stored in a file >> > and you want to trim those primer sequences off your fastq sequences, >> > correct? One way I could think of is to first store the primers in a >> > list (since they will be used repeatedly to check every single fastq >> > sequence). >> > >> > Here's the code: >> > >> > from Bio import SeqIO >> > >> > def trim_primers(records, 'primer_file_name'): >> > >> > # read the primers >> > primer_list = [] >> > with open('primer_file_name', 'r') as source: >> > for line in source: >> > primer_list.append(line.strip()) >> > >> > for record in records: >> > # list to check if the sequence begins with any of the primers >> > check = [record.seq.startswith(x) for x in primer_list] >> > # if any of the primer is present in the beginning of the >> > sequence, then we trim it off >> > if any(check): >> > # get index of primer that matches the beginning >> > idx = check.index(True) >> > len_primer = len(primer_list[idx]) >> > yield record[len_primer:] >> > # otherwise just return the whole record >> > else: >> > yield record >> > >> > and then, you can use the function like so: >> > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> > print "Saved %i reads" % count >> > >> > I haven't tested the function, but I suppose that should do the trick. >> > >> > Hope that helps :), >> > Bow >> > >> > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner > wrote: >> >> Hi, >> >> >> >> I have a question regarding one of the biopython capabilities. I would >> >> like to trim primers off the end of reads in a fastq file and I found >> >> wonderful documentation of how to do this on your website as follows: >> >> >> >> from Bio import SeqIO >> >> def trim_primers(records, primer): >> >> """Removes perfect primer sequences at start of reads. >> >> >> >> This is a generator function, the records argument should >> >> be a list or iterator returning SeqRecord objects. >> >> """ >> >> len_primer = len(primer) #cache this for later >> >> for record in records: >> >> if record.seq.startswith(primer): >> >> yield record[len_primer:] >> >> else: >> >> yield record >> >> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> >> print "Saved %i reads" % count >> >> >> >> >> >> >> >> >> >> My question is: Is there a way to loop through a primer file for >> >> instance instead of looking for only >> >> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed >> >> from the start of its respective read. >> >> >> >> Primer file structured as: >> >> GATGACGGTGT >> >> GATGACGGTGA >> >> GATGACGGCCT >> >> >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> >> >> Tychele >> >> >> >> >> >> _______________________________________________ >> >> Biopython mailing list - Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From rbuels at gmail.com Sun Apr 8 12:34:33 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 08 Apr 2012 12:34:33 -0400 Subject: [Biopython] Google Summer of Code mentors Message-ID: <4F81BE19.2050605@gmail.com> Hi all, Reminder: if you want to help mentor Google Summer of Code students to work on your Bio* project, you have to do three things: 1. Make sure you have enough time to actually help a student over the summer 2. Sign up as a mentor for the Open Bioinformatics Foundation at http://www.google-melange.com/gsoc/homepage/google/gsoc2012 3. Join the OBF Google Summer of Code mailing lists at: http://lists.open-bio.org/mailman/listinfo/gsoc and http://lists.open-bio.org/mailman/listinfo/gsoc-mentors Robert Buels 2012 OBF GSoC Org. Admin. From alfonso.esposito1983 at hotmail.it Mon Apr 9 06:40:57 2012 From: alfonso.esposito1983 at hotmail.it (fonz esposito) Date: Mon, 9 Apr 2012 12:40:57 +0200 Subject: [Biopython] biopython script and py2exe Message-ID: Dear all, I wrote a biopython script for sequence blast and report, it is a quite simple one but it just works on linux and all my collegues complain. I found a way to convert it into .exe that could run on windows, it is called py2exe. I tried it but it does not work, it gives me a lots of error messages, so I read some more and I found that it requires a .dll file to run properly and furthermore there are some more things to to to make it work with biopython packages... Did anyone of you get the same problem? Could someone of you help me to solve it? thanks in advance. Alfonso From rbuels at gmail.com Mon Apr 9 10:57:41 2012 From: rbuels at gmail.com (Robert Buels) Date: Mon, 09 Apr 2012 10:57:41 -0400 Subject: [Biopython] Google Summer of Code mentors Message-ID: <4F82F8E5.2040709@gmail.com> Hi all, Reminder: if you want to help mentor Google Summer of Code students to work on your Bio* project, you have to do four things: 1. Make sure you have enough time to actually help a student over the summer 2. Sign up as a mentor for the Open Bioinformatics Foundation at http://www.google-melange.com/gsoc/homepage/google/gsoc2012 3. Join the OBF Google Summer of Code mailing lists at: http://lists.open-bio.org/mailman/listinfo/gsoc and http://lists.open-bio.org/mailman/listinfo/gsoc-mentors 4. After your request to be a mentor is accepted by me, log into the GSoC web interface at http://www.google-melange.com (the same web application you used to sign up) and help look at and evaluate this year's student proposals. Robert Buels 2012 OBF GSoC Org. Admin. From jgrant at smith.edu Mon Apr 9 16:29:16 2012 From: jgrant at smith.edu (Jessica Grant) Date: Mon, 9 Apr 2012 16:29:16 -0400 Subject: [Biopython] search ncbi automatically Message-ID: <4531D4A6-5312-4DC4-9418-6D6939D5BD46@smith.edu> Hello, I am working on a phylogenomic pipeline and want to keep my database as up-to-date as possible. I was wondering if there is a way to automatically search genbank on occasion (every month or so, or however often they release new data) to see if any new sequences have been added for the taxa we are working with. Is there a way to run a script in the background that will just go out and do that for me, and let me know if it finds anything? Thanks for your help! Jessica From tturne18 at jhmi.edu Mon Apr 9 16:44:55 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Mon, 9 Apr 2012 20:44:55 +0000 Subject: [Biopython] [Samtools-help] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561E92@SN2PRD0102MB141.prod.exchangelabs.com>, Message-ID: <22450DD328862542A918A3BC491F263B562034@SN2PRD0102MB141.prod.exchangelabs.com> Thanks Monica! I will look into your program. Tychele ________________________________ From: Monica Britton [mtbritton at ucdavis.edu] Sent: Saturday, April 07, 2012 5:55 PM To: Tychele Turner Cc: Mic; Wibowo Arindrarto; samtools-help Subject: Re: [Samtools-help] [Biopython] biopython question Hi Tychele: If your primer is always at the beginning of each sequence, you could treat it as a barcode. We have a program to cleave barcodes from fastq sequences that would fit your purpose (see https://github.com/ucdavis-bioinformatics/sabre). Monica Britton On Sat, Apr 7, 2012 at 8:05 AM, Tychele Turner > wrote: Hi Mic, I just saw your message regarding Mark Duplicates and the script Bow and I were discussing which recognizes and cleaves primers. First off, I'm familiar with Mark Duplicates from Picard and I do use it for exome data. However, in this instance I was looking at sequences coming from short amplicon sequencing. In this instance, marking duplicates is not appropriate because most of the reads will be duplicates due to the nature of the bench experiment (in contrast to shotgun sequencing where your looking at random fragments in which PCR artifacts arise in the PCR steps post-shearing). In my short amplicon sequence data, the read will start with the primer sequence and then extend to be a total length of 100 nucleotides. For this reason, I wanted to use a script which could recognize the primer and ultimately cleave that primer from the read so it would not go into the rest of the pipeline which would ultimately go to a variant calling program. As for your last point of sending other software which cut adapters that's fine but I'm not cutting adapters I'm looking for primer sequences and cleaving those. Also, I thought that if Biopython already has such a nice setup to do this I would use that especially since python is quite efficient at this task. Hope this helps. Tychele From: Mic [mictadlo at gmail.com] Sent: Friday, April 06, 2012 5:59 AM To: Wibowo Arindrarto Cc: samtools-help; biopython at biopython.org Subject: Re: [Samtools-help] [Biopython] biopython question Hi Bow, You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates: $ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam $ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics Are adapters and fragments the same? I found the following software for adapter: * TagDust - eliminate artifactual sequence from NGS data http://www.biomedcentral.com/1471-2164/12/382 http://bioinformatics.oxfordjournals.org/content/25/21/2839.full * FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php? title=Main_Page * Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic * http://code.google.com/p/cutadapt/ * https://github.com/vsbuffalo/scythe * http://code.google.com/p/biopieces/wiki/find_adaptor Thank you in advance. Cheers, On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto > wrote: Hi Mic, I'm not familiar with picard, but it seems that this program detects whole duplicate molecules instead of detecting whether a primer is present in sequences (which may or may not be duplicates). Plus, it doesn't do any removal ~ it only flags them. So I don't think the two are comparable. cheers, Bow On Fri, Apr 6, 2012 at 04:06, Mic > wrote: > What is the difference to remove primer from the fastq file rather to use > markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on > an alignment? > > Would both ways deliver the same results? > > Thank you in advance. > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto > > wrote: >> >> Hi Tychele, >> >> Glad to hear that and thanks for attaching the code as well :). >> >> Just one more heads up on the code, the trimming function assumes that >> for any record sequence, there is only one matching primer sequence at >> most. If by any random chance a sequence begins with two or more >> primer sequences, then it will only trim the first primer sequence. So >> if you still see some primer sequences left in the trimmed sequences, >> this could be the case and you'll need to modify the code. >> >> However, that seems unlikely ~ the current code should suffice. >> >> cheers, >> Bow >> >> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner > wrote: >> > Hi Bow, >> > >> > Thank you! This works great. I have attached the final code to the email >> > in case it may benefit others. >> > >> > Tychele >> > >> > >> > ________________________________________ >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] >> > Sent: Wednesday, April 04, 2012 2:05 PM >> > To: Tychele Turner >> > Cc: biopython at biopython.org >> > Subject: Re: [Biopython] biopython question >> > >> > Hi Tychele, >> > >> > If I understood correctly, you have a list of primers stored in a file >> > and you want to trim those primer sequences off your fastq sequences, >> > correct? One way I could think of is to first store the primers in a >> > list (since they will be used repeatedly to check every single fastq >> > sequence). >> > >> > Here's the code: >> > >> > from Bio import SeqIO >> > >> > def trim_primers(records, 'primer_file_name'): >> > >> > # read the primers >> > primer_list = [] >> > with open('primer_file_name', 'r') as source: >> > for line in source: >> > primer_list.append(line.strip()) >> > >> > for record in records: >> > # list to check if the sequence begins with any of the primers >> > check = [record.seq.startswith(x) for x in primer_list] >> > # if any of the primer is present in the beginning of the >> > sequence, then we trim it off >> > if any(check): >> > # get index of primer that matches the beginning >> > idx = check.index(True) >> > len_primer = len(primer_list[idx]) >> > yield record[len_primer:] >> > # otherwise just return the whole record >> > else: >> > yield record >> > >> > and then, you can use the function like so: >> > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> > print "Saved %i reads" % count >> > >> > I haven't tested the function, but I suppose that should do the trick. >> > >> > Hope that helps :), >> > Bow >> > >> > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner > wrote: >> >> Hi, >> >> >> >> I have a question regarding one of the biopython capabilities. I would >> >> like to trim primers off the end of reads in a fastq file and I found >> >> wonderful documentation of how to do this on your website as follows: >> >> >> >> from Bio import SeqIO >> >> def trim_primers(records, primer): >> >> """Removes perfect primer sequences at start of reads. >> >> >> >> This is a generator function, the records argument should >> >> be a list or iterator returning SeqRecord objects. >> >> """ >> >> len_primer = len(primer) #cache this for later >> >> for record in records: >> >> if record.seq.startswith(primer): >> >> yield record[len_primer:] >> >> else: >> >> yield record >> >> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> >> print "Saved %i reads" % count >> >> >> >> >> >> >> >> >> >> My question is: Is there a way to loop through a primer file for >> >> instance instead of looking for only >> >> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed >> >> from the start of its respective read. >> >> >> >> Primer file structured as: >> >> GATGACGGTGT >> >> GATGACGGTGA >> >> GATGACGGCCT >> >> >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> >> >> Tychele >> >> >> >> >> >> _______________________________________________ >> >> Biopython mailing list - Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > ------------------------------------------------------------------------------ For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 _______________________________________________ Samtools-help mailing list Samtools-help at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help -- Monica Britton Bioinformatics Analyst Genome Center and Bioinformatics Core Facility University of California, Davis mtbritton at ucdavis.edu From David.Lapointe at umassmed.edu Mon Apr 9 20:09:57 2012 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Tue, 10 Apr 2012 00:09:57 +0000 Subject: [Biopython] search ncbi automatically In-Reply-To: <4531D4A6-5312-4DC4-9418-6D6939D5BD46@smith.edu> References: <4531D4A6-5312-4DC4-9418-6D6939D5BD46@smith.edu> Message-ID: <86BFEB1DFA6CB3448DB8AB1FC52F4059069C60@ummscsmbx06.ad.umassmed.edu> Hi Jessica, David Hibbett , at Clark Univ, had a program with a similar purpose. See http://www.clarku.edu/faculty/dhibbett/. Genbank publishes daily updates which can be scanned for taxa with some biopython scripts. That would involve some downloading every day or so. Each file ranges from 10-100 Mb compressed,though some days there might be a 900 Mb file. A new version of Genbank happens every 2 months so if you have division ( VRL, PRI, etc) that interests you. you can download all of the pieces of that division and rsync when a new Genbank version comes around. David ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Jessica Grant [jgrant at smith.edu] Sent: Monday, April 09, 2012 4:29 PM To: biopython at lists.open-bio.org Subject: [Biopython] search ncbi automatically Hello, I am working on a phylogenomic pipeline and want to keep my database as up-to-date as possible. I was wondering if there is a way to automatically search genbank on occasion (every month or so, or however often they release new data) to see if any new sequences have been added for the taxa we are working with. Is there a way to run a script in the background that will just go out and do that for me, and let me know if it finds anything? Thanks for your help! Jessica _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From livingstonemark at gmail.com Wed Apr 11 21:34:16 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Thu, 12 Apr 2012 11:34:16 +1000 Subject: [Biopython] Key Error Message-ID: Hi Guys, I am about 3 days into learning BioPython using the current EPD 32 bit Mac OS X academic distribution . When I run the included code, it works fine if I do num_atoms_to_do = 36 but if I try to do any more, I get a key error. I am using the 1fat.pdb since that is what your tutes seem to use. The only thing that I note is that when comparing residues in model A & C, it is at residue 37 that the letters are no longer the same. However, since I only look at Model A in the first part of the code, I can't see that this should be a factor? Program output: >From residue 0 0.000000 3.795453 5.663135 9.420295 12.296334 15.957790 19.201048 22.622383 25.621159 27.803837 27.483303 28.365652 28.663099 27.070955 23.441793 24.625151 26.016047 29.257225 31.299105 34.907970 36.100464 32.837784 33.310841 32.332653 35.280716 36.668713 35.319839 31.738102 31.607626 30.115669 33.220890 32.487370 36.199894 39.507755 42.668190 46.333019 Traceback (most recent call last): File "./inter_atom_distance.py", line 47, in residue2 = chain[y+1] File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Chain.py", line 67, in __getitem__ return Entity.__getitem__(self, id) File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Entity.py", line 38, in __getitem__ return self.child_dict[id] KeyError: (' ', 37, ' ') #! /usr/bin/env python # Initial idea for this code from http://stackoverflow.com/questions/6437391/how-to-get-distance-between-two-atoms-using-for-loop print "\n\n\033[95mMark's PDB Geometry experimentation code\033[0m\n\n" print "This code prints a set of c-alpha to c-alpha distances which are colourised so that distances < 5.0 are green\n" print "otherwise are colourised red. If you are using Microsoft Windows, you may need to load an ansi.sys driver in your config.sys\n\n" print "Any errors below in yellow are due to the .pdb file not being properly well formed.\n\n\033[93m" from Bio.PDB.PDBParser import PDBParser pdb_filename ='./1fat.pdb' parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure("1fat", pdb_filename) model = structure[0] chain = model["A"] residue1 = chain[1] print "\033[0m" print residue1.get_resname() # SER residue2 = chain[2] print residue2.get_resname() # ASN atom1 = residue1['CA'] print atom1.get_name() print atom1.get_coord() atom2 = residue2['CA'] print atom2.get_name() print atom2.get_coord() distance = atom1-atom2 # minus is overloaded to do 3D vector math - how clever!! print"%s to %s euclidean distance = %f Angstroms" % (residue1.get_resname(), residue2.get_resname(), distance) print "%d models in pdb file named %s" % (len(model), pdb_filename) # 4 models in pdb print "%d residues in model 1" % len(chain) # 239 residues print "%s has %d atoms" % (residue1.get_resname(), len(residue1)) # SER has 6 atoms print "Length of Model 'A' is %d and Model 'C' is %d" % (len(model['A']), len(model['C'])) print ("\n\033[93mDistances between C-alpha atoms of residues in model 1 \n") num_atoms_to_do = 37 for x in range(num_atoms_to_do): print "\033[0mFrom residue %d" % x residue1 = chain[x+1] atom1 = residue1['CA'] for y in range(num_atoms_to_do): residue2 = chain[y+1] atom2 = residue2['CA'] distance = (atom1 - atom2) if distance < 5.0: print("\033[92m%f" % distance), else: print("\033[91m%f" % distance), print "\n" print "\n\033[93mDistances between C-alpha atoms of residues in model 1 to model 3 \n" print "NB: These have NOT been superimposed - thus the large distances between matched atoms\033[0m\n" num_atoms_to_do = 37 for x in range(num_atoms_to_do): print "\033[0mFrom residue %d" % x model = structure[0] chain = model["A"] residue1 = chain[x+1] atom1 = residue1['CA'] for y in range(num_atoms_to_do): model = structure[0] chain = model["C"] residue2 = chain[y+1] atom2 = residue2['CA'] distance = (atom1 - atom2) if distance < 5.0: print("\033[92m%f" % distance), else: print("\033[91m%f" % distance), print "\n" print "\033[0m" Thanks in advance, MarkL From livingstonemark at gmail.com Wed Apr 11 21:57:15 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Thu, 12 Apr 2012 11:57:15 +1000 Subject: [Biopython] Key Error In-Reply-To: References: Message-ID: Hi Guys, I am about 3 days into learning BioPython using the current EPD 32 bit Mac OS X academic distribution . When I run the included code, it works fine if I do num_atoms_to_do = 36 but if I try to do any more, I get a key error. I am using the 1fat.pdb since that is what your tutes seem to use. The only thing that I note is that when comparing residues in model A & C, it is at residue 37 that the letters are no longer the same. However, since I only look at Model A in the first part of the code, I can't see that this should be a factor? Program output: >From residue 0 0.000000 3.795453 5.663135 9.420295 12.296334 15.957790 19.201048 22.622383 25.621159 27.803837 27.483303 28.365652 28.663099 27.070955 23.441793 24.625151 26.016047 29.257225 31.299105 34.907970 36.100464 32.837784 33.310841 32.332653 35.280716 36.668713 35.319839 31.738102 31.607626 30.115669 33.220890 32.487370 36.199894 39.507755 42.668190 46.333019 Traceback (most recent call last): ?File "./inter_atom_distance.py", line 47, in ? ?residue2 = chain[y+1] ?File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Chain.py", line 67, in __getitem__ ? ?return Entity.__getitem__(self, id) ?File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Entity.py", line 38, in __getitem__ ? ?return self.child_dict[id] KeyError: (' ', 37, ' ') #! /usr/bin/env python # Initial idea for this code from http://stackoverflow.com/questions/6437391/how-to-get-distance-between-two-atoms-using-for-loop print "\n\n\033[95mMark's PDB Geometry experimentation code\033[0m\n\n" print "This code prints a set of c-alpha to c-alpha distances which are colourised so that distances < 5.0 are green\n" print "otherwise are colourised red. If you are using Microsoft Windows, you may need to load an ansi.sys driver in your config.sys\n\n" print "Any errors below in yellow are due to the .pdb file not being properly well formed.\n\n\033[93m" from Bio.PDB.PDBParser import PDBParser pdb_filename ='./1fat.pdb' parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure("1fat", pdb_filename) model = structure[0] chain = model["A"] residue1 = chain[1] print "\033[0m" print residue1.get_resname() # SER residue2 = chain[2] print residue2.get_resname() # ASN atom1 = residue1['CA'] print atom1.get_name() print atom1.get_coord() atom2 = residue2['CA'] print atom2.get_name() print atom2.get_coord() distance = atom1-atom2 # minus is overloaded to do 3D vector math - how clever!! print"%s to %s euclidean distance = %f Angstroms" ?% (residue1.get_resname(), residue2.get_resname(), distance) print "%d models in pdb file named %s" % (len(model), pdb_filename) # 4 models in pdb print "%d residues in model 1" % len(chain) # 239 residues print "%s has %d atoms" % (residue1.get_resname(), len(residue1)) # SER has 6 atoms print "Length of Model 'A' is %d and Model 'C' is %d" % (len(model['A']), len(model['C'])) print ("\n\033[93mDistances between C-alpha atoms of residues in model 1 \n") num_atoms_to_do = 37 for x in range(num_atoms_to_do): ? ?print "\033[0mFrom residue %d" % x ? ?residue1 = chain[x+1] ? ?atom1 = residue1['CA'] ? ?for y in range(num_atoms_to_do): ? ? ? ?residue2 = chain[y+1] ? ? ? ?atom2 = residue2['CA'] ? ? ? ?distance = (atom1 - atom2) ? ? ? ?if distance < 5.0: ? ? ? ? ? ?print("\033[92m%f" % distance), ? ? ? ?else: ? ? ? ? ? ?print("\033[91m%f" % distance), ? ?print "\n" print "\n\033[93mDistances between C-alpha atoms of residues in model 1 to model 3 \n" print "NB: These have NOT been superimposed - thus the large distances between matched atoms\033[0m\n" num_atoms_to_do = 37 for x in range(num_atoms_to_do): ? ?print "\033[0mFrom residue %d" % x ? ?model = structure[0] ? ?chain = model["A"] ? ?residue1 = chain[x+1] ? ?atom1 = residue1['CA'] ? ?for y in range(num_atoms_to_do): ? ? ? ?model = structure[0] ? ? ? ?chain = model["C"] ? ? ? ?residue2 = chain[y+1] ? ? ? ?atom2 = residue2['CA'] ? ? ? ?distance = (atom1 - atom2) ? ? ? ?if distance < 5.0: ? ? ? ? ? ?print("\033[92m%f" % distance), ? ? ? ?else: ? ? ? ? ? ?print("\033[91m%f" % distance), ? ?print "\n" print "\033[0m" Thanks in advance, MarkL From ajperry at pansapiens.com Wed Apr 11 23:07:57 2012 From: ajperry at pansapiens.com (Andrew Perry) Date: Thu, 12 Apr 2012 13:07:57 +1000 Subject: [Biopython] Key Error In-Reply-To: References: Message-ID: Hi Mark, The problem is arising since 1FAT is missing coordinates for residue 37 in chain A,B,C and D. This is very common for protein structures in the PDB, and can be for many reasons - it's often the case that the structural biologist who determined the structure left out this residue since their data didn't allow them to determine it's position with confidence. By using range(num_atoms_to_do), you are assuming that there will be no numbers missing in the sequence ... not the case ! [also, I think you really mean range(num_of_residues_to_do) ]. The solution would be to do something like this: from Bio.PDB.PDBParser import PDBParser pdb_filename ='./1fat.pdb' parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure("1fat", pdb_filename) model = structure[0] chain = model["A"] for residue1 in chain: resnum = residue1.get_id()[1] atom1 = residue1['CA'] This loop will return residue objects in the Chain object, without caring if there is a residue missing in the sequence. (I can see how this could be confusing, since without looking at the source, it seems the Bio.PDB.Chain.Chain object mostly behaves like Python sequence object (eg a list), but behaves like a dictionary when __getitem__ is called on it via chain[some_key] . I'm sure there's some good reason for that :) ) The next thing you may find it that you hit a non-amino acid ligand "NAG" without a 'CA' atom. Use something like: if not "CA" in residue1: continue to catch that. Also, just a pedantic note on terminology that may help in reading the docs and further questions - "A", "B", "C" and "D" are chains in PDB terminology. A "model" is something different (usually only found in NMR structures with multiple models per PDB file). Hope this helps, Andrew Perry Postdoctoral Fellow Whisstock Lab Department of Biochemistry and Molecular Biology Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. Mobile: +61 409 808 529 On Thu, Apr 12, 2012 at 11:57 AM, Mark Livingstone < livingstonemark at gmail.com> wrote: > Hi Guys, > > I am about 3 days into learning BioPython using the current EPD 32 bit > Mac OS X academic distribution . When I run the included code, it > works fine if I do > > num_atoms_to_do = 36 > > but if I try to do any more, I get a key error. > > I am using the 1fat.pdb since that is what your tutes seem to use. The > only thing that I note is that when comparing residues in model A & C, > it is at residue 37 that the letters are no longer the same. However, > since I only look at Model A in the first part of the code, I can't > see that this should be a factor? > > Program output: > > > > >From residue 0 > 0.000000 3.795453 5.663135 9.420295 12.296334 15.957790 19.201048 > 22.622383 25.621159 27.803837 27.483303 28.365652 28.663099 27.070955 > 23.441793 24.625151 26.016047 29.257225 31.299105 34.907970 36.100464 > 32.837784 33.310841 32.332653 35.280716 36.668713 35.319839 31.738102 > 31.607626 30.115669 33.220890 32.487370 36.199894 39.507755 42.668190 > 46.333019 > Traceback (most recent call last): > File "./inter_atom_distance.py", line 47, in > residue2 = chain[y+1] > File > "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Chain.py", > line 67, in __getitem__ > return Entity.__getitem__(self, id) > File > "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Entity.py", > line 38, in __getitem__ > return self.child_dict[id] > KeyError: (' ', 37, ' ') > > > > > #! /usr/bin/env python > > # Initial idea for this code from > > http://stackoverflow.com/questions/6437391/how-to-get-distance-between-two-atoms-using-for-loop > > print "\n\n\033[95mMark's PDB Geometry experimentation code\033[0m\n\n" > print "This code prints a set of c-alpha to c-alpha distances which > are colourised so that distances < 5.0 are green\n" > print "otherwise are colourised red. If you are using Microsoft > Windows, you may need to load an ansi.sys driver in your > config.sys\n\n" > > print "Any errors below in yellow are due to the .pdb file not being > properly well formed.\n\n\033[93m" > > from Bio.PDB.PDBParser import PDBParser > pdb_filename ='./1fat.pdb' > parser = PDBParser(PERMISSIVE=1) > structure = parser.get_structure("1fat", pdb_filename) > model = structure[0] > chain = model["A"] > residue1 = chain[1] > print "\033[0m" > print residue1.get_resname() # SER > residue2 = chain[2] > print residue2.get_resname() # ASN > > atom1 = residue1['CA'] > print atom1.get_name() > print atom1.get_coord() > atom2 = residue2['CA'] > print atom2.get_name() > print atom2.get_coord() > distance = atom1-atom2 # minus is overloaded to do 3D vector math - how > clever!! > print"%s to %s euclidean distance = %f Angstroms" % > (residue1.get_resname(), residue2.get_resname(), distance) > print "%d models in pdb file named %s" % (len(model), pdb_filename) # > 4 models in pdb > print "%d residues in model 1" % len(chain) # 239 residues > print "%s has %d atoms" % (residue1.get_resname(), len(residue1)) # > SER has 6 atoms > > print "Length of Model 'A' is %d and Model 'C' is %d" % > (len(model['A']), len(model['C'])) > > print ("\n\033[93mDistances between C-alpha atoms of residues in model 1 > \n") > > num_atoms_to_do = 37 > > for x in range(num_atoms_to_do): > print "\033[0mFrom residue %d" % x > residue1 = chain[x+1] > atom1 = residue1['CA'] > > for y in range(num_atoms_to_do): > residue2 = chain[y+1] > atom2 = residue2['CA'] > distance = (atom1 - atom2) > if distance < 5.0: > print("\033[92m%f" % distance), > else: > print("\033[91m%f" % distance), > print "\n" > > print "\n\033[93mDistances between C-alpha atoms of residues in model > 1 to model 3 \n" > print "NB: These have NOT been superimposed - thus the large distances > between matched atoms\033[0m\n" > > > num_atoms_to_do = 37 > > for x in range(num_atoms_to_do): > print "\033[0mFrom residue %d" % x > model = structure[0] > chain = model["A"] > residue1 = chain[x+1] > atom1 = residue1['CA'] > > for y in range(num_atoms_to_do): > model = structure[0] > chain = model["C"] > residue2 = chain[y+1] > atom2 = residue2['CA'] > > distance = (atom1 - atom2) > if distance < 5.0: > print("\033[92m%f" % distance), > else: > print("\033[91m%f" % distance), > print "\n" > > > print "\033[0m" > > > > Thanks in advance, > > MarkL > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From matthiasschade.de at googlemail.com Sat Apr 14 07:41:33 2012 From: matthiasschade.de at googlemail.com (Matthias Schade) Date: Sat, 14 Apr 2012 13:41:33 +0200 Subject: [Biopython] (no subject) Message-ID: <4F89626D.9060807@googlemail.com> Hello everyone, I would like to run a blastn-query of a small nucleotide-sequence against a genome. The code works already, but my queries are still slow and mostly ineffective, so I would like to ask: Is there a way to tell the blastn-algorithm that once a 'perfect match' has been found it can stop and send back the results? Background: I am interested in only the first full match because I would like to design a nucleotide-probe which -if possible- has no(!) known match in a host-genome, neither in RNA nor DNA. Actually, I would reject all perfect-matches and all single-mismatches but allow every sequence with two or more mismatches. Currrently, I use this line of code with seq_now being about 15-30 nt long: result_handle = NCBIWWW.qblast("blastn", "nt", seq_now, entrez_query="Canis familiaris[orgn]") I am still new to this. Thank you for your help and input, Matt From mrrizkalla at gmail.com Sat Apr 14 08:35:52 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Sat, 14 Apr 2012 14:35:52 +0200 Subject: [Biopython] History, Efetch, and returned records limits Message-ID: Dear community, I aim to get sequences by a list of gi (using efetch and history variables), for a certain taxid (using esearch). I always get the first 10,000 records. For example, I need10,300 gi_ids, I split them into list of 10,000 and submit them consecutively and still getting the first 10,000 records. I tried batch approach in Biopython tutorial, didn't even reach 10,000 sequences. Is there a limit for NCBI's returned sequences? Thank you. Mariam From p.j.a.cock at googlemail.com Sat Apr 14 08:54:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 13:54:55 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah wrote: > Dear community, > > I aim to get sequences by a list of gi (using efetch and history > variables), for a certain taxid (using esearch). I always get the first > 10,000 records. For example, I need10,300 gi_ids, I split them into list of > 10,000 and submit them consecutively and still getting the first 10,000 > records. I tried batch approach in Biopython tutorial, didn't even reach > 10,000 sequences. > > Is there a limit for NCBI's returned sequences? > > Thank you. > > Mariam It does sound like you've found some sort of Entrez limit, it might be worth emailing the NCBI to clarify this. Have you considered downloading the GI/taxid mapping table from their FTP site instead? e.g. http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html Peter From cjfields at illinois.edu Sat Apr 14 09:22:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 14 Apr 2012 13:22:55 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: <3146B62A-6B02-4A22-862D-68223F6A13E0@illinois.edu> On Apr 14, 2012, at 7:54 AM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah > wrote: >> Dear community, >> >> I aim to get sequences by a list of gi (using efetch and history >> variables), for a certain taxid (using esearch). I always get the first >> 10,000 records. For example, I need10,300 gi_ids, I split them into list of >> 10,000 and submit them consecutively and still getting the first 10,000 >> records. I tried batch approach in Biopython tutorial, didn't even reach >> 10,000 sequences. >> >> Is there a limit for NCBI's returned sequences? >> >> Thank you. >> >> Mariam > > It does sound like you've found some sort of Entrez limit, > it might be worth emailing the NCBI to clarify this. > > Have you considered downloading the GI/taxid mapping > table from their FTP site instead? e.g. > http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html > > Peter This wouldn't surprise me, they have long suggested breaking up record retrieval into batches of a few thousand or more, using retstart/retmax. chris From mrrizkalla at gmail.com Sat Apr 14 09:36:29 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Sat, 14 Apr 2012 15:36:29 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Hi Peter, I am concerned with EST sequences, I will check if they are in the gi_taxid_nucl mappers. Please find my script, with 3 approaches and their results. #!/usr/bin/python > import sys > from Bio import Entrez > Entrez.email = "mariam.rizkallah at gmail.com" > txid = int(sys.argv[1]) > > #get count > prim_handle = Entrez.esearch(db="nucest",term="txid%i[Organism:exp]" > %(txid), retmax=20) > prim_record = Entrez.read(prim_handle) > prim_count = prim_record['Count'] > > #get max using history (Biopython tutorial > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc119) > search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" > %(txid), retmax=prim_count, usehistory="y") > search_results = Entrez.read(search_handle) > search_handle.close() > gi_list = search_results["IdList"] count = int(search_results["Count"]) > assert count == len(gi_list) > webenv = search_results["WebEnv"] > query_key = search_results["QueryKey"] > out_fasta = "%s_txid%i_ct%i.fasta" %(sys.argv[2], txid, count) > out_handle = open(out_fasta, "a") > > ## Approach1: gets tags within the fasta file Unable to > obtain query #1 batch_size = 1000 > for start in range(0,count,batch_size): > end = min(count, start+batch_size) > print "Going to download record %i to %i" % (start+1, end) > fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > retmode="text", retstart=start, retmax=batch_size, webenv=webenv, > query_key=query_key) > data = fetch_handle.read() > fetch_handle.close() > out_handle.write(data) > > ## Approach2: split list def SplitList( list, chunk_size ) : return [list[offs:offs+chunk_size] for offs in range(0, len(list), > chunk_size)] > z = SplitList(gi_list, 10000) for i in range(0, len(z)): > print len(z[i]) > post_handle = Entrez.epost("nucest", rettype="fasta", retmode="text", > id=",".join(z[1])) > webenv = search_results["WebEnv"] > query_key = search_results["QueryKey"] > fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > retmode="text", webenv=webenv, query_key=query_key) > data = fetch_handle.read() > fetch_handle.close() > out_handle.write(data) > > ## Approach3: with most consistent retrieval but limited to 10000 fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", > webenv=webenv, query_key=query_key) > data = fetch_handle.read() > fetch_handle.close() > out_handle.write(data) out_handle.close() On Sat, Apr 14, 2012 at 2:54 PM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah > wrote: > > Dear community, > > > > I aim to get sequences by a list of gi (using efetch and history > > variables), for a certain taxid (using esearch). I always get the first > > 10,000 records. For example, I need10,300 gi_ids, I split them into list > of > > 10,000 and submit them consecutively and still getting the first 10,000 > > records. I tried batch approach in Biopython tutorial, didn't even reach > > 10,000 sequences. > > > > Is there a limit for NCBI's returned sequences? > > > > Thank you. > > > > Mariam > > It does sound like you've found some sort of Entrez limit, > it might be worth emailing the NCBI to clarify this. > > Have you considered downloading the GI/taxid mapping > table from their FTP site instead? e.g. > http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html > > Peter > From p.j.a.cock at googlemail.com Sat Apr 14 09:52:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 14:52:43 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Sat, Apr 14, 2012 at 2:36 PM, Mariam Reyad Rizkallah wrote: > Hi Peter, > > I am concerned with EST sequences, I will check if they are in the > gi_taxid_nucl mappers. > > Please find my script, with 3 approaches and their results. > >> #!/usr/bin/python >> import sys >> from Bio import Entrez >> Entrez.email = "mariam.rizkallah at gmail.com" >> txid = int(sys.argv[1]) >> ... Can you give an example taxid where this breaks? I guess any with just over 10,000 results would be fine but it would be simpler to use the same as you for comparing results. Peter From mrrizkalla at gmail.com Sat Apr 14 09:56:22 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Sat, 14 Apr 2012 15:56:22 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Definitely! I am stuck with Rhizaria ( http://www.ncbi.nlm.nih.gov/nucest/?term=txid543769[Organism:exp])! Hope to move on through the tree of life :) ./get_est_by_txid.py "543769" "Rhizaria" Thank you so much. On Sat, Apr 14, 2012 at 3:52 PM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 2:36 PM, Mariam Reyad Rizkallah > wrote: > > Hi Peter, > > > > I am concerned with EST sequences, I will check if they are in the > > gi_taxid_nucl mappers. > > > > Please find my script, with 3 approaches and their results. > > > >> #!/usr/bin/python > >> import sys > >> from Bio import Entrez > >> Entrez.email = "mariam.rizkallah at gmail.com" > >> txid = int(sys.argv[1]) > >> ... > > Can you give an example taxid where this breaks? I guess any > with just over 10,000 results would be fine but it would be simpler > to use the same as you for comparing results. > > Peter > From p.j.a.cock at googlemail.com Sat Apr 14 13:39:22 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 18:39:22 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Hi again, I get a similar problem with this code - the first couple of tries it got the first 5000 and then failed, but that doesn't always happen: $ python mariam.py 10313 Going to download record 1 to 1000 Going to download record 1001 to 2000 Going to download record 2001 to 3000 Traceback (most recent call last): File "mariam.py", line 28, in assert data.startswith(">"), data AssertionError: Unable to obtain query #1 Sometimes it gets further: $ python mariam.py 10313 Going to download record 1 to 1000 Going to download record 1001 to 2000 Going to download record 2001 to 3000 Going to download record 3001 to 4000 Going to download record 4001 to 5000 Going to download record 5001 to 6000 Going to download record 6001 to 7000 Going to download record 7001 to 8000 Going to download record 8001 to 9000 Going to download record 9001 to 10000 Going to download record 10001 to 10313 Traceback (most recent call last): File "mariam.py", line 28, in assert data.startswith(">"), data AssertionError: Unable to obtain query #1 Notice that this demonstrates one of the major flaws with the current NCBI Entrez setup - rather than setting an error HTTP error code (which would trigger a clear exception), Entrez returns the HTTP OK but puts and error in XML format (essentially a silent error). This is most unhelpful IMO. (This is something TogoWS handles much more nicely). #!/usr/bin/python import sys from Bio import Entrez Entrez.email = "mariam.rizkallah at gmail.com" #Entrez.email = "p.j.a.cock at googlemail.com" txid = 543769 name = "Rhizaria" #using history search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" %(txid), usehistory="y") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") #Sometimes get XML error not FASTA batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) print "Going to download record %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() This is how I believe the NCBI expect this task to be done. In this specific case it seems to be an NCBI failure. Perhaps a loop to retry the efetch two or three times might work? It could be the whole history session breaks at the NCBI end though... A somewhat brute force approach would be to do the search (don't bother with the history) and get the 10313 GI numbers. Then use epost+efetch to grab the records in batches of say 1000. Peter From p.j.a.cock at googlemail.com Sat Apr 14 15:32:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 20:32:03 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock wrote: > > A somewhat brute force approach would be to do the > search (don't bother with the history) and get the 10313 > GI numbers. Then use epost+efetch to grab the records > in batches of say 1000. > That does work (see below), but not all the time. A potential advantage of this way is that each fetch batch is a separate session, so retrying it should be straightforward. Peter #!/usr/bin/python import sys from Bio import Entrez Entrez.email = "mariam.rizkallah at gmail.com" Entrez.email = "p.j.a.cock at googlemail.com" txid = 543769 name = "Rhizaria" search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" %(txid), retmax="20000") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count assert count == len(gi_list), len(gi_list) out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") ## Approach1: gets tags within the fasta file Unable to obtain query #1 batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) batch = gi_list[start:end] print "Going to download record %i to %i using epost+efetch" % (start+1, end) post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) webenv = post_results["WebEnv"] query_key = post_results["QueryKey"] fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() From David.Lapointe at umassmed.edu Sat Apr 14 16:10:12 2012 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Sat, 14 Apr 2012 20:10:12 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: , Message-ID: <86BFEB1DFA6CB3448DB8AB1FC52F40590706B2@ummscsmbx06.ad.umassmed.edu> Just curious. Is there a delay in the code? E.g 3 or 4 secs between requests. ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Peter Cock [p.j.a.cock at googlemail.com] Sent: Saturday, April 14, 2012 3:32 PM To: Mariam Reyad Rizkallah Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] History, Efetch, and returned records limits On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock wrote: > > A somewhat brute force approach would be to do the > search (don't bother with the history) and get the 10313 > GI numbers. Then use epost+efetch to grab the records > in batches of say 1000. > That does work (see below), but not all the time. A potential advantage of this way is that each fetch batch is a separate session, so retrying it should be straightforward. Peter #!/usr/bin/python import sys from Bio import Entrez Entrez.email = "mariam.rizkallah at gmail.com" Entrez.email = "p.j.a.cock at googlemail.com" txid = 543769 name = "Rhizaria" search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" %(txid), retmax="20000") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count assert count == len(gi_list), len(gi_list) out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") ## Approach1: gets tags within the fasta file Unable to obtain query #1 batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) batch = gi_list[start:end] print "Going to download record %i to %i using epost+efetch" % (start+1, end) post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) webenv = post_results["WebEnv"] query_key = post_results["QueryKey"] fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Sat Apr 14 16:24:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 21:24:55 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: <86BFEB1DFA6CB3448DB8AB1FC52F40590706B2@ummscsmbx06.ad.umassmed.edu> References: <86BFEB1DFA6CB3448DB8AB1FC52F40590706B2@ummscsmbx06.ad.umassmed.edu> Message-ID: On Sat, Apr 14, 2012 at 9:10 PM, Lapointe, David wrote: > Just curious. Is there a delay in the code? E.g 3 or 4 secs between requests. The Bio.Entrez code obeys the current NCBI limit of at most 3 queries per second by limiting the query gap to at least 0.333333334s. Back in 2009 this was relaxed from the NCBI's original limit of 3s between queries. Peter From flitrfli at gmail.com Sun Apr 15 03:55:14 2012 From: flitrfli at gmail.com (Laura Scearce) Date: Sun, 15 Apr 2012 02:55:14 -0500 Subject: [Biopython] Blast Two sequences from a python script Message-ID: I have a list of pairs of proteins and I want to compare speed and accuracy of "BLAST Two Sequences" to a Smith-Waterman program for alignment. I know there is a "Blast Two Sequences" option on NCBI website, but I would like to run it from a python script. Perhaps Biopython has this capability? If I cannot use Blast Two Sequences, I will compare different versions of Smith-Waterman, but this would not be nearly as exciting :) OR, if anyone has another idea for a great senior year project in Bioinformatics involving comparing pairs of proteins, please don't hesitate to let me know? Thank you in advance. From eric.talevich at gmail.com Sun Apr 15 10:41:08 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 15 Apr 2012 10:41:08 -0400 Subject: [Biopython] (no subject) In-Reply-To: <4F89626D.9060807@googlemail.com> References: <4F89626D.9060807@googlemail.com> Message-ID: On Sat, Apr 14, 2012 at 7:41 AM, Matthias Schade wrote: > Hello everyone, > > > I would like to run a blastn-query of a small nucleotide-sequence against a > genome. The code works already, but my queries are still slow and mostly > ineffective, so I would like to ask: > > Is there a way to tell the blastn-algorithm that once a 'perfect match' has > been found it can stop and send back the results? > > Background: I am interested in only the first full match because I would > like to design a nucleotide-probe which -if possible- has no(!) known match > in a host-genome, neither in RNA nor DNA. Actually, I would reject all > perfect-matches and all single-mismatches but allow every sequence with two > or more mismatches. > > Currrently, I use this line of code with seq_now being about 15-30 nt long: > result_handle = NCBIWWW.qblast("blastn", "nt", seq_now, entrez_query="Canis > familiaris[orgn]") > > > I am still new to this. Thank you for your help and input, > > Matt > Hi Matt, Since you're already setting the target database as one genome, this should already be reasonably fast, right? You can play with the BLAST sensitivity cutoffs and reporting thresholds, but I don't think it's possible to do exactly this, except by using an algorithm other than BLAST. If speed is crucial, you might be interested in USEARCH, which does have the feature you're looking for, but isn't wrapped in Biopython yet: http://www.drive5.com/usearch/ Cheers, Eric From golubchi at stats.ox.ac.uk Mon Apr 16 07:05:38 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Mon, 16 Apr 2012 12:05:38 +0100 Subject: [Biopython] (no subject) In-Reply-To: <4F89626D.9060807@googlemail.com> References: <4F89626D.9060807@googlemail.com> Message-ID: <4F8BFD02.5050209@stats.ox.ac.uk> Wouldn't it be faster to pre-check for a perfect match using a python string function? if primer_seq in genome_seq: return MatchFound else: Cheers, Tanya On 14/04/12 12:41, Matthias Schade wrote: > Hello everyone, > > > I would like to run a blastn-query of a small nucleotide-sequence > against a genome. The code works already, but my queries are still slow > and mostly ineffective, so I would like to ask: > > Is there a way to tell the blastn-algorithm that once a 'perfect match' > has been found it can stop and send back the results? > > Background: I am interested in only the first full match because I would > like to design a nucleotide-probe which -if possible- has no(!) known > match in a host-genome, neither in RNA nor DNA. Actually, I would reject > all perfect-matches and all single-mismatches but allow every sequence > with two or more mismatches. > > Currrently, I use this line of code with seq_now being about 15-30 nt long: > result_handle = NCBIWWW.qblast("blastn", "nt", seq_now, > entrez_query="Canis familiaris[orgn]") > > > I am still new to this. Thank you for your help and input, > > Matt > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mrrizkalla at gmail.com Mon Apr 16 10:04:08 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Mon, 16 Apr 2012 16:04:08 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: OH WOW! It works like charm! Peter, thank you very much for insight and for taking the time to fix my script. I do appreciate. Thank you. Mariam Blog post here: http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock > wrote: > > > > A somewhat brute force approach would be to do the > > search (don't bother with the history) and get the 10313 > > GI numbers. Then use epost+efetch to grab the records > > in batches of say 1000. > > > > That does work (see below), but not all the time. A potential > advantage of this way is that each fetch batch is a separate > session, so retrying it should be straightforward. > > Peter > > #!/usr/bin/python > import sys > from Bio import Entrez > Entrez.email = "mariam.rizkallah at gmail.com" > Entrez.email = "p.j.a.cock at googlemail.com" > txid = 543769 > name = "Rhizaria" > > search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" > %(txid), retmax="20000") > search_results = Entrez.read(search_handle) > search_handle.close() > gi_list = search_results["IdList"] > count = int(search_results["Count"]) > print count > assert count == len(gi_list), len(gi_list) > > out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > out_handle = open(out_fasta, "a") > > out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > out_handle = open(out_fasta, "a") > > ## Approach1: gets tags within the fasta file Unable to > obtain query #1 > batch_size = 1000 > for start in range(0,count,batch_size): > end = min(count, start+batch_size) > batch = gi_list[start:end] > print "Going to download record %i to %i using epost+efetch" % > (start+1, end) > post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) > webenv = post_results["WebEnv"] > query_key = post_results["QueryKey"] > fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > retmode="text", webenv=webenv, query_key=query_key) > data = fetch_handle.read() > assert data.startswith(">"), data > fetch_handle.close() > out_handle.write(data) > print "Done" > out_handle.close() > From p.j.a.cock at googlemail.com Mon Apr 16 10:09:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 15:09:15 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Mon, Apr 16, 2012 at 3:04 PM, Mariam Reyad Rizkallah wrote: > OH WOW! > > It works like charm!?Peter, thank you very much for insight and for taking > the time to fix my script. > > I do appreciate.?Thank you. > > Mariam > Blog post > here:?http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ Did you contact the NCBI to see where that 10,000 limit was coming from? Peter From cjfields at illinois.edu Mon Apr 16 10:15:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 14:15:50 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: <51BB20B7-BB11-4AD2-BDC6-2AE99682B56A@illinois.edu> On Apr 16, 2012, at 9:09 AM, Peter Cock wrote: > On Mon, Apr 16, 2012 at 3:04 PM, Mariam Reyad Rizkallah > wrote: >> OH WOW! >> >> It works like charm! Peter, thank you very much for insight and for taking >> the time to fix my script. >> >> I do appreciate. Thank you. >> >> Mariam >> Blog post >> here: http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > > Did you contact the NCBI to see where that 10,000 limit was coming from? > > Peter +1, I'm curious about that as well. OTOH, I've never tried it. chris From cjfields at illinois.edu Mon Apr 16 10:12:56 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 14:12:56 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Yeah, if you run a retrieval in batches you need a step to rerun the request in case it fails, particularly if the request is occurring at a busy time. We do the same with bioperl's interface, very similar to what Peter suggests. chris On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote: > OH WOW! > > It works like charm! Peter, thank you very much for insight and for taking > the time to fix my script. > > I do appreciate. Thank you. > > Mariam > Blog post here: > http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > > > On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock wrote: > >> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock >> wrote: >>> >>> A somewhat brute force approach would be to do the >>> search (don't bother with the history) and get the 10313 >>> GI numbers. Then use epost+efetch to grab the records >>> in batches of say 1000. >>> >> >> That does work (see below), but not all the time. A potential >> advantage of this way is that each fetch batch is a separate >> session, so retrying it should be straightforward. >> >> Peter >> >> #!/usr/bin/python >> import sys >> from Bio import Entrez >> Entrez.email = "mariam.rizkallah at gmail.com" >> Entrez.email = "p.j.a.cock at googlemail.com" >> txid = 543769 >> name = "Rhizaria" >> >> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" >> %(txid), retmax="20000") >> search_results = Entrez.read(search_handle) >> search_handle.close() >> gi_list = search_results["IdList"] >> count = int(search_results["Count"]) >> print count >> assert count == len(gi_list), len(gi_list) >> >> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >> out_handle = open(out_fasta, "a") >> >> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >> out_handle = open(out_fasta, "a") >> >> ## Approach1: gets tags within the fasta file Unable to >> obtain query #1 >> batch_size = 1000 >> for start in range(0,count,batch_size): >> end = min(count, start+batch_size) >> batch = gi_list[start:end] >> print "Going to download record %i to %i using epost+efetch" % >> (start+1, end) >> post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) >> webenv = post_results["WebEnv"] >> query_key = post_results["QueryKey"] >> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", >> retmode="text", webenv=webenv, query_key=query_key) >> data = fetch_handle.read() >> assert data.startswith(">"), data >> fetch_handle.close() >> out_handle.write(data) >> print "Done" >> out_handle.close() >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mrrizkalla at gmail.com Mon Apr 16 10:49:59 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Mon, 16 Apr 2012 16:49:59 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: <51BB20B7-BB11-4AD2-BDC6-2AE99682B56A@illinois.edu> References: <51BB20B7-BB11-4AD2-BDC6-2AE99682B56A@illinois.edu> Message-ID: I never considered asking NCBI. "Hey there! I need to get10 million records from different taxa from you, is it really limited to 10,000!? How about a workaround!?" I will ask them though! On Mon, Apr 16, 2012 at 4:15 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > On Apr 16, 2012, at 9:09 AM, Peter Cock wrote: > > > On Mon, Apr 16, 2012 at 3:04 PM, Mariam Reyad Rizkallah > > wrote: > >> OH WOW! > >> > >> It works like charm! Peter, thank you very much for insight and for > taking > >> the time to fix my script. > >> > >> I do appreciate. Thank you. > >> > >> Mariam > >> Blog post > >> here: > http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > > > > Did you contact the NCBI to see where that 10,000 limit was coming from? > > > > Peter > > +1, I'm curious about that as well. OTOH, I've never tried it. > > chris > > From cjfields at illinois.edu Mon Apr 16 11:51:19 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 15:51:19 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Peter, Mariam, Turns out they do document this: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 I can also confirm this, just ran a quick test locally with a simple script to retrieve a set of protein samples. The esearch count was 27382, but the retrieved set maxed out at 10K exactly. [cjfields at pyrimidine-laptop eutils]$ perl limit_test.pl 27382 [cjfields at pyrimidine-laptop eutils]$ grep -c '^>' seqs.aa 10000 Not sure if there are similar constraints using NCBI's SOAP interface, but I wouldn't be surprised. chris On Apr 16, 2012, at 9:12 AM, Fields, Christopher J wrote: > Yeah, if you run a retrieval in batches you need a step to rerun the request in case it fails, particularly if the request is occurring at a busy time. We do the same with bioperl's interface, very similar to what Peter suggests. > > chris > > On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote: > >> OH WOW! >> >> It works like charm! Peter, thank you very much for insight and for taking >> the time to fix my script. >> >> I do appreciate. Thank you. >> >> Mariam >> Blog post here: >> http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ >> >> >> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock wrote: >> >>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock >>> wrote: >>>> >>>> A somewhat brute force approach would be to do the >>>> search (don't bother with the history) and get the 10313 >>>> GI numbers. Then use epost+efetch to grab the records >>>> in batches of say 1000. >>>> >>> >>> That does work (see below), but not all the time. A potential >>> advantage of this way is that each fetch batch is a separate >>> session, so retrying it should be straightforward. >>> >>> Peter >>> >>> #!/usr/bin/python >>> import sys >>> from Bio import Entrez >>> Entrez.email = "mariam.rizkallah at gmail.com" >>> Entrez.email = "p.j.a.cock at googlemail.com" >>> txid = 543769 >>> name = "Rhizaria" >>> >>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" >>> %(txid), retmax="20000") >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> gi_list = search_results["IdList"] >>> count = int(search_results["Count"]) >>> print count >>> assert count == len(gi_list), len(gi_list) >>> >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >>> out_handle = open(out_fasta, "a") >>> >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >>> out_handle = open(out_fasta, "a") >>> >>> ## Approach1: gets tags within the fasta file Unable to >>> obtain query #1 >>> batch_size = 1000 >>> for start in range(0,count,batch_size): >>> end = min(count, start+batch_size) >>> batch = gi_list[start:end] >>> print "Going to download record %i to %i using epost+efetch" % >>> (start+1, end) >>> post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) >>> webenv = post_results["WebEnv"] >>> query_key = post_results["QueryKey"] >>> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", >>> retmode="text", webenv=webenv, query_key=query_key) >>> data = fetch_handle.read() >>> assert data.startswith(">"), data >>> fetch_handle.close() >>> out_handle.write(data) >>> print "Done" >>> out_handle.close() >>> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Apr 16 12:15:24 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 17:15:24 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Mon, Apr 16, 2012 at 4:51 PM, Fields, Christopher J wrote: > Peter, Mariam, > > Turns out they do document this: > > ? http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 > > I can also confirm this, just ran a quick test locally with a simple script > to retrieve a set of protein samples. ?The esearch count was 27382, > but the retrieved set maxed out at 10K exactly. Thanks Chris, well spotted! It would have been nice to have it on the main page too: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html Peter From mrrizkalla at gmail.com Mon Apr 16 12:19:31 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Mon, 16 Apr 2012 18:19:31 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Oh! Thank you, Chris. So, there IS a limit! I emailed them asking whether there is a limit for records retrieval. They replied that The appropriate way is to do batch retrieval, with no emphasis on limits. Thank you. Mariam On Apr 16, 2012 5:51 PM, "Fields, Christopher J" wrote: > Peter, Mariam, > > Turns out they do document this: > > http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 > > I can also confirm this, just ran a quick test locally with a simple > script to retrieve a set of protein samples. The esearch count was 27382, > but the retrieved set maxed out at 10K exactly. > > [cjfields at pyrimidine-laptop eutils]$ perl limit_test.pl > 27382 > [cjfields at pyrimidine-laptop eutils]$ grep -c '^>' seqs.aa > 10000 > > Not sure if there are similar constraints using NCBI's SOAP interface, but > I wouldn't be surprised. > > chris > > On Apr 16, 2012, at 9:12 AM, Fields, Christopher J wrote: > > > Yeah, if you run a retrieval in batches you need a step to rerun the > request in case it fails, particularly if the request is occurring at a > busy time. We do the same with bioperl's interface, very similar to what > Peter suggests. > > > > chris > > > > On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote: > > > >> OH WOW! > >> > >> It works like charm! Peter, thank you very much for insight and for > taking > >> the time to fix my script. > >> > >> I do appreciate. Thank you. > >> > >> Mariam > >> Blog post here: > >> > http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > >> > >> > >> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock >wrote: > >> > >>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock > > >>> wrote: > >>>> > >>>> A somewhat brute force approach would be to do the > >>>> search (don't bother with the history) and get the 10313 > >>>> GI numbers. Then use epost+efetch to grab the records > >>>> in batches of say 1000. > >>>> > >>> > >>> That does work (see below), but not all the time. A potential > >>> advantage of this way is that each fetch batch is a separate > >>> session, so retrying it should be straightforward. > >>> > >>> Peter > >>> > >>> #!/usr/bin/python > >>> import sys > >>> from Bio import Entrez > >>> Entrez.email = "mariam.rizkallah at gmail.com" > >>> Entrez.email = "p.j.a.cock at googlemail.com" > >>> txid = 543769 > >>> name = "Rhizaria" > >>> > >>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" > >>> %(txid), retmax="20000") > >>> search_results = Entrez.read(search_handle) > >>> search_handle.close() > >>> gi_list = search_results["IdList"] > >>> count = int(search_results["Count"]) > >>> print count > >>> assert count == len(gi_list), len(gi_list) > >>> > >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > >>> out_handle = open(out_fasta, "a") > >>> > >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > >>> out_handle = open(out_fasta, "a") > >>> > >>> ## Approach1: gets tags within the fasta file Unable to > >>> obtain query #1 > >>> batch_size = 1000 > >>> for start in range(0,count,batch_size): > >>> end = min(count, start+batch_size) > >>> batch = gi_list[start:end] > >>> print "Going to download record %i to %i using epost+efetch" % > >>> (start+1, end) > >>> post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) > >>> webenv = post_results["WebEnv"] > >>> query_key = post_results["QueryKey"] > >>> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > >>> retmode="text", webenv=webenv, query_key=query_key) > >>> data = fetch_handle.read() > >>> assert data.startswith(">"), data > >>> fetch_handle.close() > >>> out_handle.write(data) > >>> print "Done" > >>> out_handle.close() > >>> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From cjfields at illinois.edu Mon Apr 16 12:26:16 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 16:26:16 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: <8F3A4E70-C1E5-4661-86D2-09811A82A086@illinois.edu> On Apr 16, 2012, at 11:15 AM, Peter Cock wrote: > On Mon, Apr 16, 2012 at 4:51 PM, Fields, Christopher J > wrote: >> Peter, Mariam, >> >> Turns out they do document this: >> >> http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 >> >> I can also confirm this, just ran a quick test locally with a simple script >> to retrieve a set of protein samples. The esearch count was 27382, >> but the retrieved set maxed out at 10K exactly. > > Thanks Chris, well spotted! > > It would have been nice to have it on the main page too: > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html > > Peter That URL is a redirect to the new documentation for me, the link I sent is just a few sections down, under optional parameters. chris From p.j.a.cock at googlemail.com Mon Apr 16 12:53:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 17:53:53 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: <8F3A4E70-C1E5-4661-86D2-09811A82A086@illinois.edu> References: <8F3A4E70-C1E5-4661-86D2-09811A82A086@illinois.edu> Message-ID: On Mon, Apr 16, 2012 at 5:26 PM, Fields, Christopher J wrote: > On Apr 16, 2012, at 11:15 AM, Peter Cock wrote: >> Thanks Chris, well spotted! >> >> It would have been nice to have it on the main page too: >> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html >> >> Peter > > That URL is a redirect to the new documentation for me, the link > I sent is just a few sections down, under optional parameters. > > chris Same here - after a hard refresh. I wonder when that changed? Peter From jp.verta at gmail.com Mon Apr 16 15:23:25 2012 From: jp.verta at gmail.com (Jukka-Pekka Verta) Date: Mon, 16 Apr 2012 15:23:25 -0400 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Hello fellow BioPythoneers, I stumbled upon the same problem as Mariam (without reading your previous correspondence) while I was trying to fetch all Picea sitchensis nucleotide records. Following Peters code (epost+efetch), I still had the problem of fetch breakup (after 7000 sequences). The problem was fixed following Peters idea of simply retrying the failed search using try/except. A collective thank you! JP def fetchFasta(species,out_file): # script by Peter Cock with enhancement from Bio import Entrez from Bio import SeqIO Entrez.email = "jp.verta at gmail.com" search_handle = Entrez.esearch(db="nuccore",term=species+"[orgn]", retmax="20000") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count assert count == len(gi_list), len(gi_list) out_handle = open(out_file, "a") batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) batch = gi_list[start:end] print "Going to download record %i to %i using epost+efetch" %(start+1, end) post_results = Entrez.read(Entrez.epost("nuccore", id=",".join(batch))) webenv = post_results["WebEnv"] query_key = post_results["QueryKey"] fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() try: assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) except: fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() $ ./FetchFastaWithSpeciesName.py "Picea sitchensis" sitkaSequences.fa 19997 Going to download record 1 to 1000 using epost+efetch Going to download record 1001 to 2000 using epost+efetch Going to download record 2001 to 3000 using epost+efetch Going to download record 3001 to 4000 using epost+efetch Going to download record 4001 to 5000 using epost+efetch Going to download record 5001 to 6000 using epost+efetch Going to download record 6001 to 7000 using epost+efetch Going to download record 7001 to 8000 using epost+efetch Going to download record 8001 to 9000 using epost+efetch Going to download record 9001 to 10000 using epost+efetch Going to download record 10001 to 11000 using epost+efetch Going to download record 11001 to 12000 using epost+efetch Going to download record 12001 to 13000 using epost+efetch Going to download record 13001 to 14000 using epost+efetch Going to download record 14001 to 15000 using epost+efetch Going to download record 15001 to 16000 using epost+efetch Going to download record 16001 to 17000 using epost+efetch Going to download record 17001 to 18000 using epost+efetch Going to download record 18001 to 19000 using epost+efetch Going to download record 19001 to 19997 using epost+efetch Done On 2012-04-14, at 1:39 PM, Peter Cock wrote: > > This is how I believe the NCBI expect this task to be done. > In this specific case it seems to be an NCBI failure. > Perhaps a loop to retry the efetch two or three times might > work? It could be the whole history session breaks at the > NCBI end though... > > A somewhat brute force approach would be to do the > search (don't bother with the history) and get the 10313 > GI numbers. Then use epost+efetch to grab the records > in batches of say 1000. > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bjorn_johansson at bio.uminho.pt Tue Apr 17 02:44:51 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 17 Apr 2012 07:44:51 +0100 Subject: [Biopython] interactive shell to use with biopython? Message-ID: Hi all, I would like to know what interactive shell might be a good alternative to use with biopython. Ideally, it should be possible to save interactive commands and save them to run later. Could you give me some examples of what youa are using? Thanks, Bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob.? +351-967 147 704 Dept of Biology (secr) +351-253 60 4310? | fax +351-253 678980 From ajperry at pansapiens.com Tue Apr 17 03:42:36 2012 From: ajperry at pansapiens.com (Andrew Perry) Date: Tue, 17 Apr 2012 17:42:36 +1000 Subject: [Biopython] interactive shell to use with biopython? In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 4:44 PM, Bj?rn Johansson < bjorn_johansson at bio.uminho.pt> wrote: > Hi all, > > I would like to know what interactive shell might be a good > alternative to use with biopython. > Ideally, it should be possible to save interactive commands and save > them to run later. > > Could you give me some examples of what youa are using? > > Thanks, > Bjorn > > You might want to check out iPython in notebook mode. I've only played with it briefly, but it looks promising for interactive analysis, and cases where you'd like to present the transcript to others. 'Regular' commandline iPython will also allow you to save the history with the %save command. See: http://ipython.org/ipython-doc/stable/interactive/htmlnotebook.html To get an idea of how it works, see Titus Brown's demonstration: http://www.youtube.com/watch?feature=player_detailpage&v=HaS4NXxL5Qc#t=132s Andrew Perry Postdoctoral Fellow Whisstock Lab Department of Biochemistry and Molecular Biology Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. Mobile: +61 409 808 529 From eric.talevich at gmail.com Tue Apr 17 21:37:50 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 17 Apr 2012 21:37:50 -0400 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: References: <4F632E69.8010906@stats.ox.ac.uk> <4F69A98B.3040504@stats.ox.ac.uk> Message-ID: On Thu, Mar 22, 2012 at 7:29 PM, Eric Talevich wrote: > On Wed, Mar 21, 2012 at 6:12 AM, Tanya Golubchik > wrote: >> Also, the 'is_aligned' sequence property disappears when a tree is saved >> in phyloxml format and then read back using Phylo.read: >> >>>>> print tree >> Phylogeny(rooted=True, branch_length_unit='SNV') >> ? ?Clade(branch_length=0.0, name='N1') >> ? ? ? ?Clade(branch_length=0.0, name='C00000761') >> ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA', >> is_aligned=True) >> ? ? ? ?Clade(branch_length=0.0, name='C00000763') >> ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA', >> is_aligned=True) >> >>>>> Phylo.write(tree, myfile, 'phyloxml') >> 1 >>>>> tree2 = Phylo.read(myfile, 'phyloxml') >>>>> print tree2 >> Phylogeny(rooted=True, branch_length_unit='SNV') >> ? ?Clade(branch_length=0.0, name='N1') >> ? ? ? ?Clade(branch_length=0.0, name='C00000761') >> ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA') >> ? ? ? ?Clade(branch_length=0.0, name='C00000763') >> ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA') >> > > This looks like a bug, too. (Thanks for finding these!) I don't > immediately see the cause of the problem, I'll try to take a crack at > it soon. I finally had a chance to look at this again. It's fixed in the trunk, so if you're working off the development build of Biopython from GitHub, the is_aligned property should be written properly now. From marc.saric at gmx.de Wed Apr 18 16:58:18 2012 From: marc.saric at gmx.de (Marc Saric) Date: Wed, 18 Apr 2012 22:58:18 +0200 Subject: [Biopython] Is this a valid Genbank feature description or a Biopython bug? In-Reply-To: References: Message-ID: <4F8F2AEA.8060700@gmx.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, sorry for crossposting (this has also been published on stackoverflow ): I stumbled upon a Genbank-formatted file (shown here as a minimal dummy example), which contains a nested feature like this: FEATURES Location/Qualifiers xxxx_domain complement(complement(1..145)) Such a feature crashes the current Biopython Genbank parser (1.59 release), but it apparently did not in former releases (e.g. 1.55). Apparently the behaviour was already in 1.57. - From the Biopython bugtracker, it seems that the old locationparser code got removed in 1.56: - From what I could deduce from the format description on ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt and http://www.insdc.org/documents/feature_table.html#3.4.2 this is most likely invalid. Can someone comment on this. I.e. is this a glitch in Biopython or in the format of the Genbank file? A full demo file: LOCUS XXXXXXXXXXXXXX 240 bp DNA circular 17-JAN-2012 DEFINITION xxxxxx. KEYWORDS xx. SOURCE ORGANISM FEATURES Location/Qualifiers xxxx_domain complement(complement(1..145)) /vntifkey="1" /label=A label /note="A note" BASE COUNT 75 a 57 c 42 g 66 t ORIGIN 1 tttacaaaac gcattttcaa accttgggta ctaccccctt ttaaatatcc gaatacacta 61 ataaacgctc tttcctttta ggtaaacccg ccaatatata ctgatacaca ctgatagttt 121 aaactagatg cagtggccga ccatcagatc tagtaggaaa cagctatgac catgattacg 181 cattacttat ttaagatcaa ccgtaccagt ataccctgcc agcatgatgg aaacctccct // A minimum demo program to show the error (assumes Biopython 1.59 and Python 2.7 are installed and the above mentioned file is available as "test.gb": #!/usr/bin/env python from Bio import SeqIO s = SeqIO.read(open("test.gb")), "r"), "genbank") This crashes with raise LocationParserError(location_line) Bio.GenBank.LocationParserError: complement(1..145) - -- Bye, Marc Saric http://www.marcsaric.de -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+PKuoACgkQvKxJUF29wRLPGwCfaGI1+FzRZluJpjkfYBVdUtVq 5HIAn0ar1c2FK0eGIlekRtaQwGgJUk4U =oI7n -----END PGP SIGNATURE----- From p.j.a.cock at googlemail.com Wed Apr 18 17:31:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 Apr 2012 22:31:30 +0100 Subject: [Biopython] Is this a valid Genbank feature description or a Biopython bug? In-Reply-To: <4F8F2AEA.8060700@gmx.de> References: <4F8F2AEA.8060700@gmx.de> Message-ID: On Wed, Apr 18, 2012 at 9:58 PM, Marc Saric wrote: > > Hi all, > > sorry for crossposting (this has also been published on stackoverflow > ): > > > I stumbled upon a Genbank-formatted file (shown here as a minimal > dummy example), which contains a nested feature like this: > > FEATURES ? ? ? ? ? ? Location/Qualifiers > ? ? xxxx_domain ? ? complement(complement(1..145)) > I believe that is an invalid location. Was this from an NCBI file, or elsewhere? Note that for Biopython 1.60 (next release) we plan to treat bad locations as a warning rather than an error that stops parsing. Peter From p.j.a.cock at googlemail.com Thu Apr 19 11:46:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 Apr 2012 16:46:18 +0100 Subject: [Biopython] Blast Two sequences from a python script In-Reply-To: References: Message-ID: On Sun, Apr 15, 2012 at 8:55 AM, Laura Scearce wrote: > I have a list of pairs of proteins and I want to compare speed and accuracy > of "BLAST Two Sequences" to a Smith-Waterman program for alignment. I know > there is a "Blast Two Sequences" option on NCBI website, but I would like > to run it from a python script. Perhaps Biopython has this capability? If I > cannot use Blast Two Sequences, I will compare different versions of > Smith-Waterman, but this would not be nearly as exciting :) OR, if anyone > has another idea for a great senior year project in Bioinformatics > involving comparing pairs of proteins, please don't hesitate to let me > know? Thank you in advance. I would suggest looking at the EMBOSS tool water for Smith-Waterman alignments. http://emboss.open-bio.org/wiki/Appdoc:Water See also: http://emboss.open-bio.org/wiki/Appdoc:Needle and http://emboss.open-bio.org/wiki/Appdoc:Matcher For BLAST, the simplest option might be to generate temporary input FASTA files, then use the BLAST+ command line tools with the -query and -subject options. This way you don't have to make temporary BLAST databases (although it isn't quite as fast). Peter From legendre17 at hotmail.com Thu Apr 19 17:27:32 2012 From: legendre17 at hotmail.com (Tiberiu Tesileanu) Date: Thu, 19 Apr 2012 21:27:32 +0000 Subject: [Biopython] Bio.pairwise2 alignment slow Message-ID: Hi, I've noticed that Bio.pairwise2 alignments tend to be very slow; e.g., Bio.pairwise2.align.globalds is about 100 times slower than Matlab's swalign... (this is on a Macbook Air running Mac OS X Lion). Is this expected, or am I doing something wrong? Is there a way to make sure that the C version of the code is used? Is there an alternative that is similarly easy to use, but faster? Thanks!Tibi From bjorn_johansson at bio.uminho.pt Sun Apr 22 04:05:35 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sun, 22 Apr 2012 09:05:35 +0100 Subject: [Biopython] interactive shell to use with biopython? In-Reply-To: References: Message-ID: Thank you for the tip, iPython seems very useful as it works the sam way as the normal python interpreter. The notebook looks impressive, I will give it a try when the ipython 0.13 comes out. /bjorn On Tue, Apr 17, 2012 at 08:42, Andrew Perry wrote: > On Tue, Apr 17, 2012 at 4:44 PM, Bj?rn Johansson < > bjorn_johansson at bio.uminho.pt> wrote: > >> Hi all, >> >> I would like to know what interactive shell might be a good >> alternative to use with biopython. >> Ideally, it should be possible to save interactive commands and save >> them to run later. >> >> Could you give me some examples of what youa are using? >> >> Thanks, >> Bjorn >> >> > You might want to check out iPython in notebook mode. I've only played > with it briefly, but it looks promising for interactive analysis, and cases > where you'd like to present the transcript to others. > > 'Regular' commandline iPython will also allow you to save the history with > the %save command. > > See: http://ipython.org/ipython-doc/stable/interactive/htmlnotebook.html > > To get an idea of how it works, see Titus Brown's demonstration: > http://www.youtube.com/watch?feature=player_detailpage&v=HaS4NXxL5Qc#t=132s > > Andrew Perry > > Postdoctoral Fellow > Whisstock Lab > Department of Biochemistry and Molecular Biology > Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. > Mobile: +61 409 808 529 > > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob. +351-967 147 704 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From bjorn_johansson at bio.uminho.pt Sun Apr 22 04:23:28 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sun, 22 Apr 2012 09:23:28 +0100 Subject: [Biopython] double stranded sequence object / cloning simulation Message-ID: Hi, I am looking for a way to simulate cloning using python. I was thinking of s script where you could combine two sequence objects into a recombinant molecule. Does anybody know if this has been done? I think a good way to do that is to specify a double stranded sequence object where the topology and properties of the ends of the DNA molecule are preserved in a property of the object itself. Has there been any attempts at this in biopyton or elsewhere? I wouldnt want to reinvent the wheel here. PyPI and google does not seem to give me anything on this. I was thinking something along these lines: >>> stuffer1, dsSeqobj1, stuffer2 = dsSeqobj1.digest(BamHI) which creates a linear dsseq object with staggered ends. >>> clone_a, clone_b = ligate( dsSeqobj1, dsSeqobj2 ) would create two circular dsseq objects if the ends are compatible. Any ideas along these lines? cheers, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob. +351-967 147 704 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From rbuels at gmail.com Mon Apr 23 19:49:10 2012 From: rbuels at gmail.com (Robert Buels) Date: Mon, 23 Apr 2012 19:49:10 -0400 Subject: [Biopython] Announcing OBF Google Summer of Code Accepted Students Message-ID: <4F95EA76.4030004@gmail.com> Hello all, I'm very pleased and excited to announce that the Open Bioinformatics Foundation has selected 5 very capable students to work on OBF projects this summer as part of the Google Summer of Code program. The accepted students, their projects, and their mentors (in alphabetical order): Wibowo Arindrarto SearchIO Implementation in Biopython mentored by Peter Cock Lenna Peterson Diff My DNA: Development of a Genomic Variant Toolkit for Biopython mentored by Brad Chapman Marjan Povolni The worlds fastest parallelized GFF3/GTF parser in D, and an interfacing biogem plugin for Ruby mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Artem Tarasov Fast parallelized GFF3/GTF parser in C++, with Ruby FFI bindings mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Clayton Wheeler Multiple Alignment Format parser for BioRuby mentored by Francesco Strozzi and Raoul Bonnal As in every year, we received many great applications and ideas. However, funding and mentor resources are limited, and we were not able to accept as many as we would have liked. Our deepest thanks to all the students who applied: we sincerely appreciate the time and effort you put into your applications, and hope you will still consider being a part of the OBF's open source projects, even without Google funding. I speak for myself and all of the mentors who read and scored applications when I say that we were truly honored by the number and quality of the applications we received. For the accepted students: congratulations! You have risen to the top of a very competitive application process. Now it's time to "put your money where your mouth is", as the saying goes. Let's get out there and write some great code this summer! Best regards, Rob ---- Robert Buels OBF GSoC 2012 Administrator From erikclarke at gmail.com Mon Apr 23 19:54:20 2012 From: erikclarke at gmail.com (Erik C) Date: Mon, 23 Apr 2012 16:54:20 -0700 Subject: [Biopython] Bug in Geo.parser when reading some GDS files Message-ID: Hi all, When parsing a NCBI GEO dataset (GDS) file such as this: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS_full/GDS1962_full.soft.gz the Bio.Geo.parse(handle) method fails with an assertion error. Example code: >> for record in Geo.parse(open('GDS1962_full.soft')): print record Traceback (most recent call last): File "", line 1, in File "Geo/__init__.py", line 54, in parse assert key not in record.col_defs AssertionError It appears that this is due to the failed assumption that each column header exists only once, when it seems that a common trend in GDS files is to have two columns each titled GO:Function, GO:Process, and GO:Component. The first of these duplicate columns is the Gene Ontology terms for the probe at that row, and the second column is the GO ids for those terms. >From GDS3646_full.soft: #GO:Function = Gene Ontology Function term #GO:Process = Gene Ontology Process term #GO:Component = Gene Ontology Component term #GO:Function = Gene Ontology Function identifier #GO:Process = Gene Ontology Process identifier #GO:Component = Gene Ontology Component identifier While the duplicate header names is not ideal for tabular data, these GO columns do seem to appear regularly for GDS files (see GDS1962, GDS3646, and others) and they consistently break the parser. There should be a disabling of this assertion for this particular case or a more flexible column header check. I suggest using the assertion only for the sample columns (those prefixed with GSM). I'm using BioPython 1.59 (issue exists also in Git repository) with Python 2.7.1 on Mac OS 10.7.3. Cheers, Erik From p.j.a.cock at googlemail.com Tue Apr 24 07:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:24:20 +0100 Subject: [Biopython] OBF GSoC students weekly progress reports Message-ID: Hello all, First, to echo Rob, congratulations to our selected students: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/gsoc/2012/000049.html Weekly Progress Reports: To encourage community bonding and awareness of what the GSoC 2012 students are doing, this year the OBF is being much clearer about our progress report expectations. We would like every student to setup a blog for the GSoC project (or a category/tag on your existing blog) which you will use to summarize your progress every week, as well as longer posts at the half way evaluation, and at the end of the summer. In addition, after publishing each blog post, we expect you to email the URL and the text of the blog (or if important images or formatting would be lost, at least a short summary) to the host project's mailing list(s) (check with your mentors if the project has more than one) AND the gsoc at open-bio.org mailing list. You will be writing under your own name, but with a clear association with your mentors, the OBF and its projects, so please take this seriously and be professional. Remember this will become part of your online presence, and potentially looked at by future employers and colleagues. Please talk to your mentors about this during the "community bonding" stage of the GSoC code (i.e. the next few weeks before you actually start). Thank you, Peter (On behalf of the OBF GSoC mentors and projects) Note: As per Rob's earlier email, could both students and mentors please ensure you have subscribed to the public OBF GSoC email list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd you on this email just in case you haven't done this yet). Thanks! From p.j.a.cock at googlemail.com Tue Apr 24 08:46:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 13:46:32 +0100 Subject: [Biopython] Biopython GSoC 2012 Message-ID: Dear all, As you will have read in Rob's email [1], of the five Google Summer of Code (GSoC) students accepted by the OBF this year, two are going to be working on Biopython projects (in alphabetical order): Wibowo Arindrarto SearchIO Implementation in Biopython mentored by Peter Cock Lenna Peterson Diff My DNA: Development of a Genomic Variant Toolkit for Biopython mentored by Brad Chapman with Reece Hart and James Casbon Congratulations to you both, and the other accepted students. Sadly we had excellent proposals from other students worthy of being chosen, but not enough mentors to go round. If you are still eligible next year, we hope you will apply again. We are also hoping you will continue to stay involved and contribute to the Biopython community. Thank you all for your hard work, students and mentors. We're looking forward to another productive summer of code! Peter, on behalf of the mentors and Biopython. [1] http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/biopython/2012-April/007976.html From carolinechang810 at gmail.com Tue Apr 24 13:51:40 2012 From: carolinechang810 at gmail.com (Caroline Chang) Date: Tue, 24 Apr 2012 10:51:40 -0700 Subject: [Biopython] NCBIWWW qblast Times Out? Message-ID: Hi, I'm not sure I'm using the NCBIWWW module correctly. I've followed the example code given in the tutorial ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82) for my own single sequence file. However, it seems to hang on the a line of code where it sleeps, and then my request times out. Is anyone else having this error, or am I using this code incorrectly? Thanks! Caroline From p.j.a.cock at googlemail.com Tue Apr 24 13:55:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 18:55:51 +0100 Subject: [Biopython] [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: > Hi all, > > > I'm very excited to be participating in GSoC '12 with Biopython! > > My development blog is on tumblr, which I chose primarily because it > supports markdown syntax, which I'm used to from GitHub. > > Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 > > However, Tumblr doesn't allow post comments. Will I need to switch to a > blog platform that allows comments? > > Cheers, > > Lenna Hi Lenna, Great - you've got a blog already you're also the first student to reply :) Blog comments could be nice, but personally in your shoes I'd direct any discussion to the biopython(-dev) mailing list. e.g. 1. Post weekly update blog, get blog post URL 2. Send email with summary, including blog post URL 3. Goto mailing list archive, get archived email URL 4. Update blog post to link to email (and thus any thread from it, at least for that month). A little cumbersome, but it would save you moving your blog? I'd actually be happier with most discussion on the biopython-dev list rather than blog comments, or even github (which will still be useful for things like code reviews). This may be different for the other projects - I know BioRuby uses IRC much more for example, but even there they've tried to post archives of important IRC discussions to their mailing list too. Thank you! Peter From w.arindrarto at gmail.com Tue Apr 24 15:01:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Apr 2012 21:01:23 +0200 Subject: [Biopython] [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: > On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >> Hi all, >> >> >> I'm very excited to be participating in GSoC '12 with Biopython! >> >> My development blog is on tumblr, which I chose primarily because it >> supports markdown syntax, which I'm used to from GitHub. >> >> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >> >> However, Tumblr doesn't allow post comments. Will I need to switch to a >> blog platform that allows comments? >> >> Cheers, >> >> Lenna > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi everyone, Wibowo Arindrarto here, but you can just call me Bow for short :). I'm very excited to be accepted into GSoC with OBF as well! I will be blogging on my site: http://bow.web.id/blog, and I've actually made my inaugural GSoC post just a few hours after I heard the news, here: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be posting all GSoC related post under the `gsoc` tag, accessible through this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's suggestion, I'll post my weekly progress in this mailing list for everyone to see, too. cheers, Bow From rbuels at gmail.com Tue Apr 24 15:13:48 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 24 Apr 2012 15:13:48 -0400 Subject: [Biopython] [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: <4F96FB6C.3010805@gmail.com> Bow, make sure you subscribe to the OBF GSoC mailing list. http://lists.open-bio.org/mailman/listinfo/gsoc Rob On 04/24/2012 03:01 PM, Wibowo Arindrarto wrote: > On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: >> On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >>> Hi all, >>> >>> >>> I'm very excited to be participating in GSoC '12 with Biopython! >>> >>> My development blog is on tumblr, which I chose primarily because it >>> supports markdown syntax, which I'm used to from GitHub. >>> >>> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >>> >>> However, Tumblr doesn't allow post comments. Will I need to switch to a >>> blog platform that allows comments? >>> >>> Cheers, >>> >>> Lenna >> >> Hi Lenna, >> >> Great - you've got a blog already you're also the first student to reply :) >> >> Blog comments could be nice, but personally in your shoes I'd >> direct any discussion to the biopython(-dev) mailing list. e.g. >> >> 1. Post weekly update blog, get blog post URL >> 2. Send email with summary, including blog post URL >> 3. Goto mailing list archive, get archived email URL >> 4. Update blog post to link to email (and thus any thread from it, >> at least for that month). >> >> A little cumbersome, but it would save you moving your blog? >> >> I'd actually be happier with most discussion on the biopython-dev >> list rather than blog comments, or even github (which will still be >> useful for things like code reviews). >> >> This may be different for the other projects - I know BioRuby >> uses IRC much more for example, but even there they've tried >> to post archives of important IRC discussions to their mailing >> list too. >> >> Thank you! >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Hi everyone, > > Wibowo Arindrarto here, but you can just call me Bow for short :). I'm > very excited to be accepted into GSoC with OBF as well! > > I will be blogging on my site: http://bow.web.id/blog, and I've > actually made my inaugural GSoC post just a few hours after I heard > the news, here: > http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be > posting all GSoC related post under the `gsoc` tag, accessible through > this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's > suggestion, I'll post my weekly progress in this mailing list for > everyone to see, too. > > cheers, > Bow > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Apr 25 05:17:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 25 Apr 2012 10:17:16 +0100 Subject: [Biopython] NCBIWWW qblast Times Out? In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:51 PM, Caroline Chang wrote: > Hi, > > I'm not sure I'm using the NCBIWWW module correctly. I've followed the example > code given in the tutorial ( > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82) for my own single > sequence file. However, it seems to hang on the a line of code where it > sleeps, and then my request times out. > > Is anyone else having this error, or am I using this code incorrectly? > > Thanks! > Caroline Hi Caroline, The most likely cause was the NCBI BLAST service being under heavy load (especially likely during USA working hours). Did this problem persist, and has it every worked for you? If it has never worked for you it could be network problem at your institute (e.g. some proxy settings). Another useful check would be to try running the unit test in the Tests folder of the Biopython source, test_NCBI_qblast.py - and see what that says. Peter From w.arindrarto at gmail.com Sat Apr 28 08:08:35 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 28 Apr 2012 14:08:35 +0200 Subject: [Biopython] Google Summer of Code Project: SearchIO in Biopython Message-ID: Hello everyone, This is Wibowo Arindrarto (or Bow, for short), one of the Google Summer of Code students who will work on Biopython over this summer. I will be working with Peter to add support for parsing search outputs from programs like BLAST and HMMER to Biopython, so that it's easier to extract information from their outputs. Having used some of these programs quite a lot myself, I'm really looking forward to implementing the feature. However, I do understand that it won't be just me who will use the module, but also many other Biopython user. So for everyone who is interested in giving a say, input, or critiques along the way, feel free to do so :). The official coding period starts in about a month from now. Until then, I will be doing all the preparatory work required so that coding will proceed as smooth as possible. These will include preparing the test cases and preparing the SearchIO attribute / object naming convention as well as discussing anything related to its proposed implementation. Finally, here are some links related to the project that might interest you. 1. My main biopython branch for development: https://github.com/bow/biopython/tree/searchio. Since I will be building on top of Peter's SearchIO branch ( https://github.com/peterjc/biopython/tree/search-io-test), right now it only contains Peter's branch rebased against the latest master. 2. My GSoC proposal, which outlines my plans and timeline for the project: http://bit.ly/searchio-proposal 3. The proposed SearchIO naming convention (not 100% complete as of now, but will be filled along the way): http://bit.ly/searchio-terms. One of the main goals of the project is to implement a common interface for BLAST et al, which requires SearchIO to have common attribute names that refers to different search output attributes. The link contains my proposed naming convention, which is still very open to change and discussion. Feel free to comment on the document and add your own ideas. 4. My blog, in which I will write weekly posts about the project's progress: http://bow.web.id/blog 5. An extra repo for all other auxiliary files and scripts that doesn't go into Biopython's code: https://github.com/bow/gsoc. That's it for now. Thanks for taking time to read it :). I'm looking forward to a productive summer with Biopython. Have a nice weekend, Bow From igorrcosta at hotmail.com Sun Apr 1 03:04:28 2012 From: igorrcosta at hotmail.com (Igor Rodrigues da Costa) Date: Sun, 1 Apr 2012 03:04:28 +0000 Subject: [Biopython] Back translation support in Biopython Message-ID: Hi, I am interested in participating in GSoC this summer. I would like to know if there is community support for a new project: Extending Seq class to add support to back translation of proteins (something like this: http://www.bork.embl.de/pal2nal/ ). If this project isn't strong enough at its own, it could be added to any existing project, or it could be complemented with others suggestions from the community. Thanks for your attention,Igor From p.j.a.cock at googlemail.com Sun Apr 1 08:51:12 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 1 Apr 2012 09:51:12 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Sun, Apr 1, 2012 at 4:04 AM, Igor Rodrigues da Costa wrote: > > Hi, > I am interested in participating in GSoC this summer. I would > like to know if there is community support for a new project: > Extending Seq class to add support to back translation of > proteins (something like this: http://www.bork.embl.de/pal2nal/ ). > If this project isn't strong enough at its own, it could be added > to any existing project, or it could be complemented with others > suggestions from the community. > Thanks for your attention,Igor Hi Igor, I don't think back translation in itself is nearly enough to be a GSoC project. Is it also problematic - we had a good long discussion about back translation, and what it might be useful for, back in 2008. In particular, assuming back translation to a simple nucleotide sequence (as a string or Seq object), what would it actually be useful for? See http://bugzilla.open-bio.org/show_bug.cgi?id=2618 which is now using https://redmine.open-bio.org/issues/2618 and the quite long and at times confusing thread: http://lists.open-bio.org/pipermail/biopython/2008-October/004588.html Did you have any other ideas or topics that interested you? Regards, Peter From chaitanya.talnikar at iitb.ac.in Sun Apr 1 09:42:43 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Sun, 1 Apr 2012 15:12:43 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: <87wr626gc5.fsf@fastmail.fm> References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: I have uploaded a second draft incorporating the changes. Please provide comments on my proposal. Thanks, Chaitanya On Fri, Mar 30, 2012 at 6:43 AM, Brad Chapman wrote: > > Chaitanya; > Thanks for making this available. It's a great start and you need to > work from here on being much more detailed in your project plan. I left > specific comments in-line in the proposal. Let us know when you have a > revised version and we can work more. Thanks again, > Brad > >> Here's the google doc link, I have made it editable too. >> >> https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit >> >> On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: >> > >> > Chaitanya; >> > The easiest way to work on your proposal is to write it in a >> > public Google Doc and then share with the list. I don't yet have access >> > to all of the Melange GSoC project and I'd imagine others who might >> > have thoughts are in the same boat. As a side benefit it's also much >> > easier to collaborate on editing and notes. >> > >> > Brad >> > >> >> Hi, >> >> I have uploaded the first draft of my project proposal. I will add >> >> more sections to the project plan in a day or two. Just wanted to have >> >> the initial draft up. I hope to write a better proposal with your >> >> feedback. >> >> >> >> Regards, >> >> Chaitanya >> >> >> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: >> >> > >> >> > Chaitanya; >> >> > Thanks for the interest and specific questions. >> >> > >> >> >> 1. For the implementation of variants what would be better, to create >> >> >> a new SeqVariant class from scratch or to extend the SeqFeature class >> >> >> to accomodate variants? I guess a separate class would be better. >> >> > >> >> > My preference would be to see how far the SeqFeature class can take you >> >> > before implementing a new class. It should be general enough to handle >> >> > variant data, but the bigger challenge might be designing a lightweight >> >> > representation that is compatible with existing SeqFeatures. >> >> > >> >> >> 2. While looking at the Biopython wiki I came across an implementation >> >> >> of GFF at >> >> >> https://github.com/chapmanb/bcbb/tree/master/gff >> >> >> As GVF is an extension of GFF3, this module could be used for reading >> >> >> GVF's too. Is this module a good start to modify it to support GVFs? >> >> > >> >> > That would be perfect. We're hoping to merge this into the Biopython >> >> > code base before the next release. There is also an existing VCF parser >> >> > we'd love to use here: >> >> > >> >> > https://github.com/jamescasbon/PyVCF >> >> > >> >> >> 3. I've been going through the VCF documentation and SNPs, insertions >> >> >> and deletions can be represented just like it is done in VCF, the >> >> >> object would have a start position, length of reference sequence(no >> >> >> need to store this sequence) and a list of alternate sequence objects. >> >> >> I have to still look into the SV(Structural variants), rearrangements >> >> >> and imprecise variant information, so this representation is only for >> >> >> SNPs and small indels. The GVF has a very similar format for small >> >> >> indels and SNPs, just that it provides an extra end position column >> >> >> which is not required if we have the reference sequence. >> >> > >> >> > This sounds good. My general suggestion is to start writing your >> >> > proposal as soon as possible. A concrete first draft will help with more >> >> > detailed comments. The wiki has good information on the project plan: >> >> > >> >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply >> >> > >> >> > and the NESCent wiki has some examples of well-written proposals from >> >> > previous years: >> >> > >> >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application >> >> > >> >> > One of the key aspects is having a detailed week-by-week outline of your >> >> > plans for the summer. >> >> > >> >> > Thanks again for the interest, >> >> > Brad From chapmanb at 50mail.com Sun Apr 1 20:03:21 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 16:03:21 -0400 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: References: Message-ID: <87ty13te2e.fsf@fastmail.fm> Chris; Thanks for putting this together: that's a great start. I left specific suggestions as comments in the document but in general the next step is to expand your timeline to be increasingly specific about your plans. It sounds like you have a good handle on the overview, now the full nitty gritty details start. Brad > Hey everyone, > > Here's a draft of my proposal: > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > > I've allowed comments to be put in. Please tear it to shreds :). > > Thanks, > Chris > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From chapmanb at 50mail.com Sun Apr 1 20:07:31 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 01 Apr 2012 16:07:31 -0400 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: <87r4w7tdvg.fsf@fastmail.fm> Chaitanya; Thanks for the additional work on this, that's great work. I left specific comments in-line but my general suggestion is to keep expanding and clarifying the timeline. Up front work building a detailed timeline makes the summer work so much easier, as well as building a stronger proposal. Thanks again, Brad > I have uploaded a second draft incorporating the changes. Please > provide comments on my proposal. > Thanks, > Chaitanya > > On Fri, Mar 30, 2012 at 6:43 AM, Brad Chapman wrote: > > > > Chaitanya; > > Thanks for making this available. It's a great start and you need to > > work from here on being much more detailed in your project plan. I left > > specific comments in-line in the proposal. Let us know when you have a > > revised version and we can work more. Thanks again, > > Brad > > > >> Here's the google doc link, I have made it editable too. > >> > >> https://docs.google.com/document/d/12N1aEzagMZ8akc1mrfP4MxHdILT2wapjENJOoxZBIh0/edit > >> > >> On Wed, Mar 28, 2012 at 6:13 AM, Brad Chapman wrote: > >> > > >> > Chaitanya; > >> > The easiest way to work on your proposal is to write it in a > >> > public Google Doc and then share with the list. I don't yet have access > >> > to all of the Melange GSoC project and I'd imagine others who might > >> > have thoughts are in the same boat. As a side benefit it's also much > >> > easier to collaborate on editing and notes. > >> > > >> > Brad > >> > > >> >> Hi, > >> >> I have uploaded the first draft of my project proposal. I will add > >> >> more sections to the project plan in a day or two. Just wanted to have > >> >> the initial draft up. I hope to write a better proposal with your > >> >> feedback. > >> >> > >> >> Regards, > >> >> Chaitanya > >> >> > >> >> On Mon, Mar 26, 2012 at 4:37 PM, Brad Chapman wrote: > >> >> > > >> >> > Chaitanya; > >> >> > Thanks for the interest and specific questions. > >> >> > > >> >> >> 1. For the implementation of variants what would be better, to create > >> >> >> a new SeqVariant class from scratch or to extend the SeqFeature class > >> >> >> to accomodate variants? I guess a separate class would be better. > >> >> > > >> >> > My preference would be to see how far the SeqFeature class can take you > >> >> > before implementing a new class. It should be general enough to handle > >> >> > variant data, but the bigger challenge might be designing a lightweight > >> >> > representation that is compatible with existing SeqFeatures. > >> >> > > >> >> >> 2. While looking at the Biopython wiki I came across an implementation > >> >> >> of GFF at > >> >> >> https://github.com/chapmanb/bcbb/tree/master/gff > >> >> >> As GVF is an extension of GFF3, this module could be used for reading > >> >> >> GVF's too. Is this module a good start to modify it to support GVFs? > >> >> > > >> >> > That would be perfect. We're hoping to merge this into the Biopython > >> >> > code base before the next release. There is also an existing VCF parser > >> >> > we'd love to use here: > >> >> > > >> >> > https://github.com/jamescasbon/PyVCF > >> >> > > >> >> >> 3. I've been going through the VCF documentation and SNPs, insertions > >> >> >> and deletions can be represented just like it is done in VCF, the > >> >> >> object would have a start position, length of reference sequence(no > >> >> >> need to store this sequence) and a list of alternate sequence objects. > >> >> >> I have to still look into the SV(Structural variants), rearrangements > >> >> >> and imprecise variant information, so this representation is only for > >> >> >> SNPs and small indels. The GVF has a very similar format for small > >> >> >> indels and SNPs, just that it provides an extra end position column > >> >> >> which is not required if we have the reference sequence. > >> >> > > >> >> > This sounds good. My general suggestion is to start writing your > >> >> > proposal as soon as possible. A concrete first draft will help with more > >> >> > detailed comments. The wiki has good information on the project plan: > >> >> > > >> >> > http://open-bio.org/wiki/Google_Summer_of_Code#When_you_apply > >> >> > > >> >> > and the NESCent wiki has some examples of well-written proposals from > >> >> > previous years: > >> >> > > >> >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2012#Writing_your_application > >> >> > > >> >> > One of the key aspects is having a detailed week-by-week outline of your > >> >> > plans for the summer. > >> >> > > >> >> > Thanks again for the interest, > >> >> > Brad From sudeep495 at gmail.com Mon Apr 2 20:37:28 2012 From: sudeep495 at gmail.com (Sudeep Singh) Date: Tue, 3 Apr 2012 02:07:28 +0530 Subject: [Biopython] Gsoc 2012, SearchIO In-Reply-To: References: Message-ID: Dear Peter Cock, I am fifth year dual degree student in computer science at Indian Insitute of Technology, Kharagpur . I have interest in Bio-informatics and have done a course and a couple of projects in this area. I am interested in the project SearchIO listed on the Ideas Page. Kindly let me know how shall I proceed ? Thanks Sudeep From p.j.a.cock at googlemail.com Tue Apr 3 08:51:59 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 3 Apr 2012 09:51:59 +0100 Subject: [Biopython] Gsoc 2012, SearchIO In-Reply-To: References: Message-ID: On Mon, Apr 2, 2012 at 9:37 PM, Sudeep Singh wrote: > Dear Peter Cock, > > I am fifth year dual degree student in computer science at Indian Insitute > of Technology, Kharagpur . I have interest in Bio-informatics and have done > a course and a couple of projects in this area. I am interested in the > project SearchIO listed on the Ideas Page. > Kindly let me know how shall I proceed ? > > Thanks > Sudeep Hello Sudeep, Welcome to the Biopython mailing list :) Since you are interested in applying for Google Summer of Code (GSoC), you should also subscribe to the biopython-dev mailing list, which is were discussion about code for Biopython mostly happens. I wrote a more detailed email about my thoughts for SearchIO on the biopython-dev list last month: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html You are welcome to write a GSoC proposal - but you will have to hurry as the deadline is this Friday 6 April. Please see: http://www.open-bio.org/wiki/Google_Summer_of_Code http://code.google.com/soc/ You should have a look at some of the previous projects online, including their project schedule which is an important part of the proposal. http://biopython.org/wiki/Google_Summer_of_Code It is also important for us to gauge your programming ability and experience. If you can link to previous open source project contributions, that would be a good sign. I have suggested to other applicants that finding and reporting a bug in Biopython (even a mistake in the documentation) is a good start. Contributing a bug fix is even better ;) In the case of the SearchIO project idea, we'd also be looking for some evidence of familiarity with the tools whose output you would be working with (BLAST, FASTA, HMMER, etc). Perhaps you've used some in your studies? If so, you can write that in the proposal. You can send a draft proposal to me for comment and feedback, but I would encourage you to share it on the biopython-dev list for wider review - for example as a Google Doc with commenting enabled. Several of the other students have already done this. Don't leave this too late - I will be traveling Thursday 5 and Friday 6 April, so won't be giving anyone any last minute comments ;) Remember being selected is a competition - all the OBF GSoC project proposals will be reviewed and ranked, and projects then allocated based on how many students Google allocates to us. The SearchIO topic seems very popular, but only one student would be picked to work on this. Good luck, Peter From ivaylo.stoimenov at gmail.com Tue Apr 3 09:37:01 2012 From: ivaylo.stoimenov at gmail.com (Ivaylo Stoimenov) Date: Tue, 3 Apr 2012 11:37:01 +0200 Subject: [Biopython] Installation of BCBio pack Message-ID: Hi, I would like to use the nice GFF parser tool from BCBio pack, but I am not sure how to install the pack on my machine. If someone could help me to install BCBio on Ubuntu 11.10 I would be so grateful. Thank you in advance. Best regards, Ivaylo Stoimenov From chapmanb at 50mail.com Tue Apr 3 13:11:13 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 09:11:13 -0400 Subject: [Biopython] Installation of BCBio pack In-Reply-To: References: Message-ID: <87ehs5hsem.fsf@fastmail.fm> Ivaylo; > I would like to use the nice GFF parser tool from BCBio pack, but I am not > sure how to install the pack on my machine. If someone could help me to > install BCBio on Ubuntu 11.10 I would be so grateful. Thank you in > advance. We're hoping to include this in the next release of Biopython. In the meantime it's a manual install: git clone git://github.com/chapmanb/bcbb.git cd bcbb/gff python setup.py build sudo python setup.py install Hope this helps, Brad From chapmanb at 50mail.com Tue Apr 3 14:24:43 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 03 Apr 2012 10:24:43 -0400 Subject: [Biopython] Installation of BCBio pack In-Reply-To: References: <87ehs5hsem.fsf@fastmail.fm> Message-ID: <87398kj3kk.fsf@fastmail.fm> Ivaylo; > Thank you so much for the help (and for writing the tools on the first > place). However, I got a problem after trying to execute the commands. > After "python setup.py build", I am getting an error message saying > "ImportError: No module named setuptools". I wonder where is the problem. > Do I need to install setuptools from somewhere first? setuptools is a standard install framework for Python and provides the 'easy_install' command. Instructions for installing it on different platforms are here: http://pypi.python.org/pypi/setuptools Or on Ubuntu you can do: sudo apt-get install python-setuptools Hope this helps, Brad From igorrcosta at hotmail.com Tue Apr 3 21:24:02 2012 From: igorrcosta at hotmail.com (Igor Rodrigues da Costa) Date: Tue, 3 Apr 2012 21:24:02 +0000 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: , Message-ID: Thanks for your response! I think back translation has an obvious solution that avoids all those problems mentioned in that discussion you cited, that is to pass the nucleotide sequence as a parameter. It has plenty of utilities, I have used it in my own research for comparing the evolutionary profile (ks/ka) of a list of aligned proteins in a multifasta (I made a script that fetched the CDS from ncbi using Entrez module to get the nucleotide sequence), it aligns the codons of nucleotide sequences (a hard problem if the protein sequence is not available) and can also check for data integrity. Another topic of interest, also used in my projects, is the calculation of the Dn/Ds rate (non-synonymous / synonymous mutations * non-synonymous / synonymous loci) using the most popular models (Nei-Gojobori, Li, etc). It is very usefull as can be seen for it's widespread use in papers (http://code.google.com/p/kaks-calculator/wiki/Citations) Similar projects: https://github.com/tanghaibao/bio-pipeline/tree/master/synonymous_calculation/ http://www.bork.embl.de/pal2nal/ http://cran.r-project.org/web/packages/seqinr/index.html http://services.cbu.uib.no/tools/kaks http://code.google.com/p/kaks-calculator/ Thanks for your input,Igor > Date: Sun, 1 Apr 2012 09:51:12 +0100 > Subject: Re: [Biopython] Back translation support in Biopython > From: p.j.a.cock at googlemail.com > To: igorrcosta at hotmail.com > CC: biopython at lists.open-bio.org > > On Sun, Apr 1, 2012 at 4:04 AM, Igor Rodrigues da Costa > wrote: > > > > Hi, > > I am interested in participating in GSoC this summer. I would > > like to know if there is community support for a new project: > > Extending Seq class to add support to back translation of > > proteins (something like this: http://www.bork.embl.de/pal2nal/ ). > > If this project isn't strong enough at its own, it could be added > > to any existing project, or it could be complemented with others > > suggestions from the community. > > Thanks for your attention,Igor > > Hi Igor, > > I don't think back translation in itself is nearly enough to be a > GSoC project. Is it also problematic - we had a good long > discussion about back translation, and what it might be useful > for, back in 2008. In particular, assuming back translation to > a simple nucleotide sequence (as a string or Seq object), > what would it actually be useful for? > > See http://bugzilla.open-bio.org/show_bug.cgi?id=2618 which > is now using https://redmine.open-bio.org/issues/2618 and > the quite long and at times confusing thread: > http://lists.open-bio.org/pipermail/biopython/2008-October/004588.html > > Did you have any other ideas or topics that interested you? > > Regards, > > Peter From reece at harts.net Wed Apr 4 00:33:28 2012 From: reece at harts.net (Reece Hart) Date: Tue, 3 Apr 2012 17:33:28 -0700 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: On Sun, Apr 1, 2012 at 2:42 AM, Chaitanya Talnikar < chaitanya.talnikar at iitb.ac.in> wrote: > I have uploaded a second draft incorporating the changes. Please > provide comments on my proposal. > Hi Chaitanya- I also read your proposal last night. My comments mostly echo Brad's, although there are a couple of new ones I think. I'll be happy to reread or answer questions as needed. -Reece From eric.talevich at gmail.com Wed Apr 4 01:49:17 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 3 Apr 2012 21:49:17 -0400 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: Hi Igor, It sounds like you're referring to aligning amino acid sequences to codon sequences, as PAL2NAL does. This is different from what most people mean by back translation, but as you point out, certainly useful. If you write a function that can match a protein sequence alignment to a set of raw CDS sequences, returning a nucleotide alignment based on the codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does exactly that, plus a bit more, and is a fairly well-known and easily obtained program. Personally, I would prefer to write a wrapper for PAL2NAL under Bio.Align.Applications, using the existing Bio.Applications framework. Once the user has a codon alignment, dn/ds and many other calculations based on evolutionary models can be performed with our PAML wrappers, under Bio.Phylo.PAML. I agree there is room in Biopython to make this workflow easier to perform. (Although I wouldn't be able to mentor such a project under GSoC this year.) Best, Eric On Tue, Apr 3, 2012 at 5:24 PM, Igor Rodrigues da Costa < igorrcosta at hotmail.com> wrote: > > Thanks for your response! > I think back translation has an obvious solution that avoids all those > problems mentioned in that discussion you cited, that is to pass the > nucleotide sequence as a parameter. It has plenty of utilities, I have used > it in my own research for comparing the evolutionary profile (ks/ka) of a > list of aligned proteins in a multifasta (I made a script that fetched the > CDS from ncbi using Entrez module to get the nucleotide sequence), it > aligns the codons of nucleotide sequences (a hard problem if the protein > sequence is not available) and can also check for data integrity. > Another topic of interest, also used in my projects, is the calculation of > the Dn/Ds rate (non-synonymous / synonymous mutations * non-synonymous / > synonymous loci) using the most popular models (Nei-Gojobori, Li, etc). It > is very usefull as can be seen for it's widespread use in papers ( > http://code.google.com/p/kaks-calculator/wiki/Citations) > Similar projects: > https://github.com/tanghaibao/bio-pipeline/tree/master/synonymous_calculation/ > http://www.bork.embl.de/pal2nal/ > http://cran.r-project.org/web/packages/seqinr/index.html > http://services.cbu.uib.no/tools/kaks > http://code.google.com/p/kaks-calculator/ > Thanks for your input,Igor > > Date: Sun, 1 Apr 2012 09:51:12 +0100 > > Subject: Re: [Biopython] Back translation support in Biopython > > From: p.j.a.cock at googlemail.com > > To: igorrcosta at hotmail.com > > CC: biopython at lists.open-bio.org > > > > On Sun, Apr 1, 2012 at 4:04 AM, Igor Rodrigues da Costa > > wrote: > > > > > > Hi, > > > I am interested in participating in GSoC this summer. I would > > > like to know if there is community support for a new project: > > > Extending Seq class to add support to back translation of > > > proteins (something like this: http://www.bork.embl.de/pal2nal/ ). > > > If this project isn't strong enough at its own, it could be added > > > to any existing project, or it could be complemented with others > > > suggestions from the community. > > > Thanks for your attention,Igor > > > > Hi Igor, > > > > I don't think back translation in itself is nearly enough to be a > > GSoC project. Is it also problematic - we had a good long > > discussion about back translation, and what it might be useful > > for, back in 2008. In particular, assuming back translation to > > a simple nucleotide sequence (as a string or Seq object), > > what would it actually be useful for? > > > > See http://bugzilla.open-bio.org/show_bug.cgi?id=2618 which > > is now using https://redmine.open-bio.org/issues/2618 and > > the quite long and at times confusing thread: > > http://lists.open-bio.org/pipermail/biopython/2008-October/004588.html > > > > Did you have any other ideas or topics that interested you? > > > > Regards, > > > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Wed Apr 4 15:02:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 Apr 2012 16:02:41 +0100 Subject: [Biopython] Back translation support in Biopython In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich wrote: > Hi Igor, > > It sounds like you're referring to aligning amino acid sequences to codon > sequences, as PAL2NAL does. This is different from what most people mean by > back translation, but as you point out, certainly useful. > > If you write a function that can match a protein sequence alignment to a set > of raw CDS sequences, returning a nucleotide alignment based on the > codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does > exactly that, plus a bit more, and is a fairly well-known and easily > obtained program. Personally, I would prefer to write a wrapper for PAL2NAL > under Bio.Align.Applications, using the existing Bio.Applications framework. As per the old thread, a simple function in Python taking the gapped protein sequence, original nucleotide coding sequence, and the translation table does sound useful. Then using that, you could go from a protein alignment plus the original nucleotide coding sequences to a codon alignment, or other tasks. Given this is all relatively straightforward string manipulation and we already have the required genetic code tables in Biopython, I'm not convinced that wrapping PAL2NAL would be the best solution (for this sub task). > Once the user has a codon alignment, dn/ds and many other calculations based > on evolutionary models can be performed with our PAML wrappers, under > Bio.Phylo.PAML. I agree there is room in Biopython to make this workflow > easier to perform. (Although I wouldn't be able to mentor such a project > under GSoC this year.) Doing some of the calculations directly within Biopython could be interesting and useful - although calling PAML is a very pragmatic solution too. I'm note sure you have enough work here to justify a GSoC project, but the timing is also rather tight to find a suitable mentor. Maybe next year? However, you can still start contributing to Biopython now - and such involvement would be viewed positively on a future GSoC application (not just with us, but for other participating project being about to show past contributions to open source projects is good). Regards, Peter From alfonso.esposito1983 at hotmail.it Wed Apr 4 15:27:57 2012 From: alfonso.esposito1983 at hotmail.it (fonz esposito) Date: Wed, 4 Apr 2012 17:27:57 +0200 Subject: [Biopython] Blast defaults Message-ID: Hello everybody, I guess I am not the first one coming out with this question but: I have problems because the NCBIWWW.qblast function does not give the same exact result as the online blast, I use it as the tutorial says but I don't know how to change the parameters to the ones that the online web blast has as default... Does someone know what parameter should I change? Thanks in advance From p.j.a.cock at googlemail.com Wed Apr 4 15:39:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 Apr 2012 16:39:25 +0100 Subject: [Biopython] Blast defaults In-Reply-To: References: Message-ID: On Wed, Apr 4, 2012 at 4:27 PM, fonz esposito wrote: > > Hello everybody, > > I guess I am not the first one coming out with this question but: I > have problems because the NCBIWWW.qblast function does not > give the same exact result as the online blast, I use it as the tutorial > says but I don't know how to change the parameters to the ones > that the online web blast has as default... Does someone know > what parameter should I change? Check the gap parameters first, but you're going to have to compare them all to be sure - the NCBI website does some quite clever auto-selection these days. Peter From tturne18 at jhmi.edu Wed Apr 4 16:55:20 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Wed, 4 Apr 2012 16:55:20 +0000 Subject: [Biopython] biopython question Message-ID: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> Hi, I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: from Bio import SeqIO def trim_primers(records, primer): """Removes perfect primer sequences at start of reads. This is a generator function, the records argument should be a list or iterator returning SeqRecord objects. """ len_primer = len(primer) #cache this for later for record in records: if record.seq.startswith(primer): yield record[len_primer:] else: yield record original_reads = SeqIO.parse("SRR020192.fastq", "fastq") trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") print "Saved %i reads" % count My question is: Is there a way to loop through a primer file for instance instead of looking for only 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. Primer file structured as: GATGACGGTGT GATGACGGTGA GATGACGGCCT If you have any suggestions it would be greatly appreciated. Thanks. Tychele From w.arindrarto at gmail.com Wed Apr 4 18:05:54 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 4 Apr 2012 20:05:54 +0200 Subject: [Biopython] biopython question In-Reply-To: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Tychele, If I understood correctly, you have a list of primers stored in a file and you want to trim those primer sequences off your fastq sequences, correct? One way I could think of is to first store the primers in a list (since they will be used repeatedly to check every single fastq sequence). Here's the code: from Bio import SeqIO def trim_primers(records, 'primer_file_name'): # read the primers primer_list = [] with open('primer_file_name', 'r') as source: for line in source: primer_list.append(line.strip()) ? ?for record in records: # list to check if the sequence begins with any of the primers check = [record.seq.startswith(x) for x in primer_list] # if any of the primer is present in the beginning of the sequence, then we trim it off if any(check): # get index of primer that matches the beginning idx = check.index(True) len_primer = len(primer_list[idx]) yield record[len_primer:] # otherwise just return the whole record ? ? ? ?else: ? ? ? ? ? yield record and then, you can use the function like so: original_reads = SeqIO.parse("SRR020192.fastq", "fastq") trimmed_reads = trim_primers(original_reads, 'primer_file_name') count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") print "Saved %i reads" % count I haven't tested the function, but I suppose that should do the trick. Hope that helps :), Bow On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: > Hi, > > I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: > > from Bio import SeqIO > def trim_primers(records, primer): > ? ?"""Removes perfect primer sequences at start of reads. > > ? ?This is a generator function, the records argument should > ? ?be a list or iterator returning SeqRecord objects. > ? ?""" > ? ?len_primer = len(primer) #cache this for later > ? ?for record in records: > ? ? ? ?if record.seq.startswith(primer): > ? ? ? ? ? ?yield record[len_primer:] > ? ? ? ?else: > ? ? ? ? ? ?yield record > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > print "Saved %i reads" % count > > > > > My question is: Is there a way to loop through a primer file for instance instead of looking for only > > 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. > > Primer file structured as: > GATGACGGTGT > GATGACGGTGA > GATGACGGCCT > > If you have any suggestions it would be greatly appreciated. Thanks. > > Tychele > > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From ferreirafm at usp.br Wed Apr 4 18:56:08 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 04 Apr 2012 15:56:08 -0300 Subject: [Biopython] random peptide sequences Message-ID: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> Dear BioPython List, I want to write a python script to generate random peptide sequences. I have a scratch in my mind, however, I'm not sure how to deal with data itself (like: use sequence or mutableSeq?). The problem is as follow: I have a list of 20 string peptides which I join to produce a sequence. I want to generate 1000+ sequences keeping the peptide1 (pep1) in a fix position (p1) and randomly permute (without repetition) the remaining 19 peptides in the remaining 19 positions. Repeat the first step keeping pep2 in a fix position p2 to generate more 1000 peptides sequences. And repeat this step again and again for all of the peptides & positions. At the end, I'm going to run a function with each one of peptide sequences getting I binary result like "positive" or "negative". What I have in mind is to randomly generate 1000 peptide sequences of 19 peptides, insert pep1 at position p1 in all of them; generate more 1000 peptide sequences of 19 peptides again and insert pep2 at position p2 in all of them; and so on...At the end, I'm going to run the function for each of the sequences and store results in a dict where value is the binary result. Well, where Biopython goes? I'm completely new to Biopython and would like to use it to solve the problem described. So, I'm writing to ask you guys for some tips and advises to use Biopython resources as much as possible. Any help is appreciated. All the Best, Fred From p.j.a.cock at googlemail.com Wed Apr 4 19:34:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 4 Apr 2012 20:34:53 +0100 Subject: [Biopython] random peptide sequences In-Reply-To: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> References: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> Message-ID: On Wed, Apr 4, 2012 at 7:56 PM, wrote: > Dear BioPython List, > I want to write a python script to generate random peptide sequences. I have > a scratch in my mind, however, I'm not sure how to deal with data itself > (like: use sequence or mutableSeq?). I would use a Seq object - once generated your random sequence won't change, so there is no need for the MutableSeq object. > ... At the end, I'm going to run the > function for each of the sequences and store results in a dict where value > is the binary result. It sounds like a large dataset of 1000s of random sequences will be created - you probably don't want to do that all in memory. I would generate the random records one by one and write them to a FASTA file. Then loop over the FASTA file and apply your binary test. An advantage of this split is you have broken the task in two - you can get the random sequence generator working and checked separately from writing and testing the classifier. [I am assuming you want to get out of this a table of some kind linking random sequences to binary classifier results] Peter From chris.mit7 at gmail.com Wed Apr 4 20:41:53 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 4 Apr 2012 16:41:53 -0400 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: <87ty13te2e.fsf@fastmail.fm> References: <87ty13te2e.fsf@fastmail.fm> Message-ID: I put some more updates on it. I'll have it finished by the end of Thursday, but any comments on my changes are appreciated. I expanded on my timeline and just need to fill in Weeks 6-11. On Sun, Apr 1, 2012 at 4:03 PM, Brad Chapman wrote: > > Chris; > Thanks for putting this together: that's a great start. I left specific > suggestions as comments in the document but in general the next step is > to expand your timeline to be increasingly specific about your plans. It > sounds like you have a good handle on the overview, now the full nitty > gritty details start. > > Brad > > > Hey everyone, > > > > Here's a draft of my proposal: > > > > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > > > > I've allowed comments to be put in. Please tear it to shreds :). > > > > Thanks, > > Chris > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Wed Apr 4 22:26:29 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 5 Apr 2012 00:26:29 +0200 Subject: [Biopython] biopython question In-Reply-To: <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Tychele, Glad to hear that and thanks for attaching the code as well :). Just one more heads up on the code, the trimming function assumes that for any record sequence, there is only one matching primer sequence at most. If by any random chance a sequence begins with two or more primer sequences, then it will only trim the first primer sequence. So if you still see some primer sequences left in the trimmed sequences, this could be the case and you'll need to modify the code. However, that seems unlikely ~ the current code should suffice. cheers, Bow On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: > Hi Bow, > > Thank you! This works great. I have attached the final code to the email in case it may benefit others. > > Tychele > > > ________________________________________ > From: Wibowo Arindrarto [w.arindrarto at gmail.com] > Sent: Wednesday, April 04, 2012 2:05 PM > To: Tychele Turner > Cc: biopython at biopython.org > Subject: Re: [Biopython] biopython question > > Hi Tychele, > > If I understood correctly, you have a list of primers stored in a file > and you want to trim those primer sequences off your fastq sequences, > correct? One way I could think of is to first store the primers in a > list (since they will be used repeatedly to check every single fastq > sequence). > > Here's the code: > > from Bio import SeqIO > > def trim_primers(records, 'primer_file_name'): > > ? ?# read the primers > ? ?primer_list = [] > ? ?with open('primer_file_name', 'r') as source: > ? ? ?for line in source: > ? ? ? ?primer_list.append(line.strip()) > > ? ?for record in records: > ? ? ? ?# list to check if the sequence begins with any of the primers > ? ? ? ?check = [record.seq.startswith(x) for x in primer_list] > ? ? ? ?# if any of the primer is present in the beginning of the > sequence, then we trim it off > ? ? ? ?if any(check): > ? ? ? ? ? ?# get index of primer that matches the beginning > ? ? ? ? ? ?idx = check.index(True) > ? ? ? ? ? ?len_primer = len(primer_list[idx]) > ? ? ? ? ? ?yield record[len_primer:] > ? ? ? ?# otherwise just return the whole record > ? ? ? ?else: > ? ? ? ? ? ?yield record > > and then, you can use the function like so: > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > trimmed_reads = trim_primers(original_reads, 'primer_file_name') > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > print "Saved %i reads" % count > > I haven't tested the function, but I suppose that should do the trick. > > Hope that helps :), > Bow > > > On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: >> Hi, >> >> I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: >> >> from Bio import SeqIO >> def trim_primers(records, primer): >> ? ?"""Removes perfect primer sequences at start of reads. >> >> ? ?This is a generator function, the records argument should >> ? ?be a list or iterator returning SeqRecord objects. >> ? ?""" >> ? ?len_primer = len(primer) #cache this for later >> ? ?for record in records: >> ? ? ? ?if record.seq.startswith(primer): >> ? ? ? ? ? ?yield record[len_primer:] >> ? ? ? ?else: >> ? ? ? ? ? ?yield record >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> print "Saved %i reads" % count >> >> >> >> >> My question is: Is there a way to loop through a primer file for instance instead of looking for only >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. >> >> Primer file structured as: >> GATGACGGTGT >> GATGACGGTGA >> GATGACGGCCT >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> Tychele >> >> >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > From ferreirafm at usp.br Thu Apr 5 00:01:42 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Wed, 04 Apr 2012 21:01:42 -0300 Subject: [Biopython] random peptide sequences In-Reply-To: References: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> <20120404165644.15233099ve7c2pvg@webmail.usp.br> Message-ID: <20120404210142.95126nwmeotjeg9y@webmail.usp.br> Hi Peter, It seems I get there, but can't write records to file using SeqIO.write as usual. Fred code: def random_seq(fastafile): records = [ ] query = SeqIO.read(fastafile, "fasta") peplist = str(query.seq).split('GPGPG') peptup = tuple(str(query.seq).split('GPGPG')) for pep in peptup: outf = open("test.fasta", "w") peplist.remove(pep) for k in range(10): random.shuffle(peplist, random.random) peplist.insert(0, pep) rec = SeqRecord('GPGPG'.join(peplist), id="pep%s" % k) records.append(rec) print 'id: %s\nSeq: %s\n' % (rec.id, rec.seq) peplist.remove(pep) print records SeqIO.write(records, outf, "fasta") outf.close() sys.exit(1) output: $ random_pep.py --run br18.fasta id: pep0 Seq: EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGVLAIVALVVATIIAIGPGPGTMLLGMLMICSAAGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGEAIIRILQQLLFIHF id: pep1 Seq: EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGSELYLYKVVKIEPLGVAPGPGPGKRWIILGLNKIVRMYSPTSIGPGPGVLAIVALVVATIIAIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVIGPGPGSPEVIPMFSALSEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIG id: pep2 Seq: EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGEAIIRILQQLLFIHFGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVI id: pep3 Seq: EELRSLYNTVATLYCVHGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGKRWIILGLNKIVRMYSPTSIGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGNTSYRLISCNTSVI id: pep4 Seq: EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGKRWIILGLNKIVRMYSPTSIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGSPEVIPMFSALSEGPGPGNTSYRLISCNTSVIGPGPGSLQYLALVALVAPKKGPGPGTPVNIIGRNLLTQIGGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI id: pep5 Seq: EELRSLYNTVATLYCVHGPGPGSPEVIPMFSALSEGPGPGEAIIRILQQLLFIHFGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGDKELYPLASLRSLFGGPGPGSLQYLALVALVAPKKGPGPGNTSYRLISCNTSVIGPGPGTPVNIIGRNLLTQIGGPGPGVLAIVALVVATIIAIGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGRDLLLIVTRIVELLGR id: pep6 Seq: EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGALFYKLDVVPIDGPGPGKRWIILGLNKIVRMYSPTSIGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGNTSYRLISCNTSVIGPGPGVLAIVALVVATIIAIGPGPGEAIIRILQQLLFIHFGPGPGGKIILVAVHVASGYIGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKK id: pep7 Seq: EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGSPEVIPMFSALSEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI id: pep8 Seq: EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGRDLLLIVTRIVELLGRGPGPGNTSYRLISCNTSVIGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGEAIIRILQQLLFIHFGPGPGDKELYPLASLRSLFGGPGPGALFYKLDVVPIDGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKKGPGPGGKIILVAVHVASGYIGPGPGTMLLGMLMICSAAGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGKRWIILGLNKIVRMYSPTSI id: pep9 Seq: EELRSLYNTVATLYCVHGPGPGVLEWRFDSRLAFHHVGPGPGSPEVIPMFSALSEGPGPGVLAIVALVVATIIAIGPGPGTPVNIIGRNLLTQIGGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGNTSYRLISCNTSVIGPGPGTMLLGMLMICSAAGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGRDLLLIVTRIVELLGRGPGPGEAIIRILQQLLFIHFGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGQQLLFIHFRIGCRHSRIG [SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGVLAIVALVVATIIAIGPGPGTMLLGMLMICSAAGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGEAIIRILQQLLFIHF', id='pep0', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGSELYLYKVVKIEPLGVAPGPGPGKRWIILGLNKIVRMYSPTSIGPGPGVLAIVALVVATIIAIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVIGPGPGSPEVIPMFSALSEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIG', id='pep1', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGEAIIRILQQLLFIHFGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGNTSYRLISCNTSVI', id='pep2', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGGKIILVAVHVASGYIGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGKRWIILGLNKIVRMYSPTSIGPGPGSELYLYKVVKIEPLGVAPGPGPGSLQYLALVALVAPKKGPGPGQRPLVTIKIGGQLKEGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGALFYKLDVVPIDGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGDKELYPLASLRSLFGGPGPGNTSYRLISCNTSVI', id='pep3', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGVLAIVALVVATIIAIGPGPGVLEWRFDSRLAFHHVGPGPGKRWIILGLNKIVRMYSPTSIGPGPGQRPLVTIKIGGQLKEGPGPGELLKTVRLIKFLYQSNPGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGSPEVIPMFSALSEGPGPGNTSYRLISCNTSVIGPGPGSLQYLALVALVAPKKGPGPGTPVNIIGRNLLTQIGGPGPGEAIIRILQQLLFIHFGPGPGQQLLFIHFRIGCRHSRIGGPGPGTMLLGMLMICSAAGPGPGRDLLLIVTRIVELLGRGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI', id='pep4', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGSPEVIPMFSALSEGPGPGEAIIRILQQLLFIHFGPGPGTMLLGMLMICSAAGPGPGQQLLFIHFRIGCRHSRIGGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGGKIILVAVHVASGYIGPGPGDKELYPLASLRSLFGGPGPGSLQYLALVALVAPKKGPGPGNTSYRLISCNTSVIGPGPGTPVNIIGRNLLTQIGGPGPGVLAIVALVVATIIAIGPGPGSELYLYKVVKIEPLGVAPGPGPGALFYKLDVVPIDGPGPGRDLLLIVTRIVELLGR', id='pep5', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGTMLLGMLMICSAAGPGPGALFYKLDVVPIDGPGPGKRWIILGLNKIVRMYSPTSIGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGRDLLLIVTRIVELLGRGPGPGELLKTVRLIKFLYQSNPGPGPGNTSYRLISCNTSVIGPGPGVLAIVALVVATIIAIGPGPGEAIIRILQQLLFIHFGPGPGGKIILVAVHVASGYIGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKK', id='pep6', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGTMLLGMLMICSAAGPGPGSLQYLALVALVAPKKGPGPGEAIIRILQQLLFIHFGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGELLKTVRLIKFLYQSNPGPGPGVLEWRFDSRLAFHHVGPGPGQQLLFIHFRIGCRHSRIGGPGPGSPEVIPMFSALSEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGNTSYRLISCNTSVIGPGPGRDLLLIVTRIVELLGRGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGDKELYPLASLRSLFGGPGPGGKIILVAVHVASGYI', id='pep7', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGTPVNIIGRNLLTQIGGPGPGRDLLLIVTRIVELLGRGPGPGNTSYRLISCNTSVIGPGPGELLKTVRLIKFLYQSNPGPGPGSPEVIPMFSALSEGPGPGVLEWRFDSRLAFHHVGPGPGEAIIRILQQLLFIHFGPGPGDKELYPLASLRSLFGGPGPGALFYKLDVVPIDGPGPGQQLLFIHFRIGCRHSRIGGPGPGQRPLVTIKIGGQLKEGPGPGSLQYLALVALVAPKKGPGPGGKIILVAVHVASGYIGPGPGTMLLGMLMICSAAGPGPGSELYLYKVVKIEPLGVAPGPGPGVLAIVALVVATIIAIGPGPGKRWIILGLNKIVRMYSPTSI', id='pep8', name='', description='', dbxrefs=[]), SeqRecord(seq='EELRSLYNTVATLYCVHGPGPGVLEWRFDSRLAFHHVGPGPGSPEVIPMFSALSEGPGPGVLAIVALVVATIIAIGPGPGTPVNIIGRNLLTQIGGPGPGDKELYPLASLRSLFGGPGPGSELYLYKVVKIEPLGVAPGPGPGNTSYRLISCNTSVIGPGPGTMLLGMLMICSAAGPGPGGKIILVAVHVASGYIGPGPGALFYKLDVVPIDGPGPGQRPLVTIKIGGQLKEGPGPGKRWIILGLNKIVRMYSPTSIGPGPGRDLLLIVTRIVELLGRGPGPGEAIIRILQQLLFIHFGPGPGELLKTVRLIKFLYQSNPGPGPGSLQYLALVALVAPKKGPGPGQQLLFIHFRIGCRHSRIG', id='pep9', name='', description='', dbxrefs=[])] Traceback (most recent call last): File "/home/ferreirafm/bin/random_pep.py", line 173, in main() File "/home/ferreirafm/bin/random_pep.py", line 156, in main random_seq(fastafile) File "/home/ferreirafm/bin/random_pep.py", line 39, in random_seq SeqIO.write(records, outf, "fasta") File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 412, in write count = writer_class(handle).write_file(sequences) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 136, in write_record data = self._get_seq_string(record) #Catches sequence being None File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 164, in _get_seq_string % record.id) TypeError: SeqRecord (id=pep0) has an invalid sequence. Citando Peter Cock : > On Wed, Apr 4, 2012 at 8:56 PM, wrote: >> >> Hi Peter, >> Thanks for helping. I'll try something like that and let you know the >> results. >> Fred > > Good luck - and please reply on the list to let us know how you get on :) > > Peter > From tturne18 at jhmi.edu Wed Apr 4 22:12:14 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Wed, 4 Apr 2012 22:12:14 +0000 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com>, Message-ID: <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Hi Bow, Thank you! This works great. I have attached the final code to the email in case it may benefit others. Tychele ________________________________________ From: Wibowo Arindrarto [w.arindrarto at gmail.com] Sent: Wednesday, April 04, 2012 2:05 PM To: Tychele Turner Cc: biopython at biopython.org Subject: Re: [Biopython] biopython question Hi Tychele, If I understood correctly, you have a list of primers stored in a file and you want to trim those primer sequences off your fastq sequences, correct? One way I could think of is to first store the primers in a list (since they will be used repeatedly to check every single fastq sequence). Here's the code: from Bio import SeqIO def trim_primers(records, 'primer_file_name'): # read the primers primer_list = [] with open('primer_file_name', 'r') as source: for line in source: primer_list.append(line.strip()) for record in records: # list to check if the sequence begins with any of the primers check = [record.seq.startswith(x) for x in primer_list] # if any of the primer is present in the beginning of the sequence, then we trim it off if any(check): # get index of primer that matches the beginning idx = check.index(True) len_primer = len(primer_list[idx]) yield record[len_primer:] # otherwise just return the whole record else: yield record and then, you can use the function like so: original_reads = SeqIO.parse("SRR020192.fastq", "fastq") trimmed_reads = trim_primers(original_reads, 'primer_file_name') count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") print "Saved %i reads" % count I haven't tested the function, but I suppose that should do the trick. Hope that helps :), Bow On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: > Hi, > > I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows: > > from Bio import SeqIO > def trim_primers(records, primer): > """Removes perfect primer sequences at start of reads. > > This is a generator function, the records argument should > be a list or iterator returning SeqRecord objects. > """ > len_primer = len(primer) #cache this for later > for record in records: > if record.seq.startswith(primer): > yield record[len_primer:] > else: > yield record > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > print "Saved %i reads" % count > > > > > My question is: Is there a way to loop through a primer file for instance instead of looking for only > > 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read. > > Primer file structured as: > GATGACGGTGT > GATGACGGTGA > GATGACGGCCT > > If you have any suggestions it would be greatly appreciated. Thanks. > > Tychele > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -------------- next part -------------- A non-text attachment was scrubbed... Name: testTrimPrimers.py Type: text/x-python-script Size: 1181 bytes Desc: testTrimPrimers.py URL: From chapmanb at 50mail.com Thu Apr 5 11:22:00 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 05 Apr 2012 07:22:00 -0400 Subject: [Biopython] GSOC Genome Variants proposal In-Reply-To: References: <87ty13te2e.fsf@fastmail.fm> Message-ID: <87y5qabezr.fsf@fastmail.fm> Chris; Thanks for the updates, you're putting together a solid proposal. I added a couple of additional comments and pointers which should hopefully be helpful. My other practical suggestion is to include a link to your Google Doc from the official proposal in GSoC Melange. This will allow you to update your proposal in response to any reviewer comments, since Melange doesn't allow edits after Friday. Best of luck with the review process and thanks again for all of the work on the proposal, Brad > I put some more updates on it. I'll have it finished by the end of > Thursday, but any comments on my changes are appreciated. I expanded on my > timeline and just need to fill in Weeks 6-11. > > On Sun, Apr 1, 2012 at 4:03 PM, Brad Chapman wrote: > > > > > Chris; > > Thanks for putting this together: that's a great start. I left specific > > suggestions as comments in the document but in general the next step is > > to expand your timeline to be increasingly specific about your plans. It > > sounds like you have a good handle on the overview, now the full nitty > > gritty details start. > > > > Brad > > > > > Hey everyone, > > > > > > Here's a draft of my proposal: > > > > > > > > https://docs.google.com/document/d/1DNm8NQmnP4fH8KvF9v107mo4__V-FqWtx8Vgsz5yNns/edit > > > > > > I've allowed comments to be put in. Please tear it to shreds :). > > > > > > Thanks, > > > Chris > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > Non-text part: text/html From chaitanya.talnikar at iitb.ac.in Thu Apr 5 21:16:13 2012 From: chaitanya.talnikar at iitb.ac.in (Chaitanya Talnikar) Date: Fri, 6 Apr 2012 02:46:13 +0530 Subject: [Biopython] GSoC project application: Representation and manipulation of genomic variants: Some questions In-Reply-To: References: <87398vmxhj.fsf@fastmail.fm> <874nt9k11m.fsf@fastmail.fm> <87wr626gc5.fsf@fastmail.fm> Message-ID: Hi all, I have modifed my proposal based on the comments. I have also updated the proposal on the GSoC website, along with a link to the google doc. Regards, Chaitanya On Wed, Apr 4, 2012 at 6:03 AM, Reece Hart wrote: > On Sun, Apr 1, 2012 at 2:42 AM, Chaitanya Talnikar > wrote: >> >> I have uploaded a second draft incorporating the changes. Please >> provide comments on my proposal. > > > Hi Chaitanya- > > I also read your proposal last night. My comments mostly echo Brad's, > although there are a couple of new ones I think. > > I'll be happy to reread or answer questions as needed. > > -Reece > From zhigang.wu at email.ucr.edu Thu Apr 5 21:49:05 2012 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 5 Apr 2012 14:49:05 -0700 Subject: [Biopython] Biopython GSoC Proposal In-Reply-To: <87ty166g9c.fsf@fastmail.fm> References: <87ty166g9c.fsf@fastmail.fm> Message-ID: Hi Brad, Thanks for your comments. I have substantial modification to my proposal, which I think is close to submission. You and all others in the community are welcome to make any further comments and suggestions. Regards, Zhigang On Thu, Mar 29, 2012 at 6:15 PM, Brad Chapman wrote: > > Zhigang; > > > Here I am posting my draft of proposal, in which I have proposed to > > implement the SearchIO module. Please follow the link to access it > > > https://docs.google.com/document/d/15fkPAZfN2Ln8nMJr4Ad7lMscaGbKOiTaXcGpxxvIe3A/edit > > Thanks for putting this together. You've got an excellent start. I added > comments in the document on specific areas. Let us know if you have any > questions or need followup on any points. Thanks again, > Brad > From mictadlo at gmail.com Fri Apr 6 02:06:54 2012 From: mictadlo at gmail.com (Mic) Date: Fri, 6 Apr 2012 12:06:54 +1000 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: What is the difference to remove primer from the fastq file rather to use markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on an alignment? Would both ways deliver the same results? Thank you in advance. On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto wrote: > Hi Tychele, > > Glad to hear that and thanks for attaching the code as well :). > > Just one more heads up on the code, the trimming function assumes that > for any record sequence, there is only one matching primer sequence at > most. If by any random chance a sequence begins with two or more > primer sequences, then it will only trim the first primer sequence. So > if you still see some primer sequences left in the trimmed sequences, > this could be the case and you'll need to modify the code. > > However, that seems unlikely ~ the current code should suffice. > > cheers, > Bow > > > On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: > > Hi Bow, > > > > Thank you! This works great. I have attached the final code to the email > in case it may benefit others. > > > > Tychele > > > > > > ________________________________________ > > From: Wibowo Arindrarto [w.arindrarto at gmail.com] > > Sent: Wednesday, April 04, 2012 2:05 PM > > To: Tychele Turner > > Cc: biopython at biopython.org > > Subject: Re: [Biopython] biopython question > > > > Hi Tychele, > > > > If I understood correctly, you have a list of primers stored in a file > > and you want to trim those primer sequences off your fastq sequences, > > correct? One way I could think of is to first store the primers in a > > list (since they will be used repeatedly to check every single fastq > > sequence). > > > > Here's the code: > > > > from Bio import SeqIO > > > > def trim_primers(records, 'primer_file_name'): > > > > # read the primers > > primer_list = [] > > with open('primer_file_name', 'r') as source: > > for line in source: > > primer_list.append(line.strip()) > > > > for record in records: > > # list to check if the sequence begins with any of the primers > > check = [record.seq.startswith(x) for x in primer_list] > > # if any of the primer is present in the beginning of the > > sequence, then we trim it off > > if any(check): > > # get index of primer that matches the beginning > > idx = check.index(True) > > len_primer = len(primer_list[idx]) > > yield record[len_primer:] > > # otherwise just return the whole record > > else: > > yield record > > > > and then, you can use the function like so: > > > > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > > trimmed_reads = trim_primers(original_reads, 'primer_file_name') > > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > > print "Saved %i reads" % count > > > > I haven't tested the function, but I suppose that should do the trick. > > > > Hope that helps :), > > Bow > > > > > > On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: > >> Hi, > >> > >> I have a question regarding one of the biopython capabilities. I would > like to trim primers off the end of reads in a fastq file and I found > wonderful documentation of how to do this on your website as follows: > >> > >> from Bio import SeqIO > >> def trim_primers(records, primer): > >> """Removes perfect primer sequences at start of reads. > >> > >> This is a generator function, the records argument should > >> be a list or iterator returning SeqRecord objects. > >> """ > >> len_primer = len(primer) #cache this for later > >> for record in records: > >> if record.seq.startswith(primer): > >> yield record[len_primer:] > >> else: > >> yield record > >> > >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > >> print "Saved %i reads" % count > >> > >> > >> > >> > >> My question is: Is there a way to loop through a primer file for > instance instead of looking for only > >> > >> 'GATGACGGTGT' every primer would be checked and subsequently removed > from the start of its respective read. > >> > >> Primer file structured as: > >> GATGACGGTGT > >> GATGACGGTGA > >> GATGACGGCCT > >> > >> If you have any suggestions it would be greatly appreciated. Thanks. > >> > >> Tychele > >> > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From w.arindrarto at gmail.com Fri Apr 6 02:20:09 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Fri, 6 Apr 2012 04:20:09 +0200 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Mic, I'm not familiar with picard, but it seems that this program detects whole duplicate molecules instead of detecting whether a primer is present in sequences (which may or may not be duplicates). Plus, it doesn't do any removal ~ it only flags them. So I don't think the two are comparable. cheers, Bow On Fri, Apr 6, 2012 at 04:06, Mic wrote: > What is the difference to remove primer from the fastq file rather to use > markDuplicates?http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates?on > an?alignment? > > Would both ways deliver the same results? > > Thank you in advance. > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto > wrote: >> >> Hi Tychele, >> >> Glad to hear that and thanks for attaching the code as well :). >> >> Just one more heads up on the code, the trimming function assumes that >> for any record sequence, there is only one matching primer sequence at >> most. If by any random chance a sequence begins with two or more >> primer sequences, then it will only trim the first primer sequence. So >> if you still see some primer sequences left in the trimmed sequences, >> this could be the case and you'll need to modify the code. >> >> However, that seems unlikely ~ the current code should suffice. >> >> cheers, >> Bow >> >> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: >> > Hi Bow, >> > >> > Thank you! This works great. I have attached the final code to the email >> > in case it may benefit others. >> > >> > Tychele >> > >> > >> > ________________________________________ >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] >> > Sent: Wednesday, April 04, 2012 2:05 PM >> > To: Tychele Turner >> > Cc: biopython at biopython.org >> > Subject: Re: [Biopython] biopython question >> > >> > Hi Tychele, >> > >> > If I understood correctly, you have a list of primers stored in a file >> > and you want to trim those primer sequences off your fastq sequences, >> > correct? One way I could think of is to first store the primers in a >> > list (since they will be used repeatedly to check every single fastq >> > sequence). >> > >> > Here's the code: >> > >> > from Bio import SeqIO >> > >> > def trim_primers(records, 'primer_file_name'): >> > >> > ? ?# read the primers >> > ? ?primer_list = [] >> > ? ?with open('primer_file_name', 'r') as source: >> > ? ? ?for line in source: >> > ? ? ? ?primer_list.append(line.strip()) >> > >> > ? ?for record in records: >> > ? ? ? ?# list to check if the sequence begins with any of the primers >> > ? ? ? ?check = [record.seq.startswith(x) for x in primer_list] >> > ? ? ? ?# if any of the primer is present in the beginning of the >> > sequence, then we trim it off >> > ? ? ? ?if any(check): >> > ? ? ? ? ? ?# get index of primer that matches the beginning >> > ? ? ? ? ? ?idx = check.index(True) >> > ? ? ? ? ? ?len_primer = len(primer_list[idx]) >> > ? ? ? ? ? ?yield record[len_primer:] >> > ? ? ? ?# otherwise just return the whole record >> > ? ? ? ?else: >> > ? ? ? ? ? ?yield record >> > >> > and then, you can use the function like so: >> > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> > print "Saved %i reads" % count >> > >> > I haven't tested the function, but I suppose that should do the trick. >> > >> > Hope that helps :), >> > Bow >> > >> > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner wrote: >> >> Hi, >> >> >> >> I have a question regarding one of the biopython capabilities. I would >> >> like to trim primers off the end of reads in a fastq file and I found >> >> wonderful documentation of how to do this on your website as follows: >> >> >> >> from Bio import SeqIO >> >> def trim_primers(records, primer): >> >> ? ?"""Removes perfect primer sequences at start of reads. >> >> >> >> ? ?This is a generator function, the records argument should >> >> ? ?be a list or iterator returning SeqRecord objects. >> >> ? ?""" >> >> ? ?len_primer = len(primer) #cache this for later >> >> ? ?for record in records: >> >> ? ? ? ?if record.seq.startswith(primer): >> >> ? ? ? ? ? ?yield record[len_primer:] >> >> ? ? ? ?else: >> >> ? ? ? ? ? ?yield record >> >> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> >> print "Saved %i reads" % count >> >> >> >> >> >> >> >> >> >> My question is: Is there a way to loop through a primer file for >> >> instance instead of looking for only >> >> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed >> >> from the start of its respective read. >> >> >> >> Primer file structured as: >> >> GATGACGGTGT >> >> GATGACGGTGA >> >> GATGACGGCCT >> >> >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> >> >> Tychele >> >> >> >> >> >> _______________________________________________ >> >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From mictadlo at gmail.com Fri Apr 6 09:59:36 2012 From: mictadlo at gmail.com (Mic) Date: Fri, 6 Apr 2012 19:59:36 +1000 Subject: [Biopython] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> Message-ID: Hi Bow, You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates: *$ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam * *$ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics* Are adapters and fragments the same? I found the following software for adapter: ** TagDust - eliminate artifactual sequence from NGS data* *http://www.biomedcentral.com/1471-2164/12/382* *http://bioinformatics.oxfordjournals.org/content/25/21/2839.full* ** FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?* *title=Main_Page* ** Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic* ** http://code.google.com/p/cutadapt/ * ** https://github.com/vsbuffalo/scythe * ** http://code.google.com/p/biopieces/wiki/find_adaptor* Thank you in advance. Cheers, On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto wrote: > Hi Mic, > > I'm not familiar with picard, but it seems that this program detects > whole duplicate molecules instead of detecting whether a primer is > present in sequences (which may or may not be duplicates). Plus, it > doesn't do any removal ~ it only flags them. So I don't think the two > are comparable. > > cheers, > Bow > > On Fri, Apr 6, 2012 at 04:06, Mic wrote: > > What is the difference to remove primer from the fastq file rather to use > > markDuplicates > http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates > on > > an alignment? > > > > Would both ways deliver the same results? > > > > Thank you in advance. > > > > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto < > w.arindrarto at gmail.com> > > wrote: > >> > >> Hi Tychele, > >> > >> Glad to hear that and thanks for attaching the code as well :). > >> > >> Just one more heads up on the code, the trimming function assumes that > >> for any record sequence, there is only one matching primer sequence at > >> most. If by any random chance a sequence begins with two or more > >> primer sequences, then it will only trim the first primer sequence. So > >> if you still see some primer sequences left in the trimmed sequences, > >> this could be the case and you'll need to modify the code. > >> > >> However, that seems unlikely ~ the current code should suffice. > >> > >> cheers, > >> Bow > >> > >> > >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner wrote: > >> > Hi Bow, > >> > > >> > Thank you! This works great. I have attached the final code to the > email > >> > in case it may benefit others. > >> > > >> > Tychele > >> > > >> > > >> > ________________________________________ > >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] > >> > Sent: Wednesday, April 04, 2012 2:05 PM > >> > To: Tychele Turner > >> > Cc: biopython at biopython.org > >> > Subject: Re: [Biopython] biopython question > >> > > >> > Hi Tychele, > >> > > >> > If I understood correctly, you have a list of primers stored in a file > >> > and you want to trim those primer sequences off your fastq sequences, > >> > correct? One way I could think of is to first store the primers in a > >> > list (since they will be used repeatedly to check every single fastq > >> > sequence). > >> > > >> > Here's the code: > >> > > >> > from Bio import SeqIO > >> > > >> > def trim_primers(records, 'primer_file_name'): > >> > > >> > # read the primers > >> > primer_list = [] > >> > with open('primer_file_name', 'r') as source: > >> > for line in source: > >> > primer_list.append(line.strip()) > >> > > >> > for record in records: > >> > # list to check if the sequence begins with any of the primers > >> > check = [record.seq.startswith(x) for x in primer_list] > >> > # if any of the primer is present in the beginning of the > >> > sequence, then we trim it off > >> > if any(check): > >> > # get index of primer that matches the beginning > >> > idx = check.index(True) > >> > len_primer = len(primer_list[idx]) > >> > yield record[len_primer:] > >> > # otherwise just return the whole record > >> > else: > >> > yield record > >> > > >> > and then, you can use the function like so: > >> > > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') > >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > >> > print "Saved %i reads" % count > >> > > >> > I haven't tested the function, but I suppose that should do the trick. > >> > > >> > Hope that helps :), > >> > Bow > >> > > >> > > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner > wrote: > >> >> Hi, > >> >> > >> >> I have a question regarding one of the biopython capabilities. I > would > >> >> like to trim primers off the end of reads in a fastq file and I found > >> >> wonderful documentation of how to do this on your website as follows: > >> >> > >> >> from Bio import SeqIO > >> >> def trim_primers(records, primer): > >> >> """Removes perfect primer sequences at start of reads. > >> >> > >> >> This is a generator function, the records argument should > >> >> be a list or iterator returning SeqRecord objects. > >> >> """ > >> >> len_primer = len(primer) #cache this for later > >> >> for record in records: > >> >> if record.seq.startswith(primer): > >> >> yield record[len_primer:] > >> >> else: > >> >> yield record > >> >> > >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") > >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") > >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") > >> >> print "Saved %i reads" % count > >> >> > >> >> > >> >> > >> >> > >> >> My question is: Is there a way to loop through a primer file for > >> >> instance instead of looking for only > >> >> > >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed > >> >> from the start of its respective read. > >> >> > >> >> Primer file structured as: > >> >> GATGACGGTGT > >> >> GATGACGGTGA > >> >> GATGACGGCCT > >> >> > >> >> If you have any suggestions it would be greatly appreciated. Thanks. > >> >> > >> >> Tychele > >> >> > >> >> > >> >> _______________________________________________ > >> >> Biopython mailing list - Biopython at lists.open-bio.org > >> >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From ferreirafm at usp.br Fri Apr 6 10:44:04 2012 From: ferreirafm at usp.br (ferreirafm at usp.br) Date: Fri, 06 Apr 2012 07:44:04 -0300 Subject: [Biopython] random peptide sequences In-Reply-To: References: <20120404155608.14619h2pl6oj4y88@webmail.usp.br> <20120404165644.15233099ve7c2pvg@webmail.usp.br> <20120404210142.95126nwmeotjeg9y@webmail.usp.br> Message-ID: <20120406074404.12906v4jd37l4s8k@webmail.usp.br> Hi Peter, quick replaying also: thanks. Citando Peter Cock : > Just a quick reply - try changing this: > > rec = SeqRecord('GPGPG'.join(peplist), id="pep%s" % k) > > to > > rec = SeqRecord(Seq('GPGPG'.join(peplist)), id="pep%s" % k) > > You'll need to add this import line at the start as well, > from Bio.Seq import Seq > From 88whacko at gmail.com Fri Apr 6 13:01:49 2012 From: 88whacko at gmail.com (Andrea Rizzi) Date: Fri, 6 Apr 2012 15:01:49 +0200 Subject: [Biopython] GSoC - variants proposal - Andrea Rizzi Message-ID: Hi everybody, I'm a master student at Royal School of Technology in Stockholm. My program is Computational and System Biology and I'm interested in the representation and manipulation of variants project. I hope I'll have the chance to work with you. Here is the link to my proposal on google docs: https://docs.google.com/document/d/1iAjuOT1MzfMYDPr7ghDCB8pkWRdJqT_Pr5cvUEKekgg/ Reece, Brad: thank you very much for finding the time to answer me. The proposal on google docs is now publicly available with comments enabled and I've added a short summary in my google application. Cheers! -- -- Andrea From tturne18 at jhmi.edu Sat Apr 7 15:05:30 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Sat, 7 Apr 2012 15:05:30 +0000 Subject: [Biopython] [Samtools-help] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> , Message-ID: <22450DD328862542A918A3BC491F263B561E92@SN2PRD0102MB141.prod.exchangelabs.com> Hi Mic, I just saw your message regarding Mark Duplicates and the script Bow and I were discussing which recognizes and cleaves primers. First off, I'm familiar with Mark Duplicates from Picard and I do use it for exome data. However, in this instance I was looking at sequences coming from short amplicon sequencing. In this instance, marking duplicates is not appropriate because most of the reads will be duplicates due to the nature of the bench experiment (in contrast to shotgun sequencing where your looking at random fragments in which PCR artifacts arise in the PCR steps post-shearing). In my short amplicon sequence data, the read will start with the primer sequence and then extend to be a total length of 100 nucleotides. For this reason, I wanted to use a script which could recognize the primer and ultimately cleave that primer from the read so it would not go into the rest of the pipeline which would ultimately go to a variant calling program. As for your last point of sending other software which cut adapters that's fine but I'm not cutting adapters I'm looking for primer sequences and cleaving those. Also, I thought that if Biopython already has such a nice setup to do this I would use that especially since python is quite efficient at this task. Hope this helps. Tychele From: Mic [mictadlo at gmail.com] Sent: Friday, April 06, 2012 5:59 AM To: Wibowo Arindrarto Cc: samtools-help; biopython at biopython.org Subject: Re: [Samtools-help] [Biopython] biopython question Hi Bow, You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates: $ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam $ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics Are adapters and fragments the same? I found the following software for adapter: * TagDust - eliminate artifactual sequence from NGS data http://www.biomedcentral.com/1471-2164/12/382 http://bioinformatics.oxfordjournals.org/content/25/21/2839.full * FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php? title=Main_Page * Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic * http://code.google.com/p/cutadapt/ * https://github.com/vsbuffalo/scythe * http://code.google.com/p/biopieces/wiki/find_adaptor Thank you in advance. Cheers, On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto > wrote: Hi Mic, I'm not familiar with picard, but it seems that this program detects whole duplicate molecules instead of detecting whether a primer is present in sequences (which may or may not be duplicates). Plus, it doesn't do any removal ~ it only flags them. So I don't think the two are comparable. cheers, Bow On Fri, Apr 6, 2012 at 04:06, Mic > wrote: > What is the difference to remove primer from the fastq file rather to use > markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on > an alignment? > > Would both ways deliver the same results? > > Thank you in advance. > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto > > wrote: >> >> Hi Tychele, >> >> Glad to hear that and thanks for attaching the code as well :). >> >> Just one more heads up on the code, the trimming function assumes that >> for any record sequence, there is only one matching primer sequence at >> most. If by any random chance a sequence begins with two or more >> primer sequences, then it will only trim the first primer sequence. So >> if you still see some primer sequences left in the trimmed sequences, >> this could be the case and you'll need to modify the code. >> >> However, that seems unlikely ~ the current code should suffice. >> >> cheers, >> Bow >> >> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner > wrote: >> > Hi Bow, >> > >> > Thank you! This works great. I have attached the final code to the email >> > in case it may benefit others. >> > >> > Tychele >> > >> > >> > ________________________________________ >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] >> > Sent: Wednesday, April 04, 2012 2:05 PM >> > To: Tychele Turner >> > Cc: biopython at biopython.org >> > Subject: Re: [Biopython] biopython question >> > >> > Hi Tychele, >> > >> > If I understood correctly, you have a list of primers stored in a file >> > and you want to trim those primer sequences off your fastq sequences, >> > correct? One way I could think of is to first store the primers in a >> > list (since they will be used repeatedly to check every single fastq >> > sequence). >> > >> > Here's the code: >> > >> > from Bio import SeqIO >> > >> > def trim_primers(records, 'primer_file_name'): >> > >> > # read the primers >> > primer_list = [] >> > with open('primer_file_name', 'r') as source: >> > for line in source: >> > primer_list.append(line.strip()) >> > >> > for record in records: >> > # list to check if the sequence begins with any of the primers >> > check = [record.seq.startswith(x) for x in primer_list] >> > # if any of the primer is present in the beginning of the >> > sequence, then we trim it off >> > if any(check): >> > # get index of primer that matches the beginning >> > idx = check.index(True) >> > len_primer = len(primer_list[idx]) >> > yield record[len_primer:] >> > # otherwise just return the whole record >> > else: >> > yield record >> > >> > and then, you can use the function like so: >> > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> > print "Saved %i reads" % count >> > >> > I haven't tested the function, but I suppose that should do the trick. >> > >> > Hope that helps :), >> > Bow >> > >> > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner > wrote: >> >> Hi, >> >> >> >> I have a question regarding one of the biopython capabilities. I would >> >> like to trim primers off the end of reads in a fastq file and I found >> >> wonderful documentation of how to do this on your website as follows: >> >> >> >> from Bio import SeqIO >> >> def trim_primers(records, primer): >> >> """Removes perfect primer sequences at start of reads. >> >> >> >> This is a generator function, the records argument should >> >> be a list or iterator returning SeqRecord objects. >> >> """ >> >> len_primer = len(primer) #cache this for later >> >> for record in records: >> >> if record.seq.startswith(primer): >> >> yield record[len_primer:] >> >> else: >> >> yield record >> >> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> >> print "Saved %i reads" % count >> >> >> >> >> >> >> >> >> >> My question is: Is there a way to loop through a primer file for >> >> instance instead of looking for only >> >> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed >> >> from the start of its respective read. >> >> >> >> Primer file structured as: >> >> GATGACGGTGT >> >> GATGACGGTGA >> >> GATGACGGCCT >> >> >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> >> >> Tychele >> >> >> >> >> >> _______________________________________________ >> >> Biopython mailing list - Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From rbuels at gmail.com Sun Apr 8 16:34:33 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 08 Apr 2012 12:34:33 -0400 Subject: [Biopython] Google Summer of Code mentors Message-ID: <4F81BE19.2050605@gmail.com> Hi all, Reminder: if you want to help mentor Google Summer of Code students to work on your Bio* project, you have to do three things: 1. Make sure you have enough time to actually help a student over the summer 2. Sign up as a mentor for the Open Bioinformatics Foundation at http://www.google-melange.com/gsoc/homepage/google/gsoc2012 3. Join the OBF Google Summer of Code mailing lists at: http://lists.open-bio.org/mailman/listinfo/gsoc and http://lists.open-bio.org/mailman/listinfo/gsoc-mentors Robert Buels 2012 OBF GSoC Org. Admin. From alfonso.esposito1983 at hotmail.it Mon Apr 9 10:40:57 2012 From: alfonso.esposito1983 at hotmail.it (fonz esposito) Date: Mon, 9 Apr 2012 12:40:57 +0200 Subject: [Biopython] biopython script and py2exe Message-ID: Dear all, I wrote a biopython script for sequence blast and report, it is a quite simple one but it just works on linux and all my collegues complain. I found a way to convert it into .exe that could run on windows, it is called py2exe. I tried it but it does not work, it gives me a lots of error messages, so I read some more and I found that it requires a .dll file to run properly and furthermore there are some more things to to to make it work with biopython packages... Did anyone of you get the same problem? Could someone of you help me to solve it? thanks in advance. Alfonso From rbuels at gmail.com Mon Apr 9 14:57:41 2012 From: rbuels at gmail.com (Robert Buels) Date: Mon, 09 Apr 2012 10:57:41 -0400 Subject: [Biopython] Google Summer of Code mentors Message-ID: <4F82F8E5.2040709@gmail.com> Hi all, Reminder: if you want to help mentor Google Summer of Code students to work on your Bio* project, you have to do four things: 1. Make sure you have enough time to actually help a student over the summer 2. Sign up as a mentor for the Open Bioinformatics Foundation at http://www.google-melange.com/gsoc/homepage/google/gsoc2012 3. Join the OBF Google Summer of Code mailing lists at: http://lists.open-bio.org/mailman/listinfo/gsoc and http://lists.open-bio.org/mailman/listinfo/gsoc-mentors 4. After your request to be a mentor is accepted by me, log into the GSoC web interface at http://www.google-melange.com (the same web application you used to sign up) and help look at and evaluate this year's student proposals. Robert Buels 2012 OBF GSoC Org. Admin. From jgrant at smith.edu Mon Apr 9 20:29:16 2012 From: jgrant at smith.edu (Jessica Grant) Date: Mon, 9 Apr 2012 16:29:16 -0400 Subject: [Biopython] search ncbi automatically Message-ID: <4531D4A6-5312-4DC4-9418-6D6939D5BD46@smith.edu> Hello, I am working on a phylogenomic pipeline and want to keep my database as up-to-date as possible. I was wondering if there is a way to automatically search genbank on occasion (every month or so, or however often they release new data) to see if any new sequences have been added for the taxa we are working with. Is there a way to run a script in the background that will just go out and do that for me, and let me know if it finds anything? Thanks for your help! Jessica From tturne18 at jhmi.edu Mon Apr 9 20:44:55 2012 From: tturne18 at jhmi.edu (Tychele Turner) Date: Mon, 9 Apr 2012 20:44:55 +0000 Subject: [Biopython] [Samtools-help] biopython question In-Reply-To: References: <22450DD328862542A918A3BC491F263B561ACC@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561B7B@SN2PRD0102MB141.prod.exchangelabs.com> <22450DD328862542A918A3BC491F263B561E92@SN2PRD0102MB141.prod.exchangelabs.com>, Message-ID: <22450DD328862542A918A3BC491F263B562034@SN2PRD0102MB141.prod.exchangelabs.com> Thanks Monica! I will look into your program. Tychele ________________________________ From: Monica Britton [mtbritton at ucdavis.edu] Sent: Saturday, April 07, 2012 5:55 PM To: Tychele Turner Cc: Mic; Wibowo Arindrarto; samtools-help Subject: Re: [Samtools-help] [Biopython] biopython question Hi Tychele: If your primer is always at the beginning of each sequence, you could treat it as a barcode. We have a program to cleave barcodes from fastq sequences that would fit your purpose (see https://github.com/ucdavis-bioinformatics/sabre). Monica Britton On Sat, Apr 7, 2012 at 8:05 AM, Tychele Turner > wrote: Hi Mic, I just saw your message regarding Mark Duplicates and the script Bow and I were discussing which recognizes and cleaves primers. First off, I'm familiar with Mark Duplicates from Picard and I do use it for exome data. However, in this instance I was looking at sequences coming from short amplicon sequencing. In this instance, marking duplicates is not appropriate because most of the reads will be duplicates due to the nature of the bench experiment (in contrast to shotgun sequencing where your looking at random fragments in which PCR artifacts arise in the PCR steps post-shearing). In my short amplicon sequence data, the read will start with the primer sequence and then extend to be a total length of 100 nucleotides. For this reason, I wanted to use a script which could recognize the primer and ultimately cleave that primer from the read so it would not go into the rest of the pipeline which would ultimately go to a variant calling program. As for your last point of sending other software which cut adapters that's fine but I'm not cutting adapters I'm looking for primer sequences and cleaving those. Also, I thought that if Biopython already has such a nice setup to do this I would use that especially since python is quite efficient at this task. Hope this helps. Tychele From: Mic [mictadlo at gmail.com] Sent: Friday, April 06, 2012 5:59 AM To: Wibowo Arindrarto Cc: samtools-help; biopython at biopython.org Subject: Re: [Samtools-help] [Biopython] biopython question Hi Bow, You can remove duplicates in the input file or create a new output file. With the following commands you create an output file with no duplicates: $ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam $ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024 INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics Are adapters and fragments the same? I found the following software for adapter: * TagDust - eliminate artifactual sequence from NGS data http://www.biomedcentral.com/1471-2164/12/382 http://bioinformatics.oxfordjournals.org/content/25/21/2839.full * FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php? title=Main_Page * Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic * http://code.google.com/p/cutadapt/ * https://github.com/vsbuffalo/scythe * http://code.google.com/p/biopieces/wiki/find_adaptor Thank you in advance. Cheers, On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto > wrote: Hi Mic, I'm not familiar with picard, but it seems that this program detects whole duplicate molecules instead of detecting whether a primer is present in sequences (which may or may not be duplicates). Plus, it doesn't do any removal ~ it only flags them. So I don't think the two are comparable. cheers, Bow On Fri, Apr 6, 2012 at 04:06, Mic > wrote: > What is the difference to remove primer from the fastq file rather to use > markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on > an alignment? > > Would both ways deliver the same results? > > Thank you in advance. > > > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto > > wrote: >> >> Hi Tychele, >> >> Glad to hear that and thanks for attaching the code as well :). >> >> Just one more heads up on the code, the trimming function assumes that >> for any record sequence, there is only one matching primer sequence at >> most. If by any random chance a sequence begins with two or more >> primer sequences, then it will only trim the first primer sequence. So >> if you still see some primer sequences left in the trimmed sequences, >> this could be the case and you'll need to modify the code. >> >> However, that seems unlikely ~ the current code should suffice. >> >> cheers, >> Bow >> >> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner > wrote: >> > Hi Bow, >> > >> > Thank you! This works great. I have attached the final code to the email >> > in case it may benefit others. >> > >> > Tychele >> > >> > >> > ________________________________________ >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com] >> > Sent: Wednesday, April 04, 2012 2:05 PM >> > To: Tychele Turner >> > Cc: biopython at biopython.org >> > Subject: Re: [Biopython] biopython question >> > >> > Hi Tychele, >> > >> > If I understood correctly, you have a list of primers stored in a file >> > and you want to trim those primer sequences off your fastq sequences, >> > correct? One way I could think of is to first store the primers in a >> > list (since they will be used repeatedly to check every single fastq >> > sequence). >> > >> > Here's the code: >> > >> > from Bio import SeqIO >> > >> > def trim_primers(records, 'primer_file_name'): >> > >> > # read the primers >> > primer_list = [] >> > with open('primer_file_name', 'r') as source: >> > for line in source: >> > primer_list.append(line.strip()) >> > >> > for record in records: >> > # list to check if the sequence begins with any of the primers >> > check = [record.seq.startswith(x) for x in primer_list] >> > # if any of the primer is present in the beginning of the >> > sequence, then we trim it off >> > if any(check): >> > # get index of primer that matches the beginning >> > idx = check.index(True) >> > len_primer = len(primer_list[idx]) >> > yield record[len_primer:] >> > # otherwise just return the whole record >> > else: >> > yield record >> > >> > and then, you can use the function like so: >> > >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name') >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> > print "Saved %i reads" % count >> > >> > I haven't tested the function, but I suppose that should do the trick. >> > >> > Hope that helps :), >> > Bow >> > >> > >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner > wrote: >> >> Hi, >> >> >> >> I have a question regarding one of the biopython capabilities. I would >> >> like to trim primers off the end of reads in a fastq file and I found >> >> wonderful documentation of how to do this on your website as follows: >> >> >> >> from Bio import SeqIO >> >> def trim_primers(records, primer): >> >> """Removes perfect primer sequences at start of reads. >> >> >> >> This is a generator function, the records argument should >> >> be a list or iterator returning SeqRecord objects. >> >> """ >> >> len_primer = len(primer) #cache this for later >> >> for record in records: >> >> if record.seq.startswith(primer): >> >> yield record[len_primer:] >> >> else: >> >> yield record >> >> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq") >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") >> >> print "Saved %i reads" % count >> >> >> >> >> >> >> >> >> >> My question is: Is there a way to loop through a primer file for >> >> instance instead of looking for only >> >> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed >> >> from the start of its respective read. >> >> >> >> Primer file structured as: >> >> GATGACGGTGT >> >> GATGACGGTGA >> >> GATGACGGCCT >> >> >> >> If you have any suggestions it would be greatly appreciated. Thanks. >> >> >> >> Tychele >> >> >> >> >> >> _______________________________________________ >> >> Biopython mailing list - Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > ------------------------------------------------------------------------------ For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 _______________________________________________ Samtools-help mailing list Samtools-help at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help -- Monica Britton Bioinformatics Analyst Genome Center and Bioinformatics Core Facility University of California, Davis mtbritton at ucdavis.edu From David.Lapointe at umassmed.edu Tue Apr 10 00:09:57 2012 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Tue, 10 Apr 2012 00:09:57 +0000 Subject: [Biopython] search ncbi automatically In-Reply-To: <4531D4A6-5312-4DC4-9418-6D6939D5BD46@smith.edu> References: <4531D4A6-5312-4DC4-9418-6D6939D5BD46@smith.edu> Message-ID: <86BFEB1DFA6CB3448DB8AB1FC52F4059069C60@ummscsmbx06.ad.umassmed.edu> Hi Jessica, David Hibbett , at Clark Univ, had a program with a similar purpose. See http://www.clarku.edu/faculty/dhibbett/. Genbank publishes daily updates which can be scanned for taxa with some biopython scripts. That would involve some downloading every day or so. Each file ranges from 10-100 Mb compressed,though some days there might be a 900 Mb file. A new version of Genbank happens every 2 months so if you have division ( VRL, PRI, etc) that interests you. you can download all of the pieces of that division and rsync when a new Genbank version comes around. David ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Jessica Grant [jgrant at smith.edu] Sent: Monday, April 09, 2012 4:29 PM To: biopython at lists.open-bio.org Subject: [Biopython] search ncbi automatically Hello, I am working on a phylogenomic pipeline and want to keep my database as up-to-date as possible. I was wondering if there is a way to automatically search genbank on occasion (every month or so, or however often they release new data) to see if any new sequences have been added for the taxa we are working with. Is there a way to run a script in the background that will just go out and do that for me, and let me know if it finds anything? Thanks for your help! Jessica _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From livingstonemark at gmail.com Thu Apr 12 01:34:16 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Thu, 12 Apr 2012 11:34:16 +1000 Subject: [Biopython] Key Error Message-ID: Hi Guys, I am about 3 days into learning BioPython using the current EPD 32 bit Mac OS X academic distribution . When I run the included code, it works fine if I do num_atoms_to_do = 36 but if I try to do any more, I get a key error. I am using the 1fat.pdb since that is what your tutes seem to use. The only thing that I note is that when comparing residues in model A & C, it is at residue 37 that the letters are no longer the same. However, since I only look at Model A in the first part of the code, I can't see that this should be a factor? Program output: >From residue 0 0.000000 3.795453 5.663135 9.420295 12.296334 15.957790 19.201048 22.622383 25.621159 27.803837 27.483303 28.365652 28.663099 27.070955 23.441793 24.625151 26.016047 29.257225 31.299105 34.907970 36.100464 32.837784 33.310841 32.332653 35.280716 36.668713 35.319839 31.738102 31.607626 30.115669 33.220890 32.487370 36.199894 39.507755 42.668190 46.333019 Traceback (most recent call last): File "./inter_atom_distance.py", line 47, in residue2 = chain[y+1] File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Chain.py", line 67, in __getitem__ return Entity.__getitem__(self, id) File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Entity.py", line 38, in __getitem__ return self.child_dict[id] KeyError: (' ', 37, ' ') #! /usr/bin/env python # Initial idea for this code from http://stackoverflow.com/questions/6437391/how-to-get-distance-between-two-atoms-using-for-loop print "\n\n\033[95mMark's PDB Geometry experimentation code\033[0m\n\n" print "This code prints a set of c-alpha to c-alpha distances which are colourised so that distances < 5.0 are green\n" print "otherwise are colourised red. If you are using Microsoft Windows, you may need to load an ansi.sys driver in your config.sys\n\n" print "Any errors below in yellow are due to the .pdb file not being properly well formed.\n\n\033[93m" from Bio.PDB.PDBParser import PDBParser pdb_filename ='./1fat.pdb' parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure("1fat", pdb_filename) model = structure[0] chain = model["A"] residue1 = chain[1] print "\033[0m" print residue1.get_resname() # SER residue2 = chain[2] print residue2.get_resname() # ASN atom1 = residue1['CA'] print atom1.get_name() print atom1.get_coord() atom2 = residue2['CA'] print atom2.get_name() print atom2.get_coord() distance = atom1-atom2 # minus is overloaded to do 3D vector math - how clever!! print"%s to %s euclidean distance = %f Angstroms" % (residue1.get_resname(), residue2.get_resname(), distance) print "%d models in pdb file named %s" % (len(model), pdb_filename) # 4 models in pdb print "%d residues in model 1" % len(chain) # 239 residues print "%s has %d atoms" % (residue1.get_resname(), len(residue1)) # SER has 6 atoms print "Length of Model 'A' is %d and Model 'C' is %d" % (len(model['A']), len(model['C'])) print ("\n\033[93mDistances between C-alpha atoms of residues in model 1 \n") num_atoms_to_do = 37 for x in range(num_atoms_to_do): print "\033[0mFrom residue %d" % x residue1 = chain[x+1] atom1 = residue1['CA'] for y in range(num_atoms_to_do): residue2 = chain[y+1] atom2 = residue2['CA'] distance = (atom1 - atom2) if distance < 5.0: print("\033[92m%f" % distance), else: print("\033[91m%f" % distance), print "\n" print "\n\033[93mDistances between C-alpha atoms of residues in model 1 to model 3 \n" print "NB: These have NOT been superimposed - thus the large distances between matched atoms\033[0m\n" num_atoms_to_do = 37 for x in range(num_atoms_to_do): print "\033[0mFrom residue %d" % x model = structure[0] chain = model["A"] residue1 = chain[x+1] atom1 = residue1['CA'] for y in range(num_atoms_to_do): model = structure[0] chain = model["C"] residue2 = chain[y+1] atom2 = residue2['CA'] distance = (atom1 - atom2) if distance < 5.0: print("\033[92m%f" % distance), else: print("\033[91m%f" % distance), print "\n" print "\033[0m" Thanks in advance, MarkL From livingstonemark at gmail.com Thu Apr 12 01:57:15 2012 From: livingstonemark at gmail.com (Mark Livingstone) Date: Thu, 12 Apr 2012 11:57:15 +1000 Subject: [Biopython] Key Error In-Reply-To: References: Message-ID: Hi Guys, I am about 3 days into learning BioPython using the current EPD 32 bit Mac OS X academic distribution . When I run the included code, it works fine if I do num_atoms_to_do = 36 but if I try to do any more, I get a key error. I am using the 1fat.pdb since that is what your tutes seem to use. The only thing that I note is that when comparing residues in model A & C, it is at residue 37 that the letters are no longer the same. However, since I only look at Model A in the first part of the code, I can't see that this should be a factor? Program output: >From residue 0 0.000000 3.795453 5.663135 9.420295 12.296334 15.957790 19.201048 22.622383 25.621159 27.803837 27.483303 28.365652 28.663099 27.070955 23.441793 24.625151 26.016047 29.257225 31.299105 34.907970 36.100464 32.837784 33.310841 32.332653 35.280716 36.668713 35.319839 31.738102 31.607626 30.115669 33.220890 32.487370 36.199894 39.507755 42.668190 46.333019 Traceback (most recent call last): ?File "./inter_atom_distance.py", line 47, in ? ?residue2 = chain[y+1] ?File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Chain.py", line 67, in __getitem__ ? ?return Entity.__getitem__(self, id) ?File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Entity.py", line 38, in __getitem__ ? ?return self.child_dict[id] KeyError: (' ', 37, ' ') #! /usr/bin/env python # Initial idea for this code from http://stackoverflow.com/questions/6437391/how-to-get-distance-between-two-atoms-using-for-loop print "\n\n\033[95mMark's PDB Geometry experimentation code\033[0m\n\n" print "This code prints a set of c-alpha to c-alpha distances which are colourised so that distances < 5.0 are green\n" print "otherwise are colourised red. If you are using Microsoft Windows, you may need to load an ansi.sys driver in your config.sys\n\n" print "Any errors below in yellow are due to the .pdb file not being properly well formed.\n\n\033[93m" from Bio.PDB.PDBParser import PDBParser pdb_filename ='./1fat.pdb' parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure("1fat", pdb_filename) model = structure[0] chain = model["A"] residue1 = chain[1] print "\033[0m" print residue1.get_resname() # SER residue2 = chain[2] print residue2.get_resname() # ASN atom1 = residue1['CA'] print atom1.get_name() print atom1.get_coord() atom2 = residue2['CA'] print atom2.get_name() print atom2.get_coord() distance = atom1-atom2 # minus is overloaded to do 3D vector math - how clever!! print"%s to %s euclidean distance = %f Angstroms" ?% (residue1.get_resname(), residue2.get_resname(), distance) print "%d models in pdb file named %s" % (len(model), pdb_filename) # 4 models in pdb print "%d residues in model 1" % len(chain) # 239 residues print "%s has %d atoms" % (residue1.get_resname(), len(residue1)) # SER has 6 atoms print "Length of Model 'A' is %d and Model 'C' is %d" % (len(model['A']), len(model['C'])) print ("\n\033[93mDistances between C-alpha atoms of residues in model 1 \n") num_atoms_to_do = 37 for x in range(num_atoms_to_do): ? ?print "\033[0mFrom residue %d" % x ? ?residue1 = chain[x+1] ? ?atom1 = residue1['CA'] ? ?for y in range(num_atoms_to_do): ? ? ? ?residue2 = chain[y+1] ? ? ? ?atom2 = residue2['CA'] ? ? ? ?distance = (atom1 - atom2) ? ? ? ?if distance < 5.0: ? ? ? ? ? ?print("\033[92m%f" % distance), ? ? ? ?else: ? ? ? ? ? ?print("\033[91m%f" % distance), ? ?print "\n" print "\n\033[93mDistances between C-alpha atoms of residues in model 1 to model 3 \n" print "NB: These have NOT been superimposed - thus the large distances between matched atoms\033[0m\n" num_atoms_to_do = 37 for x in range(num_atoms_to_do): ? ?print "\033[0mFrom residue %d" % x ? ?model = structure[0] ? ?chain = model["A"] ? ?residue1 = chain[x+1] ? ?atom1 = residue1['CA'] ? ?for y in range(num_atoms_to_do): ? ? ? ?model = structure[0] ? ? ? ?chain = model["C"] ? ? ? ?residue2 = chain[y+1] ? ? ? ?atom2 = residue2['CA'] ? ? ? ?distance = (atom1 - atom2) ? ? ? ?if distance < 5.0: ? ? ? ? ? ?print("\033[92m%f" % distance), ? ? ? ?else: ? ? ? ? ? ?print("\033[91m%f" % distance), ? ?print "\n" print "\033[0m" Thanks in advance, MarkL From ajperry at pansapiens.com Thu Apr 12 03:07:57 2012 From: ajperry at pansapiens.com (Andrew Perry) Date: Thu, 12 Apr 2012 13:07:57 +1000 Subject: [Biopython] Key Error In-Reply-To: References: Message-ID: Hi Mark, The problem is arising since 1FAT is missing coordinates for residue 37 in chain A,B,C and D. This is very common for protein structures in the PDB, and can be for many reasons - it's often the case that the structural biologist who determined the structure left out this residue since their data didn't allow them to determine it's position with confidence. By using range(num_atoms_to_do), you are assuming that there will be no numbers missing in the sequence ... not the case ! [also, I think you really mean range(num_of_residues_to_do) ]. The solution would be to do something like this: from Bio.PDB.PDBParser import PDBParser pdb_filename ='./1fat.pdb' parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure("1fat", pdb_filename) model = structure[0] chain = model["A"] for residue1 in chain: resnum = residue1.get_id()[1] atom1 = residue1['CA'] This loop will return residue objects in the Chain object, without caring if there is a residue missing in the sequence. (I can see how this could be confusing, since without looking at the source, it seems the Bio.PDB.Chain.Chain object mostly behaves like Python sequence object (eg a list), but behaves like a dictionary when __getitem__ is called on it via chain[some_key] . I'm sure there's some good reason for that :) ) The next thing you may find it that you hit a non-amino acid ligand "NAG" without a 'CA' atom. Use something like: if not "CA" in residue1: continue to catch that. Also, just a pedantic note on terminology that may help in reading the docs and further questions - "A", "B", "C" and "D" are chains in PDB terminology. A "model" is something different (usually only found in NMR structures with multiple models per PDB file). Hope this helps, Andrew Perry Postdoctoral Fellow Whisstock Lab Department of Biochemistry and Molecular Biology Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. Mobile: +61 409 808 529 On Thu, Apr 12, 2012 at 11:57 AM, Mark Livingstone < livingstonemark at gmail.com> wrote: > Hi Guys, > > I am about 3 days into learning BioPython using the current EPD 32 bit > Mac OS X academic distribution . When I run the included code, it > works fine if I do > > num_atoms_to_do = 36 > > but if I try to do any more, I get a key error. > > I am using the 1fat.pdb since that is what your tutes seem to use. The > only thing that I note is that when comparing residues in model A & C, > it is at residue 37 that the letters are no longer the same. However, > since I only look at Model A in the first part of the code, I can't > see that this should be a factor? > > Program output: > > > > >From residue 0 > 0.000000 3.795453 5.663135 9.420295 12.296334 15.957790 19.201048 > 22.622383 25.621159 27.803837 27.483303 28.365652 28.663099 27.070955 > 23.441793 24.625151 26.016047 29.257225 31.299105 34.907970 36.100464 > 32.837784 33.310841 32.332653 35.280716 36.668713 35.319839 31.738102 > 31.607626 30.115669 33.220890 32.487370 36.199894 39.507755 42.668190 > 46.333019 > Traceback (most recent call last): > File "./inter_atom_distance.py", line 47, in > residue2 = chain[y+1] > File > "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Chain.py", > line 67, in __getitem__ > return Entity.__getitem__(self, id) > File > "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/Bio/PDB/Entity.py", > line 38, in __getitem__ > return self.child_dict[id] > KeyError: (' ', 37, ' ') > > > > > #! /usr/bin/env python > > # Initial idea for this code from > > http://stackoverflow.com/questions/6437391/how-to-get-distance-between-two-atoms-using-for-loop > > print "\n\n\033[95mMark's PDB Geometry experimentation code\033[0m\n\n" > print "This code prints a set of c-alpha to c-alpha distances which > are colourised so that distances < 5.0 are green\n" > print "otherwise are colourised red. If you are using Microsoft > Windows, you may need to load an ansi.sys driver in your > config.sys\n\n" > > print "Any errors below in yellow are due to the .pdb file not being > properly well formed.\n\n\033[93m" > > from Bio.PDB.PDBParser import PDBParser > pdb_filename ='./1fat.pdb' > parser = PDBParser(PERMISSIVE=1) > structure = parser.get_structure("1fat", pdb_filename) > model = structure[0] > chain = model["A"] > residue1 = chain[1] > print "\033[0m" > print residue1.get_resname() # SER > residue2 = chain[2] > print residue2.get_resname() # ASN > > atom1 = residue1['CA'] > print atom1.get_name() > print atom1.get_coord() > atom2 = residue2['CA'] > print atom2.get_name() > print atom2.get_coord() > distance = atom1-atom2 # minus is overloaded to do 3D vector math - how > clever!! > print"%s to %s euclidean distance = %f Angstroms" % > (residue1.get_resname(), residue2.get_resname(), distance) > print "%d models in pdb file named %s" % (len(model), pdb_filename) # > 4 models in pdb > print "%d residues in model 1" % len(chain) # 239 residues > print "%s has %d atoms" % (residue1.get_resname(), len(residue1)) # > SER has 6 atoms > > print "Length of Model 'A' is %d and Model 'C' is %d" % > (len(model['A']), len(model['C'])) > > print ("\n\033[93mDistances between C-alpha atoms of residues in model 1 > \n") > > num_atoms_to_do = 37 > > for x in range(num_atoms_to_do): > print "\033[0mFrom residue %d" % x > residue1 = chain[x+1] > atom1 = residue1['CA'] > > for y in range(num_atoms_to_do): > residue2 = chain[y+1] > atom2 = residue2['CA'] > distance = (atom1 - atom2) > if distance < 5.0: > print("\033[92m%f" % distance), > else: > print("\033[91m%f" % distance), > print "\n" > > print "\n\033[93mDistances between C-alpha atoms of residues in model > 1 to model 3 \n" > print "NB: These have NOT been superimposed - thus the large distances > between matched atoms\033[0m\n" > > > num_atoms_to_do = 37 > > for x in range(num_atoms_to_do): > print "\033[0mFrom residue %d" % x > model = structure[0] > chain = model["A"] > residue1 = chain[x+1] > atom1 = residue1['CA'] > > for y in range(num_atoms_to_do): > model = structure[0] > chain = model["C"] > residue2 = chain[y+1] > atom2 = residue2['CA'] > > distance = (atom1 - atom2) > if distance < 5.0: > print("\033[92m%f" % distance), > else: > print("\033[91m%f" % distance), > print "\n" > > > print "\033[0m" > > > > Thanks in advance, > > MarkL > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From matthiasschade.de at googlemail.com Sat Apr 14 11:41:33 2012 From: matthiasschade.de at googlemail.com (Matthias Schade) Date: Sat, 14 Apr 2012 13:41:33 +0200 Subject: [Biopython] (no subject) Message-ID: <4F89626D.9060807@googlemail.com> Hello everyone, I would like to run a blastn-query of a small nucleotide-sequence against a genome. The code works already, but my queries are still slow and mostly ineffective, so I would like to ask: Is there a way to tell the blastn-algorithm that once a 'perfect match' has been found it can stop and send back the results? Background: I am interested in only the first full match because I would like to design a nucleotide-probe which -if possible- has no(!) known match in a host-genome, neither in RNA nor DNA. Actually, I would reject all perfect-matches and all single-mismatches but allow every sequence with two or more mismatches. Currrently, I use this line of code with seq_now being about 15-30 nt long: result_handle = NCBIWWW.qblast("blastn", "nt", seq_now, entrez_query="Canis familiaris[orgn]") I am still new to this. Thank you for your help and input, Matt From mrrizkalla at gmail.com Sat Apr 14 12:35:52 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Sat, 14 Apr 2012 14:35:52 +0200 Subject: [Biopython] History, Efetch, and returned records limits Message-ID: Dear community, I aim to get sequences by a list of gi (using efetch and history variables), for a certain taxid (using esearch). I always get the first 10,000 records. For example, I need10,300 gi_ids, I split them into list of 10,000 and submit them consecutively and still getting the first 10,000 records. I tried batch approach in Biopython tutorial, didn't even reach 10,000 sequences. Is there a limit for NCBI's returned sequences? Thank you. Mariam From p.j.a.cock at googlemail.com Sat Apr 14 12:54:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 13:54:55 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah wrote: > Dear community, > > I aim to get sequences by a list of gi (using efetch and history > variables), for a certain taxid (using esearch). I always get the first > 10,000 records. For example, I need10,300 gi_ids, I split them into list of > 10,000 and submit them consecutively and still getting the first 10,000 > records. I tried batch approach in Biopython tutorial, didn't even reach > 10,000 sequences. > > Is there a limit for NCBI's returned sequences? > > Thank you. > > Mariam It does sound like you've found some sort of Entrez limit, it might be worth emailing the NCBI to clarify this. Have you considered downloading the GI/taxid mapping table from their FTP site instead? e.g. http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html Peter From cjfields at illinois.edu Sat Apr 14 13:22:55 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sat, 14 Apr 2012 13:22:55 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: <3146B62A-6B02-4A22-862D-68223F6A13E0@illinois.edu> On Apr 14, 2012, at 7:54 AM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah > wrote: >> Dear community, >> >> I aim to get sequences by a list of gi (using efetch and history >> variables), for a certain taxid (using esearch). I always get the first >> 10,000 records. For example, I need10,300 gi_ids, I split them into list of >> 10,000 and submit them consecutively and still getting the first 10,000 >> records. I tried batch approach in Biopython tutorial, didn't even reach >> 10,000 sequences. >> >> Is there a limit for NCBI's returned sequences? >> >> Thank you. >> >> Mariam > > It does sound like you've found some sort of Entrez limit, > it might be worth emailing the NCBI to clarify this. > > Have you considered downloading the GI/taxid mapping > table from their FTP site instead? e.g. > http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html > > Peter This wouldn't surprise me, they have long suggested breaking up record retrieval into batches of a few thousand or more, using retstart/retmax. chris From mrrizkalla at gmail.com Sat Apr 14 13:36:29 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Sat, 14 Apr 2012 15:36:29 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Hi Peter, I am concerned with EST sequences, I will check if they are in the gi_taxid_nucl mappers. Please find my script, with 3 approaches and their results. #!/usr/bin/python > import sys > from Bio import Entrez > Entrez.email = "mariam.rizkallah at gmail.com" > txid = int(sys.argv[1]) > > #get count > prim_handle = Entrez.esearch(db="nucest",term="txid%i[Organism:exp]" > %(txid), retmax=20) > prim_record = Entrez.read(prim_handle) > prim_count = prim_record['Count'] > > #get max using history (Biopython tutorial > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc119) > search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" > %(txid), retmax=prim_count, usehistory="y") > search_results = Entrez.read(search_handle) > search_handle.close() > gi_list = search_results["IdList"] count = int(search_results["Count"]) > assert count == len(gi_list) > webenv = search_results["WebEnv"] > query_key = search_results["QueryKey"] > out_fasta = "%s_txid%i_ct%i.fasta" %(sys.argv[2], txid, count) > out_handle = open(out_fasta, "a") > > ## Approach1: gets tags within the fasta file Unable to > obtain query #1 batch_size = 1000 > for start in range(0,count,batch_size): > end = min(count, start+batch_size) > print "Going to download record %i to %i" % (start+1, end) > fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > retmode="text", retstart=start, retmax=batch_size, webenv=webenv, > query_key=query_key) > data = fetch_handle.read() > fetch_handle.close() > out_handle.write(data) > > ## Approach2: split list def SplitList( list, chunk_size ) : return [list[offs:offs+chunk_size] for offs in range(0, len(list), > chunk_size)] > z = SplitList(gi_list, 10000) for i in range(0, len(z)): > print len(z[i]) > post_handle = Entrez.epost("nucest", rettype="fasta", retmode="text", > id=",".join(z[1])) > webenv = search_results["WebEnv"] > query_key = search_results["QueryKey"] > fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > retmode="text", webenv=webenv, query_key=query_key) > data = fetch_handle.read() > fetch_handle.close() > out_handle.write(data) > > ## Approach3: with most consistent retrieval but limited to 10000 fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", > webenv=webenv, query_key=query_key) > data = fetch_handle.read() > fetch_handle.close() > out_handle.write(data) out_handle.close() On Sat, Apr 14, 2012 at 2:54 PM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 1:35 PM, Mariam Reyad Rizkallah > wrote: > > Dear community, > > > > I aim to get sequences by a list of gi (using efetch and history > > variables), for a certain taxid (using esearch). I always get the first > > 10,000 records. For example, I need10,300 gi_ids, I split them into list > of > > 10,000 and submit them consecutively and still getting the first 10,000 > > records. I tried batch approach in Biopython tutorial, didn't even reach > > 10,000 sequences. > > > > Is there a limit for NCBI's returned sequences? > > > > Thank you. > > > > Mariam > > It does sound like you've found some sort of Entrez limit, > it might be worth emailing the NCBI to clarify this. > > Have you considered downloading the GI/taxid mapping > table from their FTP site instead? e.g. > http://lists.open-bio.org/pipermail/biopython/2009-June/005295.html > > Peter > From p.j.a.cock at googlemail.com Sat Apr 14 13:52:43 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 14:52:43 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Sat, Apr 14, 2012 at 2:36 PM, Mariam Reyad Rizkallah wrote: > Hi Peter, > > I am concerned with EST sequences, I will check if they are in the > gi_taxid_nucl mappers. > > Please find my script, with 3 approaches and their results. > >> #!/usr/bin/python >> import sys >> from Bio import Entrez >> Entrez.email = "mariam.rizkallah at gmail.com" >> txid = int(sys.argv[1]) >> ... Can you give an example taxid where this breaks? I guess any with just over 10,000 results would be fine but it would be simpler to use the same as you for comparing results. Peter From mrrizkalla at gmail.com Sat Apr 14 13:56:22 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Sat, 14 Apr 2012 15:56:22 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Definitely! I am stuck with Rhizaria ( http://www.ncbi.nlm.nih.gov/nucest/?term=txid543769[Organism:exp])! Hope to move on through the tree of life :) ./get_est_by_txid.py "543769" "Rhizaria" Thank you so much. On Sat, Apr 14, 2012 at 3:52 PM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 2:36 PM, Mariam Reyad Rizkallah > wrote: > > Hi Peter, > > > > I am concerned with EST sequences, I will check if they are in the > > gi_taxid_nucl mappers. > > > > Please find my script, with 3 approaches and their results. > > > >> #!/usr/bin/python > >> import sys > >> from Bio import Entrez > >> Entrez.email = "mariam.rizkallah at gmail.com" > >> txid = int(sys.argv[1]) > >> ... > > Can you give an example taxid where this breaks? I guess any > with just over 10,000 results would be fine but it would be simpler > to use the same as you for comparing results. > > Peter > From p.j.a.cock at googlemail.com Sat Apr 14 17:39:22 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 18:39:22 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Hi again, I get a similar problem with this code - the first couple of tries it got the first 5000 and then failed, but that doesn't always happen: $ python mariam.py 10313 Going to download record 1 to 1000 Going to download record 1001 to 2000 Going to download record 2001 to 3000 Traceback (most recent call last): File "mariam.py", line 28, in assert data.startswith(">"), data AssertionError: Unable to obtain query #1 Sometimes it gets further: $ python mariam.py 10313 Going to download record 1 to 1000 Going to download record 1001 to 2000 Going to download record 2001 to 3000 Going to download record 3001 to 4000 Going to download record 4001 to 5000 Going to download record 5001 to 6000 Going to download record 6001 to 7000 Going to download record 7001 to 8000 Going to download record 8001 to 9000 Going to download record 9001 to 10000 Going to download record 10001 to 10313 Traceback (most recent call last): File "mariam.py", line 28, in assert data.startswith(">"), data AssertionError: Unable to obtain query #1 Notice that this demonstrates one of the major flaws with the current NCBI Entrez setup - rather than setting an error HTTP error code (which would trigger a clear exception), Entrez returns the HTTP OK but puts and error in XML format (essentially a silent error). This is most unhelpful IMO. (This is something TogoWS handles much more nicely). #!/usr/bin/python import sys from Bio import Entrez Entrez.email = "mariam.rizkallah at gmail.com" #Entrez.email = "p.j.a.cock at googlemail.com" txid = 543769 name = "Rhizaria" #using history search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" %(txid), usehistory="y") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") #Sometimes get XML error not FASTA batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) print "Going to download record %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() This is how I believe the NCBI expect this task to be done. In this specific case it seems to be an NCBI failure. Perhaps a loop to retry the efetch two or three times might work? It could be the whole history session breaks at the NCBI end though... A somewhat brute force approach would be to do the search (don't bother with the history) and get the 10313 GI numbers. Then use epost+efetch to grab the records in batches of say 1000. Peter From p.j.a.cock at googlemail.com Sat Apr 14 19:32:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 20:32:03 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock wrote: > > A somewhat brute force approach would be to do the > search (don't bother with the history) and get the 10313 > GI numbers. Then use epost+efetch to grab the records > in batches of say 1000. > That does work (see below), but not all the time. A potential advantage of this way is that each fetch batch is a separate session, so retrying it should be straightforward. Peter #!/usr/bin/python import sys from Bio import Entrez Entrez.email = "mariam.rizkallah at gmail.com" Entrez.email = "p.j.a.cock at googlemail.com" txid = 543769 name = "Rhizaria" search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" %(txid), retmax="20000") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count assert count == len(gi_list), len(gi_list) out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") ## Approach1: gets tags within the fasta file Unable to obtain query #1 batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) batch = gi_list[start:end] print "Going to download record %i to %i using epost+efetch" % (start+1, end) post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) webenv = post_results["WebEnv"] query_key = post_results["QueryKey"] fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() From David.Lapointe at umassmed.edu Sat Apr 14 20:10:12 2012 From: David.Lapointe at umassmed.edu (Lapointe, David) Date: Sat, 14 Apr 2012 20:10:12 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: , Message-ID: <86BFEB1DFA6CB3448DB8AB1FC52F40590706B2@ummscsmbx06.ad.umassmed.edu> Just curious. Is there a delay in the code? E.g 3 or 4 secs between requests. ________________________________________ From: biopython-bounces at lists.open-bio.org [biopython-bounces at lists.open-bio.org] on behalf of Peter Cock [p.j.a.cock at googlemail.com] Sent: Saturday, April 14, 2012 3:32 PM To: Mariam Reyad Rizkallah Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] History, Efetch, and returned records limits On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock wrote: > > A somewhat brute force approach would be to do the > search (don't bother with the history) and get the 10313 > GI numbers. Then use epost+efetch to grab the records > in batches of say 1000. > That does work (see below), but not all the time. A potential advantage of this way is that each fetch batch is a separate session, so retrying it should be straightforward. Peter #!/usr/bin/python import sys from Bio import Entrez Entrez.email = "mariam.rizkallah at gmail.com" Entrez.email = "p.j.a.cock at googlemail.com" txid = 543769 name = "Rhizaria" search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" %(txid), retmax="20000") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count assert count == len(gi_list), len(gi_list) out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) out_handle = open(out_fasta, "a") ## Approach1: gets tags within the fasta file Unable to obtain query #1 batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) batch = gi_list[start:end] print "Going to download record %i to %i using epost+efetch" % (start+1, end) post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) webenv = post_results["WebEnv"] query_key = post_results["QueryKey"] fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Sat Apr 14 20:24:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Apr 2012 21:24:55 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: <86BFEB1DFA6CB3448DB8AB1FC52F40590706B2@ummscsmbx06.ad.umassmed.edu> References: <86BFEB1DFA6CB3448DB8AB1FC52F40590706B2@ummscsmbx06.ad.umassmed.edu> Message-ID: On Sat, Apr 14, 2012 at 9:10 PM, Lapointe, David wrote: > Just curious. Is there a delay in the code? E.g 3 or 4 secs between requests. The Bio.Entrez code obeys the current NCBI limit of at most 3 queries per second by limiting the query gap to at least 0.333333334s. Back in 2009 this was relaxed from the NCBI's original limit of 3s between queries. Peter From flitrfli at gmail.com Sun Apr 15 07:55:14 2012 From: flitrfli at gmail.com (Laura Scearce) Date: Sun, 15 Apr 2012 02:55:14 -0500 Subject: [Biopython] Blast Two sequences from a python script Message-ID: I have a list of pairs of proteins and I want to compare speed and accuracy of "BLAST Two Sequences" to a Smith-Waterman program for alignment. I know there is a "Blast Two Sequences" option on NCBI website, but I would like to run it from a python script. Perhaps Biopython has this capability? If I cannot use Blast Two Sequences, I will compare different versions of Smith-Waterman, but this would not be nearly as exciting :) OR, if anyone has another idea for a great senior year project in Bioinformatics involving comparing pairs of proteins, please don't hesitate to let me know? Thank you in advance. From eric.talevich at gmail.com Sun Apr 15 14:41:08 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 15 Apr 2012 10:41:08 -0400 Subject: [Biopython] (no subject) In-Reply-To: <4F89626D.9060807@googlemail.com> References: <4F89626D.9060807@googlemail.com> Message-ID: On Sat, Apr 14, 2012 at 7:41 AM, Matthias Schade wrote: > Hello everyone, > > > I would like to run a blastn-query of a small nucleotide-sequence against a > genome. The code works already, but my queries are still slow and mostly > ineffective, so I would like to ask: > > Is there a way to tell the blastn-algorithm that once a 'perfect match' has > been found it can stop and send back the results? > > Background: I am interested in only the first full match because I would > like to design a nucleotide-probe which -if possible- has no(!) known match > in a host-genome, neither in RNA nor DNA. Actually, I would reject all > perfect-matches and all single-mismatches but allow every sequence with two > or more mismatches. > > Currrently, I use this line of code with seq_now being about 15-30 nt long: > result_handle = NCBIWWW.qblast("blastn", "nt", seq_now, entrez_query="Canis > familiaris[orgn]") > > > I am still new to this. Thank you for your help and input, > > Matt > Hi Matt, Since you're already setting the target database as one genome, this should already be reasonably fast, right? You can play with the BLAST sensitivity cutoffs and reporting thresholds, but I don't think it's possible to do exactly this, except by using an algorithm other than BLAST. If speed is crucial, you might be interested in USEARCH, which does have the feature you're looking for, but isn't wrapped in Biopython yet: http://www.drive5.com/usearch/ Cheers, Eric From golubchi at stats.ox.ac.uk Mon Apr 16 11:05:38 2012 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Mon, 16 Apr 2012 12:05:38 +0100 Subject: [Biopython] (no subject) In-Reply-To: <4F89626D.9060807@googlemail.com> References: <4F89626D.9060807@googlemail.com> Message-ID: <4F8BFD02.5050209@stats.ox.ac.uk> Wouldn't it be faster to pre-check for a perfect match using a python string function? if primer_seq in genome_seq: return MatchFound else: Cheers, Tanya On 14/04/12 12:41, Matthias Schade wrote: > Hello everyone, > > > I would like to run a blastn-query of a small nucleotide-sequence > against a genome. The code works already, but my queries are still slow > and mostly ineffective, so I would like to ask: > > Is there a way to tell the blastn-algorithm that once a 'perfect match' > has been found it can stop and send back the results? > > Background: I am interested in only the first full match because I would > like to design a nucleotide-probe which -if possible- has no(!) known > match in a host-genome, neither in RNA nor DNA. Actually, I would reject > all perfect-matches and all single-mismatches but allow every sequence > with two or more mismatches. > > Currrently, I use this line of code with seq_now being about 15-30 nt long: > result_handle = NCBIWWW.qblast("blastn", "nt", seq_now, > entrez_query="Canis familiaris[orgn]") > > > I am still new to this. Thank you for your help and input, > > Matt > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mrrizkalla at gmail.com Mon Apr 16 14:04:08 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Mon, 16 Apr 2012 16:04:08 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: OH WOW! It works like charm! Peter, thank you very much for insight and for taking the time to fix my script. I do appreciate. Thank you. Mariam Blog post here: http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock wrote: > On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock > wrote: > > > > A somewhat brute force approach would be to do the > > search (don't bother with the history) and get the 10313 > > GI numbers. Then use epost+efetch to grab the records > > in batches of say 1000. > > > > That does work (see below), but not all the time. A potential > advantage of this way is that each fetch batch is a separate > session, so retrying it should be straightforward. > > Peter > > #!/usr/bin/python > import sys > from Bio import Entrez > Entrez.email = "mariam.rizkallah at gmail.com" > Entrez.email = "p.j.a.cock at googlemail.com" > txid = 543769 > name = "Rhizaria" > > search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" > %(txid), retmax="20000") > search_results = Entrez.read(search_handle) > search_handle.close() > gi_list = search_results["IdList"] > count = int(search_results["Count"]) > print count > assert count == len(gi_list), len(gi_list) > > out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > out_handle = open(out_fasta, "a") > > out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > out_handle = open(out_fasta, "a") > > ## Approach1: gets tags within the fasta file Unable to > obtain query #1 > batch_size = 1000 > for start in range(0,count,batch_size): > end = min(count, start+batch_size) > batch = gi_list[start:end] > print "Going to download record %i to %i using epost+efetch" % > (start+1, end) > post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) > webenv = post_results["WebEnv"] > query_key = post_results["QueryKey"] > fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > retmode="text", webenv=webenv, query_key=query_key) > data = fetch_handle.read() > assert data.startswith(">"), data > fetch_handle.close() > out_handle.write(data) > print "Done" > out_handle.close() > From p.j.a.cock at googlemail.com Mon Apr 16 14:09:15 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 15:09:15 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Mon, Apr 16, 2012 at 3:04 PM, Mariam Reyad Rizkallah wrote: > OH WOW! > > It works like charm!?Peter, thank you very much for insight and for taking > the time to fix my script. > > I do appreciate.?Thank you. > > Mariam > Blog post > here:?http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ Did you contact the NCBI to see where that 10,000 limit was coming from? Peter From cjfields at illinois.edu Mon Apr 16 14:15:50 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 14:15:50 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: <51BB20B7-BB11-4AD2-BDC6-2AE99682B56A@illinois.edu> On Apr 16, 2012, at 9:09 AM, Peter Cock wrote: > On Mon, Apr 16, 2012 at 3:04 PM, Mariam Reyad Rizkallah > wrote: >> OH WOW! >> >> It works like charm! Peter, thank you very much for insight and for taking >> the time to fix my script. >> >> I do appreciate. Thank you. >> >> Mariam >> Blog post >> here: http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > > Did you contact the NCBI to see where that 10,000 limit was coming from? > > Peter +1, I'm curious about that as well. OTOH, I've never tried it. chris From cjfields at illinois.edu Mon Apr 16 14:12:56 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 14:12:56 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Yeah, if you run a retrieval in batches you need a step to rerun the request in case it fails, particularly if the request is occurring at a busy time. We do the same with bioperl's interface, very similar to what Peter suggests. chris On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote: > OH WOW! > > It works like charm! Peter, thank you very much for insight and for taking > the time to fix my script. > > I do appreciate. Thank you. > > Mariam > Blog post here: > http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > > > On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock wrote: > >> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock >> wrote: >>> >>> A somewhat brute force approach would be to do the >>> search (don't bother with the history) and get the 10313 >>> GI numbers. Then use epost+efetch to grab the records >>> in batches of say 1000. >>> >> >> That does work (see below), but not all the time. A potential >> advantage of this way is that each fetch batch is a separate >> session, so retrying it should be straightforward. >> >> Peter >> >> #!/usr/bin/python >> import sys >> from Bio import Entrez >> Entrez.email = "mariam.rizkallah at gmail.com" >> Entrez.email = "p.j.a.cock at googlemail.com" >> txid = 543769 >> name = "Rhizaria" >> >> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" >> %(txid), retmax="20000") >> search_results = Entrez.read(search_handle) >> search_handle.close() >> gi_list = search_results["IdList"] >> count = int(search_results["Count"]) >> print count >> assert count == len(gi_list), len(gi_list) >> >> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >> out_handle = open(out_fasta, "a") >> >> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >> out_handle = open(out_fasta, "a") >> >> ## Approach1: gets tags within the fasta file Unable to >> obtain query #1 >> batch_size = 1000 >> for start in range(0,count,batch_size): >> end = min(count, start+batch_size) >> batch = gi_list[start:end] >> print "Going to download record %i to %i using epost+efetch" % >> (start+1, end) >> post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) >> webenv = post_results["WebEnv"] >> query_key = post_results["QueryKey"] >> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", >> retmode="text", webenv=webenv, query_key=query_key) >> data = fetch_handle.read() >> assert data.startswith(">"), data >> fetch_handle.close() >> out_handle.write(data) >> print "Done" >> out_handle.close() >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mrrizkalla at gmail.com Mon Apr 16 14:49:59 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Mon, 16 Apr 2012 16:49:59 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: <51BB20B7-BB11-4AD2-BDC6-2AE99682B56A@illinois.edu> References: <51BB20B7-BB11-4AD2-BDC6-2AE99682B56A@illinois.edu> Message-ID: I never considered asking NCBI. "Hey there! I need to get10 million records from different taxa from you, is it really limited to 10,000!? How about a workaround!?" I will ask them though! On Mon, Apr 16, 2012 at 4:15 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > On Apr 16, 2012, at 9:09 AM, Peter Cock wrote: > > > On Mon, Apr 16, 2012 at 3:04 PM, Mariam Reyad Rizkallah > > wrote: > >> OH WOW! > >> > >> It works like charm! Peter, thank you very much for insight and for > taking > >> the time to fix my script. > >> > >> I do appreciate. Thank you. > >> > >> Mariam > >> Blog post > >> here: > http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > > > > Did you contact the NCBI to see where that 10,000 limit was coming from? > > > > Peter > > +1, I'm curious about that as well. OTOH, I've never tried it. > > chris > > From cjfields at illinois.edu Mon Apr 16 15:51:19 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 15:51:19 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Peter, Mariam, Turns out they do document this: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 I can also confirm this, just ran a quick test locally with a simple script to retrieve a set of protein samples. The esearch count was 27382, but the retrieved set maxed out at 10K exactly. [cjfields at pyrimidine-laptop eutils]$ perl limit_test.pl 27382 [cjfields at pyrimidine-laptop eutils]$ grep -c '^>' seqs.aa 10000 Not sure if there are similar constraints using NCBI's SOAP interface, but I wouldn't be surprised. chris On Apr 16, 2012, at 9:12 AM, Fields, Christopher J wrote: > Yeah, if you run a retrieval in batches you need a step to rerun the request in case it fails, particularly if the request is occurring at a busy time. We do the same with bioperl's interface, very similar to what Peter suggests. > > chris > > On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote: > >> OH WOW! >> >> It works like charm! Peter, thank you very much for insight and for taking >> the time to fix my script. >> >> I do appreciate. Thank you. >> >> Mariam >> Blog post here: >> http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ >> >> >> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock wrote: >> >>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock >>> wrote: >>>> >>>> A somewhat brute force approach would be to do the >>>> search (don't bother with the history) and get the 10313 >>>> GI numbers. Then use epost+efetch to grab the records >>>> in batches of say 1000. >>>> >>> >>> That does work (see below), but not all the time. A potential >>> advantage of this way is that each fetch batch is a separate >>> session, so retrying it should be straightforward. >>> >>> Peter >>> >>> #!/usr/bin/python >>> import sys >>> from Bio import Entrez >>> Entrez.email = "mariam.rizkallah at gmail.com" >>> Entrez.email = "p.j.a.cock at googlemail.com" >>> txid = 543769 >>> name = "Rhizaria" >>> >>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" >>> %(txid), retmax="20000") >>> search_results = Entrez.read(search_handle) >>> search_handle.close() >>> gi_list = search_results["IdList"] >>> count = int(search_results["Count"]) >>> print count >>> assert count == len(gi_list), len(gi_list) >>> >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >>> out_handle = open(out_fasta, "a") >>> >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) >>> out_handle = open(out_fasta, "a") >>> >>> ## Approach1: gets tags within the fasta file Unable to >>> obtain query #1 >>> batch_size = 1000 >>> for start in range(0,count,batch_size): >>> end = min(count, start+batch_size) >>> batch = gi_list[start:end] >>> print "Going to download record %i to %i using epost+efetch" % >>> (start+1, end) >>> post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) >>> webenv = post_results["WebEnv"] >>> query_key = post_results["QueryKey"] >>> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", >>> retmode="text", webenv=webenv, query_key=query_key) >>> data = fetch_handle.read() >>> assert data.startswith(">"), data >>> fetch_handle.close() >>> out_handle.write(data) >>> print "Done" >>> out_handle.close() >>> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Apr 16 16:15:24 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 17:15:24 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: On Mon, Apr 16, 2012 at 4:51 PM, Fields, Christopher J wrote: > Peter, Mariam, > > Turns out they do document this: > > ? http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 > > I can also confirm this, just ran a quick test locally with a simple script > to retrieve a set of protein samples. ?The esearch count was 27382, > but the retrieved set maxed out at 10K exactly. Thanks Chris, well spotted! It would have been nice to have it on the main page too: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html Peter From mrrizkalla at gmail.com Mon Apr 16 16:19:31 2012 From: mrrizkalla at gmail.com (Mariam Reyad Rizkallah) Date: Mon, 16 Apr 2012 18:19:31 +0200 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Oh! Thank you, Chris. So, there IS a limit! I emailed them asking whether there is a limit for records retrieval. They replied that The appropriate way is to do batch retrieval, with no emphasis on limits. Thank you. Mariam On Apr 16, 2012 5:51 PM, "Fields, Christopher J" wrote: > Peter, Mariam, > > Turns out they do document this: > > http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 > > I can also confirm this, just ran a quick test locally with a simple > script to retrieve a set of protein samples. The esearch count was 27382, > but the retrieved set maxed out at 10K exactly. > > [cjfields at pyrimidine-laptop eutils]$ perl limit_test.pl > 27382 > [cjfields at pyrimidine-laptop eutils]$ grep -c '^>' seqs.aa > 10000 > > Not sure if there are similar constraints using NCBI's SOAP interface, but > I wouldn't be surprised. > > chris > > On Apr 16, 2012, at 9:12 AM, Fields, Christopher J wrote: > > > Yeah, if you run a retrieval in batches you need a step to rerun the > request in case it fails, particularly if the request is occurring at a > busy time. We do the same with bioperl's interface, very similar to what > Peter suggests. > > > > chris > > > > On Apr 16, 2012, at 9:04 AM, Mariam Reyad Rizkallah wrote: > > > >> OH WOW! > >> > >> It works like charm! Peter, thank you very much for insight and for > taking > >> the time to fix my script. > >> > >> I do appreciate. Thank you. > >> > >> Mariam > >> Blog post here: > >> > http://opensourcepharmacist.wordpress.com/2012/04/16/the-community-biopython-overcomes-limitations/ > >> > >> > >> On Sat, Apr 14, 2012 at 9:32 PM, Peter Cock >wrote: > >> > >>> On Sat, Apr 14, 2012 at 6:39 PM, Peter Cock > > >>> wrote: > >>>> > >>>> A somewhat brute force approach would be to do the > >>>> search (don't bother with the history) and get the 10313 > >>>> GI numbers. Then use epost+efetch to grab the records > >>>> in batches of say 1000. > >>>> > >>> > >>> That does work (see below), but not all the time. A potential > >>> advantage of this way is that each fetch batch is a separate > >>> session, so retrying it should be straightforward. > >>> > >>> Peter > >>> > >>> #!/usr/bin/python > >>> import sys > >>> from Bio import Entrez > >>> Entrez.email = "mariam.rizkallah at gmail.com" > >>> Entrez.email = "p.j.a.cock at googlemail.com" > >>> txid = 543769 > >>> name = "Rhizaria" > >>> > >>> search_handle = Entrez.esearch(db="nucest",term="txid%s[Organism:exp]" > >>> %(txid), retmax="20000") > >>> search_results = Entrez.read(search_handle) > >>> search_handle.close() > >>> gi_list = search_results["IdList"] > >>> count = int(search_results["Count"]) > >>> print count > >>> assert count == len(gi_list), len(gi_list) > >>> > >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > >>> out_handle = open(out_fasta, "a") > >>> > >>> out_fasta = "%s_txid%i_ct%i.fasta" %(name, txid, count) > >>> out_handle = open(out_fasta, "a") > >>> > >>> ## Approach1: gets tags within the fasta file Unable to > >>> obtain query #1 > >>> batch_size = 1000 > >>> for start in range(0,count,batch_size): > >>> end = min(count, start+batch_size) > >>> batch = gi_list[start:end] > >>> print "Going to download record %i to %i using epost+efetch" % > >>> (start+1, end) > >>> post_results = Entrez.read(Entrez.epost("nucest", id=",".join(batch))) > >>> webenv = post_results["WebEnv"] > >>> query_key = post_results["QueryKey"] > >>> fetch_handle = Entrez.efetch(db="nucest", rettype="fasta", > >>> retmode="text", webenv=webenv, query_key=query_key) > >>> data = fetch_handle.read() > >>> assert data.startswith(">"), data > >>> fetch_handle.close() > >>> out_handle.write(data) > >>> print "Done" > >>> out_handle.close() > >>> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From cjfields at illinois.edu Mon Apr 16 16:26:16 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Apr 2012 16:26:16 +0000 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: <8F3A4E70-C1E5-4661-86D2-09811A82A086@illinois.edu> On Apr 16, 2012, at 11:15 AM, Peter Cock wrote: > On Mon, Apr 16, 2012 at 4:51 PM, Fields, Christopher J > wrote: >> Peter, Mariam, >> >> Turns out they do document this: >> >> http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.retmax_3 >> >> I can also confirm this, just ran a quick test locally with a simple script >> to retrieve a set of protein samples. The esearch count was 27382, >> but the retrieved set maxed out at 10K exactly. > > Thanks Chris, well spotted! > > It would have been nice to have it on the main page too: > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html > > Peter That URL is a redirect to the new documentation for me, the link I sent is just a few sections down, under optional parameters. chris From p.j.a.cock at googlemail.com Mon Apr 16 16:53:53 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Apr 2012 17:53:53 +0100 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: <8F3A4E70-C1E5-4661-86D2-09811A82A086@illinois.edu> References: <8F3A4E70-C1E5-4661-86D2-09811A82A086@illinois.edu> Message-ID: On Mon, Apr 16, 2012 at 5:26 PM, Fields, Christopher J wrote: > On Apr 16, 2012, at 11:15 AM, Peter Cock wrote: >> Thanks Chris, well spotted! >> >> It would have been nice to have it on the main page too: >> http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html >> >> Peter > > That URL is a redirect to the new documentation for me, the link > I sent is just a few sections down, under optional parameters. > > chris Same here - after a hard refresh. I wonder when that changed? Peter From jp.verta at gmail.com Mon Apr 16 19:23:25 2012 From: jp.verta at gmail.com (Jukka-Pekka Verta) Date: Mon, 16 Apr 2012 15:23:25 -0400 Subject: [Biopython] History, Efetch, and returned records limits In-Reply-To: References: Message-ID: Hello fellow BioPythoneers, I stumbled upon the same problem as Mariam (without reading your previous correspondence) while I was trying to fetch all Picea sitchensis nucleotide records. Following Peters code (epost+efetch), I still had the problem of fetch breakup (after 7000 sequences). The problem was fixed following Peters idea of simply retrying the failed search using try/except. A collective thank you! JP def fetchFasta(species,out_file): # script by Peter Cock with enhancement from Bio import Entrez from Bio import SeqIO Entrez.email = "jp.verta at gmail.com" search_handle = Entrez.esearch(db="nuccore",term=species+"[orgn]", retmax="20000") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) print count assert count == len(gi_list), len(gi_list) out_handle = open(out_file, "a") batch_size = 1000 for start in range(0,count,batch_size): end = min(count, start+batch_size) batch = gi_list[start:end] print "Going to download record %i to %i using epost+efetch" %(start+1, end) post_results = Entrez.read(Entrez.epost("nuccore", id=",".join(batch))) webenv = post_results["WebEnv"] query_key = post_results["QueryKey"] fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() try: assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) except: fetch_handle = Entrez.efetch(db="nuccore", rettype="fasta",retmode="text", webenv=webenv, query_key=query_key) data = fetch_handle.read() assert data.startswith(">"), data fetch_handle.close() out_handle.write(data) print "Done" out_handle.close() $ ./FetchFastaWithSpeciesName.py "Picea sitchensis" sitkaSequences.fa 19997 Going to download record 1 to 1000 using epost+efetch Going to download record 1001 to 2000 using epost+efetch Going to download record 2001 to 3000 using epost+efetch Going to download record 3001 to 4000 using epost+efetch Going to download record 4001 to 5000 using epost+efetch Going to download record 5001 to 6000 using epost+efetch Going to download record 6001 to 7000 using epost+efetch Going to download record 7001 to 8000 using epost+efetch Going to download record 8001 to 9000 using epost+efetch Going to download record 9001 to 10000 using epost+efetch Going to download record 10001 to 11000 using epost+efetch Going to download record 11001 to 12000 using epost+efetch Going to download record 12001 to 13000 using epost+efetch Going to download record 13001 to 14000 using epost+efetch Going to download record 14001 to 15000 using epost+efetch Going to download record 15001 to 16000 using epost+efetch Going to download record 16001 to 17000 using epost+efetch Going to download record 17001 to 18000 using epost+efetch Going to download record 18001 to 19000 using epost+efetch Going to download record 19001 to 19997 using epost+efetch Done On 2012-04-14, at 1:39 PM, Peter Cock wrote: > > This is how I believe the NCBI expect this task to be done. > In this specific case it seems to be an NCBI failure. > Perhaps a loop to retry the efetch two or three times might > work? It could be the whole history session breaks at the > NCBI end though... > > A somewhat brute force approach would be to do the > search (don't bother with the history) and get the 10313 > GI numbers. Then use epost+efetch to grab the records > in batches of say 1000. > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bjorn_johansson at bio.uminho.pt Tue Apr 17 06:44:51 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 17 Apr 2012 07:44:51 +0100 Subject: [Biopython] interactive shell to use with biopython? Message-ID: Hi all, I would like to know what interactive shell might be a good alternative to use with biopython. Ideally, it should be possible to save interactive commands and save them to run later. Could you give me some examples of what youa are using? Thanks, Bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob.? +351-967 147 704 Dept of Biology (secr) +351-253 60 4310? | fax +351-253 678980 From ajperry at pansapiens.com Tue Apr 17 07:42:36 2012 From: ajperry at pansapiens.com (Andrew Perry) Date: Tue, 17 Apr 2012 17:42:36 +1000 Subject: [Biopython] interactive shell to use with biopython? In-Reply-To: References: Message-ID: On Tue, Apr 17, 2012 at 4:44 PM, Bj?rn Johansson < bjorn_johansson at bio.uminho.pt> wrote: > Hi all, > > I would like to know what interactive shell might be a good > alternative to use with biopython. > Ideally, it should be possible to save interactive commands and save > them to run later. > > Could you give me some examples of what youa are using? > > Thanks, > Bjorn > > You might want to check out iPython in notebook mode. I've only played with it briefly, but it looks promising for interactive analysis, and cases where you'd like to present the transcript to others. 'Regular' commandline iPython will also allow you to save the history with the %save command. See: http://ipython.org/ipython-doc/stable/interactive/htmlnotebook.html To get an idea of how it works, see Titus Brown's demonstration: http://www.youtube.com/watch?feature=player_detailpage&v=HaS4NXxL5Qc#t=132s Andrew Perry Postdoctoral Fellow Whisstock Lab Department of Biochemistry and Molecular Biology Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. Mobile: +61 409 808 529 From eric.talevich at gmail.com Wed Apr 18 01:37:50 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 17 Apr 2012 21:37:50 -0400 Subject: [Biopython] Bio.Phylo: writing newick trees with internal node names In-Reply-To: References: <4F632E69.8010906@stats.ox.ac.uk> <4F69A98B.3040504@stats.ox.ac.uk> Message-ID: On Thu, Mar 22, 2012 at 7:29 PM, Eric Talevich wrote: > On Wed, Mar 21, 2012 at 6:12 AM, Tanya Golubchik > wrote: >> Also, the 'is_aligned' sequence property disappears when a tree is saved >> in phyloxml format and then read back using Phylo.read: >> >>>>> print tree >> Phylogeny(rooted=True, branch_length_unit='SNV') >> ? ?Clade(branch_length=0.0, name='N1') >> ? ? ? ?Clade(branch_length=0.0, name='C00000761') >> ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA', >> is_aligned=True) >> ? ? ? ?Clade(branch_length=0.0, name='C00000763') >> ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA', >> is_aligned=True) >> >>>>> Phylo.write(tree, myfile, 'phyloxml') >> 1 >>>>> tree2 = Phylo.read(myfile, 'phyloxml') >>>>> print tree2 >> Phylogeny(rooted=True, branch_length_unit='SNV') >> ? ?Clade(branch_length=0.0, name='N1') >> ? ? ? ?Clade(branch_length=0.0, name='C00000761') >> ? ? ? ? ? ?BranchColor(blue=0, green=128, red=0) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTCTATGTTCTGGACTGACGTTAAACGA') >> ? ? ? ?Clade(branch_length=0.0, name='C00000763') >> ? ? ? ? ? ?BranchColor(blue=0, green=0, red=255) >> ? ? ? ? ? ?Sequence(type='dna') >> ? ? ? ? ? ? ? ?MolSeq(value='CCTTTcTATGTtCTGGACTGACGTTAAACGA') >> > > This looks like a bug, too. (Thanks for finding these!) I don't > immediately see the cause of the problem, I'll try to take a crack at > it soon. I finally had a chance to look at this again. It's fixed in the trunk, so if you're working off the development build of Biopython from GitHub, the is_aligned property should be written properly now. From marc.saric at gmx.de Wed Apr 18 20:58:18 2012 From: marc.saric at gmx.de (Marc Saric) Date: Wed, 18 Apr 2012 22:58:18 +0200 Subject: [Biopython] Is this a valid Genbank feature description or a Biopython bug? In-Reply-To: References: Message-ID: <4F8F2AEA.8060700@gmx.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, sorry for crossposting (this has also been published on stackoverflow ): I stumbled upon a Genbank-formatted file (shown here as a minimal dummy example), which contains a nested feature like this: FEATURES Location/Qualifiers xxxx_domain complement(complement(1..145)) Such a feature crashes the current Biopython Genbank parser (1.59 release), but it apparently did not in former releases (e.g. 1.55). Apparently the behaviour was already in 1.57. - From the Biopython bugtracker, it seems that the old locationparser code got removed in 1.56: - From what I could deduce from the format description on ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt and http://www.insdc.org/documents/feature_table.html#3.4.2 this is most likely invalid. Can someone comment on this. I.e. is this a glitch in Biopython or in the format of the Genbank file? A full demo file: LOCUS XXXXXXXXXXXXXX 240 bp DNA circular 17-JAN-2012 DEFINITION xxxxxx. KEYWORDS xx. SOURCE ORGANISM FEATURES Location/Qualifiers xxxx_domain complement(complement(1..145)) /vntifkey="1" /label=A label /note="A note" BASE COUNT 75 a 57 c 42 g 66 t ORIGIN 1 tttacaaaac gcattttcaa accttgggta ctaccccctt ttaaatatcc gaatacacta 61 ataaacgctc tttcctttta ggtaaacccg ccaatatata ctgatacaca ctgatagttt 121 aaactagatg cagtggccga ccatcagatc tagtaggaaa cagctatgac catgattacg 181 cattacttat ttaagatcaa ccgtaccagt ataccctgcc agcatgatgg aaacctccct // A minimum demo program to show the error (assumes Biopython 1.59 and Python 2.7 are installed and the above mentioned file is available as "test.gb": #!/usr/bin/env python from Bio import SeqIO s = SeqIO.read(open("test.gb")), "r"), "genbank") This crashes with raise LocationParserError(location_line) Bio.GenBank.LocationParserError: complement(1..145) - -- Bye, Marc Saric http://www.marcsaric.de -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+PKuoACgkQvKxJUF29wRLPGwCfaGI1+FzRZluJpjkfYBVdUtVq 5HIAn0ar1c2FK0eGIlekRtaQwGgJUk4U =oI7n -----END PGP SIGNATURE----- From p.j.a.cock at googlemail.com Wed Apr 18 21:31:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 18 Apr 2012 22:31:30 +0100 Subject: [Biopython] Is this a valid Genbank feature description or a Biopython bug? In-Reply-To: <4F8F2AEA.8060700@gmx.de> References: <4F8F2AEA.8060700@gmx.de> Message-ID: On Wed, Apr 18, 2012 at 9:58 PM, Marc Saric wrote: > > Hi all, > > sorry for crossposting (this has also been published on stackoverflow > ): > > > I stumbled upon a Genbank-formatted file (shown here as a minimal > dummy example), which contains a nested feature like this: > > FEATURES ? ? ? ? ? ? Location/Qualifiers > ? ? xxxx_domain ? ? complement(complement(1..145)) > I believe that is an invalid location. Was this from an NCBI file, or elsewhere? Note that for Biopython 1.60 (next release) we plan to treat bad locations as a warning rather than an error that stops parsing. Peter From p.j.a.cock at googlemail.com Thu Apr 19 15:46:18 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 19 Apr 2012 16:46:18 +0100 Subject: [Biopython] Blast Two sequences from a python script In-Reply-To: References: Message-ID: On Sun, Apr 15, 2012 at 8:55 AM, Laura Scearce wrote: > I have a list of pairs of proteins and I want to compare speed and accuracy > of "BLAST Two Sequences" to a Smith-Waterman program for alignment. I know > there is a "Blast Two Sequences" option on NCBI website, but I would like > to run it from a python script. Perhaps Biopython has this capability? If I > cannot use Blast Two Sequences, I will compare different versions of > Smith-Waterman, but this would not be nearly as exciting :) OR, if anyone > has another idea for a great senior year project in Bioinformatics > involving comparing pairs of proteins, please don't hesitate to let me > know? Thank you in advance. I would suggest looking at the EMBOSS tool water for Smith-Waterman alignments. http://emboss.open-bio.org/wiki/Appdoc:Water See also: http://emboss.open-bio.org/wiki/Appdoc:Needle and http://emboss.open-bio.org/wiki/Appdoc:Matcher For BLAST, the simplest option might be to generate temporary input FASTA files, then use the BLAST+ command line tools with the -query and -subject options. This way you don't have to make temporary BLAST databases (although it isn't quite as fast). Peter From legendre17 at hotmail.com Thu Apr 19 21:27:32 2012 From: legendre17 at hotmail.com (Tiberiu Tesileanu) Date: Thu, 19 Apr 2012 21:27:32 +0000 Subject: [Biopython] Bio.pairwise2 alignment slow Message-ID: Hi, I've noticed that Bio.pairwise2 alignments tend to be very slow; e.g., Bio.pairwise2.align.globalds is about 100 times slower than Matlab's swalign... (this is on a Macbook Air running Mac OS X Lion). Is this expected, or am I doing something wrong? Is there a way to make sure that the C version of the code is used? Is there an alternative that is similarly easy to use, but faster? Thanks!Tibi From bjorn_johansson at bio.uminho.pt Sun Apr 22 08:05:35 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sun, 22 Apr 2012 09:05:35 +0100 Subject: [Biopython] interactive shell to use with biopython? In-Reply-To: References: Message-ID: Thank you for the tip, iPython seems very useful as it works the sam way as the normal python interpreter. The notebook looks impressive, I will give it a try when the ipython 0.13 comes out. /bjorn On Tue, Apr 17, 2012 at 08:42, Andrew Perry wrote: > On Tue, Apr 17, 2012 at 4:44 PM, Bj?rn Johansson < > bjorn_johansson at bio.uminho.pt> wrote: > >> Hi all, >> >> I would like to know what interactive shell might be a good >> alternative to use with biopython. >> Ideally, it should be possible to save interactive commands and save >> them to run later. >> >> Could you give me some examples of what youa are using? >> >> Thanks, >> Bjorn >> >> > You might want to check out iPython in notebook mode. I've only played > with it briefly, but it looks promising for interactive analysis, and cases > where you'd like to present the transcript to others. > > 'Regular' commandline iPython will also allow you to save the history with > the %save command. > > See: http://ipython.org/ipython-doc/stable/interactive/htmlnotebook.html > > To get an idea of how it works, see Titus Brown's demonstration: > http://www.youtube.com/watch?feature=player_detailpage&v=HaS4NXxL5Qc#t=132s > > Andrew Perry > > Postdoctoral Fellow > Whisstock Lab > Department of Biochemistry and Molecular Biology > Monash University, Clayton Campus, PO Box 13d, VIC, 3800, Australia. > Mobile: +61 409 808 529 > > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob. +351-967 147 704 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From bjorn_johansson at bio.uminho.pt Sun Apr 22 08:23:28 2012 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sun, 22 Apr 2012 09:23:28 +0100 Subject: [Biopython] double stranded sequence object / cloning simulation Message-ID: Hi, I am looking for a way to simulate cloning using python. I was thinking of s script where you could combine two sequence objects into a recombinant molecule. Does anybody know if this has been done? I think a good way to do that is to specify a double stranded sequence object where the topology and properties of the ends of the DNA molecule are preserved in a property of the object itself. Has there been any attempts at this in biopyton or elsewhere? I wouldnt want to reinvent the wheel here. PyPI and google does not seem to give me anything on this. I was thinking something along these lines: >>> stuffer1, dsSeqobj1, stuffer2 = dsSeqobj1.digest(BamHI) which creates a linear dsseq object with staggered ends. >>> clone_a, clone_b = ligate( dsSeqobj1, dsSeqobj2 ) would create two circular dsseq objects if the ends are compatible. Any ideas along these lines? cheers, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile metabolicengineeringgroup Work (direct) +351-253 601517 | mob. +351-967 147 704 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From rbuels at gmail.com Mon Apr 23 23:49:10 2012 From: rbuels at gmail.com (Robert Buels) Date: Mon, 23 Apr 2012 19:49:10 -0400 Subject: [Biopython] Announcing OBF Google Summer of Code Accepted Students Message-ID: <4F95EA76.4030004@gmail.com> Hello all, I'm very pleased and excited to announce that the Open Bioinformatics Foundation has selected 5 very capable students to work on OBF projects this summer as part of the Google Summer of Code program. The accepted students, their projects, and their mentors (in alphabetical order): Wibowo Arindrarto SearchIO Implementation in Biopython mentored by Peter Cock Lenna Peterson Diff My DNA: Development of a Genomic Variant Toolkit for Biopython mentored by Brad Chapman Marjan Povolni The worlds fastest parallelized GFF3/GTF parser in D, and an interfacing biogem plugin for Ruby mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Artem Tarasov Fast parallelized GFF3/GTF parser in C++, with Ruby FFI bindings mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Clayton Wheeler Multiple Alignment Format parser for BioRuby mentored by Francesco Strozzi and Raoul Bonnal As in every year, we received many great applications and ideas. However, funding and mentor resources are limited, and we were not able to accept as many as we would have liked. Our deepest thanks to all the students who applied: we sincerely appreciate the time and effort you put into your applications, and hope you will still consider being a part of the OBF's open source projects, even without Google funding. I speak for myself and all of the mentors who read and scored applications when I say that we were truly honored by the number and quality of the applications we received. For the accepted students: congratulations! You have risen to the top of a very competitive application process. Now it's time to "put your money where your mouth is", as the saying goes. Let's get out there and write some great code this summer! Best regards, Rob ---- Robert Buels OBF GSoC 2012 Administrator From erikclarke at gmail.com Mon Apr 23 23:54:20 2012 From: erikclarke at gmail.com (Erik C) Date: Mon, 23 Apr 2012 16:54:20 -0700 Subject: [Biopython] Bug in Geo.parser when reading some GDS files Message-ID: Hi all, When parsing a NCBI GEO dataset (GDS) file such as this: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS_full/GDS1962_full.soft.gz the Bio.Geo.parse(handle) method fails with an assertion error. Example code: >> for record in Geo.parse(open('GDS1962_full.soft')): print record Traceback (most recent call last): File "", line 1, in File "Geo/__init__.py", line 54, in parse assert key not in record.col_defs AssertionError It appears that this is due to the failed assumption that each column header exists only once, when it seems that a common trend in GDS files is to have two columns each titled GO:Function, GO:Process, and GO:Component. The first of these duplicate columns is the Gene Ontology terms for the probe at that row, and the second column is the GO ids for those terms. >From GDS3646_full.soft: #GO:Function = Gene Ontology Function term #GO:Process = Gene Ontology Process term #GO:Component = Gene Ontology Component term #GO:Function = Gene Ontology Function identifier #GO:Process = Gene Ontology Process identifier #GO:Component = Gene Ontology Component identifier While the duplicate header names is not ideal for tabular data, these GO columns do seem to appear regularly for GDS files (see GDS1962, GDS3646, and others) and they consistently break the parser. There should be a disabling of this assertion for this particular case or a more flexible column header check. I suggest using the assertion only for the sample columns (those prefixed with GSM). I'm using BioPython 1.59 (issue exists also in Git repository) with Python 2.7.1 on Mac OS 10.7.3. Cheers, Erik From p.j.a.cock at googlemail.com Tue Apr 24 11:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:24:20 +0100 Subject: [Biopython] OBF GSoC students weekly progress reports Message-ID: Hello all, First, to echo Rob, congratulations to our selected students: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/gsoc/2012/000049.html Weekly Progress Reports: To encourage community bonding and awareness of what the GSoC 2012 students are doing, this year the OBF is being much clearer about our progress report expectations. We would like every student to setup a blog for the GSoC project (or a category/tag on your existing blog) which you will use to summarize your progress every week, as well as longer posts at the half way evaluation, and at the end of the summer. In addition, after publishing each blog post, we expect you to email the URL and the text of the blog (or if important images or formatting would be lost, at least a short summary) to the host project's mailing list(s) (check with your mentors if the project has more than one) AND the gsoc at open-bio.org mailing list. You will be writing under your own name, but with a clear association with your mentors, the OBF and its projects, so please take this seriously and be professional. Remember this will become part of your online presence, and potentially looked at by future employers and colleagues. Please talk to your mentors about this during the "community bonding" stage of the GSoC code (i.e. the next few weeks before you actually start). Thank you, Peter (On behalf of the OBF GSoC mentors and projects) Note: As per Rob's earlier email, could both students and mentors please ensure you have subscribed to the public OBF GSoC email list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd you on this email just in case you haven't done this yet). Thanks! From p.j.a.cock at googlemail.com Tue Apr 24 12:46:32 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 13:46:32 +0100 Subject: [Biopython] Biopython GSoC 2012 Message-ID: Dear all, As you will have read in Rob's email [1], of the five Google Summer of Code (GSoC) students accepted by the OBF this year, two are going to be working on Biopython projects (in alphabetical order): Wibowo Arindrarto SearchIO Implementation in Biopython mentored by Peter Cock Lenna Peterson Diff My DNA: Development of a Genomic Variant Toolkit for Biopython mentored by Brad Chapman with Reece Hart and James Casbon Congratulations to you both, and the other accepted students. Sadly we had excellent proposals from other students worthy of being chosen, but not enough mentors to go round. If you are still eligible next year, we hope you will apply again. We are also hoping you will continue to stay involved and contribute to the Biopython community. Thank you all for your hard work, students and mentors. We're looking forward to another productive summer of code! Peter, on behalf of the mentors and Biopython. [1] http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/biopython/2012-April/007976.html From carolinechang810 at gmail.com Tue Apr 24 17:51:40 2012 From: carolinechang810 at gmail.com (Caroline Chang) Date: Tue, 24 Apr 2012 10:51:40 -0700 Subject: [Biopython] NCBIWWW qblast Times Out? Message-ID: Hi, I'm not sure I'm using the NCBIWWW module correctly. I've followed the example code given in the tutorial ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82) for my own single sequence file. However, it seems to hang on the a line of code where it sleeps, and then my request times out. Is anyone else having this error, or am I using this code incorrectly? Thanks! Caroline From p.j.a.cock at googlemail.com Tue Apr 24 17:55:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 18:55:51 +0100 Subject: [Biopython] [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: > Hi all, > > > I'm very excited to be participating in GSoC '12 with Biopython! > > My development blog is on tumblr, which I chose primarily because it > supports markdown syntax, which I'm used to from GitHub. > > Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 > > However, Tumblr doesn't allow post comments. Will I need to switch to a > blog platform that allows comments? > > Cheers, > > Lenna Hi Lenna, Great - you've got a blog already you're also the first student to reply :) Blog comments could be nice, but personally in your shoes I'd direct any discussion to the biopython(-dev) mailing list. e.g. 1. Post weekly update blog, get blog post URL 2. Send email with summary, including blog post URL 3. Goto mailing list archive, get archived email URL 4. Update blog post to link to email (and thus any thread from it, at least for that month). A little cumbersome, but it would save you moving your blog? I'd actually be happier with most discussion on the biopython-dev list rather than blog comments, or even github (which will still be useful for things like code reviews). This may be different for the other projects - I know BioRuby uses IRC much more for example, but even there they've tried to post archives of important IRC discussions to their mailing list too. Thank you! Peter From w.arindrarto at gmail.com Tue Apr 24 19:01:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Apr 2012 21:01:23 +0200 Subject: [Biopython] [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: > On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >> Hi all, >> >> >> I'm very excited to be participating in GSoC '12 with Biopython! >> >> My development blog is on tumblr, which I chose primarily because it >> supports markdown syntax, which I'm used to from GitHub. >> >> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >> >> However, Tumblr doesn't allow post comments. Will I need to switch to a >> blog platform that allows comments? >> >> Cheers, >> >> Lenna > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi everyone, Wibowo Arindrarto here, but you can just call me Bow for short :). I'm very excited to be accepted into GSoC with OBF as well! I will be blogging on my site: http://bow.web.id/blog, and I've actually made my inaugural GSoC post just a few hours after I heard the news, here: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be posting all GSoC related post under the `gsoc` tag, accessible through this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's suggestion, I'll post my weekly progress in this mailing list for everyone to see, too. cheers, Bow From rbuels at gmail.com Tue Apr 24 19:13:48 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 24 Apr 2012 15:13:48 -0400 Subject: [Biopython] [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: <4F96FB6C.3010805@gmail.com> Bow, make sure you subscribe to the OBF GSoC mailing list. http://lists.open-bio.org/mailman/listinfo/gsoc Rob On 04/24/2012 03:01 PM, Wibowo Arindrarto wrote: > On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: >> On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >>> Hi all, >>> >>> >>> I'm very excited to be participating in GSoC '12 with Biopython! >>> >>> My development blog is on tumblr, which I chose primarily because it >>> supports markdown syntax, which I'm used to from GitHub. >>> >>> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >>> >>> However, Tumblr doesn't allow post comments. Will I need to switch to a >>> blog platform that allows comments? >>> >>> Cheers, >>> >>> Lenna >> >> Hi Lenna, >> >> Great - you've got a blog already you're also the first student to reply :) >> >> Blog comments could be nice, but personally in your shoes I'd >> direct any discussion to the biopython(-dev) mailing list. e.g. >> >> 1. Post weekly update blog, get blog post URL >> 2. Send email with summary, including blog post URL >> 3. Goto mailing list archive, get archived email URL >> 4. Update blog post to link to email (and thus any thread from it, >> at least for that month). >> >> A little cumbersome, but it would save you moving your blog? >> >> I'd actually be happier with most discussion on the biopython-dev >> list rather than blog comments, or even github (which will still be >> useful for things like code reviews). >> >> This may be different for the other projects - I know BioRuby >> uses IRC much more for example, but even there they've tried >> to post archives of important IRC discussions to their mailing >> list too. >> >> Thank you! >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Hi everyone, > > Wibowo Arindrarto here, but you can just call me Bow for short :). I'm > very excited to be accepted into GSoC with OBF as well! > > I will be blogging on my site: http://bow.web.id/blog, and I've > actually made my inaugural GSoC post just a few hours after I heard > the news, here: > http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be > posting all GSoC related post under the `gsoc` tag, accessible through > this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's > suggestion, I'll post my weekly progress in this mailing list for > everyone to see, too. > > cheers, > Bow > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Apr 25 09:17:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 25 Apr 2012 10:17:16 +0100 Subject: [Biopython] NCBIWWW qblast Times Out? In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:51 PM, Caroline Chang wrote: > Hi, > > I'm not sure I'm using the NCBIWWW module correctly. I've followed the example > code given in the tutorial ( > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82) for my own single > sequence file. However, it seems to hang on the a line of code where it > sleeps, and then my request times out. > > Is anyone else having this error, or am I using this code incorrectly? > > Thanks! > Caroline Hi Caroline, The most likely cause was the NCBI BLAST service being under heavy load (especially likely during USA working hours). Did this problem persist, and has it every worked for you? If it has never worked for you it could be network problem at your institute (e.g. some proxy settings). Another useful check would be to try running the unit test in the Tests folder of the Biopython source, test_NCBI_qblast.py - and see what that says. Peter From w.arindrarto at gmail.com Sat Apr 28 12:08:35 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sat, 28 Apr 2012 14:08:35 +0200 Subject: [Biopython] Google Summer of Code Project: SearchIO in Biopython Message-ID: Hello everyone, This is Wibowo Arindrarto (or Bow, for short), one of the Google Summer of Code students who will work on Biopython over this summer. I will be working with Peter to add support for parsing search outputs from programs like BLAST and HMMER to Biopython, so that it's easier to extract information from their outputs. Having used some of these programs quite a lot myself, I'm really looking forward to implementing the feature. However, I do understand that it won't be just me who will use the module, but also many other Biopython user. So for everyone who is interested in giving a say, input, or critiques along the way, feel free to do so :). The official coding period starts in about a month from now. Until then, I will be doing all the preparatory work required so that coding will proceed as smooth as possible. These will include preparing the test cases and preparing the SearchIO attribute / object naming convention as well as discussing anything related to its proposed implementation. Finally, here are some links related to the project that might interest you. 1. My main biopython branch for development: https://github.com/bow/biopython/tree/searchio. Since I will be building on top of Peter's SearchIO branch ( https://github.com/peterjc/biopython/tree/search-io-test), right now it only contains Peter's branch rebased against the latest master. 2. My GSoC proposal, which outlines my plans and timeline for the project: http://bit.ly/searchio-proposal 3. The proposed SearchIO naming convention (not 100% complete as of now, but will be filled along the way): http://bit.ly/searchio-terms. One of the main goals of the project is to implement a common interface for BLAST et al, which requires SearchIO to have common attribute names that refers to different search output attributes. The link contains my proposed naming convention, which is still very open to change and discussion. Feel free to comment on the document and add your own ideas. 4. My blog, in which I will write weekly posts about the project's progress: http://bow.web.id/blog 5. An extra repo for all other auxiliary files and scripts that doesn't go into Biopython's code: https://github.com/bow/gsoc. That's it for now. Thanks for taking time to read it :). I'm looking forward to a productive summer with Biopython. Have a nice weekend, Bow