From p.j.a.cock at googlemail.com Fri Mar 1 04:23:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Mar 2013 09:23:23 +0000 Subject: [Biopython] Filter Blast results In-Reply-To: References: Message-ID: On Thu, Feb 28, 2013 at 9:23 PM, Justin Gibbons wrote: > The example is here: > http://biopython.org/wiki/Retrieve_nonmatching_blast_queries > > I think the example would be improved if it pointed out you could use the > Bio.SeqIO.Index() function. > > Thank you > > Justin Excellent suggestion, I've made that change. For future ideas you can create a wiki account and edit things directly (but feel free to raise discussions like this on the mailing list too). Thank you, Peter From p.j.a.cock at googlemail.com Fri Mar 1 04:28:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Mar 2013 09:28:25 +0000 Subject: [Biopython] Filter Blast results In-Reply-To: References: Message-ID: On Fri, Mar 1, 2013 at 9:23 AM, Peter Cock wrote: > On Thu, Feb 28, 2013 at 9:23 PM, Justin Gibbons wrote: >> The example is here: >> http://biopython.org/wiki/Retrieve_nonmatching_blast_queries >> >> I think the example would be improved if it pointed out you could use the >> Bio.SeqIO.Index() function. >> >> Thank you >> >> Justin > > Excellent suggestion, I've made that change. For future ideas > you can create a wiki account and edit things directly (but > feel free to raise discussions like this on the mailing list too). > > Thank you, > > Peter Looking back over the history, when David Winter added the example in early 2009, Bio.SeqIO.index didn't exist yet (that was new in Biopython 1.52 released in September 2009), so using the in-memory dictionary like that was reasonable. Thanks again for the feedback, Regards, Peter From p.j.a.cock at googlemail.com Sun Mar 3 07:00:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 3 Mar 2013 12:00:25 +0000 Subject: [Biopython] Fwd: GSoC 2013 is ON In-Reply-To: <20130303112326.GA5638@thebird.nl> References: <20130303112326.GA5638@thebird.nl> Message-ID: Time to start preparations for Google Summer of Code 2013 :) ---------- Forwarded message ---------- From: *Pjotr Prins* Date: Sunday, March 3, 2013 Subject: GSoC 2013 is ON Game on! GSoC 2013 is ON. I am running with the OBF project administration this year for the Google Summer of code (GSoC). First and foremost I want to thank Robert Buels and others for making OBF/GSoC a success in the previous three years! This year, Robert, Chris Fields and Hilmar Lapp will act as backup administrators. The deadline for the OBF application for GSoC2013 as a mentoring organisation is Friday March 29! See http://www.google-melange.com/gsoc/events/google/gsoc2013 Similar to previous years, each Bio* project needs to update and add project ideas on the project's individual OBF wiki page and create links from the main OBF page at http://www.open-bio.org/wiki/Google_Summer_of_Code (we will update the main information on that page soon). So, for each of the OBF projects that wants to do GSoC again this year: 1. Update the list of project ideas on your project's GSoC page (BioPython, BioPerl, BioRuby, etc). Add new ones, remove ones that have already been done or no longer relevant, etc. For an example see http://bioruby.open-bio.org/wiki/Google_Summer_of_Code 2. Update the final list of project ideas on the main OBF GSoC page to match. http://www.open-bio.org/wiki/Google_Summer_of_Code 3. Register with gsoc at lists.open-bio.org 4. Announce it on that list when you are ready :) Anyone can submit a project idea! Former GSoC students are especially encouraged to contribute ideas to the mailing lists. Please have the updates done by Friday March 22nd. The number and quality of the project ideas are part of the evaluation process for whether OBF is accepted as a Summer of Code organisation again this year, so let's come up with some good ones! Pj. (Pjotr Prins) Important dates: * March 22nd: Finalise project ideas * March 29th: Deadline OBF mentoring organisation submission to Google http://www.open-bio.org/wiki/Google_Summer_of_Code From harijay at gmail.com Sun Mar 3 13:34:26 2013 From: harijay at gmail.com (hari jayaram) Date: Sun, 3 Mar 2013 13:34:26 -0500 Subject: [Biopython] Sequence object "find" is still case specific? Message-ID: I am relatively new to biopython having not used it for a while. I have the "bad" habit of storing sequences in an internal database with mixed case strings i.e "atgCTCGAGcatcatcat" where the upper case strings are a restriction site I use normally for cloning purposes. I am interested in using biopython to write a pdf based (using reportlab) plasmid vector map drawing utility for all the sequences in my database. I am just getting started and was wondering why the Sequence object "find" still behaves like an ordinary python string find for eg. >>> from Bio.Seq import Seq >>> raw_seq_mixed_case = "atgCTCGAGcatcatcatcatcat" >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq(raw_seq_mixed_case, IUPAC.unambiguous_dna) >>> my_seq.find("ctcgag") -1 >>> my_seq.find("CTCGAG") 3 Along these lines , this does not work either. >>> search_sequence = Seq("ctcgag",IUPAC.unambiguous_dna) >>> my_seq.find(search_sequence) -1 >>> my_seq.find(search_sequence.tostring()) -1 >>> my_seq.find(search_sequence.tostring().upper()) 3 I wonder if I am doing something wrong. It seems strange that the Seq object would behave like a python String after going through the process of telling it that it is "unambiguous_dna". Didnt want to roll my own solution for handling sequences etc and would prefer playing along with biopython conventions. Thanks for your help Hari From harijay at gmail.com Sun Mar 3 14:13:34 2013 From: harijay at gmail.com (hari jayaram) Date: Sun, 3 Mar 2013 14:13:34 -0500 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: References: Message-ID: Thanks for your replies on Google Plus, Iddo Friedberg and Chris Lasher...reproducing here On Google plus Iddo wrote: Good question. There is no default strict checking, you may also want to see the manual: Section 3.6 http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc173.6. On Google Plus Chris Lasher wrote: Hmm, well, lower case nucleotides have often represented "masked regions" of sequences. It seems that Biopython sequences were meant to be case-sensitive (e.g., http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc22).From the documentation there, it seems like you've discovered a bug in the API; I feel that Seq should raise a ValueError when instantiating with lower-case nucleotiods and unambiguous_dna. I suppose my suggestion would be to always normalize to upper-case if you're not dealing with masked regions. So I understand that in most cases I am better off ....just treating my Sequence objects as strings or ....impose strict checking while creating them ....or force convert to upper during instantiation Would it not make sense to have either of the following behavior seq = Seq("atgCTCGAGcatcatcat",IUPAC.unambiguous_dna) throws an error since mixed case is used which is not allowed or It just silently converts it all to the case of the Unambiguous_DNA specification and then all "find" and "search" works regardless of case on this internal representation which is just "DNA". *But for now I will just force case to upper when instantiating* Thanks for your help Hari So in the examples: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq Seq('acgtACGT', DNAAlphabet()) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet()) >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq.upper() True But however the find still fails ...which is counter-intuituve. >>>dna_seq.find("acgt") -1 On Sun, Mar 3, 2013 at 1:34 PM, hari jayaram wrote: > > I am relatively new to biopython having not used it for a while. I have the "bad" habit of storing sequences in an internal database with mixed case strings i.e "atgCTCGAGcatcatcat" where the upper case strings are a restriction site I use normally for cloning purposes. > > I am interested in using biopython to write a pdf based (using reportlab) plasmid vector map drawing utility for all the sequences in my database. > > > I am just getting started and was wondering why the Sequence object "find" still behaves like an ordinary python string find for eg. > > > >>> from Bio.Seq import Seq > >>> raw_seq_mixed_case = "atgCTCGAGcatcatcatcatcat" > >>> from Bio.Alphabet import IUPAC > >>> my_seq = Seq(raw_seq_mixed_case, IUPAC.unambiguous_dna) > >>> my_seq.find("ctcgag") > -1 > >>> my_seq.find("CTCGAG") > 3 > > Along these lines , this does not work either. > >>> search_sequence = Seq("ctcgag",IUPAC.unambiguous_dna) > >>> my_seq.find(search_sequence) > -1 > >>> my_seq.find(search_sequence.tostring()) > -1 > >>> my_seq.find(search_sequence.tostring().upper()) > 3 > > I wonder if I am doing something wrong. > > It seems strange that the Seq object would behave like a python String after going through the process of telling it that it is "unambiguous_dna". Didnt want to roll my own solution for handling sequences etc and would prefer playing along with biopython conventions. > > Thanks for your help > Hari > From p.j.a.cock at googlemail.com Sun Mar 3 15:39:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 3 Mar 2013 20:39:22 +0000 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: References: Message-ID: On Sun, Mar 3, 2013 at 7:13 PM, hari jayaram wrote: > > > On Google Plus Chris Lasher wrote: >> Hmm, well, lower case nucleotides have often represented "masked regions" >> of sequences. It seems that Biopython sequences were meant to be >> case-sensitive (e.g., >> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc22).From the >> documentation there, it seems like you've discovered a bug in the API; I >> feel that Seq should raise a ValueError when instantiating with lower-case >> nucleotiods and unambiguous_dna. Yes, it has some appeal - the trouble is if we suddenly start enforcing this it will likely break many existing scripts: https://redmine.open-bio.org/issues/2597 > Would it not make sense to have either of the following behavior > > seq = Seq("atgCTCGAGcatcatcat",IUPAC.unambiguous_dna) throws an error since > mixed case is used which is not allowed Yes, if we keep IUPAC.unambiguous_dna as upper case only, then an error makes sense. https://redmine.open-bio.org/issues/2597 Or we could make IUPAC.unambiguous_dna mixed case, and add new more specific upper only and lower only alphabets? Sadly that would also probably break some existing usage. (Whatever change is made will require a transition period with deprecation warnings in order to move to a strict by default mode) > or > > It just silently converts it all to the case of the Unambiguous_DNA > specification and then all "find" and "search" works regardless of case on > this internal representation which is just "DNA". You mean if an all upper case alphabet is used, silently switch the sequence to upper case? And vice verse for lower case? That seems to0 magic/implicit, so I'd not support that. > *But for now I will just force case to upper when instantiating* > Yes, that is the pragmatic solution for the current and recent versions of Biopython. Peter From idoerg at gmail.com Sun Mar 3 16:10:43 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 3 Mar 2013 16:10:43 -0500 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: References: Message-ID: The thing is, I am a bit unsure of the utility of alphabets associated with a Seq object in general. (And I was the one who was one of the original crafters of the Seq object). It seems like *any* letter is acceptable - there is no strict alphabet checking. I inserted "Z"s into an unambiguous-dna Seq object. So I am not sure when this happened, but aren't alphabets supposed to provide some constraints? On Sun, Mar 3, 2013 at 3:39 PM, Peter Cock wrote: > On Sun, Mar 3, 2013 at 7:13 PM, hari jayaram wrote: > > > > > > On Google Plus Chris Lasher wrote: > >> Hmm, well, lower case nucleotides have often represented "masked > regions" > >> of sequences. It seems that Biopython sequences were meant to be > >> case-sensitive (e.g., > >> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc22).From the > >> documentation there, it seems like you've discovered a bug in the API; I > >> feel that Seq should raise a ValueError when instantiating with > lower-case > >> nucleotiods and unambiguous_dna. > > Yes, it has some appeal - the trouble is if we suddenly start > enforcing this it will likely break many existing scripts: > https://redmine.open-bio.org/issues/2597 > > > Would it not make sense to have either of the following behavior > > > > seq = Seq("atgCTCGAGcatcatcat",IUPAC.unambiguous_dna) throws an error > since > > mixed case is used which is not allowed > > Yes, if we keep IUPAC.unambiguous_dna as upper case only, > then an error makes sense. https://redmine.open-bio.org/issues/2597 > > Or we could make IUPAC.unambiguous_dna mixed case, and add > new more specific upper only and lower only alphabets? Sadly that > would also probably break some existing usage. > > (Whatever change is made will require a transition period with > deprecation warnings in order to move to a strict by default mode) > > > or > > > > It just silently converts it all to the case of the Unambiguous_DNA > > specification and then all "find" and "search" works regardless of case > on > > this internal representation which is just "DNA". > > You mean if an all upper case alphabet is used, silently switch > the sequence to upper case? And vice verse for lower case? > That seems to0 magic/implicit, so I'd not support that. > > > *But for now I will just force case to upper when instantiating* > > > > Yes, that is the pragmatic solution for the current and recent > versions of Biopython. > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Sun Mar 3 17:03:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 3 Mar 2013 22:03:29 +0000 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: References: Message-ID: We're going off topic here, but for the record I think that 'find' should continue to be case sensitive (like Python strings). On Sun, Mar 3, 2013 at 9:10 PM, Iddo Friedberg wrote: > The thing is, I am a bit unsure of the utility of alphabets associated with > a Seq object in general. (And I was the one who was one of the original > crafters of the Seq object). It seems like *any* letter is acceptable - > there is no strict alphabet checking. I inserted "Z"s into an > unambiguous-dna Seq object. So I am not sure when this happened, but aren't > alphabets supposed to provide some constraints? Not so far no, and I personally find this annoying. The current alphabet system is quite heavy and not really used to its full potential - checking the letters at __init__ time seems a good idea (when requested), likewise for the MutableSeq object on edit. Right now (kind of like duck-type-checking) Biopython looks at the alphabet on demand, e.g. if trying to do a translation or transcription. But for the most part, they are ignored. My idea on https://redmine.open-bio.org/issues/2597 is to continue with the current relaxed approach UNLESS the alphabet selected has a letters attribute which would be treated as a white list of allowed letters. What I would like in the long run is to typically use the existing generic DNA, RNA, nucleotide, protein alphabets where all you care about is the type. Where you do care about the exact letters used, then the strict IUPAC alphabets would apply (or subclasses for special cases). Perhaps we should actually finally do this in the next release (or do a beta release with this enabled to see how many complaints we get?). Longer term, memory efficient bit-encoded Seq classes (like BioJava has) would be interesting, and would fit nicely with the strict letter checking approach. Regards, Peter From cjfields at illinois.edu Sun Mar 3 17:19:22 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Sun, 3 Mar 2013 22:19:22 +0000 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF72F73E19@CHIMBX5.ad.uillinois.edu> On Mar 3, 2013, at 2:39 PM, Peter Cock wrote: > On Sun, Mar 3, 2013 at 7:13 PM, hari jayaram wrote: > ... >> or >> >> It just silently converts it all to the case of the Unambiguous_DNA >> specification and then all "find" and "search" works regardless of case on >> this internal representation which is just "DNA". > > You mean if an all upper case alphabet is used, silently switch > the sequence to upper case? And vice verse for lower case? > That seems to0 magic/implicit, so I'd not support that. A note/warning: Peter's right, I've been bitten by such magic (on the Bioperl end :) a number of times. We're intending on removing such magic in v2. It's best to throw an exception and let someone know why it doesn't work than try 'helping' and end up misunderstanding the user's intent. > > Peter chris From anaryin at gmail.com Sun Mar 3 18:07:47 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 4 Mar 2013 00:07:47 +0100 Subject: [Biopython] Updating GSOC page? Message-ID: Hello all, Does any oppose to a refreshment of our GSOC pagebased on the BioRuby page ? It could use a facelift before the new round of projects/students come in. Best, Jo?o From harijay at gmail.com Sun Mar 3 18:50:38 2013 From: harijay at gmail.com (hari jayaram) Date: Sun, 3 Mar 2013 18:50:38 -0500 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF72F73E19@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF72F73E19@CHIMBX5.ad.uillinois.edu> Message-ID: Hi All, I agree that magic is bad and understand why the API is the way it is for compatibility reasons. I agree with what everyone says that the best behavior is to throw an error during instantiation and minimize surprising results downstream. The good thing about have a repo on git is that it is always possible to make branched tweaks like this and see how things go. Since I am new to Biopython, I will read more on the development process and other things..but for now I am forcing all my sequences to be in upper case and check for "alphabet" compliance prior to instantiation. Thanks for your help Hari On Sun, Mar 3, 2013 at 5:19 PM, Fields, Christopher J wrote: > On Mar 3, 2013, at 2:39 PM, Peter Cock > wrote: > >> On Sun, Mar 3, 2013 at 7:13 PM, hari jayaram wrote: >> ... >>> or >>> >>> It just silently converts it all to the case of the Unambiguous_DNA >>> specification and then all "find" and "search" works regardless of case on >>> this internal representation which is just "DNA". >> >> You mean if an all upper case alphabet is used, silently switch >> the sequence to upper case? And vice verse for lower case? >> That seems to0 magic/implicit, so I'd not support that. > > A note/warning: Peter's right, I've been bitten by such magic (on the Bioperl end :) a number of times. We're intending on removing such magic in v2. It's best to throw an exception and let someone know why it doesn't work than try 'helping' and end up misunderstanding the user's intent. > >> >> Peter > > > chris > From mjldehoon at yahoo.com Sun Mar 3 21:47:13 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 3 Mar 2013 18:47:13 -0800 (PST) Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: Message-ID: <1362365233.41145.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi everybody, --- On Sun, 3/3/13, Peter Cock wrote: > On Sun, Mar 3, 2013 at 9:10 PM, Iddo Friedberg > wrote: > > The thing is, I am a bit unsure of the utility of > > alphabets associated with a Seq object in general. > > (And I was the one who was one of the original > > crafters of the Seq object). It seems like *any* letter > > is acceptable - there is no strict alphabet checking. I > > inserted "Z"s into an unambiguous-dna Seq object. So I > > am not sure when this happened, but aren't > > alphabets supposed to provide some constraints? > > Not so far no, and I personally find this annoying. The > current alphabet system is quite heavy and not really > used to its full potential - checking the letters at > __init__ > time seems a good idea (when requested), likewise for > the MutableSeq object on edit. There was a discussion on this topic a long time ago, starting at: http://lists.open-bio.org/pipermail/biopython/2005-May/002633.html Best, -Michiel. From mjldehoon at yahoo.com Sun Mar 3 21:55:46 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 3 Mar 2013 18:55:46 -0800 (PST) Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: Message-ID: <1362365746.45468.YahooMailClassic@web164001.mail.gq1.yahoo.com> --- On Sun, 3/3/13, Peter Cock wrote: > We're going off topic here, but for the record I think > that 'find' should continue to be case sensitive (like > Python strings). I would prefer find to be case-insensitive. Biochemically there is no difference between upper case and lower case nucleotides; lower case is just used for annotation purposes. I find it quite counter-intuitive that >>> s = Seq("ACGTttt") >>> s.find("ACGTT") returns -1. While it is possible to change the sequences to upper case before executing .find, it has the disadvantage that then we won't be able to tell what the original case was (and therefore whether we are hitting a repeat region or not). Best, -Michiel. From mjldehoon at yahoo.com Sun Mar 3 22:01:43 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 3 Mar 2013 19:01:43 -0800 (PST) Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: Message-ID: <1362366103.55733.YahooMailClassic@web164001.mail.gq1.yahoo.com> > Hmm, well, lower case nucleotides have often represented > "masked regions" of sequences. It seems that Biopython > sequences were meant to be case-sensitive (e.g., > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc22). > From the documentation there, it seems like you've discovered > a bug in the API; I feel that Seq should raise a ValueError > when instantiating with lower-case nucleotides and unambiguous_dna. > I don't think that this is a bug. The difference between unambiguous and ambiguous DNA refers to the difference between ACGT and ACGTMRWSYKVHDBXN, where the nucleotides other than ACGT are ambiguous (for example, R = purine = either A or G). Best, -Michiel. From p.j.a.cock at googlemail.com Mon Mar 4 05:35:12 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 10:35:12 +0000 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: <1362366103.55733.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1362366103.55733.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Mon, Mar 4, 2013 at 3:01 AM, Michiel de Hoon wrote: >> Hmm, well, lower case nucleotides have often represented >> "masked regions" of sequences. It seems that Biopython >> sequences were meant to be case-sensitive (e.g., >> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc22). >> From the documentation there, it seems like you've discovered >> a bug in the API; I feel that Seq should raise a ValueError >> when instantiating with lower-case nucleotides and unambiguous_dna. >> > I don't think that this is a bug. The difference between unambiguous > and ambiguous DNA refers to the difference between ACGT and > ACGTMRWSYKVHDBXN, > where the nucleotides other than > ACGT are ambiguous (for example, R = purine = either A or G). That's part of the issue with the sequence objects not checking the letters against the list specified in the alphabet object - and arguably much more important than the case aspect. Peter From p.j.a.cock at googlemail.com Mon Mar 4 05:50:46 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 10:50:46 +0000 Subject: [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues wrote: > Hello all, > > Does any oppose to a refreshment of our GSOC > pagebased on the > BioRuby > page ? It could use > a facelift before the new round of projects/students come in. > > Best, > > Jo?o A good idea - see also the GSoC discussions on the biopython-dev list about potential project ideas. Thanks, Peter From p.j.a.cock at googlemail.com Mon Mar 4 08:19:39 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 13:19:39 +0000 Subject: [Biopython] Making TranslationError subclass ValueError Message-ID: Hello all, Kai and were talking on Twitter, we agreed that when writing generic error handling, it might be nice if TranslationError (defined in Bio.Data.CodonTable and raised when during sequence translation) was a subclass of ValueError - rather than the generic Exception. The idea would be something like this: from Bio.Data.CodonTable import TranslationError try: some_biopython_code(arguments) except (TranslationError, ValueError): #Do XXX Could be just: try: some_biopython_code(arguments) except ValueError: #Includes TranslationError #Do XXX Any thoughts? Good idea? Bad idea? Peter From mmokrejs at fold.natur.cuni.cz Mon Mar 4 08:43:09 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Mon, 04 Mar 2013 14:43:09 +0100 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: <1362365746.45468.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1362365746.45468.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <5134A4ED.3070507@fold.natur.cuni.cz> Michiel de Hoon wrote: > --- On Sun, 3/3/13, Peter Cock wrote: >> We're going off topic here, but for the record I think >> that 'find' should continue to be case sensitive (like >> Python strings). > > I would prefer find to be case-insensitive. Biochemically there is no difference between upper case and lower case nucleotides; lower case is just used for annotation purposes. I find it quite counter-intuitive that >>>> s = Seq("ACGTttt") >>>> s.find("ACGTT") > returns -1. > > While it is possible to change the sequences to upper case before executing .find, it has the disadvantage that then we won't be able to tell what the original case was (and therefore whether we are hitting a repeat region or not). I agree that it would be bad if biopython converted my sequences into all-uppercase. And not only me, a typical use case is nowadays import of raw sequencing reads including low-qual/masked region in lower-case. I do use mixed-casing quite often and I think it is acceptable to ask user to do the .find like: s.to_string().upper().find('ACGTT') and leave the user slice out the mixed-cased match eventually from the original sequence object. I don't think I want anything to be changed in biopython except maybe more runtime control over the checks of the alphabet during data import. Supporting searches through possibly mixed-case sequence object would require use of REGEXP engine and possibly be slower. Hope I got your discussion right. ;-) Martin From jttkim at googlemail.com Mon Mar 4 10:40:07 2013 From: jttkim at googlemail.com (Jan T Kim) Date: Mon, 4 Mar 2013 15:40:07 +0000 Subject: [Biopython] Problems with reading Swiss format records (swissprot specific date fields) Message-ID: <20130304154006.GA4227@paxarchia.galaxy.uni> Dear All, trying to parse the attached Swissprot record gives me a stack trace: Traceback (most recent call last): File "./swisstest", line 7, in e = Bio.SeqIO.read(sys.argv[1], 'swiss') File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 599, in read first = iterator.next() File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 537, in parse for r in i: File "/usr/lib/pymodules/python2.7/Bio/SeqIO/SwissIO.py", line 97, in SwissIterator annotations['date'] = swiss_record.created[0] TypeError: 'NoneType' object has no attribute '__getitem__' The problem is at line 99 (rather than 97)of https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SwissIO.py : annotations['date'] = swiss_record.created[0] without an "if swiss_record.created is not None" test or something similar. The parse function of Bio.SwissProt initialises the created instance variable to None, and only if a "DT" record containing the string "INTEGRATED" (case insensitive) is found, created is set to that date. The same kind of problem occurs with the sequence_update variable in the next statement: annotations['date_last_sequence_update'] = swiss_record.sequence_update[0] Would it be sensible to set the 'date' and 'date_last_sequence_update' entries of the annotations dictionary only if the values are actually found in the swiss_record? I understand that with a genuine SwissProt record, they should always be there, but this happened to me when working on files generated from the refseq protein database using the EMBOSS seqret program with -osformat=swiss, which doesn't seem like an entirely exotic use case to me. Best regards, Jan -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* -------------- next part -------------- ID ZP_10312765 Reviewed; 498 AA. AC ZP_10312765; DT 27-JUN-2012, entry version 1. DE hypothetical protein FraQA3DRAFT_6339 [Frankia sp. QA3]. OS Frankia sp. QA3. RN [1] RP 1-498 RN [2] RP 1-498 KW . FT REGION 1 498 Frankia sp. QA3. QA3. taxon:710111. FT REGION 1 498 hypothetical protein. 53620. FT REGION 1 498 FraQA3DRAFT_6339. FT complement(NZ_CM001489.1:7362098..7363594 FT ). 11. SQ SEQUENCE 498 AA; 53751 MW; 39E328894991F8AC CRC64; mhphrvhpsr vhpspehpsp ehlsrehqsr prhataaara arsrpprphr agrrarrddr crqrsqraac lpggcpttcr dgrraptdrg hgshapgrgp taavpdlavp agcagpgrgg vgarhrrpaa artapgsqpt aaarrstags rvprgpgrrr sattrrgrrr prdalaarpa pvrvsvhgps grgpgrarrr pcrirgrchh dapggratap avggaprlvh rcggrrwqra rpgrggrdgp amptprssvp epgppgprhp rgpsrrpahp hwnptlggrr wpgvhrrdgr hgahrrrtip rpagrptrgr sgphrpapvr paagrhagng rcrpdhgrir rqppdagpas rsahthrgsr rlrrpggrps grrsdartgl arrsaagadq twpaprrwrh rrtnhrgrgs apgrhrsaap ptvpvphpar srpphdhgsg hprthrpgpt ghhaggrrpa rapghaagag rrrtapmrra rslclpsp // From ivangreg at gmail.com Mon Mar 4 11:14:16 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 4 Mar 2013 11:14:16 -0500 Subject: [Biopython] SeqIO.write and user-specified wrapping Message-ID: Hi Everybody, Could a kind soul show how to produce a FASTA file without wrapping its sequences? I am looking for something like this: from Bio import SeqIO SeqIO.write(my_records, "my_example.fa", "fasta", wrap=0) as opposed to the default wrapping after 60 characters. Thanks for Biopython. Ivan Ivan Gregoretti, PhD Bioinformatics From p.j.a.cock at googlemail.com Mon Mar 4 11:54:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 16:54:11 +0000 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: On Mon, Mar 4, 2013 at 4:14 PM, Ivan Gregoretti wrote: > Hi Everybody, > > Could a kind soul show how to produce a FASTA file without wrapping > its sequences? > > I am looking for something like this: > > from Bio import SeqIO > SeqIO.write(my_records, "my_example.fa", "fasta", wrap=0) > > as opposed to the default wrapping after 60 characters. > > Thanks for Biopython. > > Ivan Currently SeqIO doesn't allow extra file format specific argument like that (although it could in principle). Here you must currently use the underlying writer directly, something like this: from Bio.SeqIO.FastaIO import FastaWriter handle = open("my_example.fa", "w") writer = FastaWriter(handle, wrap=0) writer.write_file(my_records) handle.close() Or, since you want no line wrapping you could do something like this instead: handle = open("my_example.fa", "w") for record in my_records: handle.write(">%s %s\n%s\n" % (record.id, record.description, record.seq)) handle.close() Peter From ivangreg at gmail.com Mon Mar 4 12:00:25 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 4 Mar 2013 12:00:25 -0500 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: Thank you Peter. I'll try the FastaWriter. Ivan Gregoretti, PhD Bioinformatics On Mon, Mar 4, 2013 at 11:54 AM, Peter Cock wrote: > On Mon, Mar 4, 2013 at 4:14 PM, Ivan Gregoretti wrote: >> Hi Everybody, >> >> Could a kind soul show how to produce a FASTA file without wrapping >> its sequences? >> >> I am looking for something like this: >> >> from Bio import SeqIO >> SeqIO.write(my_records, "my_example.fa", "fasta", wrap=0) >> >> as opposed to the default wrapping after 60 characters. >> >> Thanks for Biopython. >> >> Ivan > > Currently SeqIO doesn't allow extra file format specific > argument like that (although it could in principle). Here > you must currently use the underlying writer directly, > something like this: > > from Bio.SeqIO.FastaIO import FastaWriter > handle = open("my_example.fa", "w") > writer = FastaWriter(handle, wrap=0) > writer.write_file(my_records) > handle.close() > > Or, since you want no line wrapping you could do > something like this instead: > > handle = open("my_example.fa", "w") > for record in my_records: > handle.write(">%s %s\n%s\n" % (record.id, record.description, record.seq)) > handle.close() > > Peter From ivangreg at gmail.com Mon Mar 4 12:20:45 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 4 Mar 2013 12:20:45 -0500 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: I am trying this from Bio import SeqIO SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( record ) but I get the following error: File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 254, in write_file count = self.write_records(records) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/Interfaces.py", line 239, in write_records self.write_record(record) File "/usr/lib64/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 174, in write_record id = self.clean(record.id) AttributeError: 'str' object has no attribute 'id' Any ideas? Thank you, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Mar 4, 2013 at 12:00 PM, Ivan Gregoretti wrote: > Thank you Peter. I'll try the FastaWriter. > > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Mon, Mar 4, 2013 at 11:54 AM, Peter Cock wrote: >> On Mon, Mar 4, 2013 at 4:14 PM, Ivan Gregoretti wrote: >>> Hi Everybody, >>> >>> Could a kind soul show how to produce a FASTA file without wrapping >>> its sequences? >>> >>> I am looking for something like this: >>> >>> from Bio import SeqIO >>> SeqIO.write(my_records, "my_example.fa", "fasta", wrap=0) >>> >>> as opposed to the default wrapping after 60 characters. >>> >>> Thanks for Biopython. >>> >>> Ivan >> >> Currently SeqIO doesn't allow extra file format specific >> argument like that (although it could in principle). Here >> you must currently use the underlying writer directly, >> something like this: >> >> from Bio.SeqIO.FastaIO import FastaWriter >> handle = open("my_example.fa", "w") >> writer = FastaWriter(handle, wrap=0) >> writer.write_file(my_records) >> handle.close() >> >> Or, since you want no line wrapping you could do >> something like this instead: >> >> handle = open("my_example.fa", "w") >> for record in my_records: >> handle.write(">%s %s\n%s\n" % (record.id, record.description, record.seq)) >> handle.close() >> >> Peter From p.j.a.cock at googlemail.com Mon Mar 4 12:26:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 17:26:04 +0000 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: On Mon, Mar 4, 2013 at 5:20 PM, Ivan Gregoretti wrote: > I am trying this > > from Bio import SeqIO > SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( record ) > That should be a list of SeqRecord objects, not just one SeqRecord. If you have just one, try [record] instead. Peter From ivangreg at gmail.com Mon Mar 4 13:28:11 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 4 Mar 2013 13:28:11 -0500 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: Sorry to pester the list's subscribers but there seems to be bug in FastaWriter It appears that FastaWriter does not respect the wrap argument: # No wrapping SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( [record] ) or # wrap after 10 characters SeqIO.FastaIO.FastaWriter( handle, wrap=10 ).write_file( [record] ) both produce the same output: >M01483:6:000000000-A2WVJ:1:1101:12685:1873 SVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIATDRSRARRCVEACVYGTLDFVG YPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAEFTENIINGVERPVKAAELFAFTLRV RAGNTDV >M01483:6:000000000-A2WVJ:1:1101:19629:2231 LTLADDRLEAFYDNPNALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRL HFHAVHFMRTLPTGSVDPNFGRRVRNRRQLNSLQNTWPYGYSMP >M01483:6:000000000-A2WVJ:1:1101:12952:2294 GFENQKELTKMQLDNQKEIAEMQNETQKEIAGIQSATSRQNTKDQVYAQNEMLAYQQKES TARVASIMENTNLSKQQQVSEIMRQMLTQAQTAGQYFTNDQIKEMTRKVS FastaWriter does produce an error when I pass wrap="hello world", so, the argument is interpreted initially. Also, as expected, when I comment out the SeqIO.FastaIO.FastaWriter line, there is no output. I checked and double checked. I am using Python 2.7.3 on a linux 64bit (Fedora 18) and my Biopython is 1.6.1. Thank you, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Mar 4, 2013 at 12:26 PM, Peter Cock wrote: > On Mon, Mar 4, 2013 at 5:20 PM, Ivan Gregoretti wrote: >> I am trying this >> >> from Bio import SeqIO >> SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( record ) >> > > That should be a list of SeqRecord objects, not just one SeqRecord. > If you have just one, try [record] instead. > > Peter From arklenna at gmail.com Mon Mar 4 14:26:38 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 4 Mar 2013 14:26:38 -0500 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: For me, this works as expected. I will paste my exact code below. I am creating the SeqRecord from your string, so perhaps it's some difference there. Cheers, Lenna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqIO.FastaIO import FastaWriter mystr = """SVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIATDRSRARRCVEACVYGTLDFVG YPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAEFTENIINGVERPVKAAELFAFTLRV RAGNTDV""" myseq = Seq(mystr.replace("\n", "")) myrec = SeqRecord(myseq) with open("nowrap.fa", "wb") as fh: FastaWriter(fh, wrap=0).write_file([myrec]) with open("wrap10.fa", "wb") as fh: FastaWriter(fh, wrap=10).write_file([myrec]) On Mon, Mar 4, 2013 at 1:28 PM, Ivan Gregoretti wrote: > Sorry to pester the list's subscribers but there seems to be bug in > FastaWriter > > It appears that FastaWriter does not respect the wrap argument: > > # No wrapping > SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( [record] ) > > or > > # wrap after 10 characters > SeqIO.FastaIO.FastaWriter( handle, wrap=10 ).write_file( [record] ) > > both produce the same output: > > >M01483:6:000000000-A2WVJ:1:1101:12685:1873 > SVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIATDRSRARRCVEACVYGTLDFVG > YPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAEFTENIINGVERPVKAAELFAFTLRV > RAGNTDV > >M01483:6:000000000-A2WVJ:1:1101:19629:2231 > LTLADDRLEAFYDNPNALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRL > HFHAVHFMRTLPTGSVDPNFGRRVRNRRQLNSLQNTWPYGYSMP > >M01483:6:000000000-A2WVJ:1:1101:12952:2294 > GFENQKELTKMQLDNQKEIAEMQNETQKEIAGIQSATSRQNTKDQVYAQNEMLAYQQKES > TARVASIMENTNLSKQQQVSEIMRQMLTQAQTAGQYFTNDQIKEMTRKVS > > FastaWriter does produce an error when I pass wrap="hello world", so, > the argument is interpreted initially. Also, as expected, when I > comment out the SeqIO.FastaIO.FastaWriter line, there is no output. > > I checked and double checked. > > I am using Python 2.7.3 on a linux 64bit (Fedora 18) and my Biopython is > 1.6.1. > > Thank you, > > Ivan > > > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Mon, Mar 4, 2013 at 12:26 PM, Peter Cock > wrote: > > On Mon, Mar 4, 2013 at 5:20 PM, Ivan Gregoretti > wrote: > >> I am trying this > >> > >> from Bio import SeqIO > >> SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( record ) > >> > > > > That should be a list of SeqRecord objects, not just one SeqRecord. > > If you have just one, try [record] instead. > > > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ivangreg at gmail.com Mon Mar 4 15:34:06 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 4 Mar 2013 15:34:06 -0500 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: I have to rectify myself. Both your code and Peter's suggestions work. I had a bug in my code that prevented me from overriding the default 60 character length. Thank you, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Mar 4, 2013 at 2:26 PM, Lenna Peterson wrote: > For me, this works as expected. I will paste my exact code below. > > I am creating the SeqRecord from your string, so perhaps it's some > difference there. > > Cheers, > > Lenna > > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > from Bio.SeqIO.FastaIO import FastaWriter > > mystr = """SVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIATDRSRARRCVEACVYGTLDFVG > YPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAEFTENIINGVERPVKAAELFAFTLRV > RAGNTDV""" > > myseq = Seq(mystr.replace("\n", "")) > myrec = SeqRecord(myseq) > > with open("nowrap.fa", "wb") as fh: > FastaWriter(fh, wrap=0).write_file([myrec]) > > with open("wrap10.fa", "wb") as fh: > FastaWriter(fh, wrap=10).write_file([myrec]) > > > > On Mon, Mar 4, 2013 at 1:28 PM, Ivan Gregoretti wrote: >> >> Sorry to pester the list's subscribers but there seems to be bug in >> FastaWriter >> >> It appears that FastaWriter does not respect the wrap argument: >> >> # No wrapping >> SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( [record] ) >> >> or >> >> # wrap after 10 characters >> SeqIO.FastaIO.FastaWriter( handle, wrap=10 ).write_file( [record] ) >> >> both produce the same output: >> >> >M01483:6:000000000-A2WVJ:1:1101:12685:1873 >> SVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIATDRSRARRCVEACVYGTLDFVG >> YPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAEFTENIINGVERPVKAAELFAFTLRV >> RAGNTDV >> >M01483:6:000000000-A2WVJ:1:1101:19629:2231 >> LTLADDRLEAFYDNPNALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRL >> HFHAVHFMRTLPTGSVDPNFGRRVRNRRQLNSLQNTWPYGYSMP >> >M01483:6:000000000-A2WVJ:1:1101:12952:2294 >> GFENQKELTKMQLDNQKEIAEMQNETQKEIAGIQSATSRQNTKDQVYAQNEMLAYQQKES >> TARVASIMENTNLSKQQQVSEIMRQMLTQAQTAGQYFTNDQIKEMTRKVS >> >> FastaWriter does produce an error when I pass wrap="hello world", so, >> the argument is interpreted initially. Also, as expected, when I >> comment out the SeqIO.FastaIO.FastaWriter line, there is no output. >> >> I checked and double checked. >> >> I am using Python 2.7.3 on a linux 64bit (Fedora 18) and my Biopython is >> 1.6.1. >> >> Thank you, >> >> Ivan >> >> >> >> >> Ivan Gregoretti, PhD >> Bioinformatics >> >> >> >> On Mon, Mar 4, 2013 at 12:26 PM, Peter Cock >> wrote: >> > On Mon, Mar 4, 2013 at 5:20 PM, Ivan Gregoretti >> > wrote: >> >> I am trying this >> >> >> >> from Bio import SeqIO >> >> SeqIO.FastaIO.FastaWriter( handle, wrap=0 ).write_file( record ) >> >> >> > >> > That should be a list of SeqRecord objects, not just one SeqRecord. >> > If you have just one, try [record] instead. >> > >> > Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Mon Mar 4 15:37:00 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 20:37:00 +0000 Subject: [Biopython] SeqIO.write and user-specified wrapping In-Reply-To: References: Message-ID: On Mon, Mar 4, 2013 at 8:34 PM, Ivan Gregoretti wrote: > I have to rectify myself. Both your code and Peter's suggestions work. > I had a bug in my code that prevented me from overriding the default > 60 character length. > > Thank you, > > Ivan No problem - thanks for letting us know :) Peter From dtomso at agbiome.com Mon Mar 4 15:44:20 2013 From: dtomso at agbiome.com (Dan Tomso) Date: Mon, 4 Mar 2013 15:44:20 -0500 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation Message-ID: Hi all-- New list subscriber here. I'm having an issue using DBSeqRecord objects pulled via BioSQL. Any insight welcome! Hello all-- Running on Ubuntu 12.10 with Python 2.7, latest BioPython and BioSQL. I have successfully established the MySQL-based BioSQL server, and I can load sequences into the system properly (or they seem to be proper--tables are populated correctly in MySQL and things are generally error-free). However--when I retrieve via 'lookup,' I can only access the id, name, and description for the DBSeqRecords. Annotations and features are supposed to be called on demand, but this crashes things. For example: File "/usr/lib/pymodules/python2.7/Bio/SeqRecord.py", line 595, in __str__ lines.append("Number of features: %i" % len(self.features)) File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 516, in __get_features self._primary_id) File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 280, in _retrieve_features feature.location = SeqFeature.FeatureLocation(start, end) File "/usr/lib/pymodules/python2.7/Bio/SeqFeature.py", line 561, in __init__ raise TypeError(start) TypeError: 0 Any idea what is happening here? Thanks! Dan From p.j.a.cock at googlemail.com Mon Mar 4 15:52:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 20:52:21 +0000 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: References: Message-ID: On Mon, Mar 4, 2013 at 8:44 PM, Dan Tomso wrote: > Hi all-- > New list subscriber here. I'm having an issue using DBSeqRecord objects > pulled via BioSQL. Any insight welcome! > > Hello all-- Running on Ubuntu 12.10 with Python 2.7, latest BioPython and > BioSQL. > > I have successfully established the MySQL-based BioSQL server, and I can > load sequences into the system properly (or they seem to be proper--tables > are populated correctly in MySQL and things are generally error-free). Excellent. That should have been the hardest part done. Any feedback on how to improve the docs would be good - presumably you used this?: http://biopython.org/wiki/BioSQL > However--when I retrieve via 'lookup,' I can only access the id, name, and > description for the DBSeqRecords. Annotations and features are supposed to > be called on demand, but this crashes things. For example: > > File "/usr/lib/pymodules/python2.7/Bio/SeqRecord.py", line 595, in __str__ > lines.append("Number of features: %i" % len(self.features)) > File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 516, in > __get_features > self._primary_id) > File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 280, in > _retrieve_features > feature.location = SeqFeature.FeatureLocation(start, end) > File "/usr/lib/pymodules/python2.7/Bio/SeqFeature.py", line 561, in > __init__ > raise TypeError(start) > TypeError: 0 > Any idea what is happening here? Thanks! Dan Hmm. Do you have the original sequence file loaded into the database? If so, we could try and reproduce the problem - that would be the easiest way forward. Otherwise you might need to look into the start value for the location for that feature (e.g. SQL queries on the database, or some debug print statements in the Biopython code). Thanks, Peter From p.j.a.cock at googlemail.com Mon Mar 4 15:59:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Mar 2013 20:59:57 +0000 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: References: Message-ID: On Mon, Mar 4, 2013 at 8:52 PM, Peter Cock wrote: > On Mon, Mar 4, 2013 at 8:44 PM, Dan Tomso wrote: >> Hi all-- >> New list subscriber here. I'm having an issue using DBSeqRecord objects >> pulled via BioSQL. Any insight welcome! >> >> Hello all-- Running on Ubuntu 12.10 with Python 2.7, latest BioPython and >> BioSQL. >> >> ... >> >> However--when I retrieve via 'lookup,' I can only access the id, name, and >> description for the DBSeqRecords. Annotations and features are supposed to >> be called on demand, but this crashes things. For example: >> >> File "/usr/lib/pymodules/python2.7/Bio/SeqRecord.py", line 595, in __str__ >> lines.append("Number of features: %i" % len(self.features)) >> File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 516, in >> __get_features >> self._primary_id) >> File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 280, in >> _retrieve_features >> feature.location = SeqFeature.FeatureLocation(start, end) >> File "/usr/lib/pymodules/python2.7/Bio/SeqFeature.py", line 561, in >> __init__ >> raise TypeError(start) >> TypeError: 0 >> Any idea what is happening here? Thanks! Dan > > Hmm. Do you have the original sequence file loaded into the database? > If so, we could try and reproduce the problem - that would be the > easiest way forward. Otherwise you might need to look into the > start value for the location for that feature (e.g. SQL queries on the > database, or some debug print statements in the Biopython code). Based on the trace above, you could be running Biopython 1.60, but if you were running the current release of Biopython 1.61 you'd get a slightly more explicit TypeError here when there is something odd about the start co-ordinate. Right now my hunch is a low level C int vs long issue, which could be dependent on the OS and MySQL version, and thus escaped our testing. If I'm right, this should be an easy fix. Could you update to Biopython 1.61 (or the latest code from GitHub) and retest? Thanks, Peter From dtomso at agbiome.com Mon Mar 4 16:02:28 2013 From: dtomso at agbiome.com (Dan Tomso) Date: Mon, 4 Mar 2013 16:02:28 -0500 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: References: Message-ID: Thanks, Peter. I don't have the files, but I pulled them just the other day via BioPython Entrez methods. Here are several identifiers that I just verified (as having the problem): 384448934 121612099 384117894 229587578 Basically, I pulled these down, loaded them into the DB, and can (sort of) retrieve the records, but when the records try to load up the annotations or features, things fall apart. I'll try to work up some comments on install. There was at least one SQL term that I needed to search and replace to get things to fly with the latest MySQL. Dan T. On Mon, Mar 4, 2013 at 3:52 PM, Peter Cock wrote: > On Mon, Mar 4, 2013 at 8:44 PM, Dan Tomso wrote: > > Hi all-- > > New list subscriber here. I'm having an issue using DBSeqRecord objects > > pulled via BioSQL. Any insight welcome! > > > > Hello all-- Running on Ubuntu 12.10 with Python 2.7, latest BioPython and > > BioSQL. > > > > I have successfully established the MySQL-based BioSQL server, and I can > > load sequences into the system properly (or they seem to be > proper--tables > > are populated correctly in MySQL and things are generally error-free). > > Excellent. That should have been the hardest part done. Any feedback on > how to improve the docs would be good - presumably you used this?: > http://biopython.org/wiki/BioSQL > > > However--when I retrieve via 'lookup,' I can only access the id, name, > and > > description for the DBSeqRecords. Annotations and features are supposed > to > > be called on demand, but this crashes things. For example: > > > > File "/usr/lib/pymodules/python2.7/Bio/SeqRecord.py", line 595, in > __str__ > > lines.append("Number of features: %i" % len(self.features)) > > File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 516, in > > __get_features > > self._primary_id) > > File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 280, in > > _retrieve_features > > feature.location = SeqFeature.FeatureLocation(start, end) > > File "/usr/lib/pymodules/python2.7/Bio/SeqFeature.py", line 561, in > > __init__ > > raise TypeError(start) > > TypeError: 0 > > Any idea what is happening here? Thanks! Dan > > Hmm. Do you have the original sequence file loaded into the database? > If so, we could try and reproduce the problem - that would be the > easiest way forward. Otherwise you might need to look into the > start value for the location for that feature (e.g. SQL queries on the > database, or some debug print statements in the Biopython code). > > Thanks, > > Peter > From dtomso at agbiome.com Mon Mar 4 16:03:39 2013 From: dtomso at agbiome.com (Dan Tomso) Date: Mon, 4 Mar 2013 16:03:39 -0500 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: References: Message-ID: <2C715595-006E-4EF1-B508-288C4885EA94@agbiome.com> Sure, will do. DT On Mar 4, 2013, at 3:59 PM, Peter Cock wrote: > On Mon, Mar 4, 2013 at 8:52 PM, Peter Cock wrote: >> On Mon, Mar 4, 2013 at 8:44 PM, Dan Tomso wrote: >>> Hi all-- >>> New list subscriber here. I'm having an issue using DBSeqRecord objects >>> pulled via BioSQL. Any insight welcome! >>> >>> Hello all-- Running on Ubuntu 12.10 with Python 2.7, latest BioPython and >>> BioSQL. >>> >>> ... >>> >>> However--when I retrieve via 'lookup,' I can only access the id, name, and >>> description for the DBSeqRecords. Annotations and features are supposed to >>> be called on demand, but this crashes things. For example: >>> >>> File "/usr/lib/pymodules/python2.7/Bio/SeqRecord.py", line 595, in __str__ >>> lines.append("Number of features: %i" % len(self.features)) >>> File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 516, in >>> __get_features >>> self._primary_id) >>> File "/usr/lib/pymodules/python2.7/BioSQL/BioSeq.py", line 280, in >>> _retrieve_features >>> feature.location = SeqFeature.FeatureLocation(start, end) >>> File "/usr/lib/pymodules/python2.7/Bio/SeqFeature.py", line 561, in >>> __init__ >>> raise TypeError(start) >>> TypeError: 0 >>> Any idea what is happening here? Thanks! Dan >> >> Hmm. Do you have the original sequence file loaded into the database? >> If so, we could try and reproduce the problem - that would be the >> easiest way forward. Otherwise you might need to look into the >> start value for the location for that feature (e.g. SQL queries on the >> database, or some debug print statements in the Biopython code). > > Based on the trace above, you could be running Biopython 1.60, but > if you were running the current release of Biopython 1.61 you'd get > a slightly more explicit TypeError here when there is something odd > about the start co-ordinate. > > Right now my hunch is a low level C int vs long issue, which could > be dependent on the OS and MySQL version, and thus escaped > our testing. If I'm right, this should be an easy fix. > > Could you update to Biopython 1.61 (or the latest code from GitHub) > and retest? > > Thanks, > > Peter From tiagoantao at gmail.com Mon Mar 4 16:13:40 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 4 Mar 2013 21:13:40 +0000 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: References: Message-ID: On Mon, Mar 4, 2013 at 8:59 PM, Peter Cock wrote: > Could you update to Biopython 1.61 (or the latest code from GitHub) > and retest? > > For good or for bad, there have been a few changes on BioSQL on github. It might be more efficient to test with the github version (not 1.61), either that or the comparison of results and creation of unit tests might be a bit confusing at this stage (with the changing code base) From dtomso at agbiome.com Mon Mar 4 16:37:33 2013 From: dtomso at agbiome.com (Dan Tomso) Date: Mon, 4 Mar 2013 16:37:33 -0500 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: References: Message-ID: <16F38D8A-FA95-4EB8-BC63-03EA5091E90A@agbiome.com> Gents-- Updating to the github version of BioPython has solved the problem, although I am not greatly enlightened. If you are in debug mode, let me know if there is anything I can do to help, otherwise I will proceed w/ some crazy science . . . Many thanks!! Dan T. On Mar 4, 2013, at 4:13 PM, Tiago Ant?o wrote: > > > > On Mon, Mar 4, 2013 at 8:59 PM, Peter Cock wrote: > Could you update to Biopython 1.61 (or the latest code from GitHub) > and retest? > > > For good or for bad, there have been a few changes on BioSQL on github. It might be more efficient to test with the github version (not 1.61), either that or the comparison of results and creation of unit tests might be a bit confusing at this stage (with the changing code base) > From p.j.a.cock at googlemail.com Tue Mar 5 04:58:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 Mar 2013 09:58:33 +0000 Subject: [Biopython] Issues with BioSQL and BioPython--can't access features and annotation In-Reply-To: <16F38D8A-FA95-4EB8-BC63-03EA5091E90A@agbiome.com> References: <16F38D8A-FA95-4EB8-BC63-03EA5091E90A@agbiome.com> Message-ID: On Mon, Mar 4, 2013 at 9:37 PM, Dan Tomso wrote: > Gents-- > Updating to the github version of BioPython has solved the problem, although > I am not greatly enlightened. If you are in debug mode, let me know if > there is anything I can do to help, otherwise I will proceed w/ some crazy > science . . . > > Many thanks!! > > Dan T. Hi Dan, I'm pretty sure this was an int/long issue (Python 2 specific, this distinction goes away in Python 3), with your database binding giving you longs rather than ints - in which case this was the fix: https://github.com/biopython/biopython/commit/4a67d851d1eda0a138b604c8aeffc151d331a29b That was included in Biopython 1.61 onwards, so you were just a bit unlucky with Biopython 1.60 to run into this. If you want to make absolutely sure, you could test with Biopython 1.61 and/or edit those lines in Bio/SeqFeature.py just to check. Thanks for helping diagnose this, Peter From mjldehoon at yahoo.com Wed Mar 6 02:45:30 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Mar 2013 23:45:30 -0800 (PST) Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: <5134A4ED.3070507@fold.natur.cuni.cz> Message-ID: <1362555930.84799.YahooMailClassic@web164001.mail.gq1.yahoo.com> --- On Mon, 3/4/13, Martin Mokrejs wrote: > I do use mixed-casing quite often and I think it is > acceptable to ask user to do the > .find like: > > s.tostring().upper().find('ACGTT') > > and leave the user slice out the mixed-cased match > eventually from the original sequence object. The problem though is that the call to .upper() will be slow if s is a long sequence. Trying this for human chromosome 1 showed that the search will take 20,000 times longer, and is unacceptably slow if you want to execute this search often. Best, -Michiel From p.j.a.cock at googlemail.com Wed Mar 6 05:11:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Mar 2013 10:11:16 +0000 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: <1362555930.84799.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <5134A4ED.3070507@fold.natur.cuni.cz> <1362555930.84799.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 6, 2013 at 7:45 AM, Michiel de Hoon wrote: > --- On Mon, 3/4/13, Martin Mokrejs wrote: >> I do use mixed-casing quite often and I think it is >> acceptable to ask user to do the >> .find like: >> >> s.tostring().upper().find('ACGTT') >> >> and leave the user slice out the mixed-cased match >> eventually from the original sequence object. > > The problem though is that the call to .upper() will be slow if s is a > long sequence. Trying this for human chromosome 1 showed that > the search will take 20,000 times longer, and is unacceptably slow > if you want to execute this search often. With the current code, the simple route is to standardise all your query and search strings into one case (e.g. upper case). Might optional case insensitive search might be useful if we can make it fast with some optional C code (and a pure Python fallback for PyPy, Jython, etc)? Peter From mmokrejs at fold.natur.cuni.cz Wed Mar 6 05:48:18 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Wed, 06 Mar 2013 11:48:18 +0100 Subject: [Biopython] Sequence object "find" is still case specific? In-Reply-To: References: <5134A4ED.3070507@fold.natur.cuni.cz> <1362555930.84799.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <51371EF2.3000402@fold.natur.cuni.cz> Peter Cock wrote: > On Wed, Mar 6, 2013 at 7:45 AM, Michiel de Hoon wrote: >> --- On Mon, 3/4/13, Martin Mokrejs wrote: >>> I do use mixed-casing quite often and I think it is >>> acceptable to ask user to do the >>> .find like: >>> >>> s.tostring().upper().find('ACGTT') >>> >>> and leave the user slice out the mixed-cased match >>> eventually from the original sequence object. >> >> The problem though is that the call to .upper() will be slow if s is a >> long sequence. Trying this for human chromosome 1 showed that >> the search will take 20,000 times longer, and is unacceptably slow >> if you want to execute this search often. I convert to .upper() raw 454 read sequences about up to 1200nt in length but haven't studied the perfomance. I just wanted to avoid re.compile() for every unpredictable query. The SeqIO.parse() objects are still in mixed-casing which is what I am happy with. > > With the current code, the simple route is to standardise all your > query and search strings into one case (e.g. upper case). > > Might optional case insensitive search might be useful if we > can make it fast with some optional C code (and a pure Python > fallback for PyPy, Jython, etc)? Yes, if you provide a case-insensitive search interface to SeqIO objects I will gladly use it instead of making a temporary copy of s.to_string().upper(). But I do it only once for some, maybe not even all reads, would have to dig more into my code. ;-) Just in case you would be considering the penalty during initial *all data* import. Martin From stephane.teletchea at inserm.fr Thu Mar 7 16:37:07 2013 From: stephane.teletchea at inserm.fr (=?ISO-8859-1?Q?T=E9letch=E9a_St=E9phane?=) Date: Thu, 07 Mar 2013 22:37:07 +0100 Subject: [Biopython] Some help to access "hidden" features :-) Message-ID: <51390883.2020602@inserm.fr> Dear biopythoners, I am struggling in extracting some informations from a uniprot file. a) Get the inital file, for instance http://www.uniprot.org/uniprot/P02724.xml b) parse it: python >>> from Bio import SeqIO >>> record=list(SeqIO.parse("P02724.xml",'uniprot-xml')) >>> print record[0].dbxrefs ... >>> for i in record[0].dbxrefs: ... if 'PDB:' in i: ... print i ... PDB:1AFO PDB:1MSR PDB:2KPE PDB:2KPF In the Uniprot file, there are annotations for the 1AFO model: NMR method, starts at 81 and ends at 120. The corresponding entry in the xml file is: According to the module source code (http://biopython.org/DIST/docs/api/Bio.SeqIO.UniprotIO-pysrc.html), it is possible to access these datas, they are correctly handled: def _parse_dbReference(element): 299 self.ParsedSeqRecord.dbxrefs .append (element.attrib['type'] + ':' + element.attrib['id']) 300 #e.g. 301 # 302 # 303 # 304 # 305 # However, I'm unable to go futher the "print i" above ... How can I extract this information for the 'i' object above? Do I have to use another approach? Thanks a lot for your comments, links and remarks. St?phane PS: sent in plain text this time... -- Equipe DSIMB - Dynamique des Structures et des Interactions des Macromol?cules Biologiques INTS, INSERM-Paris-Diderot UMR-S665 6 rue Alexandre Cabanel - 75739 Paris cedex 15- France T?l : +33 144 493 057 Fax : +33 147 347 431 http://www.dsimb.inserm.fr / http://steletch.free.fr From stephane.teletchea at inserm.fr Thu Mar 7 16:26:37 2013 From: stephane.teletchea at inserm.fr (=?ISO-8859-1?Q?T=E9letch=E9a_St=E9phane?=) Date: Thu, 07 Mar 2013 22:26:37 +0100 Subject: [Biopython] Some help to access "hidden" features :-) Message-ID: <5139060D.5020806@inserm.fr> Dear biopythoners, I am struggling in extracting some informations from a uniprot file. a) Get the inital file, for instance http://www.uniprot.org/uniprot/P02724.xml b) parse it: python >>> from Bio import SeqIO >>> record=list(SeqIO.parse("P02724.xml",'uniprot-xml')) >>> print record[0].dbxrefs ... >>> for i in record[0].dbxrefs: ... if 'PDB:' in i: ... print i ... PDB:1AFO PDB:1MSR PDB:2KPE PDB:2KPF In the Uniprot file, there are annotations for the 1AFO model: NMR method, starts at 81 and ends at 120. The corresponding entry in the xml file is: According to the module source code (http://biopython.org/DIST/docs/api/Bio.SeqIO.UniprotIO-pysrc.html), it is possible to access these datas, they are correctly handled: def _parse_dbReference(element): 299 self.ParsedSeqRecord.dbxrefs .append (element.attrib['type'] + ':' + element.attrib['id']) 300 #e.g. 301 # 302 # 303 # 304 # 305 # However, I'm unable to go futher the "print i" above ... How can I extract this information for the 'i' object above? Do I have to use another approach? Thanks a lot for your comments, links and remarks. St?phane -- Equipe DSIMB - Dynamique des Structures et des Interactions des Macromol?cules Biologiques INTS, INSERM-Paris-Diderot UMR-S665 6 rue Alexandre Cabanel - 75739 Paris cedex 15- France T?l : +33 144 493 057 Fax : +33 147 347 431 http://www.dsimb.inserm.fr / http://steletch.free.fr From p.j.a.cock at googlemail.com Thu Mar 7 17:19:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 7 Mar 2013 22:19:15 +0000 Subject: [Biopython] Some help to access "hidden" features :-) In-Reply-To: <5139060D.5020806@inserm.fr> References: <5139060D.5020806@inserm.fr> Message-ID: On Thu, Mar 7, 2013 at 9:26 PM, T?letch?a St?phane wrote: > Dear biopythoners, > > I am struggling in extracting some informations from a uniprot file. > > a) Get the inital file, for instance > http://www.uniprot.org/uniprot/P02724.xml > b) parse it: > > python >>>> from Bio import SeqIO >>>> record=list(SeqIO.parse("P02724.xml",'uniprot-xml')) >>>> print record[0].dbxrefs > ... > >>>> for i in record[0].dbxrefs: > ... if 'PDB:' in i: > ... print i > ... > PDB:1AFO > PDB:1MSR > PDB:2KPE > PDB:2KPF Excellent - a self contained example :) That makes it much easier for us to see what you're doing and how to help. Thank you. > In the Uniprot file, there are annotations for the 1AFO model: > NMR method, starts at 81 and ends at 120. > > The corresponding entry in the xml file is: > > > > > > > According to the module source code > (http://biopython.org/DIST/docs/api/Bio.SeqIO.UniprotIO-pysrc.html), > it is possible to access these datas, they are correctly handled: > > def _parse_dbReference(element): > self.ParsedSeqRecord.dbxrefs.append(element.attrib['type'] + ':' + element.attrib['id']) > ... As you will have seen, the SeqRecord's dbxrefs does get populated with the key information - but this is (based on usage in other file formats) a very simple list of strings. Right now the extra information *is not returned*, mainly as it doesn't naturally map onto the existing SeqRecord model. A little later in that method you'd have seen a comment: "TODO - How best to store these, do SeqFeatures make sense?" and the following lines created a SeqFeature object, but never add it to the returned SeqRecord. Elsewhere the UniProt file does have things we store as SeqFeature objects - so doing this for the database reference information is a bit odd. Perhaps we'd be better off following the approach used for references in GenBank files instead? I'm unclear what is best (partly since I don't use these bits of data). What do you think the parser should do with this data? [Note that in this situation you might be better off using one of the Python standard library modules to work with the XML directly (e.g. ElementTree or cElementTree) if you need all the details in the UniProt XML file which are not yet handled in the conversion to a SeqRecord object.] Regards, Peter From stephane.teletchea at inserm.fr Fri Mar 8 09:46:02 2013 From: stephane.teletchea at inserm.fr (=?ISO-8859-1?Q?T=E9letch=E9a_St=E9phane?=) Date: Fri, 08 Mar 2013 15:46:02 +0100 Subject: [Biopython] Some help to access "hidden" features :-) In-Reply-To: References: <5139060D.5020806@inserm.fr> Message-ID: <5139F9AA.6040403@inserm.fr> Le 07/03/2013 23:19, Peter Cock a ?crit : > > Excellent - a self contained example :) :-) > That makes it much easier for us to see what you're > doing and how to help. Thank you. > >> In the Uniprot file, there are annotations for the 1AFO model: >> NMR method, starts at 81 and ends at 120. >> >> The corresponding entry in the xml file is: >> >> >> >> >> >> >> According to the module source code >> (http://biopython.org/DIST/docs/api/Bio.SeqIO.UniprotIO-pysrc.html), >> it is possible to access these datas, they are correctly handled: >> >> def _parse_dbReference(element): >> self.ParsedSeqRecord.dbxrefs.append(element.attrib['type'] + ':' + element.attrib['id']) >> ... > As you will have seen, the SeqRecord's dbxrefs does get > populated with the key information - but this is (based on > usage in other file formats) a very simple list of strings. > > Right now the extra information *is not returned*, mainly as > it doesn't naturally map onto the existing SeqRecord model. OK, I was suspected this, but this is confirmed, thank you. > A little later in that method you'd have seen a comment: > "TODO - How best to store these, do SeqFeatures make sense?" > and the following lines created a SeqFeature object, but never > add it to the returned SeqRecord. Elsewhere the UniProt > file does have things we store as SeqFeature objects - so > doing this for the database reference information is a bit > odd. Perhaps we'd be better off following the approach > used for references in GenBank files instead? I'm unclear > what is best (partly since I don't use these bits of data). > > What do you think the parser should do with this data? Bah, parse all :-) Seriously, this is very nice that a lot of fields are properly parsed, I'm not involved enough to give my opinion on this, I assume this was more a comment to say "we should may be post in elsewhere, but after a second reading I understand better. > [Note that in this situation you might be better off using one > of the Python standard library modules to work with the XML > directly (e.g. ElementTree or cElementTree) if you need > all the details in the UniProt XML file which are not yet > handled in the conversion to a SeqRecord object.] Yes, this is what I will do, these informations were only factual in reality, I was planning to retrieve this information by other means too. > > Regards, > > Peter Thanks a lot for the rapid answer, and keep up the good job on biopython, it is really appreciated. St?phane -- Equipe DSIMB - Dynamique des Structures et des Interactions des Macromol?cules Biologiques INTS, INSERM-Paris-Diderot UMR-S665 6 rue Alexandre Cabanel - 75739 Paris cedex 15- France T?l : +33 144 493 057 Fax : +33 147 347 431 http://www.dsimb.inserm.fr / http://steletch.free.fr From p.j.a.cock at googlemail.com Fri Mar 8 11:00:39 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Mar 2013 16:00:39 +0000 Subject: [Biopython] Problems with reading Swiss format records (swissprot specific date fields) In-Reply-To: <20130304154006.GA4227@paxarchia.galaxy.uni> References: <20130304154006.GA4227@paxarchia.galaxy.uni> Message-ID: On Mon, Mar 4, 2013 at 3:40 PM, Jan T Kim wrote: > Dear All, > > trying to parse the attached Swissprot record gives me a stack trace: > > Traceback (most recent call last): > File "./swisstest", line 7, in > e = Bio.SeqIO.read(sys.argv[1], 'swiss') > File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 599, in read > first = iterator.next() > File "/usr/lib/pymodules/python2.7/Bio/SeqIO/__init__.py", line 537, in parse > for r in i: > File "/usr/lib/pymodules/python2.7/Bio/SeqIO/SwissIO.py", line 97, in SwissIterator > annotations['date'] = swiss_record.created[0] > TypeError: 'NoneType' object has no attribute '__getitem__' > > The problem is at line 99 (rather than 97)of > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SwissIO.py : > > annotations['date'] = swiss_record.created[0] > > without an "if swiss_record.created is not None" test or something > similar. The parse function of Bio.SwissProt initialises the created > instance variable to None, and only if a "DT" record containing the > string "INTEGRATED" (case insensitive) is found, created is set to that > date. > > The same kind of problem occurs with the sequence_update variable in the > next statement: > > annotations['date_last_sequence_update'] = swiss_record.sequence_update[0] > > Would it be sensible to set the 'date' and 'date_last_sequence_update' > entries of the annotations dictionary only if the values are actually > found in the swiss_record? I understand that with a genuine SwissProt > record, they should always be there, but this happened to me when working > on files generated from the refseq protein database using the EMBOSS > seqret program with -osformat=swiss, which doesn't seem like an entirely > exotic use case to me. > > Best regards, Jan Good idea - this should now work in the next release: https://github.com/biopython/biopython/commit/6d4d3838920bbb92e4acacc94d76ab3312417ca8 Can we use your example file for a test case? Thanks, Peter From anaryin at gmail.com Fri Mar 8 17:59:53 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 8 Mar 2013 23:59:53 +0100 Subject: [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: Small update: http://biopython.org/wiki/GSOC If ok, We can just link the normal one for this one. I kept it separate just in case. 2013/3/4 Peter Cock > On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues wrote: > > Hello all, > > > > Does any oppose to a refreshment of our GSOC > > pagebased on the > > BioRuby > > page ? It could > use > > a facelift before the new round of projects/students come in. > > > > Best, > > > > Jo?o > > A good idea - see also the GSoC discussions on the biopython-dev > list about potential project ideas. > > Thanks, > > Peter > From matthew.m.mccormick at gmail.com Fri Mar 8 21:24:03 2013 From: matthew.m.mccormick at gmail.com (Matthew McCormick) Date: Sat, 9 Mar 2013 02:24:03 +0000 Subject: [Biopython] Scipy 2013 Conference Announcement Message-ID: SciPy 2013, the twelfth annual Scientific Computing with Python conference, will be held this June 24th-29th in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest projects, learn from skilled users and developers, and collaborate on code development. The conference consists of two days of tutorials by followed by two days of presentations, and concludes with two days of developer sprints on projects of interest to attendees. Specialized Tracks ------------------ This year we are happy to announce two specialized tracks that run in parallel to the general conference: *Machine Learning* In recent years, Python's machine learning libraries rapidly matured with a flurry of new libraries and cutting-edge algorithm implement and development occurring within Python. As Python makes these algorithms more accessible, machine learning algorithm application has spread across disciplines. Showcase your favorite machine learning library or how it has been used as an effective tool in your work! *Reproducible Science* Over recent years, the Open Science movement has stoked a renewed acknowledgement of the importance of reproducible research. The goals of this movement include improving the dissemination of progress, prevent fraud through transparency, and enable deeper/wider development and collaboration. This track is to discuss the tools and methods used to achieve reproducible scientific computing. Domain-specific Mini-symposia ----------------------------- Introduced in 2012, mini-symposia are held to discuss scientific computing applied to a specific scientific domain/industry during a half afternoon after the general conference. Their goal is to promote industry specific libraries and tools, and gather people with similar interests for discussions. Mini-symposia on the following topics will take place this year: - Astronomy and astrophysics - Bioinformatics - Medical imaging - Meteorology, climatology, and atmospheric and oceanic science Tutorials --------- Multiple interactive half-day tutorials will be taught by community experts. The tutorials provide conceptual and practical coverage of tools that have broad interest at both an introductory or advanced level. This year, a third track will be added, which target specifically programmers with no prior knowledge of scientific python. Developer Sprints ----------------- A hackathon environment is setup for attendees to work on the core SciPy packages or their own personal projects. The conference is an opportunity for developers that are usually physically separated to come together and engage in highly productive sessions. It is also an occasion for new community members to introduce themselves and recieve tips from community experts. This year, some of the sprints will be scheduled and announced ahead of the conference. Birds-of-a-Feather (BOF) Sessions --------------------------------- Birds-of-a-Feather sessions are self-organized discussions that run parallel to the main conference. The BOFs sessions cover primary, tangential, or unrelated topics in an interactive, discussion setting. This year, some of the BOF sessions will be scheduled and announced ahead of the conference. Important Dates --------------- - March 20th: Presentation abstracts, poster, tutorial submission deadline. Application for sponsorship deadline. - April 15th: Speakers selected - April 22nd: Sponsorship acceptance deadline - May 1st: Speaker schedule announced - May 6th: Early-bird registration ends - June 24th-29th: 2 days of tutorials, 2 days of conference, 2 days of sprints We look forward to a very exciting conference and hope to see you all at the conference. The SciPy2013 organization team: * Andy Terrel, Co-chair * Jonathan Rocher, Co-chair * Katy Huff, Program Committee co-chair * Matt McCormick, Program Committee co-chair * Dharhas Pothina, Tutorial co-chair * Francesc Alted, Tutorial co-chair * Corran Webster, Sprint co-chair * Peter Wang, Sprint co-chair * Matthew Turk, BoF co-chair * Jarrod Millman, Proceeding co-chair * St?fan van der Walt, Proceeding co-chair * Anthony Scopatz, Communications co-chair * Majken Tranby, Communications co-chair * Jeff Daily, Financial Aid co-chair * John Wiggins, Financial Aid co-chair * Leah Jones, Operations chair * Brett Murphy, Sponsor chair * Bill Cowan, Financial chair From anaryin at gmail.com Wed Mar 13 07:09:29 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 13 Mar 2013 12:09:29 +0100 Subject: [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: Hello all, I updated the GSOC page on the wiki to be more organized: http://biopython.org/wiki/GSOC If no one opposes, I'll replace the current page (here) with it, just in time for GSOC 2013. Best, Jo?o PS. sorry for the spamming but I posted this 5 days ago in the non dev list and got no answers so.. 2013/3/8 Jo?o Rodrigues > Small update: http://biopython.org/wiki/GSOC > > If ok, We can just link the normal one for this one. I kept it separate > just in case. > > > 2013/3/4 Peter Cock > >> On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues >> wrote: >> > Hello all, >> > >> > Does any oppose to a refreshment of our GSOC >> > pagebased on the >> > BioRuby >> > page ? It >> could use >> > a facelift before the new round of projects/students come in. >> > >> > Best, >> > >> > Jo?o >> >> A good idea - see also the GSoC discussions on the biopython-dev >> list about potential project ideas. >> >> Thanks, >> >> Peter >> > > From mikael.trellet at gmail.com Wed Mar 13 07:17:17 2013 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Wed, 13 Mar 2013 12:17:17 +0100 Subject: [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: It's well-formated and looks nice for me, the improvement from the former one is signifcant so I would agree to update the page. Good work ;) Mikael On Wed, Mar 13, 2013 at 12:09 PM, Jo?o Rodrigues wrote: > Hello all, > > I updated the GSOC page on the wiki to be more organized: > http://biopython.org/wiki/GSOC > > If no one opposes, I'll replace the current page > (here) > with it, just in time for GSOC 2013. > > Best, > > Jo?o > > PS. sorry for the spamming but I posted this 5 days ago in the non dev list > and got no answers so.. > > > 2013/3/8 Jo?o Rodrigues > > > Small update: http://biopython.org/wiki/GSOC > > > > If ok, We can just link the normal one for this one. I kept it separate > > just in case. > > > > > > 2013/3/4 Peter Cock > > > >> On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues > >> wrote: > >> > Hello all, > >> > > >> > Does any oppose to a refreshment of our GSOC > >> > pagebased on the > >> > BioRuby > >> > page ? It > >> could use > >> > a facelift before the new round of projects/students come in. > >> > > >> > Best, > >> > > >> > Jo?o > >> > >> A good idea - see also the GSoC discussions on the biopython-dev > >> list about potential project ideas. > >> > >> Thanks, > >> > >> Peter > >> > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- -------------------------------------------- Mikael TRELLET, - Groupe VENISE, CNRS LIMSI 91403 Orsay CEDEX - LBT/IBPC, 75005 Paris France +33650607172 From p.j.a.cock at googlemail.com Wed Mar 13 08:04:28 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Mar 2013 12:04:28 +0000 Subject: [Biopython] [Biopython-dev] Updating GSOC page? In-Reply-To: References: Message-ID: On Wed, Mar 13, 2013 at 11:09 AM, Jo?o Rodrigues wrote: > Hello all, > > I updated the GSOC page on the wiki to be more organized: > http://biopython.org/wiki/GSOC > > If no one opposes, I'll replace the current page > (here) > with it, just in time for GSOC 2013. > > Best, > > Jo?o Sounds sensible, and you can set a direct on GSOC to Google_Summer_of_Code by replacing the content with: #REDIRECT [[link]] Peter From anaryin at gmail.com Wed Mar 13 09:22:23 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 13 Mar 2013 14:22:23 +0100 Subject: [Biopython] [Biopython-dev] Updating GSOC page? In-Reply-To: References: Message-ID: Done, thanks. http://biopython.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/GSOC 2013/3/13 Peter Cock > On Wed, Mar 13, 2013 at 11:09 AM, Jo?o Rodrigues > wrote: > > Hello all, > > > > I updated the GSOC page on the wiki to be more organized: > > http://biopython.org/wiki/GSOC > > > > If no one opposes, I'll replace the current page > > (here) > > with it, just in time for GSOC 2013. > > > > Best, > > > > Jo?o > > Sounds sensible, and you can set a direct on GSOC to > Google_Summer_of_Code by replacing the content with: > > #REDIRECT [[link]] > > Peter > From natassa_g_2000 at yahoo.com Sun Mar 17 16:04:08 2013 From: natassa_g_2000 at yahoo.com (natassa) Date: Sun, 17 Mar 2013 13:04:08 -0700 (PDT) Subject: [Biopython] fastq manipulations speed Message-ID: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> Hi biopython list, I have a few fasta files that come from processing fastq illumina reads for quality, polyAs adaptors etc, but i need to get their associated qualities back. I wrote a simple script that calls the following 2 funbctions, which I think are the fastest way to deal with fastq-fasta files in Biopython, but the script is awfully slow:? For example, for one of my files, after 41h of run, only 28000 records out of 28 million have been processed. My files contain between 28-40 million reads, so i need to somehow make it faster if this is possible. Any ideas or any things you might see in the code that make iot so slow? def makeDictofFasta_withLgth(fastafile): ??? '''tested on Illumina IIx files after my cleaning routine, ie sequences used in velvet''' ??? mydict={} ??? info= SeqIO.index(fastafile, "fasta") ??? for rec in info.keys(): ??????? mydict[rec]=len(info[rec].seq) ??? print 'finished making dictionary of fasta records' ??? return mydict def Addquals_inTrimmedFastA(fastq, newfastq, fasta): ??? outfastq=open(newfastq, "w") ??? fasta_dict=makeDictofFasta_withLgth(fasta) ??? fq_dict = SeqIO.index(fastq,"fastq-illumina") ??? for record in fasta_dict.keys(): ??????? if record in fq_dict.keys(): ??????????? length= fasta_dict[record] ??????????? sub_rec = fq_dict[record][0:length] ??????????? outfastq.write(sub_rec.format('fastq-illumina')) ??? outfastq.close() Thanks in advance, Natassa From chris.mit7 at gmail.com Sun Mar 17 16:22:49 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Sun, 17 Mar 2013 16:22:49 -0400 Subject: [Biopython] fastq manipulations speed In-Reply-To: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> References: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> Message-ID: Hi Natassa, First, I wouldn't bother indexing. This seems a one-and-done operation and indexing is thus a waste of time. Have the list of stuff you want to find first, then iterate through the fasta file looking for what you want. In general though, what are you hoping to accomplish with the qualities? That would help immensely with any feedback and best practice suggestions. Are you just doing QC? If so, fastQC might be a better option than rolling your own solution. One comment on the code that will speed it up: don't use if record in fq_dict.keys(). That returns a list which is going to have a lookup time proportional to the list size. Do: fq_keys = set(fq_dict.keys()) and then if record in fq_keys, this will be O(1) lookup time. Chris On Sun, Mar 17, 2013 at 4:04 PM, natassa wrote: > Hi biopython list, > > I have a few fasta files that come from processing fastq illumina reads > for quality, polyAs adaptors etc, but i need to get their associated > qualities back. I wrote a simple script that calls the following 2 > funbctions, which I think are the fastest way to deal with fastq-fasta > files in Biopython, but the script is awfully slow: > For example, for one of my files, after 41h of run, only 28000 records out > of 28 million have been processed. My files contain between 28-40 million > reads, so i need to somehow make it faster if this is possible. Any ideas > or any things you might see in the code that make iot so slow? > > def makeDictofFasta_withLgth(fastafile): > '''tested on Illumina IIx files after my cleaning routine, ie > sequences used in velvet''' > mydict={} > info= SeqIO.index(fastafile, "fasta") > for rec in info.keys(): > mydict[rec]=len(info[rec].seq) > print 'finished making dictionary of fasta records' > return mydict > > > def Addquals_inTrimmedFastA(fastq, newfastq, fasta): > outfastq=open(newfastq, "w") > fasta_dict=makeDictofFasta_withLgth(fasta) > fq_dict = SeqIO.index(fastq,"fastq-illumina") > > for record in fasta_dict.keys(): > if record in fq_dict.keys(): > length= fasta_dict[record] > sub_rec = fq_dict[record][0:length] > outfastq.write(sub_rec.format('fastq-illumina')) > outfastq.close() > > Thanks in advance, > > Natassa > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Sun Mar 17 17:24:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 17 Mar 2013 21:24:33 +0000 Subject: [Biopython] fastq manipulations speed In-Reply-To: References: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> Message-ID: On Sun, Mar 17, 2013 at 8:22 PM, Chris Mitchell wrote: > Hi Natassa, > > First, I wouldn't bother indexing. This seems a one-and-done operation and > indexing is thus a waste of time. Have the list of stuff you want to find > first, then iterate through the fasta file looking for what you want. You might be able to do a paired iteration between the trimmed FASTA file and the untrimmed quality file. I'll reply separately with comments on the current code... > One comment on the code that will speed it up: > don't use if record in fq_dict.keys(). That returns a list which is going > to have a lookup time proportional to the list size. Do: > fq_keys = set(fq_dict.keys()) and then if record in fq_keys, this will be > O(1) lookup time. > > Chris That's an excellent point, but both dictionaries and sets use hash based lookups for speed, and should be about the same. i.e. instead of this: if record in fq_dict.keys(): #do stuff Use this: if record in fq_dict: #do stuff That is also considered better style. Another related point, rather than: for record in fasta_dict.keys(): #do stuff this would typically be written as: for record in fasta_dict: #do stuff In this case it would be a little faster since there is no need to run the keys method, but will do the same thing. Peter From natassa_g_2000 at yahoo.com Sun Mar 17 19:02:12 2013 From: natassa_g_2000 at yahoo.com (natassa) Date: Sun, 17 Mar 2013 16:02:12 -0700 (PDT) Subject: [Biopython] fastq manipulations speed In-Reply-To: References: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> Message-ID: <1363561332.69787.YahooMailNeo@web160805.mail.bf1.yahoo.com> Thanks Peter, I am using python quite often, but I was missing the fact that I don't need the keys method (!), and I always used dictionaries instead of sets. I am not very clear though about this point, I mean, is the use of set faster or not in general? I have adapted the script according to your suggestions, see below. I am not sure also about the format point you raised, I mean, is calling an internal function used in SeqIO write slower than SeqIO? def Addquals_inTrimmedFastA(fastq, newfastq, fasta): ??? outfastq=open(newfastq, "w") ??? length_dict = dict((rec.id, len(rec)) for rec in ?????????????????????? SeqIO.parse(fasta, "fasta")) ??? for record in SeqIO.parse(fastq,"fastq-illumina"): ??????? if record.id in length_dict: ??????????? #print 'found: '+record.id ??????????? length=length_dict[record.id] ??????????? #print 'will write the substring from 0 to: '+str(length-1) ??????????? sub_rec = record[0:(length-1)] ??????????? SeqIO.write(sub_rec, outfastq, 'fastq-illumina') ??? outfastq.close() Will let you know if this is faster, I appreciate a lot the advise, which i really need to improve my programming skills!I have exactly the tendency of making things more complicated than they are. As for the trimming, yes it was done only on the end, based on thresholds I obtained by plots using fastq files and the fastx toolkit. I then imposed the thresholds in a hard-coded way in all my files, and proceeded with further trimming of polyAs and NNs in the resulting fastas/ I know it must have been better to have kept the qualities, and that this trimming method is not the most common but this was done a long time ago, when i was not sure about many things and? this routine made sense to me (I mean, I stil think there is nothing wrong about it). I also knew that i would not need these qualities, since the assembler I used does not take them into account. So I simply thought that dragging info that is not necessary would be just a pain, as files are also bigger. Thanks, Natassa ________________________________ From: Peter Cock To: natassa Sent: Sunday, March 17, 2013 2:38 PM Subject: Re: [Biopython] fastq manipulations speed On Sun, Mar 17, 2013 at 8:04 PM, natassa wrote: > Hi biopython list, > > I have a few fasta files that come from processing fastq illumina reads > for quality, polyAs adaptors etc, but i need to get their associated > qualities back. I wrote a simple script that calls the following 2 > funbctions, which I think are the fastest way to deal with fastq-fasta files > in Biopython, but the script is awfully slow: You're using some relatively sophisticated bits of SeqIO, but you've made it more complicated than it needs to be. > For example, for one of my files, after 41h of run, only 28000 records out > of 28 million have been processed. My files contain between 28-40 million > reads, so i need to somehow make it faster if this is possible. Any ideas or > any things you might see in the code that make iot so slow? > > def makeDictofFasta_withLgth(fastafile): >? ? '''tested on Illumina IIx files after my cleaning routine, ie > sequences used in velvet''' >? ? mydict={} >? ? info= SeqIO.index(fastafile, "fasta") >? ? for rec in info.keys(): >? ? ? ? mydict[rec]=len(info[rec].seq) >? ? print 'finished making dictionary of fasta records' >? ? return medicate In this example you don't need the index - all you need to do is one loop over the file while building up the dictionary of lengths. e.g. length_dict = {} for rec in SeqIO.parse(fastafile, "fasta"): ? ? length_dict[rec.id] = len(rec) Or more elegantly using a generator expression: length_dict = dict((rec.id, len(rec)) for rec in SeqIO.parse(fastafile, "fasta")) By using an index like this, and looping over the reads in whatever order the dictionary uses (based on the hashing algorithm), you are doing a lot of wasteful disk access jumping back and forth in the FASTA file. This will I think explain the main source of slowness in your script. > def Addquals_inTrimmedFastA(fastq, newfastq, fasta): >? ? outfastq=open(newfastq, "w") >? ? fasta_dict=makeDictofFasta_withLgth(fasta) >? ? fq_dict = SeqIO.index(fastq,"fastq-illumina") > >? ? for record in fasta_dict.keys(): >? ? ? ? if record in fq_dict.keys(): >? ? ? ? ? ? length= fasta_dict[record] >? ? ? ? ? ? sub_rec = fq_dict[record][0:length] >? ? ? ? ? ? outfastq.write(sub_rec.format('fastq-illumina')) >? ? outfastq.close() Again, you don't need the index. Doing this you'll process the reads in the sorted order Python uses for the dictionary (essentially random order), meaning lots and lots of wasted and slow disk access as the Biopython indexing jumps to the records in the FASTQ file in this arbitrary order. Just make a single loop over the file with SeqIO.parse. Also don't use the SeqRecord's format method like that - the help text tries to direct you to the SeqIO.write method which will be faster (the format method calls this internally). With those changes you should get far more sensible run times - but there is still a lot of room for improvement. Do you want to try out the suggestions so far, and then we can make a second round of feedback? That would be my recommendation if you are hoping to improve your Python skills. As an aside, are you *sure* your trimming pipeline has only trimmed the ends of the sequences? That seems to be the assumption you've made here - but if you have any barcodes they'll be trimmed from the start of the sequences. In general it would be far better to do the trimming on the FASTQ files, rather than on the FASTA files and then trying to fix the qualities. At Chris pointed out, there are existing well tested quality control/trimming libraries which might be worth checking. I hope that is useful, Peter From p.j.a.cock at googlemail.com Sun Mar 17 20:07:26 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 18 Mar 2013 00:07:26 +0000 Subject: [Biopython] fastq manipulations speed In-Reply-To: <1363561332.69787.YahooMailNeo@web160805.mail.bf1.yahoo.com> References: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> <1363561332.69787.YahooMailNeo@web160805.mail.bf1.yahoo.com> Message-ID: On Sun, Mar 17, 2013 at 11:02 PM, natassa wrote: > Thanks Peter, > I am using python quite often, but I was missing the fact that I don't need > the keys method (!), and I always used dictionaries instead of sets. I am > not very clear though about this point, I mean, is the use of set faster or > not in general? For checking membership, "key in my_dict" and "key in my_set" should take about the same time - and both will be much faster than "key in my_list" or "key in my_tuple" when you have a lot of things to check. If all you want the data structure for is checking membership, then use a set. If you need to associate a value with the key, then use a dictionary. Because they don't store a separate value for each key, sets use less memory than dicts. Note that sets were only included as a built in object in Python 2.4, so many books and guides written before then will often use a dict instead. Also note that neither sets not dicts preserve the order of the elements, which is sometimes an important reason to use a list or tuple instead. Hopefully the updated script is working better for you - I can think of at least a few more suggestions worth trying. However, before making it faster - is it doing what you wanted? Are you sure you should be using length-1 when you trim the FASTQ records? Regards, Peter From p.j.a.cock at googlemail.com Sun Mar 17 20:33:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 18 Mar 2013 00:33:21 +0000 Subject: [Biopython] fastq manipulations speed In-Reply-To: <1363566077.39961.YahooMailNeo@web160805.mail.bf1.yahoo.com> References: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> <1363561332.69787.YahooMailNeo@web160805.mail.bf1.yahoo.com> <1363566077.39961.YahooMailNeo@web160805.mail.bf1.yahoo.com> Message-ID: On Mon, Mar 18, 2013 at 12:21 AM, natassa wrote: > Thanks, the length-1 was an error, it was supposed to be 0:length to get the > qualities of the associated trimmed files. The script seems to be running > much faster! But what would be your other suggestions? > Natassa You should be able to refactor the code to make a single call to SeqIO.write by giving it a generator which constructs all the trimmed records. That would require a bit of thought and experience with iterators, generator functions and/or generator expression - but can be a really powerful way to think about things. I'm expecting this to be faster, but the second idea below will definitely be faster, perhaps five times as fast... More straightforwardly, you don't need to use SeqRecord objects for this task - they make slicing the sequence and quality easier, but come with a performance cost. See: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ In addition, consider doing the same for the FASTA file with: from Bio.SeqIO.FastaIO import SimpleFastaParser (requires Biopython 1.61 or later - looks like that wasn't highlighted in the release notes which was an oversight). Good night, Peter From sameer at blueplastic.com Sun Mar 17 21:15:57 2013 From: sameer at blueplastic.com (Sameer Farooqui) Date: Sun, 17 Mar 2013 21:15:57 -0400 Subject: [Biopython] Which Alphabet type should I use with FASTA files in Biopython? Message-ID: If I'm using the FASTA files from the link below, what Alphabet type should I use in Biopython? Would it be IUPAC.unambiguous_dna? link to FASTA files: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/?C=S;O=A - SFx From p.j.a.cock at googlemail.com Mon Mar 18 06:37:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 18 Mar 2013 10:37:52 +0000 Subject: [Biopython] Which Alphabet type should I use with FASTA files in Biopython? In-Reply-To: References: Message-ID: On Mon, Mar 18, 2013 at 1:15 AM, Sameer Farooqui wrote: > If I'm using the FASTA files from the link below, what Alphabet type should > I use in Biopython? Would it be IUPAC.unambiguous_dna? > > link to FASTA files: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/?C=S;O=A > RIght now I would suggest using generic_dna, as in: from Bio.Alphabet import generic_dna That doesn't give an explicit list of expected letters, unlike the IUPAC alphabet which does (upper case only). This is an area of Biopython likely to change in future releases to try to enforce the white-list of an alphabet against the letters in the sequence being used. Peter P.S. Duplicate post here: http://www.biostars.org/p/66687/ From p.j.a.cock at googlemail.com Tue Mar 19 05:24:39 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 Mar 2013 09:24:39 +0000 Subject: [Biopython] fastq manipulations speed In-Reply-To: <1363654515.68500.YahooMailNeo@web160802.mail.bf1.yahoo.com> References: <1363550648.1404.YahooMailNeo@web160803.mail.bf1.yahoo.com> <1363561332.69787.YahooMailNeo@web160805.mail.bf1.yahoo.com> <1363566077.39961.YahooMailNeo@web160805.mail.bf1.yahoo.com> <1363654515.68500.YahooMailNeo@web160802.mail.bf1.yahoo.com> Message-ID: Excellent :) P.S. Try to include the mailing list in your replies On Tuesday, March 19, 2013, natassa wrote: > Hello, > Just to let you kniw that the script adapted with your suggestions > comopleted very fast, ie within a few hours only. Thank you! > Natassa > > > ------------------------------ > *From:* Peter Cock 'p.j.a.cock at googlemail.com');>> > *To:* natassa 'natassa_g_2000 at yahoo.com');>> > *Cc:* Biopython Mailing List > > > *Sent:* Sunday, March 17, 2013 5:33 PM > *Subject:* Re: [Biopython] fastq manipulations speed > > On Mon, Mar 18, 2013 at 12:21 AM, natassa > > wrote: > > Thanks, the length-1 was an error, it was supposed to be 0:length to get > the > > qualities of the associated trimmed files. The script seems to be running > > much faster! But what would be your other suggestions? > > Natassa > > You should be able to refactor the code to make a single call to > SeqIO.write by giving it a generator which constructs all the > trimmed records. That would require a bit of thought and > experience with iterators, generator functions and/or generator > expression - but can be a really powerful way to think about > things. I'm expecting this to be faster, but the second idea > below will definitely be faster, perhaps five times as fast... > > More straightforwardly, you don't need to use SeqRecord > objects for this task - they make slicing the sequence and > quality easier, but come with a performance cost. See: > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ > > In addition, consider doing the same for the FASTA file with: > from Bio.SeqIO.FastaIO import SimpleFastaParser > (requires Biopython 1.61 or later - looks like that wasn't > highlighted in the release notes which was an oversight). > > Good night, > > Peter > > > From debruinjj at gmail.com Wed Mar 20 07:39:15 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Wed, 20 Mar 2013 13:39:15 +0200 Subject: [Biopython] Restriction - REBASE Message-ID: Hi, I would like to know if it would be possible to determine if an enzymes in the Restriction class is a nicking enzyme or not, would this be possible and if so which of the following contains the info: 'all_suppliers', 'buffers', 'catalyse', 'catalyze', 'charac', 'characteristic', 'compatible_end', 'compsite', 'cut_once', 'cut_twice', 'dna', 'elucidate', 'equischizomers', 'freq', 'frequency', 'fst3', 'fst5', 'inact_temp', 'is_3overhang', 'is_5overhang', 'is_ambiguous', 'is_blunt', 'is_comm', 'is_defined', 'is_equischizomer', 'is_isoschizomer', 'is_methylable', 'is_neoschizomer', 'is_palindromic', 'is_unknown', 'isoschizomers', 'mro', 'neoschizomers', 'on_minus', 'opt_temp', 'overhang', 'ovhg', 'ovhgseq', 'results', 'scd3', 'scd5', 'search', 'site', 'size', 'substrat', 'suppl', 'supplier_list', 'suppliers' -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From nicolas.joannin at gmail.com Thu Mar 21 04:30:46 2013 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Thu, 21 Mar 2013 17:30:46 +0900 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: Hi everyone, I am still in need of a fix for this problem... I have made an attempt at fixing it myself by adding a line of code in Bio.Entrez.__init__.py: At line 451 there is: try: if post: #HTTP POST * options = options.encode('utf-8') **#Line added by Nicolas* handle = urllib.request.urlopen(cgi, data=options) else: #HTTP GET cgi += "?" + options handle = urllib.request.urlopen(cgi) except urllib.error.HTTPError as exception: raise exception In my (limited) testing, it seems to work... Would this be suitable? Best regards, Nicolas "Because the world owes me nothing, and we owe each other the world" Ani Difranco On Mon, Jan 28, 2013 at 4:04 AM, Peter Cock wrote: > On Sun, Jan 27, 2013 at 4:41 AM, Michiel de Hoon > wrote: > > Looking at this some more, I found this on the mailing list explaining > why we are using post=True: > > > > http://lists.open-bio.org/pipermail/biopython/2009-May/005152.html > > Yes, we use post (as the name epost suggests) to upload a long list > of IDs without the long URL limitations faced if using an HTTP get. > > > This page provides some explanation on urllib.parse.urlencode in Python3: > > > > http://docs.python.org/3/library/urllib.request.html#urllib-examples > > > > Does this mean we have a subtle Python 2 vs 3 problem with > epost? > > Time for another unit test in test_Entrez_online.py which > currently only tests einfo and efetch - we should have > esearch, epost, espell and esummary in there too I think. > > Peter > From p.j.a.cock at googlemail.com Thu Mar 21 06:44:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 10:44:33 +0000 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 8:30 AM, Nicolas Joannin wrote: > Hi everyone, > > I am still in need of a fix for this problem... Is this a new problem which you have not previously reported? > I have made an attempt at fixing it myself by adding a line of code in > Bio.Entrez.__init__.py: > > At line 451 there is: > > try: > if post: > #HTTP POST > options = options.encode('utf-8') #Line added by Nicolas > handle = urllib.request.urlopen(cgi, data=options) > else: > #HTTP GET > cgi += "?" + options > handle = urllib.request.urlopen(cgi) > except urllib.error.HTTPError as exception: > raise exception > > In my (limited) testing, it seems to work... > Would this be suitable? > > Best regards, > Nicolas Assuming you are running this on Python 3.2, it is likely a bytes versus unicode issue. I'm guessing, but we'd probably want something like this to handle both Python 2 and 3 (untested): from Bio._py3k import _as_byte ... handle = urllib.request.urlopen(cgi, data=_as_bytes(options)) Could you share an example script (just a few lines) to reproduce the problem please? And include the error message you get. Thanks, Peter From p.j.a.cock at googlemail.com Thu Mar 21 11:58:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 15:58:22 +0000 Subject: [Biopython] Translation of partial codons Message-ID: Hi all, I was prompted by a recent BioPerl thread to check out how Biopython handles translation of partial codons: http://lists.open-bio.org/pipermail/bioperl-l/2013-March/037085.html Here's a tiny example, a partial sequence ending "CC". If we assume this is an incomplete codon, i.e. "CCN", we can translate this into an amino acid - in this case with the standard table it is translated unambiguously as proline, "P". >>> from Bio.Seq import translate >>> translate("AAACCC") 'KP' >>> translate("AAACC") 'K' >>> translate("AAAC") 'K' >>> translate("AAA") 'K' >>> translate("CCN") 'P' >>> translate("CC") '' >>> translate("C") '' >>> translate("") '' This behaviour surprised me, and as far as I recall this Biopython behaviour is undocumented. Since I rewrote the current translation code, I am partly to blame for not considering this corner case. Whatever we agree should happen will need some unit tests. Personally I think Biopython should be raising an exception on these partial codons - or at least a warning, rather than as it does now silently ignoring them. I don't think we need yet another option here. If the user knows they are dealing with incomplete sequences (e.g. partial CDS from an EST assembly or PCR product), then they can explicitly check the length and add "N" or "NN" to round it up to a whole number of codons (ensure the length is a multiple of three). Any thoughts? Thanks, Peter From mmokrejs at fold.natur.cuni.cz Thu Mar 21 12:24:45 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 21 Mar 2013 17:24:45 +0100 Subject: [Biopython] Translation of partial codons In-Reply-To: References: Message-ID: <514B344D.30801@fold.natur.cuni.cz> Peter Cock wrote: > Hi all, > > I was prompted by a recent BioPerl thread to check out how Biopython > handles translation of partial codons: > > http://lists.open-bio.org/pipermail/bioperl-l/2013-March/037085.html > > Here's a tiny example, a partial sequence ending "CC". If we assume > this is an incomplete codon, i.e. "CCN", we can translate this into an > amino acid - in this case with the standard table it is translated > unambiguously as proline, "P". > >>>> from Bio.Seq import translate >>>> translate("AAACCC") > 'KP' >>>> translate("AAACC") > 'K' >>>> translate("AAAC") > 'K' >>>> translate("AAA") > 'K' >>>> translate("CCN") > 'P' >>>> translate("CC") > '' >>>> translate("C") > '' >>>> translate("") > '' > > This behaviour surprised me, and as far as I recall this Biopython > behaviour is undocumented. Since I rewrote the current translation > code, I am partly to blame for not considering this corner case. > Whatever we agree should happen will need some unit tests. > > Personally I think Biopython should be raising an exception on these > partial codons - or at least a warning, rather than as it does now > silently ignoring them. I don't think we need yet another option here. > > If the user knows they are dealing with incomplete sequences (e.g. > partial CDS from an EST assembly or PCR product), then they can > explicitly check the length and add "N" or "NN" to round it up to a > whole number of codons (ensure the length is a multiple of three). > > Any thoughts? I agree that biopython should give an error as the length cannot be divided by 3 without slack. Martin From p.j.a.cock at googlemail.com Thu Mar 21 12:33:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:33:29 +0000 Subject: [Biopython] Translation of partial codons In-Reply-To: <514B344D.30801@fold.natur.cuni.cz> References: <514B344D.30801@fold.natur.cuni.cz> Message-ID: On Thu, Mar 21, 2013 at 4:24 PM, Martin Mokrejs wrote: > > I agree that biopython should give an error as the length cannot be divided > by 3 without slack. > > Martin So that's a +1 for an explicit error from Martin. Similarly Pete Thorpe (off list) agreed an error would be useful, but suggested a new option to specifically request the current behaviour. Thanks, Peter From idoerg at gmail.com Thu Mar 21 12:44:03 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 21 Mar 2013 12:44:03 -0400 Subject: [Biopython] Translation of partial codons In-Reply-To: References: <514B344D.30801@fold.natur.cuni.cz> Message-ID: Perhaps raising an exceptiob should be the default behavior. But I suggest the user can pass an argument lenghtcheck=False. In that case, an exception will not be raised. Iddo Friedberg http://iddo-friedberg.net/contact.html On Mar 21, 2013 12:38 PM, "Peter Cock" wrote: On Thu, Mar 21, 2013 at 4:24 PM, Martin Mokrejs wrote: > > I agree tha... So that's a +1 for an explicit error from Martin. Similarly Pete Thorpe (off list) agreed an error would be useful, but suggested a new option to specifically request the current behaviour. Thanks, Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.... From mmokrejs at fold.natur.cuni.cz Thu Mar 21 12:56:20 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 21 Mar 2013 17:56:20 +0100 Subject: [Biopython] Translation of partial codons In-Reply-To: References: <514B344D.30801@fold.natur.cuni.cz> Message-ID: <514B3BB4.70303@fold.natur.cuni.cz> Iddo Friedberg wrote: > Perhaps raising an exceptiob should be the default behavior. But I suggest the user can pass an argument lenghtcheck=False. In that case, an exception will not be raised. But if somebody has to adjust an existing python code and add the extra argument, it is better to make the check in his/her code right away, and either prepend/append N's to the string, or whatever else. I don't see much sense to introduce the extra argument at all. Existing code will break for good, and those affected will just fix their code. ;) Normally I wouldn't be so sure but in this case I am really fond of raising an exception. If the character is missing in front of the string due to s = myseq[start:stop] while user hasn't realized for slicing one has to adjust to s = myseq[start-1:stop] , a typical error I bet, then these people will just be glad to hit the exception, finally. ;-) Getting an answer in a wrong reading frame is really bad. Those who had one or two letters less on the right end won't care, except maybe when they realized they have lost the trailing nucleotide somewhere. Again, I bet a slicing issue will be the upstream problem. Martin > > Iddo Friedberg > http://iddo-friedberg.net/contact.html > >> On Mar 21, 2013 12:38 PM, "Peter Cock" > wrote: >> >> On Thu, Mar 21, 2013 at 4:24 PM, Martin Mokrejs >> > wrote: >> > >> > I agree tha... >> >> So that's a +1 for an explicit error from Martin. >> >> Similarly Pete Thorpe (off list) agreed an error would be useful, but >> suggested a new option to specifically request the current behaviour. >> >> Thanks, >> >> Peter >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.... >> From idoerg at gmail.com Thu Mar 21 13:10:13 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 21 Mar 2013 13:10:13 -0400 Subject: [Biopython] Translation of partial codons In-Reply-To: <514B3BB4.70303@fold.natur.cuni.cz> References: <514B344D.30801@fold.natur.cuni.cz> <514B3BB4.70303@fold.natur.cuni.cz> Message-ID: Suggstions so far: 1. Raise an exception. This may cause code running on existing data to change behavior. I.e. it ran before well on bad length sequences, but as of the new code installtion, things will break. 2. Add a default length_check=True to the translate method. Again, this may cause exiting code to behave differently wiht the same data once user upgrades. Unless the user explicitly changes the call to myseq.translate(length_check=False) 3. My suggestion: use length_check=False as default. Code behaves the same as before, so no data-induced breakages. If the user wants to check length, the explicitly pass a True value. So we give the option of checking length, and retaining code-behavior legacy. length_check, being an argument, does not need to be passed explicitly. On Thu, Mar 21, 2013 at 12:56 PM, Martin Mokrejs < mmokrejs at fold.natur.cuni.cz> wrote: > > > Iddo Friedberg wrote: > > Perhaps raising an exceptiob should be the default behavior. But I > suggest the user can pass an argument lenghtcheck=False. In that case, an > exception will not be raised. > > > But if somebody has to adjust an existing python code and add the extra > argument, it is better > to make the check in his/her code right away, and either prepend/append > N's to the string, > or whatever else. I don't see much sense to introduce the extra argument > at all. Existing > code will break for good, and those affected will just fix their code. ;) > > Normally I wouldn't be so sure but in this case I am really fond of > raising an exception. > If the character is missing in front of the string due to > > s = myseq[start:stop] > > while user hasn't realized for slicing one has to adjust to > > s = myseq[start-1:stop] > > , a typical error I bet, then these people will just be glad to hit the > exception, finally. > ;-) Getting an answer in a wrong reading frame is really bad. > > > Those who had one or two letters less on the right end won't care, except > maybe when they realized > they have lost the trailing nucleotide somewhere. Again, I bet a slicing > issue will be the > upstream problem. > > Martin > > > > > Iddo Friedberg > > http://iddo-friedberg.net/contact.html > > > >> On Mar 21, 2013 12:38 PM, "Peter Cock" p.j.a.cock at googlemail.com>> wrote: > >> > >> On Thu, Mar 21, 2013 at 4:24 PM, Martin Mokrejs > >> > > wrote: > >> > > >> > I agree tha... > >> > >> So that's a +1 for an explicit error from Martin. > >> > >> Similarly Pete Thorpe (off list) agreed an error would be useful, but > >> suggested a new option to specifically request the current behaviour. > >> > >> Thanks, > >> > >> Peter > >> > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.... > >> > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Thu Mar 21 13:19:45 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:19:45 +0000 Subject: [Biopython] Translation of partial codons In-Reply-To: References: <514B344D.30801@fold.natur.cuni.cz> <514B3BB4.70303@fold.natur.cuni.cz> Message-ID: On Thu, Mar 21, 2013 at 5:10 PM, Iddo Friedberg wrote: > Suggstions so far: > 1. Raise an exception. This may cause code running on existing data to > change behavior. I.e. it ran before well on bad length sequences, but as of > the new code installtion, things will break. Yes, but in most cases this will be a good thing. The minority of people knowingly dealing with partial sequences can make this explicit by first ensuring their sequence is a multiple of three in length (by padding or cropping as most appropriate to their use case). > 2. Add a default length_check=True to the translate method. Again, this may > cause exiting code to behave differently wiht the same data once user > upgrades. Unless the user explicitly changes the call to > myseq.translate(length_check=False) A sensible approach to making likely errors explicit, with an easy work-around for the old implicit truncation. The downside is yet another argument to the translate functions/methods, which are already pretty complicated. I prefer (1). > 3. My suggestion: use length_check=False as default. Code behaves the same > as before, so no data-induced breakages. If the user wants to check length, > the explicitly pass a True value. So we give the option of checking length, > and retaining code-behavior legacy. > > length_check, being an argument, does not need to be passed explicitly. I don't like this, even though it is backwards compatible for the corner case. I think the old behaviour is a bug. Regards, Peter From idoerg at gmail.com Thu Mar 21 13:51:47 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 21 Mar 2013 13:51:47 -0400 Subject: [Biopython] Translation of partial codons In-Reply-To: References: <514B344D.30801@fold.natur.cuni.cz> <514B3BB4.70303@fold.natur.cuni.cz> Message-ID: I tend to agree with Peter rather than with myself :) If the current behavior is a bug, we should squash it. On Thu, Mar 21, 2013 at 1:19 PM, Peter Cock wrote: > On Thu, Mar 21, 2013 at 5:10 PM, Iddo Friedberg wrote: > > Suggstions so far: > > 1. Raise an exception. This may cause code running on existing data to > > change behavior. I.e. it ran before well on bad length sequences, but as > of > > the new code installtion, things will break. > > Yes, but in most cases this will be a good thing. The minority of people > knowingly dealing with partial sequences can make this explicit by first > ensuring their sequence is a multiple of three in length (by padding or > cropping as most appropriate to their use case). > > > 2. Add a default length_check=True to the translate method. Again, this > may > > cause exiting code to behave differently wiht the same data once user > > upgrades. Unless the user explicitly changes the call to > > myseq.translate(length_check=False) > > A sensible approach to making likely errors explicit, with an easy > work-around > for the old implicit truncation. The downside is yet another argument to > the > translate functions/methods, which are already pretty complicated. I > prefer (1). > > > 3. My suggestion: use length_check=False as default. Code behaves the > same > > as before, so no data-induced breakages. If the user wants to check > length, > > the explicitly pass a True value. So we give the option of checking > length, > > and retaining code-behavior legacy. > > > > length_check, being an argument, does not need to be passed explicitly. > > I don't like this, even though it is backwards compatible for the corner > case. I think the old behaviour is a bug. > > Regards, > > Peter > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From cjfields at illinois.edu Thu Mar 21 13:56:59 2013 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 21 Mar 2013 17:56:59 +0000 Subject: [Biopython] Translation of partial codons In-Reply-To: References: <514B344D.30801@fold.natur.cuni.cz> <514B3BB4.70303@fold.natur.cuni.cz> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF74DB3336@CITESMBX5.ad.uillinois.edu> On Mar 21, 2013, at 12:19 PM, Peter Cock wrote: > On Thu, Mar 21, 2013 at 5:10 PM, Iddo Friedberg wrote: >> Suggstions so far: >> 1. Raise an exception. This may cause code running on existing data to >> change behavior. I.e. it ran before well on bad length sequences, but as of >> the new code installtion, things will break. > > Yes, but in most cases this will be a good thing. The minority of people > knowingly dealing with partial sequences can make this explicit by first > ensuring their sequence is a multiple of three in length (by padding or > cropping as most appropriate to their use case). > >> 2. Add a default length_check=True to the translate method. Again, this may >> cause exiting code to behave differently wiht the same data once user >> upgrades. Unless the user explicitly changes the call to >> myseq.translate(length_check=False) > > A sensible approach to making likely errors explicit, with an easy work-around > for the old implicit truncation. The downside is yet another argument to the > translate functions/methods, which are already pretty complicated. I prefer (1). > >> 3. My suggestion: use length_check=False as default. Code behaves the same >> as before, so no data-induced breakages. If the user wants to check length, >> the explicitly pass a True value. So we give the option of checking length, >> and retaining code-behavior legacy. >> >> length_check, being an argument, does not need to be passed explicitly. > > I don't like this, even though it is backwards compatible for the corner > case. I think the old behaviour is a bug. > > Regards, > > Peter That's basically the approach we took on the bioperl end, e.g. the old behavior was an unintended bug (it was a little more complex than that in reality, but in essence it boils down to that). It was too magic, and the old behavior can be regained with a parameter setting. I don't think we throw an exception, but maybe we should... Anyway, I would think going with something that would be following the tenant of least surprise would be very python-esque :) chris From nicolas.joannin at gmail.com Thu Mar 21 21:50:15 2013 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Fri, 22 Mar 2013 10:50:15 +0900 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: Hello Peter, I might not have followed the correct procedure for reporting the problem. If that is the case, please let me know the correct procedure. What I had done was email the biopython mailing list with my question, including an example and the output . >From that question, I received two replies from Michiel de Hoon (firstand second), to which I replied to ... After that, you replied as well. At that point, my wife gave birth and I was off the grid until recently... As I mentioned, am still in need of a solution, which is why I emailed again (in reply to your last email). Seeing your response, I am guessing that emailing the mailing list is not quite the proper way, so, could you tell me the proper way to report problems? Regarding my and your suggested solutions, I have never worked with Python 2.X, so I don't know if my solution would break it when used with Python 2. But I guess so. I have tested your solution and that works as well in my testing. (My testing, after modifying the Entrez.__init__.py file: >>>from Bio import Entrez >>>Entrez.email='my at email' >>>post_h=Entrez.epost("nuccore",id="160418,160351") >>> ) Best regards, Nicolas "Because the world owes me nothing, and we owe each other the world" Ani Difranco On Thu, Mar 21, 2013 at 7:44 PM, Peter Cock wrote: > On Thu, Mar 21, 2013 at 8:30 AM, Nicolas Joannin > wrote: > > Hi everyone, > > > > I am still in need of a fix for this problem... > > Is this a new problem which you have not previously reported? > > > I have made an attempt at fixing it myself by adding a line of code in > > Bio.Entrez.__init__.py: > > > > At line 451 there is: > > > > try: > > if post: > > #HTTP POST > > options = options.encode('utf-8') #Line added by Nicolas > > handle = urllib.request.urlopen(cgi, data=options) > > else: > > #HTTP GET > > cgi += "?" + options > > handle = urllib.request.urlopen(cgi) > > except urllib.error.HTTPError as exception: > > raise exception > > > > In my (limited) testing, it seems to work... > > Would this be suitable? > > > > Best regards, > > Nicolas > > Assuming you are running this on Python 3.2, it is likely a > bytes versus unicode issue. I'm guessing, but we'd probably > want something like this to handle both Python 2 and 3 > (untested): > > from Bio._py3k import _as_byte > ... > handle = urllib.request.urlopen(cgi, data=_as_bytes(options)) > > Could you share an example script (just a few lines) to reproduce > the problem please? And include the error message you get. > > Thanks, > > Peter > From p.j.a.cock at googlemail.com Fri Mar 22 07:06:40 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Mar 2013 11:06:40 +0000 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 22, 2013 at 1:50 AM, Nicolas Joannin wrote: > Hello Peter, > > I might not have followed the correct procedure for reporting the problem. > If that is the case, please let me know the correct procedure. > What I had done was email the biopython mailing list with my question, > including an example and the output. > From that question, I received two replies from Michiel de Hoon (first and > second), to which I replied to... > After that, you replied as well. Sorry - I didn't reread the whole thread carefully enough - I saw the post issue which was resoled, but missed the encode issue from before. > Seeing your response, I am guessing that emailing the mailing list is not > quite the proper way, so, could you tell me the proper way to report > problems? Email works to a point, but the formal bug tracker has advantages if something isn't resolved quickly. We currently use RedMine (but have discussed moving to GitHub's issue tracker): http://redmine.open-bio.org/projects/biopython > Regarding my and your suggested solutions, I have never worked with Python > 2.X, so I don't know if my solution would break it when used with Python 2. > But I guess so. > I have tested your solution and that works as well in my testing. > > (My testing, after modifying the Entrez.__init__.py file: > >>>>from Bio import Entrez >>>>Entrez.email='my at email' >>>>post_h=Entrez.epost("nuccore",id="160418,160351") >>>> > ) > We should be able to test this under a range of Python versions - thank you for the clarification. Peter From nicolas.joannin at gmail.com Fri Mar 22 10:05:37 2013 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Fri, 22 Mar 2013 23:05:37 +0900 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: Hello Peter, Thank you for these explanations. And again thank you for your suggested fix. Best regards, Nicolas "Because the world owes me nothing, and we owe each other the world" Ani Difranco On Fri, Mar 22, 2013 at 8:06 PM, Peter Cock wrote: > On Fri, Mar 22, 2013 at 1:50 AM, Nicolas Joannin > wrote: > > Hello Peter, > > > > I might not have followed the correct procedure for reporting the > problem. > > If that is the case, please let me know the correct procedure. > > What I had done was email the biopython mailing list with my question, > > including an example and the output. > > From that question, I received two replies from Michiel de Hoon (first > and > > second), to which I replied to... > > After that, you replied as well. > > Sorry - I didn't reread the whole thread carefully enough - I saw the post > issue which was resoled, but missed the encode issue from before. > > > Seeing your response, I am guessing that emailing the mailing list is not > > quite the proper way, so, could you tell me the proper way to report > > problems? > > Email works to a point, but the formal bug tracker has advantages if > something isn't resolved quickly. We currently use RedMine (but have > discussed moving to GitHub's issue tracker): > http://redmine.open-bio.org/projects/biopython > > > Regarding my and your suggested solutions, I have never worked with > Python > > 2.X, so I don't know if my solution would break it when used with Python > 2. > > But I guess so. > > I have tested your solution and that works as well in my testing. > > > > (My testing, after modifying the Entrez.__init__.py file: > > > >>>>from Bio import Entrez > >>>>Entrez.email='my at email' > >>>>post_h=Entrez.epost("nuccore",id="160418,160351") > >>>> > > ) > > > > We should be able to test this under a range of Python versions - thank > you for the clarification. > > Peter > From p.j.a.cock at googlemail.com Fri Mar 22 10:11:12 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Mar 2013 14:11:12 +0000 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 22, 2013 at 2:05 PM, Nicolas Joannin wrote: > Hello Peter, > > Thank you for these explanations. > And again thank you for your suggested fix. > > Best regards, > Nicolas Hi Nicolas, Apologies for the confusion - the good news is that I think your fix works nicely, and I've applied the generalised version to the repository (and added a test for this): https://github.com/biopython/biopython/commit/f0f4536119947e7d4df838adf6283e545e0dee54 If you're OK with installing Biopython from git, having you double check this fix on your setup would be great. Are you happy to be thanked in the CONTRIB/NEWS file as a contributor to the next release? Thanks, Peter From nicolas.joannin at gmail.com Fri Mar 22 10:24:32 2013 From: nicolas.joannin at gmail.com (Nicolas Joannin) Date: Fri, 22 Mar 2013 23:24:32 +0900 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: Hi Peter, Thanks again for looking into this, it's greatly appreciated. I will look into installing from git, and testing it, on Monday when back at work (it's pas 11PM Friday evening, here in Japan ;)... As for the contrib/news... I do intend on participating as much as I can, as I improve my programming skills. But for the moment I am really just a beginner and would feel awkward being thanked for so little contribution. I do appreciate the offer, though! Best regards, Nicolas "Because the world owes me nothing, and we owe each other the world" Ani Difranco On Fri, Mar 22, 2013 at 11:11 PM, Peter Cock wrote: > On Fri, Mar 22, 2013 at 2:05 PM, Nicolas Joannin > wrote: > > Hello Peter, > > > > Thank you for these explanations. > > And again thank you for your suggested fix. > > > > Best regards, > > Nicolas > > Hi Nicolas, > > Apologies for the confusion - the good news is that I think > your fix works nicely, and I've applied the generalised > version to the repository (and added a test for this): > > https://github.com/biopython/biopython/commit/f0f4536119947e7d4df838adf6283e545e0dee54 > > If you're OK with installing Biopython from git, having you > double check this fix on your setup would be great. > > Are you happy to be thanked in the CONTRIB/NEWS file > as a contributor to the next release? > > Thanks, > > Peter > From p.j.a.cock at googlemail.com Fri Mar 22 10:30:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Mar 2013 14:30:20 +0000 Subject: [Biopython] Bio.Entrez.epost error with Python 3.2 In-Reply-To: References: <1359259570.36220.YahooMailClassic@web164003.mail.gq1.yahoo.com> <1359261714.79007.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 22, 2013 at 2:24 PM, Nicolas Joannin wrote: > Hi Peter, > > Thanks again for looking into this, it's greatly appreciated. > I will look into installing from git, and testing it, on Monday when back at > work (it's pas 11PM Friday evening, here in Japan ;)... Thanks > As for the contrib/news... I do intend on participating as much as I can, as > I improve my programming skills. > But for the moment I am really just a beginner and would feel awkward being > thanked for so little contribution. > I do appreciate the offer, though! > > Best regards, > Nicolas I hope we'll see some more sizeable contributions from you in the future then -- although a one line bug fix is often valuable in itself ;) Good night, Peter From rz1991 at foxmail.com Fri Mar 22 16:02:37 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Sat, 23 Mar 2013 04:02:37 +0800 Subject: [Biopython] Tree Comparison in BioPython Message-ID: Hi, I was wondering if there is any tree comparison support in BioPython? I want to know if two trees are topologically the same. The idea I have now is to convert trees into an adjacency matrix and check their equality. This is not easy because internal node typically don't have a name and the structure of the adjacency matrix may be different. Or I may go through all internal nodes and check if they have the same terminals. This idea is also not straightforward as typically for an internal node in one tree I have to compare it against all the internal nodes in the other tree to potentially find a match. Is there any suggestions? Thanks!! Best, Zheng From eric.talevich at gmail.com Fri Mar 22 22:30:03 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 22 Mar 2013 22:30:03 -0400 Subject: [Biopython] Tree Comparison in BioPython In-Reply-To: References: Message-ID: Hi Zheng, The usual way to do this is the Robinson-Foulds metric. It's not implemented directly in Biopython yet, but two programs that can do it are "treedist" in Phylip (a.k.a. ftreedist in Embassy, in the Debian package "embassy-phylip"), and RAxML with the "-f r" option. The Python library DendroPy also has a Robinson-Foulds distance function. See: https://groups.google.com/forum/?fromgroups=#!topic/raxml/JgvxgknTeqw http://emboss.bioinformatics.nl/cgi-bin/emboss/help/ftreedist http://pythonhosted.org/DendroPy/library/tree.html#dendropy.dataobject.tree.Tree.robinson_foulds_distance Hope that helps, Eric On Fri, Mar 22, 2013 at 4:02 PM, ?? wrote: > Hi, > > I was wondering if there is any tree comparison support in BioPython? I > want to know if two trees are topologically the same. > > > The idea I have now is to convert trees into an adjacency matrix and check > their equality. This is not easy because internal node typically don't have > a name and the structure of the adjacency matrix may be different. Or I may > go through all internal nodes and check if they have the same terminals. > This idea is also not straightforward as typically for an internal node in > one tree I have to compare it against all the internal nodes in the other > tree to potentially find a match. > > > Is there any suggestions? Thanks!! > > > Best, > Zheng > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From rz1991 at foxmail.com Fri Mar 22 22:48:22 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Sat, 23 Mar 2013 10:48:22 +0800 Subject: [Biopython] Tree Comparison in BioPython Message-ID: Thanks Eric, I will check your suggestions to see if it can get my problem solved. Best, Zheng ------------------ Original ------------------ From: "Eric Talevich"; Date: Mar 23, 2013 To: "????"; Cc: "biopython"; Subject: Re: [Biopython] Tree Comparison in BioPython Hi Zheng, The usual way to do this is the Robinson-Foulds metric. It's not implemented directly in Biopython yet, but two programs that can do it are "treedist" in Phylip (a.k.a. ftreedist in Embassy, in the Debian package "embassy-phylip"), and RAxML with the "-f r" option. The Python library DendroPy also has a Robinson-Foulds distance function. See: https://groups.google.com/forum/?fromgroups=#!topic/raxml/JgvxgknTeqw http://emboss.bioinformatics.nl/cgi-bin/emboss/help/ftreedist http://pythonhosted.org/DendroPy/library/tree.html#dendropy.dataobject.tree.Tree.robinson_foulds_distance Hope that helps, Eric On Fri, Mar 22, 2013 at 4:02 PM, ???? wrote: Hi, I was wondering if there is any tree comparison support in BioPython? I want to know if two trees are topologically the same. The idea I have now is to convert trees into an adjacency matrix and check their equality. This is not easy because internal node typically don't have a name and the structure of the adjacency matrix may be different. Or I may go through all internal nodes and check if they have the same terminals. This idea is also not straightforward as typically for an internal node in one tree I have to compare it against all the internal nodes in the other tree to potentially find a match. Is there any suggestions? Thanks!! Best, Zheng _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Sat Mar 23 10:55:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 23 Mar 2013 14:55:19 +0000 Subject: [Biopython] Translation of partial codons In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF74DB3336@CITESMBX5.ad.uillinois.edu> References: <514B344D.30801@fold.natur.cuni.cz> <514B3BB4.70303@fold.natur.cuni.cz> <118F034CF4C3EF48A96F86CE585B94BF74DB3336@CITESMBX5.ad.uillinois.edu> Message-ID: On Thu, Mar 21, 2013 at 5:56 PM, Fields, Christopher J wrote: > On Mar 21, 2013, at 12:19 PM, Peter Cock > wrote: > >> On Thu, Mar 21, 2013 at 5:10 PM, Iddo Friedberg wrote: >>> Suggstions so far: >>> 1. Raise an exception. This may cause code running on existing data to >>> change behavior. I.e. it ran before well on bad length sequences, but as of >>> the new code installtion, things will break. >> >> Yes, but in most cases this will be a good thing. The minority of people >> knowingly dealing with partial sequences can make this explicit by first >> ensuring their sequence is a multiple of three in length (by padding or >> cropping as most appropriate to their use case). >> >>> 2. Add a default length_check=True to the translate method. Again, this may >>> cause exiting code to behave differently wiht the same data once user >>> upgrades. Unless the user explicitly changes the call to >>> myseq.translate(length_check=False) >> >> A sensible approach to making likely errors explicit, with an easy work-around >> for the old implicit truncation. The downside is yet another argument to the >> translate functions/methods, which are already pretty complicated. I prefer (1). >> >>> 3. My suggestion: use length_check=False as default. Code behaves the same >>> as before, so no data-induced breakages. If the user wants to check length, >>> the explicitly pass a True value. So we give the option of checking length, >>> and retaining code-behavior legacy. >>> >>> length_check, being an argument, does not need to be passed explicitly. >> >> I don't like this, even though it is backwards compatible for the corner >> case. I think the old behaviour is a bug. >> >> Regards, >> >> Peter > > That's basically the approach we took on the bioperl end, e.g. the > old behavior was an unintended bug (it was a little more complex > than that in reality, but in essence it boils down to that). It was > too magic, and the old behavior can be regained with a parameter > setting. I don't think we throw an exception, but maybe we should... > > Anyway, I would think going with something that would be following > the tenant of least surprise would be very python-esque :) > > chris I started work on making this an exception, and from our test suite realised that simple ORF finding is an example where this change is likely to be noticed. I have therefore for now just added a new warning if translating partial codons, which can be upgraded to a full exception in future (or removed depending on how people react). https://github.com/biopython/biopython/commit/c0112a7b79a61eabe0adea78bb70d572f1950cde Peter From no-reply at dropboxmail.com Mon Mar 25 18:56:15 2013 From: no-reply at dropboxmail.com (Dropbox) Date: Mon, 25 Mar 2013 22:56:15 +0000 Subject: [Biopython] =?utf-8?q?Christos_Dimitrakopoulos_invited_you_to_che?= =?utf-8?q?ck_out_Dropbox?= Message-ID: <20130325225615.88744B01E3B@sjc-batch2.sjc.dropbox.com> Hi there, Christos Dimitrakopoulos wants you to try Dropbox! Dropbox lets you bring all your photos, docs and videos with you anywhere and share them easily. Get started here. https://www.dropbox.com/l/ozWEKh9SzhpW7n6kwFXQZ12 Thanks! - The Dropbox Team ____________________________________________________ To stop receiving invites from Dropbox, please go to https://www.dropbox.com/l/WExn05nr6NbsaHEO77kUq12 Dropbox, Inc., PO Box 77767, San Francisco, CA 94107 From natassa_g_2000 at yahoo.com Mon Mar 25 21:25:43 2013 From: natassa_g_2000 at yahoo.com (natassa) Date: Mon, 25 Mar 2013 18:25:43 -0700 (PDT) Subject: [Biopython] convert to interleaved nexus Message-ID: <1364261143.15605.YahooMailNeo@web160806.mail.bf1.yahoo.com> Hi, I am trying to convert a phylip alignment to interleaved nexus and I cannot. Something like: def Convert_alnformat(alnfile, out, informat, outformat): ??? alignment=AlignIO.read(alnfile, informat,? alphabet=Gapped(IUPAC.unambiguous_dna)) ????SeqIO.write(alignment,out, outformat) will give me a sequential nexus, while I can;'t seem to get hold of the write_nexus_data method of the Bio:Nexus:Nexus:Nexus class. As this is called internally by AlignIO or SeqIO , I am unsure of how to even call it within my function. I tried: from Bio.Nexus import Nexus then in the function ?Nexus.Nexus.write_nexus_data( alignment,interleave=True) and I get TypeError: unbound method write_nexus_data() must be called with Nexus instance as first argument (got MultipleSeqAlignment instance instead) I understand that the alignment object is not a nexus instance, but I don't see what i should do. Can you please help? thanks, Natassa From rz1991 at foxmail.com Tue Mar 26 01:06:50 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Tue, 26 Mar 2013 13:06:50 +0800 Subject: [Biopython] convert to interleaved nexus Message-ID: Hi Natassa, I'm just reading the biopython source code. It seems NexusIO does not support interleaved nexus output. Internally, NexusIO.NexusWriter.write_alignment uses Bio.Nexus module to write out nexus format. And it doesn't allow you to specify interleave option. You can try the following code (This is how write_alignment works): from Bio.Nexus import Nexus from Bio import AlignIO alignment = AlignIO.read('...', format='...', format = ...) minimal_record = "#NEXUS\nbegin data; dimensions ntax=0 nchar=0; format datatype=%s; end;" % "dna" n = Nexus.Nexus(minimal_record) n.alphabet = alignment._alphabet for record in alignment: n.add_sequence(record.id, record.seq.tostring()) n.write_nexus_data('filename', interleave=True) Alternatively, you may write MultipleSeqAlignment instance to a StringIO instance and then give it to Nexus.Nexus Here is the code you may try: from StringIO import StringIO from Bio.Nexus import Nexus from Bio import AlignIO alignment = AlignIO.read('...', format='...', format = ...) output = StringIO() AlignIO.write(alignment, output, 'nexus') p = Nexus.Nexus() p.read(output.getvalue()) p.write_nexus_data('filename', interleave=True) Hope this helps, Zheng ------------------ Original ------------------ From: "natassa"; Date: Mar 26, 2013 To: "biopython at biopython.org"; Subject: [Biopython] convert to interleaved nexus Hi, I am trying to convert a phylip alignment to interleaved nexus and I cannot. Something like: def Convert_alnformat(alnfile, out, informat, outformat): alignment=AlignIO.read(alnfile, informat, alphabet=Gapped(IUPAC.unambiguous_dna)) SeqIO.write(alignment,out, outformat) will give me a sequential nexus, while I can;'t seem to get hold of the write_nexus_data method of the Bio:Nexus:Nexus:Nexus class. As this is called internally by AlignIO or SeqIO , I am unsure of how to even call it within my function. I tried: from Bio.Nexus import Nexus then in the function Nexus.Nexus.write_nexus_data( alignment,interleave=True) and I get TypeError: unbound method write_nexus_data() must be called with Nexus instance as first argument (got MultipleSeqAlignment instance instead) I understand that the alignment object is not a nexus instance, but I don't see what i should do. Can you please help? thanks, Natassa _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Mar 26 06:11:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Mar 2013 10:11:24 +0000 Subject: [Biopython] Christos Dimitrakopoulos invited you to check out Dropbox In-Reply-To: <20130325225615.88744B01E3B@sjc-batch2.sjc.dropbox.com> References: <20130325225615.88744B01E3B@sjc-batch2.sjc.dropbox.com> Message-ID: Christos, Please don't spam the mailing lists like this. Thanks, Peter On Mon, Mar 25, 2013 at 10:56 PM, Dropbox wrote: > Hi there, > > Christos Dimitrakopoulos wants you to try Dropbox! Dropbox lets > you bring all your photos, docs and videos with you anywhere and > share them easily. > > Get started here. > ... > > Thanks! > - The Dropbox Team > > ____________________________________________________ > To stop receiving invites from Dropbox, please go to https://www.dropbox.com/l/WExn05nr6NbsaHEO77kUq12 > Dropbox, Inc., PO Box 77767, San Francisco, CA 94107 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From saladi at caltech.edu Tue Mar 26 09:08:26 2013 From: saladi at caltech.edu (Shyam Saladi) Date: Tue, 26 Mar 2013 09:08:26 -0400 Subject: [Biopython] Parsing GB seq files with BioPython into BioSQL Message-ID: Hi, I am parsing genbank genome files for microbial genomes and loading the sequence and annotations into a BioSQL database. The program I have is quite simple (same as given onlinehttp:// biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database). The issue is that each record when loaded into memory is huge. Some genomes take up the entire 32 gb ram + 32 gb swap. Does anyone have suggestions on how to make this process more efficient? Thanks, Shyam From p.j.a.cock at googlemail.com Tue Mar 26 09:50:42 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Mar 2013 13:50:42 +0000 Subject: [Biopython] Parsing GB seq files with BioPython into BioSQL In-Reply-To: References: Message-ID: On Tue, Mar 26, 2013 at 1:08 PM, Shyam Saladi wrote: > Hi, > > I am parsing genbank genome files for microbial genomes and loading the > sequence and annotations into a BioSQL database. > > The program I have is quite simple (same as given onlinehttp:// > biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database). > > The issue is that each record when loaded into memory is huge. Some genomes > take up the entire 32 gb ram + 32 gb swap. > > Does anyone have suggestions on how to make this process more efficient? Could you show us your code and/or give some examples were you find a single microbial genome is taking that much RAM - it does seem more likely there is something else happening, like keeping old records in memory (possibly as simple as failing to commit the data to the database regularly). Which database are you using? How are you doing the commits? Thanks, Peter From saladi at caltech.edu Tue Mar 26 10:22:06 2013 From: saladi at caltech.edu (Shyam Saladi) Date: Tue, 26 Mar 2013 10:22:06 -0400 Subject: [Biopython] Parsing GB seq files with BioPython into BioSQL In-Reply-To: References: Message-ID: Hi, Thanks for the quick response. Here's the code: server = BioSeqDatabase.open_database( ... db = server["microbial"] handle = open(sys.argv[1], "rU") count = db.load(SeqIO.parse(handle, "genbank")) print "Loaded %i records" % count server.commit() Since each microbial genome with it's annotations comes in single genbank file, I guess it's processed as one record with many annotations for genes and proteins. We are using BioSQL running on MySQL (on a different machine). Are there any tips on configuration here? Upon further thought, I think the commit step might actually be the issue. Thanks, Shyam On Tue, Mar 26, 2013 at 9:50 AM, Peter Cock wrote: > On Tue, Mar 26, 2013 at 1:08 PM, Shyam Saladi wrote: > > Hi, > > > > I am parsing genbank genome files for microbial genomes and loading the > > sequence and annotations into a BioSQL database. > > > > The program I have is quite simple (same as given onlinehttp:// > > biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database). > > > > The issue is that each record when loaded into memory is huge. Some > genomes > > take up the entire 32 gb ram + 32 gb swap. > > > > Does anyone have suggestions on how to make this process more efficient? > > Could you show us your code and/or give some examples were > you find a single microbial genome is taking that much RAM - > it does seem more likely there is something else happening, > like keeping old records in memory (possibly as simple as > failing to commit the data to the database regularly). Which > database are you using? How are you doing the commits? > > Thanks, > > Peter > > From p.j.a.cock at googlemail.com Tue Mar 26 10:36:06 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Mar 2013 14:36:06 +0000 Subject: [Biopython] Parsing GB seq files with BioPython into BioSQL In-Reply-To: References: Message-ID: On Tue, Mar 26, 2013 at 2:22 PM, Shyam Saladi wrote: > Hi, > > Thanks for the quick response. Here's the code: > > server = BioSeqDatabase.open_database( ... > db = server["microbial"] > > handle = open(sys.argv[1], "rU") > > count = db.load(SeqIO.parse(handle, "genbank")) > print "Loaded %i records" % count > server.commit() > > Since each microbial genome with it's annotations comes in single genbank > file, I guess it's processed as one record with many annotations for genes > and proteins. > > We are using BioSQL running on MySQL (on a different machine). Are there any > tips on configuration here? Upon further thought, I think the commit step > might actually be the issue. > > Thanks, > Shyam How many records in the file, and can you confirm there are no memory issues just parsing the file? e.g. count = 0 for record in SeqIO.parse(handle, "genbank"): count += 1 print count My guess is that you're not using auto-commit, so the database itself (or possibly the Python MySQL layer?) is caching all the changes (until the explicit commit is made). This could be a lot of data! Either try turning on auto-commit, or use a batched loading approach. Most simply, you could commit after each record: count = 0 for record in SeqIO.parse(handle, "genbank"): assert 1 == db.load([record]) server.commit() count += 1 print "Loaded %i records" % count Peter From natassa_g_2000 at yahoo.com Tue Mar 26 15:43:42 2013 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 26 Mar 2013 12:43:42 -0700 (PDT) Subject: [Biopython] convert to interleaved nexus In-Reply-To: References: Message-ID: <1364327022.87108.YahooMailNeo@web160806.mail.bf1.yahoo.com> Thanks Zheng, Both your suggestions worked fine. It is weird though that the script works only if I want to convert a phylip-relaxed sequential file to a nexus interleaved, while when I try the same on a phylip interleaved (that I got from the phylip-relaxed using the same function but going through the first if block ) , then I get this error message (which i suspect has to do with line wrapping? ) This is my function: def Convert_alnformat(alnfile, out, informat, outformat): ??? alignment=AlignIO.read(alnfile, informat,? alphabet=Gapped(IUPAC.unambiguous_dna)) ??? if outformat !='nexus': ??????? SeqIO.write(alignment,out, outformat) ??? else: ??????? #output=StringIO() ??????? #AlignIO.write(alignment, output, 'nexus') ??????? #p = Nexus.Nexus() ??????? #p.read(output.getvalue()) ??????? #p.write_nexus_data(out, interleave=True) ????? ??????? minimal_record = "#NEXUS\nbegin data; dimensions ntax=0 nchar=0; format datatype=%s; end;" % "dna" ??????? n = Nexus.Nexus(minimal_record) ??????? n.alphabet = alignment._alphabet ??????? for record in alignment: ??????????? n.add_sequence(record.id, record.seq.tostring()) ??????????? n.write_nexus_data(out, interleave=True) and the error message File "/bigdata/agioti/scripts/Align_manips.py", line 261, in ??? Convert_alnformat(alignment, outaln, "phylip", "nexus") ? File "/bigdata/agioti/scripts/Align_manips.py", line 48, in Convert_alnformat ??? alignment=AlignIO.read(alnfile, informat,? alphabet=Gapped(IUPAC.unambiguous_dna)) ##hardcoded ? File "/opt/Python/2.7.3/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 418, in read ??? first = iterator.next() ? File "/opt/Python/2.7.3/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 366, in parse ??? for a in i: ? File "/opt/Python/2.7.3/lib/python2.7/site-packages/Bio/AlignIO/PhylipIO.py", line 257, in next ??? raise ValueError("End of file mid-block") ValueError: End of file mid-block Anyways, problem solved, I was just curious. ... Thanks again, Natassa ________________________________ From: ?? To: natassa Cc: Biopython Sent: Monday, March 25, 2013 10:06 PM Subject: Re:[Biopython] convert to interleaved nexus Hi?Natassa, I'm just reading the biopython source code. It seems NexusIO does not support interleaved nexus output. Internally, NexusIO.NexusWriter.write_alignment uses Bio.Nexus module to write out nexus format. And it doesn't allow you to specify interleave option. You can try the following code (This is how write_alignment works): from Bio.Nexus import Nexus from Bio import AlignIO alignment = AlignIO.read('...', format='...', format = ...) minimal_record = "#NEXUS\nbegin data; dimensions ntax=0 nchar=0; format datatype=%s; end;" % "dna" n = Nexus.Nexus(minimal_record) n.alphabet = alignment._alphabet for record in alignment: ? ??n.add_sequence(record.id, record.seq.tostring()) n.write_nexus_data('filename', interleave=True) Alternatively, you may write MultipleSeqAlignment instance to a StringIO instance and then give it to Nexus.Nexus Here is the code you may try: from StringIO import StringIO from Bio.Nexus import Nexus from Bio import AlignIO alignment = AlignIO.read('...', format='...', format = ...) output = StringIO() AlignIO.write(alignment, output, 'nexus') p = Nexus.Nexus() p.read(output.getvalue()) p.write_nexus_data('filename', interleave=True) Hope this helps, Zheng ------------------?Original?------------------ From: ?"natassa"; Date: ?Mar 26, 2013 To: ?"biopython at biopython.org"; Subject: ?[Biopython] convert to interleaved nexus Hi, I am trying to convert a phylip alignment to interleaved nexus and I cannot. Something like: def Convert_alnformat(alnfile, out, informat, outformat): ?alignment=AlignIO.read(alnfile, informat, ?alphabet=Gapped(IUPAC.unambiguous_dna)) SeqIO.write(alignment,out, outformat) will give me a sequential nexus, while I can;'t seem to get hold of the write_nexus_data method of the Bio:Nexus:Nexus:Nexus class. As this is called internally by AlignIO or SeqIO , I am unsure of how to even call it within my function. I tried: from Bio.Nexus import Nexus then in the function Nexus.Nexus.write_nexus_data( alignment,interleave=True) and I get TypeError: unbound method write_nexus_data() must be called with Nexus instance as first argument (got MultipleSeqAlignment instance instead) I understand that the alignment object is not a nexus instance, but I don't see what i should do. Can you please help? thanks, Natassa _______________________________________________ Biopython mailing list? -? Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Mar 26 18:46:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Mar 2013 22:46:22 +0000 Subject: [Biopython] convert to interleaved nexus In-Reply-To: <1364327022.87108.YahooMailNeo@web160806.mail.bf1.yahoo.com> References: <1364327022.87108.YahooMailNeo@web160806.mail.bf1.yahoo.com> Message-ID: On Tue, Mar 26, 2013 at 7:43 PM, natassa wrote: > Thanks Zheng, > Both your suggestions worked fine. It is weird though that the script > works only if I want to convert a phylip-relaxed sequential file to a nexus > interleaved, while when I try the same on a phylip interleaved (that I got > from the phylip-relaxed using the same function but going through the > first if block ) , then I get this error message (which i suspect has to > do with line wrapping? ) > This is my function: > > def Convert_alnformat(alnfile, out, informat, outformat): > > alignment=AlignIO.read(alnfile, informat, > alphabet=Gapped(IUPAC.unambiguous_dna)) > if outformat !='nexus': > SeqIO.write(alignment,out, outformat) > else: > #output=StringIO() > #AlignIO.write(alignment, output, 'nexus') > #p = Nexus.Nexus() > #p.read(output.getvalue()) > #p.write_nexus_data(out, interleave=True) > > minimal_record = "#NEXUS\nbegin data; dimensions ntax=0 nchar=0; > format datatype=%s; end;" % "dna" > n = Nexus.Nexus(minimal_record) > n.alphabet = alignment._alphabet > for record in alignment: > n.add_sequence(record.id, record.seq.tostring()) > n.write_nexus_data(out, interleave=True) It may be just the email formatting messing up, but there should be ONE call to write_nexus_data - it should be outside the for loop as in Zheng's example, or the write_alignment method in Bio/AlignIO/NexusIO.py That might explain things (I've not checked). If might be useful to make interleave an option on the NexusWriter class in Bio/AlignIO/NexusIO.py - especially if there is a good reason to sometimes need interleaved NEXUS. My impression was that non-interleaved was better for reliable parsing in most tools (from memory - I've not looked at this for a while). Regards, Peter From natassa_g_2000 at yahoo.com Tue Mar 26 20:07:54 2013 From: natassa_g_2000 at yahoo.com (natassa) Date: Tue, 26 Mar 2013 17:07:54 -0700 (PDT) Subject: [Biopython] convert to interleaved nexus In-Reply-To: References: <1364327022.87108.YahooMailNeo@web160806.mail.bf1.yahoo.com> Message-ID: <1364342874.12215.YahooMailNeo@web160801.mail.bf1.yahoo.com> Hi Peter, There is a single call to write_nexus_data within the call, the second one is quoted out. But it is within an if block-do you mean this is incorrect? I can't really tell if there is a specific need for interleaved versus sequential nexus, I use the file for a MrBayes analysis and the error message i got when running it with a sequential nexus had? something to do with max characters per line , similarly to what was discussed here: http://biopython.org/pipermail/biopython-dev/2010-December/008480.html I thus decided to go for an interleaved format. But it would be good if NexusIO supports more formats or in general, that the functionalities of the write_nexus_data were more easily accessible compared to the 'workaround' that I did here. I am not a biopython expert to assess this though :-) Thanks, Natassa ________________________________ From: Peter Cock To: natassa Cc: ?? ; "biopython at biopython.org" Sent: Tuesday, March 26, 2013 3:46 PM Subject: Re: [Biopython] convert to interleaved nexus On Tue, Mar 26, 2013 at 7:43 PM, natassa wrote: > Thanks Zheng, > Both your suggestions worked fine. It is weird though that the script > works only if I want to convert a phylip-relaxed sequential file to a nexus > interleaved, while when I try the same on a phylip interleaved (that I got > from the phylip-relaxed using the same function but going through the > first if block ) , then I get this error message (which i suspect has to > do with line wrapping? ) > This is my function: > > def Convert_alnformat(alnfile, out, informat, outformat): > >? ? alignment=AlignIO.read(alnfile, informat, > alphabet=Gapped(IUPAC.unambiguous_dna)) >? ? if outformat !='nexus': >? ? ? ? SeqIO.write(alignment,out, outformat) >? ? else: >? ? ? ? #output=StringIO() >? ? ? ? #AlignIO.write(alignment, output, 'nexus') >? ? ? ? #p = Nexus.Nexus() >? ? ? ? #p.read(output.getvalue()) >? ? ? ? #p.write_nexus_data(out, interleave=True) > >? ? ? ? minimal_record = "#NEXUS\nbegin data; dimensions ntax=0 nchar=0; > format datatype=%s; end;" % "dna" >? ? ? ? n = Nexus.Nexus(minimal_record) >? ? ? ? n.alphabet = alignment._alphabet >? ? ? ? for record in alignment: >? ? ? ? ? ? n.add_sequence(record.id, record.seq.tostring()) >? ? ? ? ? ? n.write_nexus_data(out, interleave=True) It may be just the email formatting messing up, but there should be ONE call to write_nexus_data - it should be outside the for loop as in Zheng's example, or the write_alignment method in Bio/AlignIO/NexusIO.py That might explain things (I've not checked). If might be useful to make interleave an option on the NexusWriter class in? Bio/AlignIO/NexusIO.py - especially if there is a good reason to sometimes need interleaved NEXUS. My impression was that non-interleaved was better for reliable parsing in most tools (from memory - I've not looked at this for a while). Regards, Peter From dan837446 at gmail.com Tue Mar 26 23:04:09 2013 From: dan837446 at gmail.com (Dan) Date: Wed, 27 Mar 2013 16:04:09 +1300 Subject: [Biopython] qblast error, probably due to BLAST server overload: any way to handle this error better? Message-ID: Hi, I have a script that runs Qblast over a multiline fasta file (protein).. the relevant code is: for seq_record in SeqIO.parse(args.infile,"fasta"): blast_result_handle = NCBIWWW.qblast(args.program, args.database, \ seq_record.format("fasta"),expect=args.expect, hitlist_size=args.num_hits, \ service=args.service) time.sleep(5) for a generalised case and in the particular case I am looking at it's: for seq_record in SeqIO.parse(args.infile,"fasta"): blast_result_handle = NCBIWWW.qblast(blastp, nr, \ seq_record.format("fasta"),expect=args.expect, hitlist_size=args.num_hits, \ service=args.service) time.sleep(5) From dan837446 at gmail.com Tue Mar 26 23:12:29 2013 From: dan837446 at gmail.com (Dan) Date: Wed, 27 Mar 2013 16:12:29 +1300 Subject: [Biopython] Qblast Message-ID: Hi, I have a script that runs Qblast over a multiline fasta file (protein).. the relevant code is: for seq_record in SeqIO.parse(args.infile,"fasta"): # For each individual fasta record in a multiline fasta file.. # check that it's an appropriate time to search and wait if not # do the search blast_result_handle = NCBIWWW.qblast(args.program, args.database, \ seq_record.format("fasta"),expect=args.expect, hitlist_size=args.num_hits, \ service=args.service) time.sleep(5) Most of the time it works fine, but every so often it fails, like so: Traceback (most recent call last): File "remote_blast_multiline_fasta.py", line 174, in service=args.service) File "/usr/lib/pymodules/python2.7/Bio/Blast/NCBIWWW.py", line 122, in qblast handle = urllib2.urlopen(request) File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) File "/usr/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python2.7/urllib2.py", line 444, in error return self._call_chain(*args) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 502: Bad Gateway I'm assuming that this is a "overloaded blast server" error.. Is there any way of handling this error in better way? Sorry if this question is a bit general. From chris.mit7 at gmail.com Tue Mar 26 23:49:00 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Tue, 26 Mar 2013 23:49:00 -0400 Subject: [Biopython] Qblast In-Reply-To: References: Message-ID: You could do a try except block and reschedule the job in 10 minutes or whatever. It would require a few tweaks to your setup, like: while jobs and