From biopython at maubp.freeserve.co.uk Mon May 2 18:06:21 2005 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon May 2 17:52:30 2005 Subject: [BioPython] Big GenBank files In-Reply-To: <005b01c54e6f$55a796b0$0b413851@YSENGARD> References: <1114422149.426cbb8506bc7@imp4-q.free.fr> <426CD193.5030801@maubp.freeserve.co.uk> <005b01c54e6f$55a796b0$0b413851@YSENGARD> Message-ID: <4276A45D.6060001@maubp.freeserve.co.uk> I sent Aur?lie a Python file version my patch (from bug 1747) off the mailing list, and it looks like there is a problem using it with the GenBank.NCBIDictionary (see below) which I had never used. http://bugzilla.open-bio.org/show_bug.cgi?id=1747 Thanks for letting me know Aur?lie! I will try and look at this as time permits, but I will have to work out how the NCBIDictionary code works first... so if someone else wants to leap in, please do :) Peter -------- Original Message -------- Subject: Re: [BioPython] Big GenBank files Date: Sun, 1 May 2005 19:00:53 +0200 From: Aur?lie Bornot To: Peter Hello Peter and everybody ! Sorry Peter : I take a lot of time to answer you about your patch (GenBank.__init__.py).... I have tried it with this code (that works with the "old" __init__.py) : fichier = open('AC008625.5.gb',"w") record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary ('nucleotide','genbank',parser=record_parser) gb_record = ncbi_dict['AC008625.5'] fichier.close() And I got this error : Traceback (most recent call last): File "essais.py", line 112, in ? gb_record = ncbi_dict['AC008625.5'] File "C:\Python24\lib\site-packages\Bio\GenBank\__init__.py", line 1736, in __getitem__ return self.parser.parse(handle) File "C:\Python24\lib\site-packages\Bio\GenBank\__init__.py", line 219, in parse self._scanner.feed(handle, self._consumer) File "C:\Python24\lib\site-packages\Bio\GenBank\__init__.py", line 1261, in feed line = handle.readline() AttributeError: ReseekFile instance has no attribute 'readline' I don't know why very well... BUT !!!!! : ) like you said : with something like : #connexion: fichierGB = urllib2.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id ="+ID+"&db="+database +"&retmod=text&rettype=genbank") record_parser = GenBank.RecordParser() gb_iterator = GenBank.Iterator(fichierGB, record_parser) cur_record = gb_iterator.next() fichierGB.close() It works !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! The big file are parsed without any problem.... : ) So I simply modified my code like this.... To conclude : Peter , You are my savior !!!! THANK YOU VERY VERY MUCH !!! Aurelie From mdehoon at ims.u-tokyo.ac.jp Tue May 3 02:45:00 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Tue May 3 02:32:55 2005 Subject: [BioPython] Rethinking Seq objects Message-ID: <42771DEC.7090100@ims.u-tokyo.ac.jp> Hi everybody, Recently, there was a discussion on biopython-dev about changes to the Seq and MutableSeq classses. I'd like to ask you if any of the proposed changes would cause you any problems. The current proposal is: 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and the MutableSeq class basically describe the same thing, except that one is read-only and the other one is not. If desired, we can add a readonly flag to the class to describe if it is mutable or not. (Given that e.g. Numerical Python arrays don't have such a flag, my feeling is that it is not really needed for Seq objects either). For performance reasons, the new Seq class will be implemented in C. 2) By default, a Seq class doesn't assume a particular alphabet. Same as current behavior: >>> from Bio.Seq import * >>> Seq('ATCG') Seq('ATCG', Alphabet()) However, if the user decides to specify the alphabet explicitly, input to the sequence will be checked for consistency with the alphabet. So >>> from Bio.Seq import * >>> from Bio.Alphabet import IUPAC >>> my_alpha = IUPAC.unambiguous_dna >>> s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>> s[:3] = "XYZ" will raise an error. 3) Make Seq objects understand circular genomes. Many bacterial genomes are circular. It would be nice if we could take the indices [-1000:1000] from a Seq object, if it is circular, or [3999000:40001000] if the sequence is circular with length 4000000. Circular genomes will likely be implemented as an optional keyword (perhaps "topology") when creating the Seq object, with corresponding set_topology, get_topology methods. 4) Perhaps it would be a good idea to add transcribe and translate methods to the Seq class. Currently, to translate a DNA sequence, we have to do >>> from Bio.Seq import Seq >>> from Bio import Translate >>> from Bio.Alphabet import IUPAC >>> my_alpha = IUPAC.unambiguous_dna >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>> standard_translator = Translate.unambiguous_dna_by_id[1] >>> standard_translator.translate(my_seq) Seq('AIVMGR*KGAR', IUPACProtein()) which is too much typing for my taste. Questions/comments/suggestions are welcome. None of this has actually been coded yet, so it's all still open to discussion. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From bneron at pasteur.fr Tue May 3 05:27:39 2005 From: bneron at pasteur.fr (bneron@pasteur.fr) Date: Tue May 3 05:20:00 2005 Subject: [BioPython] Rethinking Seq objects In-Reply-To: <42771DEC.7090100@ims.u-tokyo.ac.jp> References: <42771DEC.7090100@ims.u-tokyo.ac.jp> Message-ID: <20050503092739.GB10339@kerka-sis.pasteur.fr> * Michiel Jan Laurens de Hoon (20050503 15:45): > Hi everybody, > > Recently, there was a discussion on biopython-dev about changes to the Seq > and MutableSeq classses. I'd like to ask you if any of the proposed changes > would cause you any problems. > > The current proposal is: > > 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and > the MutableSeq class basically describe the same thing, except that one is > read-only and the other one is not. If desired, we can add a readonly flag > to the class to describe if it is mutable or not. (Given that e.g. > Numerical Python arrays don't have such a flag, my feeling is that it is > not really needed for Seq objects either). For performance reasons, the new > Seq class will be implemented in C. > > 2) By default, a Seq class doesn't assume a particular alphabet. Same as > current behavior: > >>> from Bio.Seq import * > >>> Seq('ATCG') > Seq('ATCG', Alphabet()) > However, if the user decides to specify the alphabet explicitly, input to > the sequence will be checked for consistency with the alphabet. So > >>> from Bio.Seq import * > >>> from Bio.Alphabet import IUPAC > >>> my_alpha = IUPAC.unambiguous_dna > >>> s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) > >>> s[:3] = "XYZ" > will raise an error. > > 3) Make Seq objects understand circular genomes. Many bacterial genomes are > circular. It would be nice if we could take the indices [-1000:1000] from a > Seq object, if it is circular, or [3999000:40001000] if the sequence is > circular with length 4000000. > Circular genomes will likely be implemented as an optional keyword (perhaps > "topology") when creating the Seq object, with corresponding set_topology, > get_topology methods. > > 4) Perhaps it would be a good idea to add transcribe and translate methods > to the Seq class. Currently, to translate a DNA sequence, we have to do > >>> from Bio.Seq import Seq > >>> from Bio import Translate > >>> from Bio.Alphabet import IUPAC > >>> my_alpha = IUPAC.unambiguous_dna > >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) > >>> standard_translator = Translate.unambiguous_dna_by_id[1] > >>> standard_translator.translate(my_seq) > Seq('AIVMGR*KGAR', IUPACProtein()) > which is too much typing for my taste. > > > Questions/comments/suggestions are welcome. None of this has actually been > coded yet, so it's all still open to discussion. > > > --Michiel. > I agree with suggestions above , but I'd like to add a remark on the way in which the Seq object manage the alphabet used for the sequence more precisely the case of the sequence. just an exemple: Python 2.3.4 (#1, Mar 11 2005, 17:34:27) [GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio import Translate >>> from Bio.Alphabet import IUPAC >>> my_alpha = IUPAC.unambiguous_dna >>> my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>> my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha) >>> standard_translator = Translate.unambiguous_dna_by_id[1] >>> standard_translator.translate(my_seq_upper) Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*')) >>> standard_translator.translate(my_seq_lower) Seq('**********', HasStopCodon(IUPACProtein(), '*')) >>> obviously the lower case doesn't work in the Seq object. But I haven't neither exceptions at the Seq init nor during the translation. worst I have a return value after the translate method but it doesn't mean anything. (it work of the same manner for the traduction). I think it could be a good thing to correct this behavior. -- Bertrand Neron Groupe Logiciels et Banques de Donnees Institut Pasteur Tel: 01 45 68 86 78 Fax: 01 40 61 30 80 From letondal at pasteur.fr Tue May 3 11:04:00 2005 From: letondal at pasteur.fr (Catherine Letondal) Date: Tue May 3 10:52:30 2005 Subject: [BioPython] suggestions for Bio.PDB In-Reply-To: <32950.83.92.3.59.1113745194.squirrel@www.binf.ku.dk> References: <32950.83.92.3.59.1113745194.squirrel@www.binf.ku.dk> Message-ID: <7bb49749d8bfd65835e1e19fdb859a37@pasteur.fr> Hi, On Apr 17, 2005, at 3:39 PM, thamelry@binf.ku.dk wrote: > Hi, > >> Would it be possible for the get_structure() method in PDBParser to >> accept a filehandle > > You're not the first to suggest this - it's already in > the CVS version, also for generating PDB output with PDBIO. Ok, thanks. >> Another suggestion: it could be useful to keep a record of the read >> structure, just in case the user would like to benefit from biopython >> PDB modules, but also do some custom analysis. > > I don't think this would be very useful, it's easy enough to just read > in > the file separately. Ok (the student who made the suggestion had a lot of files to read, that's why - but if you implement the filehandle parameter, it's fine) > BTW, I'm soon going to start to implement a parser for the new > PDB XML format. Any suggestions, comments, etc. regarding this > are welcome. A suggestion (not related to XML) from one of the teachers of our course: a method returning embedded elements at any level would be useful. For instance, a get_residues() method enabling to directly iterate on residues whatever the chain would be very convenient : p = PDBParser() s = p.get_structure('...') for residue in s.get_residues(): ... Similarly: for atom in s.get_atoms(): ... I'm aware it's easy to implement with a simple function - at the same time it might be useful enough to have it available directly in the Structure class? Thanks in advance, -- Catherine Letondal -- Institut Pasteur -- Informatics in Biology Course www.pasteur.fr/formation/infobio/infobio-en.html From GECrooks at lbl.gov Wed May 4 12:37:18 2005 From: GECrooks at lbl.gov (Gavin Crooks) Date: Wed May 4 12:30:15 2005 Subject: [BioPython] Rethinking Seq objects In-Reply-To: <42771DEC.7090100@ims.u-tokyo.ac.jp> References: <42771DEC.7090100@ims.u-tokyo.ac.jp> Message-ID: On May 2, 2005, at 23:45, Michiel Jan Laurens de Hoon wrote: > 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class > and the MutableSeq class basically describe the same thing, except > that one is read-only and the other one is not. If desired, we can add > a readonly flag to the class to describe if it is mutable or not. > (Given that e.g. Numerical Python arrays don't have such a flag, my > feeling is that it is not really needed for Seq objects either). For > performance reasons, the new Seq class will be implemented in C. > Although I agree that we don't need a Seq and a MutableSeq class, I don't follow why we need a mutable sequence class at all. What's the use case? If, in the alternative, Seq was a simple immutable object then it could be implemented as a light weight subclass of str, with an alphabet attribute that is also a subclass of str. You'd edit it like you would edit any string in python; split it into a list, do whatever manipulations are necessary, and then join the list back together into a new Seq. Gavin Crooks -- Gavin E. Crooks Divisional Fellow tel: (510) 486-7721 Physical Biosciences aim:notastring Lawrence Berkeley Natl. Lab http://threeplusone.com/ Berkeley, CA 94720, USA GECrooks@lbl.gov From mdehoon at ims.u-tokyo.ac.jp Thu May 5 03:30:50 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Thu May 5 03:18:34 2005 Subject: [BioPython] Rethinking Seq objects In-Reply-To: References: <42771DEC.7090100@ims.u-tokyo.ac.jp> Message-ID: <4279CBAA.809@ims.u-tokyo.ac.jp> Gavin Crooks wrote: > On May 2, 2005, at 23:45, Michiel Jan Laurens de Hoon wrote: >> 1) Make Seq objects mutable, and get rid of MutableSeq. > > Although I agree that we don't need a Seq and a MutableSeq class, I > don't follow why we need a mutable sequence class at all. What's the use > case? Biopython itself uses a MutableSeq in various places, so there does seem to be a need for a mutable sequence class. However, in some places a MutableSeq is used where a Seq would do. As far as I can tell, Bio.GA and Bio.NeuralNetwork actually use the MutableSeq class; in this case, a simple array might work also. So maybe there is not much use for a mutable Seq class. I'm a bit hesitant though to simply throw out MutableSeq, so I'd like to ask our users: Can you give an example where you can't use an (immutable) Seq, but have to use a MutableSeq? > If, in the alternative, Seq was a simple immutable object then it could > be implemented as a light weight subclass of str, with an alphabet > attribute that is also a subclass of str. You'd edit it like you would > edit any string in python; split it into a list, do whatever > manipulations are necessary, and then join the list back together into a > new Seq. There may be performance issues with this approach, if a Seq object is mutated often. So let's wait and see if any of our users actually want to mutate a sequence object, and if so, if the performance is critical. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From Frederic.Sohm at iaf.cnrs-gif.fr Thu May 5 09:29:46 2005 From: Frederic.Sohm at iaf.cnrs-gif.fr (=?iso-8859-1?b?RnLpZOlyaWM=?= Sohm) Date: Thu May 5 09:26:02 2005 Subject: [BioPython] (no subject) Message-ID: <1115299786.427a1fcaacb25@mail.iaf.cnrs-gif.fr> Hi Michiel and everyone, Just a thought, don't flame me for that. Since you will be making a new Seq object, will it be worth making it behave more like a typical object : But first a disclaimer, I realise the proposed change could mean breaking a lot of code, so it might a very bad idea in the end. When I did first used Biopython, I have been surprised by the behaviour of Seq object, in regards of the use of the built-in str() and repr() functions (I should have read the manual first, but hey...) : Ok here is a the Seq behaviour : >>> from Bio.Seq import Seq >>> a = 'a'*80 >>> a 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaa' >>> s = Seq(a) >>> s Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa', Alphabet()) >>> str(s) "Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaa ...', Alphabet())" >>> repr(s) "Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())" >>> s.tostring() 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa' Now here is what I was expecting at the time following the respective meaning of str and repr >>> a = 'a'*80 >>> a 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaa' >>> s = Seq(a) >>> s Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet()) >>> str(s) 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaa' >>> repr(s) "Seq('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaa', Alphabet())" So what I would propose is to : change str(seq) to return the actual sequence as do seq.tostring() right now. leave repr(seq) as it is, make seq.tostring() return str(seq) for backward compatibity. (Would be eventually removed). add a new function Seq.short() for example which would behave like the actual str(Seq). I don't have any idea how much code this would break. And the feasability of it will as well depends on the way the new Seq will be release (I mean do you plan to have the actual Seq and the new one co-existing for a while or to directly replace the old Seq?). If the later is the way we go this change is certainly not desirable, otherwise it might be something to consider. Personally I have mix filling about it, but I think it is worth discussing the matter now. This change would make the Seq objects behave more like a Python programmer would expect, on the other hand Biopython have been built on the current model and this might be a bad idea to change after so much time. Since the only real problem with this is the replacement of the str() method all boiled down to how frequently people use the actual string method of Seq in their code? I do not have the impression it is very frequent but ... What do you think ? Fred Le mardi 3 Mai 2005 08:45, Michiel Jan Laurens de Hoon a ?crit : > Hi everybody, > > Recently, there was a discussion on biopython-dev about changes to the Seq > and MutableSeq classses. I'd like to ask you if any of the proposed changes > would cause you any problems. > > The current proposal is: > > 1) Make Seq objects mutable, and get rid of MutableSeq. The Seq class and > the MutableSeq class basically describe the same thing, except that one is > read-only and the other one is not. If desired, we can add a readonly flag > to the class to describe if it is mutable or not. (Given that e.g. > Numerical Python arrays don't have such a flag, my feeling is that it is > not really needed for Seq objects either). For performance reasons, the new > Seq class will be implemented in C. > > 2) By default, a Seq class doesn't assume a particular alphabet. Same as > current > > behavior: > >>> from Bio.Seq import * > >>> Seq('ATCG') > > Seq('ATCG', Alphabet()) > However, if the user decides to specify the alphabet explicitly, input to > the sequence will be checked for consistency with the alphabet. So > > >>> from Bio.Seq import * > >>> from Bio.Alphabet import IUPAC > >>> my_alpha = IUPAC.unambiguous_dna > >>> s = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) > >>> s[:3] = "XYZ" > > will raise an error. > > 3) Make Seq objects understand circular genomes. Many bacterial genomes are > circular. It would be nice if we could take the indices [-1000:1000] from a > Seq object, if it is circular, or [3999000:40001000] if the sequence is > circular with length 4000000. > Circular genomes will likely be implemented as an optional keyword (perhaps > "topology") when creating the Seq object, with corresponding set_topology, > get_topology methods. > > 4) Perhaps it would be a good idea to add transcribe and translate methods > to the Seq class. Currently, to translate a DNA sequence, we have to do > > >>> from Bio.Seq import Seq > >>> from Bio import Translate > >>> from Bio.Alphabet import IUPAC > >>> my_alpha = IUPAC.unambiguous_dna > >>> my_seq = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) > >>> standard_translator = Translate.unambiguous_dna_by_id[1] > >>> standard_translator.translate(my_seq) > > Seq('AIVMGR*KGAR', IUPACProtein()) > which is too much typing for my taste. > > > Questions/comments/suggestions are welcome. None of this has actually been > coded yet, so it's all still open to discussion. > > > --Michiel. -- Fr?d?ric Sohm Equipe INRA U1126 "Morphogen?se du syst?me nerveux des Chord?s" UPR 2197 DEPSN, CNRS Institut de Neurosciences A. Fessard 1 Avenue de la Terrasse 91 198 GIF-SUR-YVETTE FRANCE Phone: +33 (0) 1 69 82 34 12 Fax:+33 (0) 1 69 82 34 47 From GECrooks at lbl.gov Thu May 5 14:35:19 2005 From: GECrooks at lbl.gov (Gavin Crooks) Date: Thu May 5 14:28:28 2005 Subject: [BioPython] Rethinking Seq objects In-Reply-To: <4279CBAA.809@ims.u-tokyo.ac.jp> References: <42771DEC.7090100@ims.u-tokyo.ac.jp> <4279CBAA.809@ims.u-tokyo.ac.jp> Message-ID: On May 5, 2005, at 00:30, Michiel Jan Laurens de Hoon wrote: > >> If, in the alternative, Seq was a simple immutable object then it >> could be implemented as a light weight subclass of str, with an >> alphabet attribute that is also a subclass of str. You'd edit it like >> you would edit any string in python; split it into a list, do >> whatever manipulations are necessary, and then join the list back >> together into a new Seq. > > There may be performance issues with this approach, if a Seq object is > mutated often. So let's wait and see if any of our users actually want > to mutate a sequence object, and if so, if the performance is > critical. Performance would be no worse than for string manipulation in standard python. The Way of The Python is not to use MutableString's (Which are in the standard library, but not really canonical) but to split string into lists or arrays, do whatever manipulations are necessary and then join the string back together. Is there any reason why Seq's can't be mutated analogously? Gavin Crooks -- Gavin E. Crooks Divisional Fellow tel: (510) 486-7721 Physical Biosciences aim:notastring Lawrence Berkeley Natl. Lab http://threeplusone.com/ Berkeley, CA 94720, USA GECrooks@lbl.gov From john.corradi at bms.com Fri May 6 11:51:48 2005 From: john.corradi at bms.com (John Corradi) Date: Fri May 6 11:45:20 2005 Subject: [BioPython] handling sequence ambiguity in SeqUtils Message-ID: <427B9294.6010605@bms.com> Hi All, I just noticed that the protein molecular weight caculation in ProtParam.py chokes on sequence amibiguities (i.e. X's). It just throws a KeyError exception. How about using either an average amino acid molecular weight or calculating a minimum and maximum for those cases? Thanks. John P.S. I guess this is a consideration for the other utils as well. From mdehoon at ims.u-tokyo.ac.jp Sat May 7 01:25:28 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Sat May 7 01:13:35 2005 Subject: [BioPython] Rethinking Seq objects In-Reply-To: <20050503092739.GB10339@kerka-sis.pasteur.fr> References: <42771DEC.7090100@ims.u-tokyo.ac.jp> <20050503092739.GB10339@kerka-sis.pasteur.fr> Message-ID: <427C5148.9080708@ims.u-tokyo.ac.jp> bneron@pasteur.fr wrote: > just an exemple: > > Python 2.3.4 (#1, Mar 11 2005, 17:34:27) > [GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>>>from Bio.Seq import Seq >>>>from Bio import Translate >>>>from Bio.Alphabet import IUPAC >>>>my_alpha = IUPAC.unambiguous_dna >>>>my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>>>my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha) >>>>standard_translator = Translate.unambiguous_dna_by_id[1] >>>>standard_translator.translate(my_seq_upper) > > Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*')) > >>>>standard_translator.translate(my_seq_lower) > > Seq('**********', HasStopCodon(IUPACProtein(), '*')) > > > obviously the lower case doesn't work in the Seq object. I agree, this should be corrected. The translate and transcribe methods should work with both uppercase and lowercase. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From mdehoon at ims.u-tokyo.ac.jp Sat May 7 01:36:01 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Sat May 7 01:23:49 2005 Subject: [BioPython] Rethinking Seq objects In-Reply-To: References: <42771DEC.7090100@ims.u-tokyo.ac.jp> <4279CBAA.809@ims.u-tokyo.ac.jp> Message-ID: <427C53C1.7040008@ims.u-tokyo.ac.jp> Gavin Crooks wrote: > On May 5, 2005, at 00:30, Michiel Jan Laurens de Hoon wrote: >>> If, in the alternative, Seq was a simple immutable object then it >>> could be implemented as a light weight subclass of str, with an >>> alphabet attribute that is also a subclass of str. You'd edit it like >>> you would edit any string in python; split it into a list, do >>> whatever manipulations are necessary, and then join the list back >>> together into a new Seq. >> >> There may be performance issues with this approach, if a Seq object is >> mutated often. So let's wait and see if any of our users actually want >> to mutate a sequence object, and if so, if the performance is critical. > > Performance would be no worse than for string manipulation in standard > python. The Way of The Python is not to use MutableString's (Which are > in the standard library, but not really canonical) but to split string > into lists or arrays, do whatever manipulations are necessary and then > join the string back together. Is there any reason why Seq's can't be > mutated analogously? > Well, I was gonna say that Seq objects can be very large, certainly much larger than common usage of strings in Python, and that this will be a performance issue. But when I tried to modify a long string by splitting and rejoining, it doesn't seem to be bad at all. So maybe this is the way to go. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From wligtenberg at gmail.com Sat May 7 10:28:47 2005 From: wligtenberg at gmail.com (Willem Ligtenberg) Date: Sat May 7 10:22:03 2005 Subject: [BioPython] getting the positions as a string for comparison reason Message-ID: Hello, I am trying to parse a genbank file and want to compare the start (or end) positions of a gene with start (or end) positions I already may have parsed from Ensembl. These positions have been stored as a string and the string is empty "" if none is yet stored in the gene object. [code snippet] for gen in record.features: if gen.location != "": gene.addStartBP(gen.location.start) gene.addStartBP(gen.location.end) [/code snippet] [code from gene class] def addStartBP(self, startBP): if self.startBP == "": self.startBP = startBP [/code from gene class] Gives (ofcourse) this error: AssertionError: We can only do comparisons between Biopython Position objects. But how can I throw this position object to a string? Thanks in advance, Willem Ligtenberg From JBonis at imim.es Mon May 9 09:34:17 2005 From: JBonis at imim.es (BONIS SANZ, JULIO) Date: Mon May 9 09:28:54 2005 Subject: [BioPython] Problems parsing xml with sax Message-ID: <66373AD054447F47851FCC5EB49B3611061251@basquet.imim.es> Maybe it is not closely related with biopython, maybe it is... anyway: I am using biopython GenBank.EUtils.ThinClient.ThinClient() to get some xml from NCBI. After that I have build some xml parsers in sax to get information. My problem is that sax does not recognize the format that NCBI uses in their DTD for SNP records. I did: snpdbi = GenBank.DBIds("snp",['6313']) file = GenBank.EUtils.ThinClient.ThinClient.efetch_using_dbids(snpdbi,rettype = 'flt',retmode = 'xml') the problem is that file starts with: ### ### And sax returns this error: ValueError: unknown url type: /entrez/query/DTD/NSE.dtd When retrieving from elink I have not that problem. For example: >>> dbid = GenBank.DBIds("nucleotide",['55956922']) >>> xmlFileWithSNPsStream = eutils.elink_using_dbids(dbid,db="snp") And I can parse with sax, as the file starts with: ### ### That is a well formed URL.... Any idea? Regards, Julio Bonis Sanz MD www.juliobonis.com -----Mensaje original----- De: biopython-bounces@portal.open-bio.org [mailto:biopython-bounces@portal.open-bio.org]En nombre de Michiel Jan Laurens de Hoon Enviado el: s?bado, 07 de mayo de 2005 7:25 Para: bneron@pasteur.fr CC: Biopython mailing list Asunto: Re: [BioPython] Rethinking Seq objects bneron@pasteur.fr wrote: > just an exemple: > > Python 2.3.4 (#1, Mar 11 2005, 17:34:27) > [GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>>>from Bio.Seq import Seq >>>>from Bio import Translate >>>>from Bio.Alphabet import IUPAC >>>>my_alpha = IUPAC.unambiguous_dna >>>>my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>>>my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha) >>>>standard_translator = Translate.unambiguous_dna_by_id[1] >>>>standard_translator.translate(my_seq_upper) > > Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*')) > >>>>standard_translator.translate(my_seq_lower) > > Seq('**********', HasStopCodon(IUPACProtein(), '*')) > > > obviously the lower case doesn't work in the Seq object. I agree, this should be corrected. The translate and transcribe methods should work with both uppercase and lowercase. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon _______________________________________________ BioPython mailing list - BioPython@biopython.org http://biopython.org/mailman/listinfo/biopython From JBonis at imim.es Mon May 9 10:22:47 2005 From: JBonis at imim.es (BONIS SANZ, JULIO) Date: Mon May 9 10:16:27 2005 Subject: [BioPython] Problems parsing xml with sax Message-ID: <66373AD054447F47851FCC5EB49B3611061252@basquet.imim.es> Solved ;) Just after defining the parser >>> parser = from xml.sax.make_parser() put: self.__parser.setFeature('http://xml.org/sax/features/external-general-entities',False) This disables the external entities solver, avoiding the problem of the DTD use. Regards, Julio Bonis Sanz MD www.juliobonis.com -----Mensaje original----- De: biopython-bounces@portal.open-bio.org [mailto:biopython-bounces@portal.open-bio.org]En nombre de BONIS SANZ, JULIO Enviado el: lunes, 09 de mayo de 2005 15:34 Para: Biopython mailing list Asunto: [BioPython] Problems parsing xml with sax Maybe it is not closely related with biopython, maybe it is... anyway: I am using biopython GenBank.EUtils.ThinClient.ThinClient() to get some xml from NCBI. After that I have build some xml parsers in sax to get information. My problem is that sax does not recognize the format that NCBI uses in their DTD for SNP records. I did: snpdbi = GenBank.DBIds("snp",['6313']) file = GenBank.EUtils.ThinClient.ThinClient.efetch_using_dbids(snpdbi,rettype = 'flt',retmode = 'xml') the problem is that file starts with: ### ### And sax returns this error: ValueError: unknown url type: /entrez/query/DTD/NSE.dtd When retrieving from elink I have not that problem. For example: >>> dbid = GenBank.DBIds("nucleotide",['55956922']) >>> xmlFileWithSNPsStream = eutils.elink_using_dbids(dbid,db="snp") And I can parse with sax, as the file starts with: ### ### That is a well formed URL.... Any idea? Regards, Julio Bonis Sanz MD www.juliobonis.com -----Mensaje original----- De: biopython-bounces@portal.open-bio.org [mailto:biopython-bounces@portal.open-bio.org]En nombre de Michiel Jan Laurens de Hoon Enviado el: s?bado, 07 de mayo de 2005 7:25 Para: bneron@pasteur.fr CC: Biopython mailing list Asunto: Re: [BioPython] Rethinking Seq objects bneron@pasteur.fr wrote: > just an exemple: > > Python 2.3.4 (#1, Mar 11 2005, 17:34:27) > [GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>>>from Bio.Seq import Seq >>>>from Bio import Translate >>>>from Bio.Alphabet import IUPAC >>>>my_alpha = IUPAC.unambiguous_dna >>>>my_seq_upper = Seq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha) >>>>my_seq_lower = Seq('gatcgatgggcctattaggatcgaaaatcgc', my_alpha) >>>>standard_translator = Translate.unambiguous_dna_by_id[1] >>>>standard_translator.translate(my_seq_upper) > > Seq('DRWAY*DRKS', HasStopCodon(IUPACProtein(), '*')) > >>>>standard_translator.translate(my_seq_lower) > > Seq('**********', HasStopCodon(IUPACProtein(), '*')) > > > obviously the lower case doesn't work in the Seq object. I agree, this should be corrected. The translate and transcribe methods should work with both uppercase and lowercase. --Michiel. -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon _______________________________________________ BioPython mailing list - BioPython@biopython.org http://biopython.org/mailman/listinfo/biopython _______________________________________________ BioPython mailing list - BioPython@biopython.org http://biopython.org/mailman/listinfo/biopython From eirik.sonneland at student.umb.no Tue May 10 06:09:34 2005 From: eirik.sonneland at student.umb.no (=?ISO-8859-1?Q?Eirik_S=F8nneland?=) Date: Tue May 10 10:29:48 2005 Subject: [BioPython] Megablast Message-ID: <4280885E.7070201@student.umb.no> Hi! Is there someone who knows what the correct code is if I want to use megablast instead of blastn in: b_results = NCBIWWW.qblast('blastn', 'bta_genome/all_contig', f_record).read() Thank you! Cheers, Eirik From eirik.sonneland at student.umb.no Tue May 10 06:09:34 2005 From: eirik.sonneland at student.umb.no (=?ISO-8859-1?Q?Eirik_S=F8nneland?=) Date: Tue May 10 10:29:54 2005 Subject: [BioPython] Megablast Message-ID: <4280885E.7070201@student.umb.no> Hi! Is there someone who knows what the correct code is if I want to use megablast instead of blastn in: b_results = NCBIWWW.qblast('blastn', 'bta_genome/all_contig', f_record).read() Thank you! Cheers, Eirik From aurelie.bornot at free.fr Wed May 11 11:52:58 2005 From: aurelie.bornot at free.fr (aurelie.bornot@free.fr) Date: Wed May 11 11:45:52 2005 Subject: [BioPython] pairwise alignment tools ??? Message-ID: <1115826778.42822a5a1161d@imp4-q.free.fr> Hello everybody ! I would like to know if something exists in biopython to do optimal pairwise alignments for DNA and proteins sequences ?? I need something that alllows to use BLOSUM62 and BLOSUM45 for proteins.... and if it doesn't exist : does anyone know where I can find binaries to do this ??? (Unfortunatly I must work on windowsXP... and I have trouble to find something...) Thanks a lot !! Aurelie From kris at Math.Princeton.EDU Thu May 12 11:31:23 2005 From: kris at Math.Princeton.EDU (Kristina Rogale Plazonic) Date: Thu May 12 11:24:53 2005 Subject: [BioPython] some files missing in sources of 1.40b and weird non-error Message-ID: Hi! Both the tar.gz and zip archive of 1.40b offered on the download page of the website seem to be incomplete - in particular these files are missing KDTree/__init__.py, KDTree/KDTree.py and maybe others. (source was downloaded yesterday.) As a result PDB module's NeighborSearch.py doesn't work anymore with a strange error: >>> ns = Bio.PDB.NeighborSearch(atlist) Traceback (most recent call last): File "", line 1, in ? TypeError: 'module' object is not callable - as you see, python doesn't report missing KDTree module called from NeighborSearch at all!!! (Sample scripts do report missing KDTree module.) Indeed, it seems that NeighborSearch is then a module with no classes at all; i.e. all that is contained in NeighborSearch.py after the second line from Bio.KDTree import * is ignored, with no error reported. I'm utterly confused. Does this happen because of some setting in biopython? Thanks, Kristina From kris at Math.Princeton.EDU Thu May 12 11:56:12 2005 From: kris at Math.Princeton.EDU (Kristina Rogale Plazonic) Date: Thu May 12 11:48:49 2005 Subject: [BioPython] some files missing in sources of 1.40b and weird non-error In-Reply-To: <200505121732.47339.thamelry@binf.ku.dk> References: <200505121732.47339.thamelry@binf.ku.dk> Message-ID: > KDTree is C++ code, which causes problems on some > systems, and hence compilation is disabled by default. Un-commenting > the KDTree lines in setup.py and re-installing will quite likely solve > your problem. Hi, this is the first thing I tried. Then I discovered that some of the KDTree files are MISSING from the source archives of 1.40b on the download page. I had to fetch the current CVS version to get the complete source. Kristina From thamelry at binf.ku.dk Thu May 12 11:32:47 2005 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Thu May 12 12:04:42 2005 Subject: [BioPython] some files missing in sources of 1.40b and weird non-error In-Reply-To: References: Message-ID: <200505121732.47339.thamelry@binf.ku.dk> Hi Kristina, > I'm utterly confused. Does this happen because of some setting in > biopython? KDTree is C++ code, which causes problems on some systems, and hence compilation is disabled by default. Un-commenting the KDTree lines in setup.py and re-installing will quite likely solve your problem. Best regards, -Thomas From mdehoon at ims.u-tokyo.ac.jp Fri May 13 01:14:13 2005 From: mdehoon at ims.u-tokyo.ac.jp (Michiel Jan Laurens de Hoon) Date: Fri May 13 01:01:58 2005 Subject: [BioPython] some files missing in sources of 1.40b and weird non-error In-Reply-To: References: <200505121732.47339.thamelry@binf.ku.dk> Message-ID: <428437A5.1030407@ims.u-tokyo.ac.jp> KDTree's *.py were missing in MANIFEST.in, which caused them to be skipped when creating the source distribution. I've fixed MANIFEST.in in CVS, however the source distribution on www.biopython.org is still the old one. Iddo, are you planning a 1.40 (final) release? Since the current release is 1.40 beta. --Michiel. Kristina Rogale Plazonic wrote: > >> KDTree is C++ code, which causes problems on some >> systems, and hence compilation is disabled by default. Un-commenting >> the KDTree lines in setup.py and re-installing will quite likely solve >> your problem. > > > Hi, this is the first thing I tried. Then I discovered that some of the > KDTree files are MISSING from the source archives of 1.40b on the > download page. I had to fetch the current CVS version to get the > complete source. > > Kristina > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > > -- Michiel de Hoon, Assistant Professor University of Tokyo, Institute of Medical Science Human Genome Center 4-6-1 Shirokane-dai, Minato-ku Tokyo 108-8639 Japan http://bonsai.ims.u-tokyo.ac.jp/~mdehoon From edeveaud at pasteur.fr Tue May 17 05:57:28 2005 From: edeveaud at pasteur.fr (edeveaud@pasteur.fr) Date: Tue May 17 05:51:07 2005 Subject: [BioPython] NCBIStandalone Parser problem In-Reply-To: <426DFB17.5010409@ims.u-tokyo.ac.jp> References: <20050421091119.GA12744@hebus.sis.pasteur.fr> <426DFB17.5010409@ims.u-tokyo.ac.jp> Message-ID: <20050517095728.GA19359@hebus.sis.pasteur.fr> On Tue, Apr 26, 2005 at 05:25:59PM +0900, Michiel Jan Laurens de Hoon wrote: > Could you try this again with Biopython version 1.40b and see if the > problem still occurs there? If so, could you send me the query that you are > using so I can replicate this error? sorry for the long delay, but know as course finished I will have more time to dig into the problem. anyway I reproduced the same error with Biopython version 1.40b if you want to check with the datas I provided a tar.gz of all the necessary files here the tarball contains : *) a bank containing the following 2 sequences CV793585 and CV793586 taken from the genbank nc0421.flat update *) the generated index for this one (formatdb -p F -i test_bank) NB this ones may need to be rebuild *) 2 query files query_ok: a query giving some hits query_crash: the query that does produce '***** No hits found ******' and leads the parser to crahs *) the basic python script used python ./crash.py query_ok bank/test_bank python ./crash.py query_crash bank/test_bank Eric -- E> desole mais je n est pas trop l habitude des groupes de discutions Le?on n? 1 : on r?pond en haut et on vire le message auquel on r?pond Cette suppression facilite grandement la lecture !!! -+- DrN in : Le Neuneu par l'exemple -+- From jleigh at dal.ca Tue May 17 13:13:49 2005 From: jleigh at dal.ca (Jessica Leigh) Date: Tue May 17 13:06:31 2005 Subject: [BioPython] retrieve results with RID Message-ID: <428A264D.9000607@dal.ca> Hi, I'm new to BioPython, and what I REALLY want to do is use blastp, restricting results to a particular entrez query. The blast function allows me to do this, but not the qblast... this would be a useful addition, it's really easy to add. My real problem, though, is this: when I use either blast or qblast, instead of getting a blast result page, I get the RID page (the one that says "This page will NOT be automatically updated.") I know that this is just NCBI being a pain, but is there any function in BioPython that allows me to retrieve the results associated with an RID? Thanks, Jessica From cgw501 at york.ac.uk Tue May 17 15:07:03 2005 From: cgw501 at york.ac.uk (cgw501@york.ac.uk) Date: Tue May 17 14:59:33 2005 Subject: [BioPython] alignment processing Message-ID: Hi, I have a file processing task I'm trying to do with biopython. I have to take a bunch of clustal alignment files that cover one arm of a whole chromosome, strip off the lowercase letters at the end of each sequence, and produce a file containing all the stripped sequences together is fasta format. This is what I have so far: import Bio.Clustalw from Bio.Alphabet import IUPAC import string from Bio.Seq import Seq from Bio.SeqIO import FASTA from Bio.SeqRecord import SeqRecord from sys import * import sys inputs = sys.argv[1:-2] output = open(sys.argv[-1], 'w') for f in inputs: align = Bio.Clustalw.parse_file(f, alphabet=IUPAC.ambiguous_dna) lines = align.get_all_seqs() strippedAlignRecord = [] for line in lines: lineSeq = line.seq lineString = lineSeq.tostring() strippedSeq = lineString.rstrip('atcg-') strippedSeqObj = Seq(strippedSeq, IUPAC.ambiguous_dna) strippedRecObj = SeqRecord(strippedSeqObj, id = line.description) out = FASTA.FastaWriter(output) out.write(strippedRecObj) When I run this from the command line I don't get any errors, but the outfile is not created. I'm a bit flummoxed. Any ideas? Thanks, Chris From amairgen at gmail.com Fri May 20 13:06:25 2005 From: amairgen at gmail.com (Mattias de Hollander) Date: Fri May 20 12:58:57 2005 Subject: [BioPython] [ClustalW] alignment score Message-ID: <6eeafc6305052010062f2e97be@mail.gmail.com> Is it possible to get the 'alignment score' from a clustalw alignment, just like when you run ClustalW over the web? -- Mattias From fkauff at duke.edu Fri May 20 13:20:04 2005 From: fkauff at duke.edu (Frank Kauff) Date: Fri May 20 13:13:53 2005 Subject: [BioPython] [ClustalW] alignment score In-Reply-To: <6eeafc6305052010062f2e97be@mail.gmail.com> References: <6eeafc6305052010062f2e97be@mail.gmail.com> Message-ID: <1116609605.4513.25.camel@osiris.biology.duke.edu> Mattias, On Fri, 2005-05-20 at 19:06 +0200, Mattias de Hollander wrote: > Is it possible to get the 'alignment score' from a clustalw alignment, just > like when you run ClustalW over the web? > Clustalw doesn't save the score together with the alignment, but only in the log (or the on-screen output). I think the only way to get the score might be to parse the log for the magic words 'Alignment score' and get the value associated with them. Frank -- Frank Kauff Dept. of Biology Duke University Box 90338 Durham, NC 27708 USA Phone 919-660-7382 Fax 919-660-7293 Web http://www.lutzonilab.net/member/frankkauff.shtml From julie.bernauer at ibbmc.u-psud.fr Tue May 24 13:26:13 2005 From: julie.bernauer at ibbmc.u-psud.fr (Julie Bernauer) Date: Tue May 24 13:25:15 2005 Subject: [BioPython] Comment/Suggestion about Bio.PDB.Polypeptide class. How to keep gaps information ? Message-ID: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr> Hello Let's imagine we want a fasta file or a seq object containing gaps describing the amino acids that are present in a structure : Ex : 1t6b chain X Using this code : for pp in ppd.build_peptides(structure[0][X]): print pp We get : If we want to bind those peptides together, let's try to define an empty polypeptide : pp1=Polypeptide.Polypeptide([]) and extend it with the peptides we get : pp1=Polypeptide.Polypeptide([]) for pp in ppd.build_peptides(structurecomplex[0][chaineR]): pp1.extend(pp) print pp1 seq=pp1.get_sequence() print seq.tostring() We have : SQGLLGYYFSDLNFQAPMVVTSSTTGDLSIPSSELENIPSENQYFQSAIWSGFIKVKKSDEYTFATSADNHVTMWVDDQEVINKASNSNKIRLEKGRLYQIKIQYQRENPTEKGLDFKLYWTDSQNKKEVISSDNLQLPELKQVPDRDNDGIPDSLEVEGYTVDVKNKRTFLSPWISNIHEKKGLTKYKSSPEKWSTASDPYSDFEKVTGRIDKNVSPEARHPLVAAYPIVHVDMENIILSKNETISKNTSTSRTHTSEVVSAGFSNSNSSTVAIDHSLSLAGERTWAETMGLNTADTARLNANIRYVNTGTAPIYNVLPTTSLVLGKNQTLATIKAKENQLSQILAPNNYYPSKNLAPIALNAQDDFSSTPITMNYNQFLELEKTKQLRLDTDQVYGNIATYNFENGRVRVDTGSNWSEVLPQIQETTARIIFNGKDLNLVERRIAAVNPSDPLETTKPDMTLKEALKIAFGFNEPNGNLQYQGKDITEFDFNFDQQTSQNIKNQLAELNATNIYTVLDKIKLNAKMNILIRDKRFHYDRNNIAVGADESVVKEAHREVINSSTEGLLLNIDKDIRKILSGYIVEIEDTEGLKEVINDRYDMLNISSLRQDGKTFIDFKKYNDKLPLYISNPNYKVNVYAVTKENTIINPSENGDTSTNGIKKILIFSKKGYEIG i.e.: We totally lose the information of gaps. "pp1" still contains this information but cannot give it to "seq" even if using the gapped alphabet. I know it would be possible to get it from an iteration on residue from the structure. However, I think it would be better to fill gap with an 'X' or a '-' while doing pp1.get_sequence(). I mean changing the method get_sequence to handle this case. Instead of : for res in self: resname=res.get_resname() if to_one_letter_code.has_key(resname): resname=to_one_letter_code[resname] else: resname='X' s=s+resname I think would be nice to iterate over resseq What do you think ? -- Julie BERNAUER Equipe de G?nomique Structurale http://www.genomics.eu.org IBBMC - UMR 8619 - U.P.S. B?t.430 Tel. : +33 1 69 15 31 57 91405 Orsay - FRANCE Fax. : +33 1 69 85 37 15 From thamelry at binf.ku.dk Tue May 24 14:26:32 2005 From: thamelry at binf.ku.dk (thamelry@binf.ku.dk) Date: Tue May 24 14:25:44 2005 Subject: [BioPython] Comment/Suggestion about Bio.PDB.Polypeptide class. How to keep gaps information ? In-Reply-To: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr> References: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr> Message-ID: <35376.83.92.3.59.1116959192.squirrel@www.binf.ku.dk> Hi Julie, > i.e.: We totally lose the information of gaps. "pp1" still contains this > information but cannot give it to "seq" even if using the gapped > alphabet. > I know it would be possible to get it from an iteration on residue from > the structure. However, I think it would be better to fill gap with an > 'X' or a '-' while doing pp1.get_sequence(). I mean changing the method > get_sequence to handle this case. I'll start with pointing out that you cannot rely on the fact that the resseq numbering is meaningfull AT ALL. There are plenty of structures in the PDB where residue X is firmly attached to residue X+Y (with Y>1) and structures where X is not attached to X+1. That's the reason why Bio.PDB uses a distance criterium to find polypeptides. OTOH it would certainly be useful to have gap information, but I'd like to put that in a seperate class, ie. BrokenPolypeptide. PolypeptideBuilder could have a method build_broken_peptide that would return a BrokenPolypeptide object. That class could have fancy methods to deal with gaps and the sequences of the missing parts, for example. I'll try to add this, but I'm busy at the moment (4 articles in the pipeline), but you're welcome to give it a try and send me your code :-). Best regards, -Thomas From julie.bernauer at ibbmc.u-psud.fr Wed May 25 04:59:58 2005 From: julie.bernauer at ibbmc.u-psud.fr (Julie Bernauer) Date: Wed May 25 04:52:59 2005 Subject: [BioPython] Comment/Suggestion about Bio.PDB.Polypeptide class. How to keep gaps information ? In-Reply-To: <35376.83.92.3.59.1116959192.squirrel@www.binf.ku.dk> References: <1116955573.3946.250.camel@fifi.ibbmc.u-psud.fr> <35376.83.92.3.59.1116959192.squirrel@www.binf.ku.dk> Message-ID: <1117011599.3946.257.camel@fifi.ibbmc.u-psud.fr> On Tue, 2005-05-24 at 20:26 +0200, thamelry@binf.ku.dk wrote: > Hi Julie, > [...] > I'll try to add this, but I'm busy at the moment (4 articles in the > pipeline), but you're welcome to give it a try and send me your code :-). Hi Thomas, Thank you for your quick answer. Here is a quick and dirty hack that works for me, just see whether you may use it: class BrokenPolypeptide(list): """ A broken polypeptide is simply a list of polypeptide objects. """ def get_sequence(self): """ Return the AA sequence, filling gap or unknown residue with X. @return: polypeptide sequence @rtype: L{Seq} """ s="" if self == []: end=0 else : end=self[0][-1].get_id()[1] for peptide in self: start=peptide[0].get_id()[1] gaplength=start-end for indexgap in range(0, gaplength): s=s+'X' for res in peptide: resname=res.get_resname() if to_one_letter_code.has_key(resname): resname=to_one_letter_code[resname] else: resname='X' s=s+resname end=peptide[-1].get_id()[1] seq=Seq(s, ProteinAlphabet) return seq HTH, regards, J. -- Julie BERNAUER Equipe de G?nomique Structurale http://www.genomics.eu.org IBBMC - UMR 8619 - U.P.S. B?t.430 Tel. : +33 1 69 15 31 57 91405 Orsay - FRANCE Fax. : +33 1 69 85 37 15 From edeveaud at pasteur.fr Thu May 26 09:31:53 2005 From: edeveaud at pasteur.fr (edeveaud@pasteur.fr) Date: Thu May 26 09:26:43 2005 Subject: [BioPython] iterative ace parsing Message-ID: <20050526133153.GA23295@hebus.sis.pasteur.fr> Hi, after reading the doc for Bio.Sequencing.Ace I would like to run some analysis on an assembly composed of 174 contigs based on approximatively 49000 reads. the only problem is that parsing whole ace file at once needs 872M of memory. my idea was to itereate over the contigs in order to decrease the memory needs, but the doc claims 2) *** DEPRECATED: not entirely suitable for ACE files! Or you can iterate over the contigs of an ace file one by one in the ususal way: could someone point me to some explanation about this warning ?? is the ace parser suitable for iterative tasks ?? thank's -- > dvips -o $@ $< Faut faire gffe de pas te couper avec ton truc, t'as mis des ciseaux ($<) partout :)) -+- Dom in Guide du linuxien pervers - "J'aime pas les Makefile !" -+- From fkauff at duke.edu Thu May 26 09:59:48 2005 From: fkauff at duke.edu (Frank Kauff) Date: Thu May 26 09:52:00 2005 Subject: [BioPython] iterative ace parsing In-Reply-To: <20050526133153.GA23295@hebus.sis.pasteur.fr> References: <20050526133153.GA23295@hebus.sis.pasteur.fr> Message-ID: <1117115988.4496.26.camel@osiris.biology.duke.edu> Hi, On Thu, 2005-05-26 at 15:31 +0200, edeveaud@pasteur.fr wrote: > Hi, > > after reading the doc for Bio.Sequencing.Ace > > I would like to run some analysis on an assembly composed of 174 contigs > based on approximatively 49000 reads. > > the only problem is that parsing whole ace file at once needs 872M of memory. > > my idea was to itereate over the contigs in order to decrease the memory needs, > but the doc claims > 2) *** DEPRECATED: not entirely suitable for ACE files! > Or you can iterate over the contigs of an ace file one by one in > the ususal way: > > could someone point me to some explanation about this warning ?? > It works fine, in theory. The problem with ace files is, that they are not entirely suitable for contg-by-contig parsing, they can contain contig-specific information at the very end of the file. So in your case, after reading contig no. 174, there might be still some more info left in the file about contigs no. 12, 132, and 160. Depending on what kind of contigs you have, there might be no info at all or it's just irrelevant for your analysis. The phrap manual (you're using phrap to create the contigs?) lists the tags that can appear at the end of an ace file, so you might want to have a look there and decide whether they are important for you or not. If not, iterating voer contigs should just do fine. Frank > is the ace parser suitable for iterative tasks ?? > > thank's > -- Frank Kauff Dept. of Biology Duke University Box 90338 Durham, NC 27708 USA Phone 919-660-7382 Fax 919-660-7293 Web http://www.lutzonilab.net/member/frankkauff.shtml From edeveaud at pasteur.fr Thu May 26 11:49:03 2005 From: edeveaud at pasteur.fr (edeveaud@pasteur.fr) Date: Thu May 26 11:43:42 2005 Subject: [BioPython] iterative ace parsing In-Reply-To: <1117115988.4496.26.camel@osiris.biology.duke.edu> References: <20050526133153.GA23295@hebus.sis.pasteur.fr> <1117115988.4496.26.camel@osiris.biology.duke.edu> Message-ID: <20050526154903.GA23750@hebus.sis.pasteur.fr> On Thu, May 26, 2005 at 09:59:48AM -0400, Frank Kauff wrote: > Hi, > > On Thu, 2005-05-26 at 15:31 +0200, edeveaud@pasteur.fr wrote: > > Hi, > > > > after reading the doc for Bio.Sequencing.Ace > > > > my idea was to itereate over the contigs in order to decrease the memory > > needs, but the doc claims > > 2) *** DEPRECATED: not entirely suitable for ACE files! > > Or you can iterate over the contigs of an ace file one by one in > > the ususal way: > > > > could someone point me to some explanation about this warning ?? > > > > It works fine, in theory. The problem with ace files is, that they are > not entirely suitable for contg-by-contig parsing, they can contain > contig-specific information at the very end of the file. So in your > case, after reading contig no. 174, there might be still some more info > left in the file about contigs no. 12, 132, and 160. Depending on what > kind of contigs you have, there might be no info at all or it's just > irrelevant for your analysis. The phrap manual (you're using phrap to > create the contigs?) lists the tags that can appear at the end of an ace > file, so you might want to have a look there and decide whether they are > important for you or not. If not, iterating voer contigs should just do > fine. thank's for the clarification. yes indede we use phred/phrarp in order to create the assembly. and for the analysis we want to perform we don't care about the eventuals tag set-up by phrap. we just need the contig coverage and the read starts. I'll check the iterative way. thank's again Eric -- Ici, l'exemple est un peu capillotract?. Si on choisissait plut?t un dilemme entre fr.comp.os.unix et fr.rec.arts.os.unix ? -+- APM in: Guide du Cabaliste Usenet - La Cabale est-elle barbue ? -+- From fredgca at hotmail.com Sat May 28 11:22:59 2005 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Sat May 28 11:15:42 2005 Subject: [BioPython] Tool for Biomolecular data edition and analysis in Python Message-ID: Hello, My name is Frederico G. Colombo Arnoldi, I am Phd student (subject: Molecular Evolution) in Brazil. I have been developing a software for Biomolecular data edition and analysis in python, mainly for gene analysis. It started as a little tool to help me and became a more serious work. Actually this tool is formed by a GTK interface that allows the user to align sequences with Malign and Clustalw; color sequences previously aligned according conservation and residues characteristics; create reverse, reverse/complement and consensus sequences; search for conserved regions and determined sequences inside a bigger (like restriction enzymes sites and ORF's); generate alignments colored reports and others. The program and Its portal has been written in English, although, as anyone can see in my bad english, I'm not a native english speaker. I would like to know about the possibility and the interest of integrate it to biopython. The portal is : http://mpalign.incubadora.fapesp.br/portal Thanks a lot. Frederico _________________________________________________________________ MSN Messenger: converse online com seus amigos . http://messenger.msn.com.br From idoerg at burnham.org Sun May 29 02:13:37 2005 From: idoerg at burnham.org (Iddo Friedberg) Date: Sun May 29 02:06:42 2005 Subject: [BioPython] Tool for Biomolecular data edition and analysis in Python In-Reply-To: Message-ID: Frederico, This looks like a really useful tool, thanks for sharing. One way I can see this fitting into Biopython is as an addition to the Align module. We'll have to think exactly how. However, I am wondering if this really fits within biopython: you have what seems to me as a standalone program. Biopython's goal is to provide buiding blocks for such tools. So your code is the next step: you have constructed a house (a very fine one, may I add), Biopython provides lumber, bricks, etc. etc. Like I said, I'll have to look at it. But with ISMB coming up, it will probably not be for a month. If someone else has any thoughts on the matter, please share. Cheers, and thanks again, Iddo -- Iddo Friedberg, Ph.D. The Burnham Institute 10901 N. Torrey Pines Rd. La Jolla, CA 92037, USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 646 3171 http://ffas.ljcrf.edu/~iddo ------------------------------------- Automated Protein Function Prediction Meeting, June 24, 2005 http://ffas.burnham.org/AFP On Sat, 28 May 2005, Frederico Arnoldi wrote: > Hello, > > My name is Frederico G. Colombo Arnoldi, I am Phd student (subject: > Molecular Evolution) in Brazil. > I have been developing a software for Biomolecular data edition and > analysis in python, mainly for gene analysis. It started as a little > tool to help me and became a more serious work. Actually this tool is > formed by a GTK interface that allows the user to align sequences with > Malign and Clustalw; color sequences previously aligned according > conservation and residues characteristics; create reverse, > reverse/complement and consensus sequences; search for conserved > regions and determined sequences inside a bigger (like restriction > enzymes sites and ORF's); generate alignments colored reports and > others. > The program and Its portal has been written in English, although, > as anyone can see in my bad english, I'm not a native english speaker. > I would like to know about the possibility and the interest of > integrate it to biopython. The portal is : > http://mpalign.incubadora.fapesp.br/portal > > Thanks a lot. > Frederico > > _________________________________________________________________ > MSN Messenger: converse online com seus amigos . > http://messenger.msn.com.br > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > From janaspe at web.de Mon May 30 10:10:41 2005 From: janaspe at web.de (Jana Sperschneider) Date: Mon May 30 10:03:03 2005 Subject: [BioPython] mxtexttools link Message-ID: <506450146@web.de> Hi there, I need to get the mxtexttools package for Windows, the link http://www.egenix.com/files/python/eGenix-mx-Extensions.html doesn't seem to work.. maybe the server is down.. can anyone help? Cheers Jana __________________________________________________________ Mit WEB.DE FreePhone mit hoechster Qualitaet ab 0 Ct./Min. weltweit telefonieren! http://freephone.web.de/?mc=021201 From biopython at maubp.freeserve.co.uk Mon May 30 10:42:45 2005 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon May 30 10:27:42 2005 Subject: [BioPython] mxtexttools link In-Reply-To: <506450146@web.de> References: <506450146@web.de> Message-ID: <429B2665.1020709@maubp.freeserve.co.uk> Jana Sperschneider wrote: > Hi there, > > I need to get the mxtexttools package for Windows, the link > > http://www.egenix.com/files/python/eGenix-mx-Extensions.html > > doesn't seem to work.. maybe the server is down.. can anyone help? > > Cheers > Jana Their website does seem to be down. I do have egenix-mx-base-2.0.5.win32-py2.3.exe on my hard disk which I could email you if you like (off the list - its 574kb). Note - you would have to be using Python 2.3 for this to work. Peter From janaspe at web.de Mon May 30 12:26:51 2005 From: janaspe at web.de (Jana Sperschneider) Date: Mon May 30 12:23:41 2005 Subject: [BioPython] mxtexttools link Message-ID: <506597832@web.de> Hi Peter, thank you so much for your help, would be great if you could send the file to my email address! I have Python 2.4 on my computer, should work? Cheers Jana ______________________________________________________________ Verschicken Sie romantische, coole und witzige Bilder per SMS! Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193 From amorgan at mitre.org Tue May 31 17:43:46 2005 From: amorgan at mitre.org (Alexander A. Morgan) Date: Tue May 31 17:36:39 2005 Subject: [BioPython] qblast through a proxy Message-ID: <429CDA92.7080704@mitre.org> Hello: Most of the parts of BioPython use urllib to connect to webservices which makes using a proxy (without a password at least) very straightforward. However, Blast.NCBIWWW uses the socket library in '_send_to_qblast()'. There doesn't seem to be an easy way to get through a proxy using the low level socket library. Does anyone have a quick fix/workaround for this? Thanks, Alex