From mjldehoon at yahoo.com Fri Feb 1 02:22:19 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2008 23:22:19 -0800 (PST) Subject: [BioPython] blast parse In-Reply-To: <47A07222.9000200@biotec.tu-dresden.de> Message-ID: <807461.62532.qm@web62405.mail.re1.yahoo.com> I have added a DeprecationWarning to NCBIXML.BlastParser.parse. --Michiel. Christof Winter wrote: Michiel de Hoon wrote: > Dear Jose, > > To get the records one-by-one, use > > from Bio.Blast import NCBIXML blast_parse = NCBIXML.parse(blasth) for > blast_result in blast_parse: # do whatever with blast_result > > This avoids having to read the complete XML file all at once. > > To the developers: We should probably think about removing the > NCBIXML.BlastParser.parse, and perhaps adding a NCBIXML.read function to read > exactly one record from the XML file. I thinks removing NCBIXML.BlastParser.parse is a good idea. We should keep it simple. Christof --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From e.picardi at unical.it Tue Feb 5 09:55:10 2008 From: e.picardi at unical.it (Ernesto) Date: Tue, 5 Feb 2008 15:55:10 +0100 Subject: [BioPython] GFF parser Message-ID: <9DE3866D-D345-4C88-8935-A793336259D7@unical.it> Dear All, I found around Internet a very interesting GFF parser written in Python by Martin Knudsen. Since I know that at the moment there isn't a real GFF parser in BioPython, we could think to add the one by Martin. For sure, requesting the permission to the author. The parser can be downloaded from the following web page: http:// www.daimi.au.dk/~martink/birc/scripts.html Hope this help, Ernesto -------------------------------------------------------- Dr Ernesto Picardi, PhD Dept. of Biochemistry and Molecular Biology University of Bari Italy E-mail: e.picardi at unical.it -------------------------------------------------------- From chris.lasher at gmail.com Tue Feb 5 22:27:19 2008 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 5 Feb 2008 22:27:19 -0500 Subject: [BioPython] Biopython to begin transition to Subversion Message-ID: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> Hello all Biopythonistas, In the next upcoming weeks, Biopython will begin and complete its transition from CVS to Subversion (SVN) as its revision control system. This transition will likely not affect end users of Biopython except that to get the development version, a checkout with a Subversion client, rather than a CVS client, will be necessary. For developers, we will need to determine a suitable range of dates (a week) during which we will "freeze" the CVS repository for its transition to SVN. From the freeze and thereon, commits to the CVS repository will no longer be possible. Instead, commits not placed in during the freeze will need to take place in the Subversion repository once we have it running. This week, we hope to have a "dry run" of the Subversion repository available for the developers to poke around and make sure the transition will include everything necessary. Following that, we'll have the freeze and complete the transition. If you have any questions, I'll be checking posts to the list, or you may feel free contact me directly. Best, Chris From cjfields at uiuc.edu Tue Feb 5 22:33:42 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Feb 2008 21:33:42 -0600 Subject: [BioPython] Biopython to begin transition to Subversion In-Reply-To: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> References: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> Message-ID: Let me know if you need any help. chris On Feb 5, 2008, at 9:27 PM, Chris Lasher wrote: > Hello all Biopythonistas, > > In the next upcoming weeks, Biopython will begin and complete its > transition from CVS to Subversion (SVN) as its revision control > system. > > This transition will likely not affect end users of Biopython except > that to get the development version, a checkout with a Subversion > client, rather than a CVS client, will be necessary. > > For developers, we will need to determine a suitable range of dates (a > week) during which we will "freeze" the CVS repository for its > transition to SVN. From the freeze and thereon, commits to the CVS > repository will no longer be possible. Instead, commits not placed in > during the freeze will need to take place in the Subversion repository > once we have it running. This week, we hope to have a "dry run" of the > Subversion repository available for the developers to poke around and > make sure the transition will include everything necessary. Following > that, we'll have the freeze and complete the transition. > > If you have any questions, I'll be checking posts to the list, or you > may feel free contact me directly. > > Best, > Chris > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From dalke at dalkescientific.com Wed Feb 6 06:03:38 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 6 Feb 2008 12:03:38 +0100 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com> <69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com> <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> Message-ID: <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> On Feb 6, 2008, at 11:44 AM, Peter wrote: > Am I right in thinking the authors have not made any of their sample > input files available? In the case of the multi GB Blast file, this > is perhaps justified. Also I didn't see any timing script. the alignment programs contain the test data. the fasta parser and blast parser do not contain test data. The lack of data is not justified as having a 9GB file adds little to the comparison over having a 9 MB file as it should scale linearly. It does show that the parsers can handle large files, but big whoop. And the test is unaffected by having a 9MB file duplicated 1,000 times. the neighbor-joining code contains no test data There's no timing script. Andrew dalke at dalkescientific.com From jblanca at btc.upv.es Wed Feb 6 11:06:08 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 6 Feb 2008 17:06:08 +0100 Subject: [BioPython] Alignment add_sequence Message-ID: <200802061706.08830.jblanca@btc.upv.es> Hello, I'm building an alignment object from a set of seqRecords using the following code: from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC my_alpha = IUPAC.IUPACAmbiguousDNA() ali = Alignment(my_alpha) for seqName in sequences.keys(): seq = sequences[seqName].seq.tostring() start = mesh[seqName]['location_begin'] id = sequences[seqName].id ali.add_sequence(id, seq, start) Is this the best way to do it? Everything is working as expected, but I have a problem with this implementation. My seqRecords have additional annotations and I'm loosing them. Maybe this could be solved with a new function like: def add_sequence(self, seqRecord, start = None, end = None, weight = 1.0): Also in this way the we woudn't need to create a new SeqRecord for every sequence and it should be quicker. The result could be something like: from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC my_alpha = IUPAC.IUPACAmbiguousDNA() ali = Alignment(my_alpha) for seqName in sequences.keys(): start = mesh[seqName]['location_begin'] ali.add_sequence(sequences[seqName], start) With such a function a problem could appear if an annotation named 'start' or 'end' is already in the annotation dict. But this could be solved raising an expection in that case. What do you think? Thanks for your help. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From p.j.a.cock at googlemail.com Wed Feb 6 11:20:20 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Feb 2008 16:20:20 +0000 Subject: [BioPython] Alignment add_sequence In-Reply-To: <200802061706.08830.jblanca@btc.upv.es> References: <200802061706.08830.jblanca@btc.upv.es> Message-ID: <320fb6e00802060820h609d5f10vccba4953455794bb@mail.gmail.com> On Feb 6, 2008 4:06 PM, Jose Blanca wrote: > Hello, > I'm building an alignment object from a set of seqRecords using the following > code: > ... > Is this the best way to do it? No, not really. See below .. > Everything is working as expected, but I have a > problem with this implementation. My seqRecords have additional annotations > and I'm loosing them. Yes, using that method the alignment is creating a new SeqRecord for each sequence with no annotation. > Maybe this could be solved with a new function like: > def add_sequence(self, seqRecord, start = None, end = None, > weight = 1.0): This has been discussed before, along with other limitations of the current alignment class, e.g. on bug 1944 http://bugzilla.open-bio.org/show_bug.cgi?id=1944 Right now I would suggest you try the Bio.SeqIO.to_alignment() function, although this doesn't try and do anything clever with start/end annotation: http://biopython.org/wiki/SeqIO Peter From nuin at genedrift.org Wed Feb 6 11:07:41 2008 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 06 Feb 2008 11:07:41 -0500 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com> <69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com> <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> Message-ID: <47A9DB4D.4030801@genedrift.org> Hi all I am running pylint on the code and getting some evaluation. Currently the alignment.py scored -10.16/10, mainly because of indentation issues and lack of spaces between operators. NJ.py scored -7.66/10 parse.py scored -6.10/10 readFasta.py scored -7.00/10 Of course this test just measures the "Pythonic" level of the code, but it does not check the code itself for quality. Cheers Paulo Andrew Dalke wrote: > On Feb 6, 2008, at 11:44 AM, Peter wrote: >> Am I right in thinking the authors have not made any of their sample >> input files available? In the case of the multi GB Blast file, this >> is perhaps justified. Also I didn't see any timing script. > > the alignment programs contain the test data. > > the fasta parser and blast parser do not contain test data. The lack > of data is not justified as having a 9GB file adds little to the > comparison over having a 9 MB file as it should scale linearly. It > does show that the parsers can handle large files, but big whoop. And > the test is unaffected by having a 9MB file duplicated 1,000 times. > > the neighbor-joining code contains no test data > > There's no timing script. > > Andrew > dalke at dalkescientific.com > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mcolosimo at mitre.org Wed Feb 6 10:28:15 2008 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Wed, 6 Feb 2008 10:28:15 -0500 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com><69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com><320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> Message-ID: What is biology in python or more to the point why is there yet another mailing list (Web site?) for biology in python? >From looking at their archive messages: 1. Need to establish python/biology community..... Isn't that what BioPython is? If not, why not? I'll also point out that there is "CoreBio" a python toolkit for writing computational biology applications I don't want to subscribe to another mailing list, install another suite of code, keep track of another Web site. -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Wednesday, February 06, 2008 6:04 AM To: biopython at lists.open-bio.org Cc: biology-in-python at lists.idyll.org Subject: Re: [BioPython] [bip] Bioinformatics Programming Language Shootout,Python performance poopoo'd On Feb 6, 2008, at 11:44 AM, Peter wrote: > Am I right in thinking the authors have not made any of their sample > input files available? In the case of the multi GB Blast file, this > is perhaps justified. Also I didn't see any timing script. the alignment programs contain the test data. the fasta parser and blast parser do not contain test data. The lack of data is not justified as having a 9GB file adds little to the comparison over having a 9 MB file as it should scale linearly. It does show that the parsers can handle large files, but big whoop. And the test is unaffected by having a 9MB file duplicated 1,000 times. the neighbor-joining code contains no test data There's no timing script. Andrew dalke at dalkescientific.com _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From tiagoantao at gmail.com Wed Feb 6 12:05:33 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 6 Feb 2008 17:05:33 +0000 Subject: [BioPython] [Biopython-dev] Biopython to begin transition to Subversion In-Reply-To: <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> References: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> Message-ID: <6d941f120802060905h3bc09488tbd7ea3c85bce5914@mail.gmail.com> Hi, On Feb 6, 2008 4:27 PM, Peter wrote: > Michiel - do you think we should try and do another release before the > CVS freeze and migration? We've had a lots little changes, plus > Tiago's PopGen work and my own efforts with BioSQL. There are still a > few open issues, but I think a release soon would be reasonable > (depending on your time commitments of course). Just FYI: As I noticed that the SVN move would be happening sooner or later, I decided to put everything into a stable state and stop at that point. Hopefully all that there is PopGen related is stable and ready to move (code, test, doc). As soon as we move to SVN I will get back into committing (now the really interesting stuff will start: statistics and maybe HapMap). Tiago From cjfields at uiuc.edu Wed Feb 6 12:19:33 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 6 Feb 2008 11:19:33 -0600 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com><69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com><320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> Message-ID: <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> On Feb 6, 2008, at 9:28 AM, Colosimo, Marc E. wrote: > > What is biology in python or more to the point why is there yet > another > mailing list (Web site?) for biology in python? The BioPython group primarily focuses on the BioPython suite of tools. Other groups might address more general computational issues which may or may not pertain to BioPython. There are similar efforts with perl. >> From looking at their archive messages: > > 1. Need to establish python/biology community..... > > Isn't that what BioPython is? If not, why not? > > I'll also point out that there is "CoreBio" a python toolkit for > writing computational biology applications > > > I don't want to subscribe to another mailing list, install another > suite of code, keep track of another Web site. > ... You don't have to if you don't want to. This was probably cross- posted by Andrew to bring in discussion on this paper with like-minds from BioPython. BTW, Andrew et al, speaking as a perl/BioPerl programmer, I also think it's a terribly researched and written piece; surprised it got past the reviewers. Programming language 'shootouts' are always controversial (anything with a 'my language is better that yours' conclusion is bound to cause arguments). One would think a shootout means setting strict rules and having the best/brightest put forward their qualifying code, but clearly in this case that didn't happen. chris From dalke at dalkescientific.com Wed Feb 6 13:48:20 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 6 Feb 2008 19:48:20 +0100 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com><69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com><320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> Message-ID: <6BAB63D7-57AB-4F3A-A38C-FE8185F88533@dalkescientific.com> On Feb 6, 2008, at 6:19 PM, Chris Fields wrote: > You don't have to if you don't want to. This was probably cross- > posted by Andrew to bring in discussion on this paper with like- > minds from BioPython. That was Peter who started the cross-post > Peter > P.S. Hello from Biopython I'm just the one who wrote a lot last year pushing people on the BIP list to use more from Biopython, such as http://lists.idyll.org/pipermail/biology-in-python/2007-August/ 000046.html or for that matter many of my posts from last Augus http://lists.idyll.org/pipermail/biology-in-python/2007-August/ :) Andrew dalke at dalkescientific.com From p.j.a.cock at googlemail.com Wed Feb 6 15:47:44 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Feb 2008 20:47:44 +0000 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <6BAB63D7-57AB-4F3A-A38C-FE8185F88533@dalkescientific.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com> <69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com> <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> <6BAB63D7-57AB-4F3A-A38C-FE8185F88533@dalkescientific.com> Message-ID: <320fb6e00802061247v65366f95u56752325b1f797a5@mail.gmail.com> Andrew Dalke wrote: > That was Peter who started the > cross-post Ah. So it was - entirely accidentally due to an unwittingly set reply-to field. I've fixed my email settings, and would like to apologise to anyone on the biopython mailing list who ended getting caught up in the thread as a result (especially Marc). If any biopython people would like to join in the discussion about this paper, please do join the BIP list - otherwise let's stop the double posting. The original link was: http://www.biomedcentral.com/1471-2105/9/82 Andrew Dalke wrote: > I'm just the one who wrote a lot last year pushing people on the BIP > list to use more from Biopython, such as ... A sentiment I agree with ;) Peter From vmatthewa at gmail.com Wed Feb 6 16:21:47 2008 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Wed, 6 Feb 2008 14:21:47 -0700 Subject: [BioPython] Iterating through an alignment to calculate the number of gaps and their lengths Message-ID: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> Hi Everyone, I was wondering if anyone could help, I am trying to write a little python script to iterate through an alignment and determine the number of gaps the alignment has and their lengths and output that information as a list. Such as this made up alignemt: Seq1 ATT-AGC-C Seq2 AT--AGCTC and your program runs and outputs like 2 gaps of length 1 outputted as a list like this [1,1] or something like that. I am still learning about python strings and iterators and am not sure how you would approach this? Appreciate any help you could give. Thanks. Sincerely, Matthew From ruchira.datta at gmail.com Wed Feb 6 16:39:02 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Wed, 6 Feb 2008 13:39:02 -0800 Subject: [BioPython] Iterating through an alignment to calculate the number of gaps and their lengths In-Reply-To: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> References: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> Message-ID: Hi, Matthew try this: import re contiguous_gap = re.compile('-+') gappy_regions = contiguous_gap.findall(seq) Now gappy_regions contains a list of the gappy regions, e.g., if seq = 'ILV--F---AAS', then gappy_regions will be ['--','---'] Then to find the lengths of the gappy_regions, you can just say [len(region) for region in gappy_regions] which would give you in the above example [2,3] Hope this helps, --Ruchira Ruchira S. Datta , Ph.D Postdoctoral Researcher Berkeley Phylogenomics Group 324D Stanley Hall Department of Bioengineering California Institute for Quantitative Biosciences (QB3) University of California Berkeley , CA 94720 Phone: (510) 642-6642 Email: ruchira at berkeley.edu On Feb 6, 2008 1:21 PM, Matthew Abravanel wrote: > Hi Everyone, > > I was wondering if anyone could help, I am trying to write a little python > script to iterate through an alignment and determine the number of gaps > the > alignment has and their lengths and output that information as a list. > Such as this made up alignemt: > > Seq1 ATT-AGC-C > Seq2 AT--AGCTC > > and your program runs and outputs like 2 gaps of length 1 outputted as a > list like this [1,1] or something like that. I am still learning about > python strings and iterators and am not sure how you would approach this? > Appreciate any help you could give. Thanks. > > Sincerely, > > Matthew > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Feb 6 16:57:48 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Feb 2008 21:57:48 +0000 Subject: [BioPython] Iterating through an alignment to calculate the number of gaps and their lengths In-Reply-To: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> References: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> Message-ID: <320fb6e00802061357m4306b700x7fda2957d1cfb2e2@mail.gmail.com> On Feb 6, 2008 9:21 PM, Matthew Abravanel wrote: > Hi Everyone, > > I was wondering if anyone could help, I am trying to write a little python > script to iterate through an alignment and determine the number of gaps the > alignment has and their lengths and output that information as a list. > Such as this made up alignemt: > > Seq1 ATT-AGC-C > Seq2 AT--AGCTC > > and your program runs and outputs like 2 gaps of length 1 outputted as a > list like this [1,1] or something like that. I am still learning about > python strings and iterators and am not sure how you would approach this? > Appreciate any help you could give. Thanks. I would start with using Bio.SeqIO to read in the sequences as SeqRecord objects - I'm assuming you have them in a file (e.g. fasta format, or maybe clustal?). See the tutorial or http://biopython.org/wiki/SeqIO e.g. from Bio import SeqIO handle = open("example.fasta") for rec in SeqIO.parse(handle, "fasta") : print rec.id, len(rec.seq), rec.seq.count("-") The above code will simple count the number of gap characters. I think you wanted to look at the sequence strings and how long each stretch of gap characters is? Rather than counting the number of gap characters? Well that is a little more complicated... perhaps something like this: from Bio import SeqIO handle = open("example.fasta") gap = "-" for rec in SeqIO.parse(handle, "fasta") : print rec.id, rec.seq #TODO - Handle leading or trailing gaps in_gap = False gap_len = 0 for letter in rec.seq : if letter == gap and not in_gap : #Start of a gap in_gap = True assert gap_len == 0, "Logic error?" gap_len = 1 elif in_gap and letter == gap : #Continuation of a gap gap_len += 1 elif in_gap and letter <> gap : #End of the gap... print " - Found a gap of length %i" % gap_len #Reset in_gap = False gap_len = 0 Note that this doesn't record a running tally of the gap lengths found, for which a python dictionary might be sensible. Peter From mjldehoon at yahoo.com Wed Feb 6 20:10:06 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 6 Feb 2008 17:10:06 -0800 (PST) Subject: [BioPython] [Biopython-dev] Biopython to begin transition to Subversion In-Reply-To: <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> Message-ID: <617104.88204.qm@web62413.mail.re1.yahoo.com> Peter wrote:Michiel - do you think we should try and do another release before the CVS freeze and migration? We've had a lots little changes, plus Tiago's PopGen work and my own efforts with BioSQL. There are still a few open issues, but I think a release soon would be reasonable (depending on your time commitments of course). I think that the Subversion/CVS issue is separate from our release schedule, so I don't think that the transition to Subversion by itself should be a reason for a release. However, we can probably make a release soon after the transition. I would like to finalize my work on Bio.WWW before making a release, but hopefully that won't be too complicated. --Michiel --------------------------------- Never miss a thing. Make Yahoo your homepage. From biopython-dev at maubp.freeserve.co.uk Thu Feb 7 04:33:49 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Feb 2008 09:33:49 +0000 Subject: [BioPython] Alignment add_sequence In-Reply-To: <200802070925.28882.jblanca@btc.upv.es> References: <200802061706.08830.jblanca@btc.upv.es> <320fb6e00802060820h609d5f10vccba4953455794bb@mail.gmail.com> <200802070925.28882.jblanca@btc.upv.es> Message-ID: <320fb6e00802070133n67a549b5k8868a025f423dc82@mail.gmail.com> On Feb 7, 2008 8:25 AM, Jose Blanca wrote: > Hi: > I think I can't use Bio.SeqIO.to_alignment() because the > sequences have different lengths and start at different > positions. It's and EST alignmet not a clustal-like one. > I have also looked at your proposal in bug 1944 and I really > like it, specially the clever __getitem__ method. But I can't > use it because the different lengths of the sequences. > I'm going to add an add_seqRecord method. Now, thanks to you I > understand why this is not a good solution. But, at least, it > will do for this time. The whole idea behind the current alignment class is that all the sequences are the same length (often with gaps). I don't think this fits with your intended usage - unless you pad each record with leading gap characters (according to its start) and then pad the end until they are all the same length. You could write a function to take a list of SeqRecords and pad them like this (note the example will be easier to read in a mono-spaced font): e.g. CONSENSUS: AGGCCTGAGGCCCCTTTT, start 0 EST1 : CGCAGGCCCGAGGCC, start -3 EST2 : GGCCTGAGGCCCCTT, start 1 EST3 : CTGAGGCCACTTTTTCGC, start 4 In this case we want to add (start+3) gaps to each line, where -3 = min(starts). This becomes: ---AGGCCTGAGGCCCCTTTT, start 0 CGCAGGCCCGAGGCC, start -3 ----GGCCTGAGGCCCCTT, start 1 -------CTGAGGCCACTTTTTCGC, start 4 Then work out the maximum length, and pad all the sequences with trailing gaps: ---AGGCCTGAGGCCCCTTTT---- CGCAGGCCCGAGGCC---------- ----GGCCTGAGGCCCCTT------ -------CTGAGGCCACTTTTTCGC A little bit of work, but now all the sequences are the same length and the Biopython alignment class will be happy. As far as I know, there is nothing for this built into Biopython at the moment. Could you tell us what your input file looks like (e.g. link to the file format?) Peter From peter at maubp.freeserve.co.uk Thu Feb 7 04:36:34 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Feb 2008 09:36:34 +0000 Subject: [BioPython] [Biopython-dev] Biopython to begin transition to Subversion In-Reply-To: <617104.88204.qm@web62413.mail.re1.yahoo.com> References: <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> <617104.88204.qm@web62413.mail.re1.yahoo.com> Message-ID: <320fb6e00802070136r7984d523rcc3c683d8f897431@mail.gmail.com> On Feb 7, 2008 1:10 AM, Michiel de Hoon wrote: > I think that the Subversion/CVS issue is separate from our release schedule, > so I don't think that the transition to Subversion by itself should be a reason > for a release. However, we can probably make a release soon after the > transition. I would like to finalize my work on Bio.WWW before making a > release, but hopefully that won't be too complicated. > > --Michiel You're right the CVS/SVN migration isn't directly linked - but its a nice excuse to get a release out ;) I'd forgotten you still had the Bio.WWW module to sort out, sorry. Peter From kosa at genesilico.pl Thu Feb 7 09:15:28 2008 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 07 Feb 2008 15:15:28 +0100 Subject: [BioPython] Alignment class In-Reply-To: References: Message-ID: <47AB1280.7040209@genesilico.pl> Peter wrote: > The whole idea behind the current alignment class is that all the > sequences are the same length (often with gaps). I was always wondering what is the reason that you made the alignment class which requires all sequences have the same length (even if incl. gaps)? Jan Kosinski :. From biopython at maubp.freeserve.co.uk Thu Feb 7 09:59:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Feb 2008 14:59:46 +0000 Subject: [BioPython] Alignment class In-Reply-To: <47AB1280.7040209@genesilico.pl> References: <47AB1280.7040209@genesilico.pl> Message-ID: <320fb6e00802070659w43dca423s19d98ae354e1a121@mail.gmail.com> On Feb 7, 2008 2:15 PM, Jan Kosinski wrote: > Peter wrote: > > The whole idea behind the current alignment class is that all the > > sequences are the same length (often with gaps). > I was always wondering what is the reason that you made the alignment > class which requires all sequences have the same length (even if incl. > gaps)? The design of the current alignment class predates my involvement, but from the point of view of the code (and the column access in particular) it assumes the sequences have the same length. This assumption (with leading/trailing gaps) is also common to all the alignment file formats I have worked with. I like this abstraction as you can regard the alignment as an array of characters (using matrix notation or what ever). I can see that the EST alignment case is a little different, in that by convention the leading/trailing "gaps" are not shown. It would be possible to write an new EST class which stored the sequences without leading/trailing "gap"s, but took into account the start offset, and would allow access to the "columns" inserting leading/trailing gaps where a given sequence has not started or has already finished. I don't see that this would be any more useful (except perhaps for a small memory saving) In general leading/trailing gaps can mean the limits of a gene, or the limit of a domain with an gene, or the limits of a sequenced fragment, etc. Sometimes there really is no character to go there, in other cases the sequence concerns does continue but for whatever reason it was not included in the alignment. One possibility (depending on what you want to do with the alignment) is to use different characters for internal gaps, leading "gaps" and trailing "gaps". Peter From dalke at dalkescientific.com Thu Feb 7 11:09:25 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 7 Feb 2008 17:09:25 +0100 Subject: [BioPython] psyco reference in bioinformatics article Message-ID: <86A36405-443A-4A76-9352-5C902FB5CAC6@dalkescientific.com> Does anyone know of a bioinformatics reference that mentions using psyco for improving performance of a Python program, and which mentions numbers? I know of one for medical informatics that mentions numbers http://www.biomedcentral.com/1472-6947/2/9/ Preparation of name and address data for record linkage using hidden Markov models Tim Churches, Peter Christen, Kim Lim, and Justin Xi Zhu BMC Medical Informatics and Decision Making 2002, 2:9doi: 10.1186/1472-6947-2-9 > one million address records on the PC platform took 14,061 seconds > (234 minutes), or 5832 seconds (97 minutes) with the Psyco just-in- > time Python compiler enabled > and an Entrez search finds one for bioinformatics which doesn't mention numbers http://www.pubmedcentral.nih.gov/articlerender.fcgi? tool=pmcentrez&artid=1635261 Nucleic Acids Res. 2006 November; 34(20): 5730?5739. doi: 10.1093/ nar/gkl585. A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites Brian T. Naughton, Eugene Fratkin, Serafim Batzoglou, and Douglas L. Brutlag > MotifScan will use psyco (http://psyco.sourceforge.net) for a > performance > gain, if it is installed. Andrew dalke at dalkescientific.com From jblanca at btc.upv.es Fri Feb 8 10:23:34 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 8 Feb 2008 16:23:34 +0100 Subject: [BioPython] Alignment class In-Reply-To: <320fb6e00802070659w43dca423s19d98ae354e1a121@mail.gmail.com> References: <47AB1280.7040209@genesilico.pl> <320fb6e00802070659w43dca423s19d98ae354e1a121@mail.gmail.com> Message-ID: <200802081623.34767.jblanca@btc.upv.es> Hi: I've been thinking a little little on this alignment problem. On Thursday 07 February 2008 15:59:46 Peter wrote: > On Feb 7, 2008 2:15 PM, Jan Kosinski wrote: > > Peter wrote: > > > The whole idea behind the current alignment class is that all the > > > sequences are the same length (often with gaps). > > > > I was always wondering what is the reason that you made the alignment > > class which requires all sequences have the same length (even if incl. > > gaps)? > > The design of the current alignment class predates my involvement, but > from the point of view of the code (and the column access in > particular) it assumes the sequences have the same length. This > assumption (with leading/trailing gaps) is also common to all the > alignment file formats I have worked with. I like this abstraction as > you can regard the alignment as an array of characters (using matrix > notation or what ever). This kind of alignment is useful, but in my opinion it would be better if the sequences could have different lengths and start points. > > I can see that the EST alignment case is a little different, in that > by convention the leading/trailing "gaps" are not shown. It would be > possible to write an new EST class which stored the sequences without > leading/trailing "gap"s, but took into account the start offset, and > would allow access to the "columns" inserting leading/trailing gaps > where a given sequence has not started or has already finished. I > don't see that this would be any more useful (except perhaps for a > small memory saving) > > In general leading/trailing gaps can mean the limits of a gene, or the > limit of a domain with an gene, or the limits of a sequenced fragment, > etc. Sometimes there really is no character to go there, in other > cases the sequence concerns does continue but for whatever reason it > was not included in the alignment. > > One possibility (depending on what you want to do with the alignment) > is to use different characters for internal gaps, leading "gaps" and > trailing "gaps". That would be a good solution for the EST case, althogh it could have some memory problems with longer sequences. Anyway I felt like experimenting a bit so I looked at bioperl for inspiration. For this problem they use ranges and LocatableSeqs. I don't know if we need a full featured BioRange class for this problem, I've coded one, but I haven't used. I have coded a draft of a LocatableSeq class and I've done some minimal modifications to the newAlignment proposal from bug 1944 (http://bugzilla.open-bio.org/show_bug.cgi?id=1944). Maybe I should have created an Alignment subclass, but I think the most relevant change is the new LocatableSeq class. This is not a finished work, but it's mostly working and I would like to know your opinions. This is my first atempt to create something in python. I'm ready to learn from you, I will take the suggestions and criticisms with a smile, so don't be shy. I guess that I could have broken some style rules, I hope to learn them with some time and help. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) -------------- next part -------------- A non-text attachment was scrubbed... Name: newAlignment.py Type: application/x-python Size: 22982 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LocatableSeq.py Type: application/x-python Size: 8195 bytes Desc: not available URL: From rwbarrette at gmail.com Fri Feb 8 12:07:35 2008 From: rwbarrette at gmail.com (Roger Barrette) Date: Fri, 8 Feb 2008 12:07:35 -0500 Subject: [BioPython] Clustalw error: .aln not produced Message-ID: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> Hey all, I'm trying to run clustalw from python (windows) using the simple script example below; ** *import os from Bio.Clustalw import MultipleAlignCL from Bio.Clustalw import do_alignment from sys import ** *cline = MultipleAlignCL("c:\\adenotest.fasta") cline.set_output("c:\\adeno4.aln") print "Command line: ", cline* *align = do_alignment(cline) for seq in align.get_all_seqs(): print seq.description print seq.seq* ** This generates the command line "clustalw c:\adenotest.fasta -OUTFILE=c:\adeno4.aln" ** However, I continuously get the following error message: *IOError: Output .aln file c:\adeno4.aln not produced, commandline: clustalw c:\adenotest.fasta -OUTFILE=c:\adeno4.aln* ** I do have the clustalw executable in the path, and when I copy the generated command line for clustalw into the windows command line, it runs fine, and generates the alignment, with no errors. I updated the clustalw _init_ file, but the error still remains. Any thoughts or suggestions would be greatly appreciated. Thanks. -Roger ** ** From biopython at maubp.freeserve.co.uk Fri Feb 8 12:17:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Feb 2008 17:17:23 +0000 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> Message-ID: <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> On Feb 8, 2008 5:07 PM, Roger Barrette wrote: > Hey all, > > I'm trying to run clustalw from python (windows) using the simple script > example below; > ... > I do have the clustalw executable in the path, and when I copy the generated > command line for clustalw into the windows command line, it runs fine, and > generates the alignment, with no errors. > > I updated the clustalw _init_ file, but the error still remains. Any > thoughts or suggestions would be greatly appreciated. Thanks. Are you sure you are using the latest Bio/Clustalw/__init__.py from CVS? I would have expected it to try a command line like: clustalw -INFILE=c:\adenotest.fasta -OUTFILE=c:\adeno4.aln What version of clustalw do you have (in case that makes a difference)? Have you tried supplying the full path to the clustalw.exe file? Peter From rwbarrette at gmail.com Fri Feb 8 12:53:05 2008 From: rwbarrette at gmail.com (Roger Barrette) Date: Fri, 8 Feb 2008 12:53:05 -0500 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> Message-ID: <2af454d50802080953l5d8ffd1j42bf3ff5e5bf80c1@mail.gmail.com> Hi Peter, I'm using the version 1.16 of the clustalw_init_file, and clustalw version 2.0. I notice that when I run clustalw from the windows command line, it generates the adeno4.aln file. After this file is generated, the script WILL successfully run from python. It doesn't appear to be able to create a new file when called from the python script, but it will update and modify the existing one. Am I not setting up the files correctly? I'm not sure what you mean by "supply the full path to the clustalw.exefile". I have the location of the executable clustalw.exe described in the system path, and it runs directly from the windows command line, so I would assume it is properly mapped. If you mean the path to the .fasta file and location of the output file for clustalw to use; they are being input directly at the clustalw command, or am I missing something? Thanks. -Roger On 2/8/08, Peter wrote: > > On Feb 8, 2008 5:07 PM, Roger Barrette wrote: > > Hey all, > > > > I'm trying to run clustalw from python (windows) using the simple script > > example below; > > ... > > I do have the clustalw executable in the path, and when I copy the > generated > > command line for clustalw into the windows command line, it runs fine, > and > > generates the alignment, with no errors. > > > > I updated the clustalw _init_ file, but the error still remains. Any > > thoughts or suggestions would be greatly appreciated. Thanks. > > Are you sure you are using the latest Bio/Clustalw/__init__.py from > CVS? I would have expected it to try a command line like: > > clustalw -INFILE=c:\adenotest.fasta -OUTFILE=c:\adeno4.aln > > What version of clustalw do you have (in case that makes a difference)? > Have you tried supplying the full path to the clustalw.exe file? > > Peter > From rwbarrette at gmail.com Fri Feb 8 13:18:16 2008 From: rwbarrette at gmail.com (Roger Barrette) Date: Fri, 8 Feb 2008 13:18:16 -0500 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <2af454d50802080953l5d8ffd1j42bf3ff5e5bf80c1@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> <2af454d50802080953l5d8ffd1j42bf3ff5e5bf80c1@mail.gmail.com> Message-ID: <2af454d50802081018i217e3080j97d6aef897f7934a@mail.gmail.com> Hey Peter, I went and downloaded clustalw version 1.83, and that fixed the problem, so it would appear at least that part of the problem has to do with clustalw 2.0. Thanks. -Roger On 2/8/08, Roger Barrette wrote: > > Hi Peter, > > I'm using the version 1.16 of the clustalw_init_file, and clustalw version > 2.0. > I notice that when I run clustalw from the windows command line, it > generates the adeno4.aln file. After this file is generated, the script > WILL successfully run from python. It doesn't appear to be able to create a > new file when called from the python script, but it will update and modify > the existing one. Am I not setting up the files correctly? > > I'm not sure what you mean by "supply the full path to the clustalw.exefile". I have the location of the executable > clustalw.exe described in the system path, and it runs directly from the > windows command line, so I would assume it is properly mapped. If you mean > the path to the .fasta file and location of the output file for clustalw to > use; they are being input directly at the clustalw command, or am I missing > something? Thanks. > > -Roger > > > On 2/8/08, Peter wrote: > > > > On Feb 8, 2008 5:07 PM, Roger Barrette wrote: > > > Hey all, > > > > > > I'm trying to run clustalw from python (windows) using the simple > > script > > > example below; > > > ... > > > I do have the clustalw executable in the path, and when I copy the > > generated > > > command line for clustalw into the windows command line, it runs fine, > > and > > > generates the alignment, with no errors. > > > > > > I updated the clustalw _init_ file, but the error still remains. Any > > > thoughts or suggestions would be greatly appreciated. Thanks. > > > > Are you sure you are using the latest Bio/Clustalw/__init__.py from > > CVS? I would have expected it to try a command line like: > > > > clustalw -INFILE=c:\adenotest.fasta -OUTFILE=c:\adeno4.aln > > > > What version of clustalw do you have (in case that makes a difference)? > > Have you tried supplying the full path to the clustalw.exe file? > > > > Peter > > > > -- Roger William Barrette II, Ph.D Microbiologist USDA /APHIS/ VS/ FADDL Plum Island Animal Disease Center P.O. Box 848, Greenport, NY 11944 631-323-3300 (Lab) 631-323-3200 x4415 (Office) RWBarrette at gmail.com Roger.W.Barrette at APHIS.USDA.GOV From biopython at maubp.freeserve.co.uk Fri Feb 8 14:10:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Feb 2008 19:10:57 +0000 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <2af454d50802080942h38b881bdxc39850ce49575f7d@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> <2af454d50802080942h38b881bdxc39850ce49575f7d@mail.gmail.com> Message-ID: <320fb6e00802081110q31bd45ccv7bca1f1fa31d4595@mail.gmail.com> Hi Robin, > I'm using the version 1.16 of the clustalw_init_file, That is the latest revision of Bio/Clustalw/__init__.py in CVS, good. > and clustalw version 2.0. I haven't tried that, only version 1.83 I think. This could be the problem, but you did say that the command line work when run by hand. I might have time to check the new version this weekend... > I'm not sure what you mean by "supply the full path to the clustalw.exe > file". I have the location of the executable clustalw.exe described in the > system path, and it runs directly from the windows command line, so I would > assume it is properly mapped. I meant rather than trusting Windows will find the executable on the system path, try specifying it in full, e.g. C:\Program Files\Clustal\Clustalw.exe Peter From hlapp at gmx.net Wed Feb 13 20:54:28 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 14 Feb 2008 10:54:28 +0900 Subject: [BioPython] [BioSQL-l] update DBSeqRecords In-Reply-To: <47B32040.1040400@ucd.ie> References: <47B2BAD7.9000109@ucd.ie> <6B2BB40A-8F6A-4757-8A3E-944759298144@gmx.net> <47B32040.1040400@ucd.ie> Message-ID: <69A88D1B-4462-4669-BA44-FFF869947437@gmx.net> Andreas - I really don't know anything about Biopython (but many others on the list may, especially the Biopython list, which I'm cc'ing too). So - I'm passing this on to Biopythonians to respond. -hilmar On Feb 14, 2008, at 1:52 AM, Andreas De Stefani wrote: > Thanks Hilmar, > > I kind of figured this and i am just using the adaptor to execute > the sql statement to delete the entry. > I also noticed that i cannot access all the information via > biopython/biosql, i would like to show the comments for each entry > but i cant find any attribute in the DBSeqRecord to access this > information. Is this something which will be added in the near future? > > My workaround is to use the adaptor from the record and just > execute a sql query ... but that might not be the ideal way to do it!? > > thanks again, > > Andreas > > > > Hilmar Lapp wrote: >> As Peter says this is easily possible, simply delete the sequence >> (protein) first that you want to update and then reload it. >> >> This is also called the 'refresh' mode of updating. >> >> -hilmar >> >> On Feb 13, 2008, at 6:39 PM, Andreas De Stefani wrote: >> >>> Hi Guys, >>> >>> I was wondering if it is possible to update a single DBSeqRecord, >>> without having to delete the whole sub datbase first... >>> >>> I am using BioPython and BioSQL and what I intend todo is to >>> create a local "cache" for protein informations which i get from >>> the web, and after a month or so i would like to re-fetch the >>> info from the web and update the local protein information >>> "cache" (which uses BioSQL). >>> >>> It basically will work like this: >>> >>> if the user requests information for a certain protein the >>> program queries the local DB using the accession number and sees >>> if there is information about the protein, if not (or if the >>> protein is expired, ie older than a month) it gets the info from >>> the web (expasy) and loads (updates the protein information in) >>> the local database. However, is a update of a single protein >>> entry possible? when inserting the same protein i get the >>> following error: >>> >>> (, IntegrityError(1062, >>> "Duplicate entry 'P08317-1-0' for key 2"), >> 0xd6b170>) >>> >>> i am just using db.load(...) again, but maybe there is another >>> way to update entries? >>> >>> Hope somebody can help me with this, thanks very much in advance! >>> >>> Andy >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > > -- > Biontrack - bioinformatics solutions > e: andreas.destefani at biontrack.com > w: www.biontrack.com > t: +353 (0)1 716 3760 > f: +353 (0)1 716 3709 > m: +353 85 141 9941 -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ULNJUJERYDIX at spammotel.com Thu Feb 14 01:02:18 2008 From: ULNJUJERYDIX at spammotel.com (Kevin Lam) Date: Thu, 14 Feb 2008 14:02:18 +0800 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes Message-ID: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Hi I have been scouring through the web for something I thought was a rather simple task but I can't find the answer. How do I get the sequence coordinates for exons of genes in a stretch of genome demarcated by say HoxA13 and Hox A1 ? below is the example of the data I am looking for. 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 From p.j.a.cock at googlemail.com Thu Feb 14 06:01:14 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Feb 2008 11:01:14 +0000 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes In-Reply-To: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> References: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Message-ID: <320fb6e00802140301s111a08famecd9a23b74aba1aa@mail.gmail.com> Hi Kevin, Where do you normally get your genomes from? I am most familiar with the NCBI formats, so I would start by examining the GenBank file for the relevant genome. Have a look by hand first - it may well have features for these genes, and in particular a CDS feature which marks out the introns/exons for you. Biopython will read GenBank files, although I would say dealing with the locations via the SeqFeature object is a little fiddly... have a look at the main documentation and also perhaps http://www2.warwick.ac.uk/go/peter_cock/python/genbank/ Peter On Thu, Feb 14, 2008 at 6:02 AM, Kevin Lam wrote: > Hi > I have been scouring through the web for something I thought was a rather > simple task but I can't find the answer. > > How do I get the sequence coordinates for exons of genes in a stretch of > genome demarcated by say HoxA13 and Hox A1 ? > > below is the example of the data I am looking for. > > 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 > 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From tiagoantao at gmail.com Thu Feb 14 06:20:40 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 14 Feb 2008 11:20:40 +0000 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes In-Reply-To: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> References: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Message-ID: <6d941f120802140320v14ef63d6h755238b26459f01@mail.gmail.com> On Thu, Feb 14, 2008 at 6:02 AM, Kevin Lam wrote: > I have been scouring through the web for something I thought was a rather > simple task but I can't find the answer. > > How do I get the sequence coordinates for exons of genes in a stretch of > genome demarcated by say HoxA13 and Hox A1 ? > > below is the example of the data I am looking for. > > 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 > 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 Have a look at the UCSC Genome Browser http://genome.cse.ucsc.edu/cgi-bin/hgTables on the table knownGene you have things like lists of exonStarts and exonEnds. I would like, in the long run, to support this in biopython (I have python code which I can share), but this won't happen in the next few months for sure (unless it is some sort of team work...). From sdavis2 at mail.nih.gov Thu Feb 14 06:26:05 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 14 Feb 2008 06:26:05 -0500 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes In-Reply-To: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> References: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Message-ID: <264855a00802140326j68f49ddbo339d1906c15b2844@mail.gmail.com> On Thu, Feb 14, 2008 at 1:02 AM, Kevin Lam wrote: > Hi > I have been scouring through the web for something I thought was a rather > simple task but I can't find the answer. > > How do I get the sequence coordinates for exons of genes in a stretch of > genome demarcated by say HoxA13 and Hox A1 ? > > below is the example of the data I am looking for. > > 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 > 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 UCSC and Ensembl both offer simple tools for doing this sort of thing. In UCSC, they call it the "table browser", while in Ensembl, they call it ensmart. Both allow you to specify a region and get various interesting pieces of information from those regions. I would look at those two interfaces, as they will do what you need. Alternatively, both offer open MySQL access to the underlying databases. Of course, this assumes that the organism that you are interested in is available in UCSC and/or Ensembl. If you need more details, feel free to ask.... Sean From pet85 at libero.it Mon Feb 18 17:19:29 2008 From: pet85 at libero.it (Crivellaro Patrizia) Date: Mon, 18 Feb 2008 23:19:29 +0100 Subject: [BioPython] Fwd: Message-ID: Do someone know how to save a sequence in FASTA format not as a text file .txt but as a file .fasta?? thank you very very much! From biopython at maubp.freeserve.co.uk Mon Feb 18 19:04:35 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Feb 2008 00:04:35 +0000 Subject: [BioPython] Fwd: In-Reply-To: References: Message-ID: <320fb6e00802181604q476d860fx9f8e53d7bfa00174@mail.gmail.com> On 2/18/08, Crivellaro Patrizia wrote: > Do someone know how to save a sequence in FASTA format not > as a text file .txt but as a file .fasta?? > thank you very very much! How have you got your sequences in the first place? How about something very simple like: name = "Test" seq = "ATAGACTACGCATACGACT" handle = open("example.fasta", "w") handle.write(">%s\n%s\n" % (name, seq)) handle.close() Maybe you should read the Biopython tutorial or http://biopython.org/wiki/SeqIO for more ideas? Peter From smriti.sebastuan at gmail.com Mon Feb 18 23:59:30 2008 From: smriti.sebastuan at gmail.com (smriti Sebastian) Date: Tue, 19 Feb 2008 10:29:30 +0530 Subject: [BioPython] Parser Message-ID: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> Hi, Can anyone plz help me how to parse description part of PSI Blast output.When I use the description method I am getting an error there is no such attribute.Thanks in advance From biopython at maubp.freeserve.co.uk Tue Feb 19 04:55:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Feb 2008 09:55:07 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> Message-ID: <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> On 2/19/08, smriti Sebastian wrote: > Hi, > Can anyone plz help me how to parse description part of PSI Blast > output.When I use the description method I am getting an error there is no > such attribute.Thanks in advance Could you show us your code, and the full error message? The BLAST examples in the tutorial should be helpful... Peter From biopython at maubp.freeserve.co.uk Tue Feb 19 19:09:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Feb 2008 00:09:09 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> Message-ID: <320fb6e00802191609t4689a819m1a50bd09dc152fdb@mail.gmail.com> Hi smriti. That does look helpful. Assuming its not too big, could you email me the psi_out file (off the list to avoid clogging up everyone email). Once we sort this out, it would be a good idea for us to update the PSI Blast section of the tutorial... Peter On Feb 19, 2008 6:00 PM, smriti Sebastian wrote: > > Hi , > My code is like this: > > #!usr/bin/python > fh=open('psi_out','r') > import Bio > from Bio.Blast import * > import Bio.Blast.NCBIStandalone > b_parser=Bio.Blast.NCBIStandalone.PSIBlastParser() > b_record=b_parser.parse(fh) > > E_VALUE_THRESH=0.04 > for line in b_record.rounds: > > for record in line.descriptions: > print record > > My error: > > Traceback (most recent call last): > File "parse_psi.py", line 12, in > for record in line.descriptions: > AttributeError: Round instance has no attribute 'descriptions' Whatever the "line" object does, it seems it doesn't have a "descriptions" attribute. What does dir(line) give? Peter From ivan at biodec.com Wed Feb 20 04:31:30 2008 From: ivan at biodec.com (Ivan Rossi) Date: Wed, 20 Feb 2008 10:31:30 +0100 (CET) Subject: [BioPython] plone4bio project starts Message-ID: Dear list members, We are pleased to announce a new Plone project: plone4bio. As the name suggest it is intended as a set of products to do bioinformatics within the Plone CMS. Plone4Bio takes advantage of Biopython. What is plone4bio The rationale of the plone4bio project is to provide an integrated environment where it is possible to manage and analyze biological sequences. The plone4bio package provides the possibility to add a new plone content type, called sequence, than can be either written by hand or imported from a FASTA file, and to apply to that sequence a program, called predictor, that gives back a plot of predicted probabilities for the sequence to have a given property (the property that the predictor tries to determine). thus a predictor can try to assess if a protein sequence is trans-membrane, whether a signal peptide exists, and so on. plone4bio.base The plone4bio.base is a package that defines a skeleton predictor: deriving from that it is possible to integrate any other application and visualize all the results together. biocomp.pscoils This is an example predictor, encapsulating the pscoils algorithm by Fariselli et al. available at http://www.biocomp.unibo.it/ It is intended both as an example on how to integrate one's own predictor in the plone4bio framework. Requirements 1. python2.4 2. python setup tools (the python-setuptools Debian package) 3. biopython 4. PIL Download and Project page The software is available at http://www.plone4bio.org Further information Available either through the web site (plone4bio.org) or subscribing to the mailing list (p4b at biodec dot com) For installation and documentation issues refer to README.txt and INSTALL.txt files from the archive, or the script published on the plone4bio wiki site. plone4bio is published under the GPL license. This product is produced independently from the product Plone, and carries no guarantee from the Plone Foundation about quality, suitability or anything else. The supplier of this product assumes all responsibility for it. -- Ivan Rossi, PhD - ivan AT biodec dot com - ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, 40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Wed Feb 20 07:22:48 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Feb 2008 12:22:48 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802192340n2f4cab8vb767eea6f098cac5@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> <320fb6e00802191609t4689a819m1a50bd09dc152fdb@mail.gmail.com> <22c5c6390802192340n2f4cab8vb767eea6f098cac5@mail.gmail.com> Message-ID: <320fb6e00802200422o597c8f56r8bfec5d5d3c9b035@mail.gmail.com> On Wed, Feb 20, 2008 at 7:40 AM, smriti Sebastian wrote: > dir(line) is not giving descriptions in it.But if we check the > NCBIStandalone.py file it has an attribute called descriptions. > I am attaching the file The b_record object is a Bio.Blast.Record.PSIBlast instance, which has different attributes to the "normal blast" object. In particular, the list "rounds" of Bio.Blast.Record.Round objects, and the boolean/integer "converged". Try: help(Bio.Blast.Record.PSIBlast) help(Bio.Blast.Record.Round) I'm not sure exactly what you want to achieve, but perhaps something like this would be a start: #!usr/bin/python fh=open('psi_out','r') import Bio from Bio.Blast import * import Bio.Blast.NCBIStandalone b_parser=Bio.Blast.NCBIStandalone.PSIBlastParser() b_record=b_parser.parse(fh) E_VALUE_THRESH=0.04 for round in b_record.rounds: print round for aln in round.alignments : print aln Peter From lueck at ipk-gatersleben.de Thu Feb 21 09:50:58 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 21 Feb 2008 15:50:58 +0100 Subject: [BioPython] write a genbank file Message-ID: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> Hi! Does someone can give me an example how I can write in python a new genbank file? I want to make a blast and to use the location of the match as a feature in a genbank file (and finally to work on it in DNA Star). Is it at all possible? Thanks in advance! Stefanie From hlapp at gmx.net Thu Feb 21 22:21:19 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 Feb 2008 22:21:19 -0500 Subject: [BioPython] BioSQL documentation for Biopython Message-ID: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Hi all, there is some Biopython-related documentation of BioSQL and using Biopython's language binding within the BioSQL codebase: http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/ trunk/doc/biopython This is about 5 years old (since it has been last updated by Brad Chapman), according to the svn log. Could some Biopythonist check this material whether or not it still has any relevance, and whether there are any errors? I'll be releasing within the next couple of days, so if this is outdated I'd like to remove it from (at least) the release branch. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ULNJUJERYDIX at spammotel.com Fri Feb 22 02:01:48 2008 From: ULNJUJERYDIX at spammotel.com (Kevin Lam) Date: Fri, 22 Feb 2008 15:01:48 +0800 Subject: [BioPython] **Fwd: write a genbank file In-Reply-To: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> References: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> Message-ID: <5b6410e0802212301q61133365tfc8e819c825d9e09@mail.gmail.com> http://www.embl-heidelberg.de/~chenna/PySAT/ might be helpful From sbassi at gmail.com Fri Feb 22 18:45:19 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 22 Feb 2008 21:45:19 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: > Could some Biopythonist check this material whether or not it still > has any relevance, and whether there are any errors? I've tried to import the schema (biosqldb-mysql.sql) into my MySQL server (5.0.45-Debian_1ubuntu3-log) and got this: Error SQL query: -- CONFIG: you may want to add this for mysql because MySQL often is broken -- with respect to using the composite index for the initial keys - - CREATE INDEX ontrel_subjectid ON term_relationship( subject_term_id ); MySQL said: #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id)' at line 1 So I tried with "compatibility with SQL323" (option in the phpmyadmin), but got the same result. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Fri Feb 22 21:41:52 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 Feb 2008 21:41:52 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> Interesting. Apparently MySQL needs a space after the double-dash comment prefix: In MySQL, the ?-- ? (double-dash) comment style requires the second dash to be followed by at least one whitespace or control character (such as a space, tab, newline, and so on). This syntax differs slightly from standard SQL comment syntax [...]. If you add that space, i.e., change the line: --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); to -- CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); and do the same for all other lines where this occurs, does it work then? I've also updated the MySQL schema on svn: http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/ trunk/sql/biosqldb-mysql.sql BTW the reason this hasn't come up before is that most everyone uses the mysql command line client to instantiate the schema, which ignores lines starting with '--'. -hilmar On Feb 22, 2008, at 6:45 PM, Sebastian Bassi wrote: > On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: >> Could some Biopythonist check this material whether or not it still >> has any relevance, and whether there are any errors? > > I've tried to import the schema (biosqldb-mysql.sql) into my MySQL > server (5.0.45-Debian_1ubuntu3-log) and got this: > > Error > SQL query: > -- CONFIG: you may want to add this for mysql because MySQL often > is broken > -- with respect to using the composite index for the initial keys > - - CREATE INDEX ontrel_subjectid ON term_relationship( > subject_term_id > ); > > MySQL said: > #1064 - You have an error in your SQL syntax; check the manual that > corresponds to your MySQL server version for the right syntax to use > near '--CREATE INDEX ontrel_subjectid ON > term_relationship(subject_term_id)' at line 1 > > So I tried with "compatibility with SQL323" (option in the > phpmyadmin), but got the same result. > > > -- > Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 > Bioinformatics news: http://www.bioinformatica.info > Tutorial libre de Python: http://tinyurl.com/2az5d5 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sbassi at gmail.com Fri Feb 22 22:17:36 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 01:17:36 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> Message-ID: > If you add that space, i.e., change the line: > --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); > to > -- CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); .... > and do the same for all other lines where this occurs, does it work > then? Yes, I did it and now it works. So I will keep on testing the documentation. > I've also updated the MySQL schema on svn: > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/ > trunk/sql/biosqldb-mysql.sql I've just check it out (via web) and it is still the same as before. May be that the web version is delayed? BTW, when I click on "Blame/Annotate" in the web SVN (http://code.open-bio.org/svnweb/index.cgi/biosql/blame/biosql-schema/trunk/sql/biosqldb-mysql.sql), I get this: An error occured Error string not specified yet: Can't find a temporary directory: Error string not specified yet at /usr/lib/perl5/site_perl/5.8.8/SVN/Web/Blame.pm line 146 Maybe an issue with the installation. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Fri Feb 22 22:28:30 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 Feb 2008 22:28:30 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> Message-ID: <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> On Feb 22, 2008, at 10:17 PM, Sebastian Bassi wrote: > I've just check it out (via web) and it is still the same as before. > May be that the web version is delayed? Yes, sorry that was my mistake. The URL was to the anonymous access mirror, which gets updated only every hour or so. > BTW, when I click on "Blame/Annotate" in the web SVN > [...] I get this: > > An error occured Yes, I know. I don't know what the issue is but I've reported it earlier to support at open-bio.org. Thanks for your help and for reporting the issue. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sbassi at gmail.com Sat Feb 23 00:49:34 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 03:49:34 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 1:28 AM, Hilmar Lapp wrote: > Yes, sorry that was my mistake. The URL was to the anonymous access > mirror, which gets updated only every hour or so. Thats OK. Here is my next report: There is a part here: "For this example, we are going to assume we have a GenBank file on our computer called cor6_6.gb that we are going to work with." I think the tutorial should state that the cor6_6.gb is included with Biopython (under Test/Genbank). Also a link to the file won't hurt. When I tried to follow the step by step guide, I found this error (I am using Biopython 1.44): >>> from BioSQL import BioSeqDatabase >>> server=BioSeqDatabase.open_database(driver = "MySQLdb", user = "", passwd="", host = "localhost", db = "bioseqdb") >>> db = server.new_database("cold") >>> from Bio import GenBank >>> parser = GenBank.FeatureParser() >>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) >>> db.load(iterator) Traceback (most recent call last): File "", line 1, in File "BioSQL/BioSeqDatabase.py", line 414, in load db_loader.load_seqrecord(cur_record) File "BioSQL/Loader.py", line 30, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "BioSQL/Loader.py", line 250, in _load_bioentry_table version)) File "BioSQL/BioSeqDatabase.py", line 277, in execute self.cursor.execute(sql, args or ()) File "/var/lib/python-support/python2.5/MySQLdb/cursors.py", line 151, in execute query = query % db.literal(args) TypeError: not all arguments converted during string formatting Should I donwload the BioSQL from CVS? -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Sat Feb 23 12:58:14 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 23 Feb 2008 12:58:14 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> Message-ID: <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> On Feb 23, 2008, at 12:49 AM, Sebastian Bassi wrote: > File "/var/lib/python-support/python2.5/MySQLdb/cursors.py", line > 151, in execute > query = query % db.literal(args) > TypeError: not all arguments converted during string formatting > > Should I donwload the BioSQL from CVS? You mean from SVN, probably? I don't know but it seems to me that problem is in some (Bio)Python code? I.e., (re-)downloading BioSQL from anonymous SVN would only update the schema, and there was no update, so I can't imagine how that would help. Or did you mean (re-)downloading the BioSQL bindings from Biopython? That would be a question for the Biopython folks (I actually don't use Biopython). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From srikrishnamohan at gmail.com Sat Feb 23 13:07:45 2008 From: srikrishnamohan at gmail.com (km) Date: Sat, 23 Feb 2008 23:37:45 +0530 Subject: [BioPython] Bio.SCOP problem Message-ID: Hi all, I have a problem with Bio.SCOP module in BioPython There is absolutely no documentation for Bio.SCOP module and looking at the source code, I found a way to load scop parseable files (from astral db) to get the domain information represented as attributes of scop object (Bio.SCOP.Scop (...)) Now the problem is that each domain shows the parent to be of None type object !!! How do I traverse thru the hierarchy ? ie ., given a domain, how do i Know which fold it belongs to and corresponding family and class ?? any hints ? Am i missing something ? regards, KM From sbassi at gmail.com Sat Feb 23 14:07:47 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 17:07:47 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 3:58 PM, Hilmar Lapp wrote: > You mean from SVN, probably? I don't know but it seems to me that > problem is in some (Bio)Python code? Yes, the problem was that I was using 1.44 biopython without the new BioSQL code from Peter. Biopython repository is still in CVS, not SVN (at least biopython is not listed here: http://code.open-bio.org/svnweb/index.cgi/) Now with the new code, I could reproduce the tutorial, up to here: >>> from BioSQL import BioSeqDatabase >>> server=BioSeqDatabase.open_database(driver = "MySQLdb", user = "X",passwd="X", host = "localhost", db = "bioseqdb") >>> db = server.new_database("cold") >>> from Bio import GenBank >>> parser = GenBank.FeatureParser() >>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) >>> db.load(iterator) 6 But when I look into the mysql, there is no new record!. The "6" is supposed to be the number of records loaded into the database. But my database is empty (it has the schema, but w/o data). > That would be a question for the Biopython folks (I actually don't > use Biopython). I am copying this into biopython and biopython-dev mailing list. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Sat Feb 23 14:20:35 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 23 Feb 2008 14:20:35 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Feb 23, 2008, at 2:07 PM, Sebastian Bassi wrote: > Now with the new code, I could reproduce the tutorial, up to here: > >>>> from BioSQL import BioSeqDatabase >>>> server=BioSeqDatabase.open_database(driver = "MySQLdb", user = > "X",passwd="X", host = "localhost", db = "bioseqdb") >>>> db = server.new_database("cold") >>>> from Bio import GenBank >>>> parser = GenBank.FeatureParser() >>>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) >>>> db.load(iterator) > 6 > > But when I look into the mysql, there is no new record!. The "6" is > supposed to be the number of records loaded into the database. But my > database is empty (it has the schema, but w/o data). I.e., there is no error from the db.load() command, just no data? Does the Biopython binding enable or disable auto-commit? If the latter (which would be the Right Thing(tm) to do), you will have to commit the transaction. (Obviously I don't know what the API method would be for this, but db.commit() might be a good start.) BioSQL uses InnoDB on MySQL, and hence will be transactional unless you make the language's db driver to auto-commit. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sbassi at gmail.com Sat Feb 23 14:50:50 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 17:50:50 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 5:20 PM, Hilmar Lapp wrote: > I.e., there is no error from the db.load() command, just no data? Yes, there was no error, the only response was "6". > Does the Biopython binding enable or disable auto-commit? If the > latter (which would be the Right Thing(tm) to do), you will have to Yes, when working with MySQLdb, it does not auto-commit. You have to do DB_HANDLE.commit(). There is no commit method in db: >>> dir(db) ['__doc__', '__getitem__', '__init__', '__module__', '__repr__', 'adaptor', 'dbid', 'get_PrimarySeq_stream', 'get_Seq_by_acc', 'get_Seq_by_id', 'get_Seq_by_primary_id', 'get_Seq_by_ver', 'get_Seqs_by_acc', 'get_all_primary_ids', 'items', 'keys', 'load', 'lookup', 'name', 'values'] > BioSQL uses InnoDB on MySQL, and hence will be transactional unless > you make the language's db driver to auto-commit. I am looking at the DatabaseLoader class (in loader.py) but I don't see any commit statement, anyway, I don't understand this class, so I may be missing something. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Sat Feb 23 15:58:03 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 18:58:03 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 5:50 PM, Sebastian Bassi wrote: > > BioSQL uses InnoDB on MySQL, and hence will be transactional unless > > you make the language's db driver to auto-commit. > I am looking at the DatabaseLoader class (in loader.py) but I don't > see any commit statement, anyway, I don't understand this class, so I > may be missing something. I've just found the answer. Here is what was missing: server.adaptor.commit() I found it here: http://www.biopython.org/wiki/BioSQL So the document IMHO should be changed, for example: ">>> db.load(iterator) 6 And the GenBank file is loaded into the database. Notice that the load function returns the number of records loaded (6 in this case). This is useful for sanity checking to make sure that you didn't try to load a massive file and end up with a result like 3." To: ">>> db.load(iterator) 6 >>> server.adaptor.commit() And the GenBank file is loaded into the database. Notice that the load function returns the number of records loaded (6 in this case). This is useful for sanity checking to make sure that you didn't try to load a massive file and end up with a result like 3." A link to http://www.biopython.org/wiki/BioSQL could be added. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Sat Feb 23 16:31:27 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 19:31:27 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: > Could some Biopythonist check this material whether or not it still > has any relevance, and whether there are any errors? Everything went OK. I could follow the whole document. The only minor difference I found was: >>> print feature.location (0..880) It is in fact: >>> print feature.location [0:880] > I'll be releasing within the next couple of days, so if this is > outdated I'd like to remove it from (at least) the release branch. I think there is no need to remove it, just add the ">>> server.adaptor.commit()" and a link to the wiki (http://www.biopython.org/wiki/BioSQL) Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From anaryin at gmail.com Sat Feb 23 17:00:02 2008 From: anaryin at gmail.com (=?ISO-8859-1?Q?Jo=E3o_Rodrigues?=) Date: Sat, 23 Feb 2008 22:00:02 +0000 Subject: [BioPython] Uniprot Parser Message-ID: Hello all! I've written a small parser for the uniprot_sprot.dat files that come out once and again because I read about some incompatibilities of the Biopython's with the source files. Now I want to rewrite and clean the code and I'm considering (strongly) to rewrite my parser. It's a mess of a code (though it works) and I'd rather use something more... readable! So, I'm asking, basically, is Biopython's parser good already or are there still some incompatibilities? Thanks a lot! Jo?o Rodrigues From ruchira.datta at gmail.com Sat Feb 23 17:44:43 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sat, 23 Feb 2008 14:44:43 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: I've been using Bio.SwissProt.SProt to parse this file. The only glitch that came up so far is that when some fields span multiple lines (e.g., OS, the species field), SProt puts a newline in the field. This is not correct--it should be just a blank space. However, this can easily be corrected within SProt itself without requiring a forked parser. At least two other parsers for this file have been written by people in my group, but I have pushed and implemented standardization on the BioPython one. Part of the point of BioPython is to have one central repository for development and maintenance of things like this, so that hundreds of people don't have to spend their time reinventing the wheel. It is much preferable that people contribute changes rather than creating a forked version. --Ruchira On Sat, Feb 23, 2008 at 2:00 PM, Jo?o Rodrigues wrote: > Hello all! > > I've written a small parser for the uniprot_sprot.dat files that come out > once and again because I read about some incompatibilities of the > Biopython's with the source files. Now I want to rewrite and clean the > code > and I'm considering (strongly) to rewrite my parser. It's a mess of a code > (though it works) and I'd rather use something more... readable! So, I'm > asking, basically, is Biopython's parser good already or are there still > some incompatibilities? > > Thanks a lot! > > Jo?o Rodrigues > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From hlapp at gmx.net Sat Feb 23 23:30:56 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 23 Feb 2008 23:30:56 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: Thanks a lot for your help Sebastian - I've updated the documentation now, also changing the CVS links to SVN, and adding in some wiki links. It took me a while to figure out the hevea tool that they had used originally to convert to text and HTML, but it's re-converted now. I still couldn't manage the title and authors to be printed, so I just copied those parts from the old txt and HMTL versions. -hilmar On Feb 23, 2008, at 4:31 PM, Sebastian Bassi wrote: > On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: >> Could some Biopythonist check this material whether or not it still >> has any relevance, and whether there are any errors? > > Everything went OK. I could follow the whole document. The only minor > difference I found was: > >>>> print feature.location > (0..880) > > It is in fact: > >>>> print feature.location > [0:880] > >> I'll be releasing within the next couple of days, so if this is >> outdated I'd like to remove it from (at least) the release branch. > > I think there is no need to remove it, just add the ">>> > server.adaptor.commit()" and a link to the wiki > (http://www.biopython.org/wiki/BioSQL) > > Best, > SB. > > > -- > Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 > Bioinformatics news: http://www.bioinformatica.info > Tutorial libre de Python: http://tinyurl.com/2az5d5 -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun Feb 24 05:42:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 10:42:55 +0000 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> I've been away a few days, but it looks like you and Sebastian has worked out where things had been going wrong. Good job :) > > Everything went OK. I could follow the whole document. The only minor > > difference I found was: > > > >>>> print feature.location > > (0..880) > > > > It is in fact: > > > >>>> print feature.location > > [0:880] That was a change made a year and a half ago - it is just cosmetic. http://bugzilla.open-bio.org/show_bug.cgi?id=1902 > >> I'll be releasing within the next couple of days, so if this is > >> outdated I'd like to remove it from (at least) the release branch. As you'll have noticed, Biopython 1.44 has a few problems with BioSQL, and the next release currently in CVS will be a lot better. It might be worth adding a warning to the BioSQL release for any Biopython users to wait for Biopython 1.45. The only other thing I noticed was this example code: >>> from Bio import GenBank >>> parser = GenBank.FeatureParser() >>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) I would write this using Bio.SeqIO as we a promoting this as a uniform sequence input/output library in Biopython (as in the wiki page Sebastian mentioned). i.e. >>> from Bio import SeqIO >>> iterator = SeqIO.parse(open("cor6_6.gb"), "genbank") (However I have not yet sat down and gone through the whole document) Peter From biopython at maubp.freeserve.co.uk Sun Feb 24 05:51:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 10:51:22 +0000 Subject: [BioPython] write a genbank file In-Reply-To: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> References: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00802240251m657fe4b6i44f3b128c213268d@mail.gmail.com> On Thu, Feb 21, 2008 at 2:50 PM, Stefanie L?ck wrote: > Hi! > > Does someone can give me an example how I can write in python a new > genbank file? I want to make a blast and to use the location of the match > as a feature in a genbank file (and finally to work on it in DNA Star). > > Is it at all possible? Writing GenBank files isn't easy at the moment. Depending on your needs, creating Bio.GenBank.Record.Record objects and writing them to file may work. I hope to include support for writing GenBank files from SeqRecord objects in Bio.SeqIO later... but that would still be complicated in your case as you would have to create SeqFeatures from each BLAST match. Peter From biopython at maubp.freeserve.co.uk Sun Feb 24 07:58:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 12:58:13 +0000 Subject: [BioPython] Bio.SCOP problem In-Reply-To: References: Message-ID: <320fb6e00802240458r66b44e54qc22cf8a2c62eba2e@mail.gmail.com> On Sat, Feb 23, 2008 at 6:07 PM, km wrote: > Hi all, > I have a problem with Bio.SCOP module in BioPython > There is absolutely no documentation for Bio.SCOP module and looking at the > source code, I found a way to load scop parseable files (from astral db) to ... Have you looked at the SCOP unit tests? Those could be quite helpful. Peter From biopython at maubp.freeserve.co.uk Sun Feb 24 08:06:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 13:06:20 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> On Sat, Feb 23, 2008 at 10:44 PM, Ruchira Datta wrote: > I've been using Bio.SwissProt.SProt to parse this file. The only glitch > that came up so far is that when some fields span multiple lines (e.g., OS, > the species field), SProt puts a newline in the field. This is not > correct--it should be just a blank space. However, this can easily be > corrected within SProt itself without requiring a forked parser. I'm guessing you are using the parser to return Record objects, which are a fairly simple direct mapping of the raw file format - and I can understand why the newlines were included. If you use the parser to get SeqRecord objects (which are generic and not tied to the SwissProt/UniProt format), then the newlines are removed. > At least two other parsers for this file have been written by people in my > group, but I have pushed and implemented standardization on the BioPython > one. Part of the point of BioPython is to have one central repository for > development and maintenance of things like this, so that hundreds of people > don't have to spend their time reinventing the wheel. It is much preferable > that people contribute changes rather than creating a forked version. > > --Ruchira From hlapp at gmx.net Sun Feb 24 11:02:33 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 24 Feb 2008 11:02:33 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> Message-ID: <4E967ED0-13DF-45AF-A305-A32AD9B4B303@gmx.net> On Feb 24, 2008, at 5:42 AM, Peter wrote: > The only other thing I noticed was this example code: > >>>> from Bio import GenBank >>>> parser = GenBank.FeatureParser() >>>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) > > I would write this using Bio.SeqIO as we a promoting this as a uniform > sequence input/output library in Biopython (as in the wiki page > Sebastian mentioned). i.e. > >>>> from Bio import SeqIO >>>> iterator = SeqIO.parse(open("cor6_6.gb"), "genbank") > > (However I have not yet sat down and gone through the whole document) If you assure me that that would work (with a current release of Biopython), I'll change it accordingly. BTW also in regard to a previous comment from Sebastian, the file cor6_6.gb is in fact in that same directory in biosql. As another aside, if either of you would like write permission to biosql so you can maintain that document yourself that would be no problem. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ruchira.datta at gmail.com Sun Feb 24 11:28:33 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sun, 24 Feb 2008 08:28:33 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> Message-ID: On Sun, Feb 24, 2008 at 5:06 AM, Peter wrote: > On Sat, Feb 23, 2008 at 10:44 PM, Ruchira Datta > wrote: > > I've been using Bio.SwissProt.SProt to parse this file. The only glitch > > that came up so far is that when some fields span multiple lines (e.g., > OS, > > the species field), SProt puts a newline in the field. This is not > > correct--it should be just a blank space. However, this can easily be > > corrected within SProt itself without requiring a forked parser. > > I'm guessing you are using the parser to return Record objects, which > are a fairly simple direct mapping of the raw file format - and I can > understand why the newlines were included. If you use the parser to > get SeqRecord objects (which are generic and not tied to the > SwissProt/UniProt format), then the newlines are removed. > Hi Peter, I had tried SeqRecord first, but it didn't include the references, which I absolutely need. While inclusion of newlines may be understandable, it's a bug. The newline is stripped from several other fields by _RecordConsumer, e.g., def reference_number(self, line): rn = line[5:].rstrip() ... and it needs to be stripped from this one, instead of def organism_species(self, line): self.data.organism += line[5:] The newlines are never significant in any field. In a couple of weeks I might be able to check out the cvs version and provide a patch. --Ruchira > > > At least two other parsers for this file have been written by people in > my > > group, but I have pushed and implemented standardization on the > BioPython > > one. Part of the point of BioPython is to have one central repository > for > > development and maintenance of things like this, so that hundreds of > people > > don't have to spend their time reinventing the wheel. It is much > preferable > > that people contribute changes rather than creating a forked version. > > > > --Ruchira > From biopython at maubp.freeserve.co.uk Sun Feb 24 11:47:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 16:47:01 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> Message-ID: <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> On Sun, Feb 24, 2008 at 4:28 PM, Ruchira Datta wrote: > > Hi Peter, > > I had tried SeqRecord first, but it didn't include the references, which I > absolutely need. The good news is I think the references are included now (in Biopython CVS), see enhancement Bug 2235: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > While inclusion of newlines may be understandable, it's a bug. The newline > is stripped from several other fields by _RecordConsumer, e.g., > ... Off the top of my head, I would say that example is a little different - reference number lines do not span multiple lines. > The newlines are never significant in any field. You are probably right - although perhaps they could be important in long text fields where a line break has been inserted mid word and a hyphenation added. The newlines are also important if using the Record object to recreate the raw file (e.g. to save to disk). However I doubt anyone is doing this. Having a __str__ method defined like there is in the Bio.GenBank.Record.Record object which would make this easier. > In a couple of weeks I might be able to check out the cvs > version and provide a patch. Please do. Peter From ruchira.datta at gmail.com Sun Feb 24 12:36:56 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sun, 24 Feb 2008 09:36:56 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> Message-ID: I just found another bug, which would be a bit trickier to fix properly. This code: def database_cross_reference(self, line): # From CLD1_HUMAN, Release 39: # DR EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence] # DR PRODOM [Domain structure / List of seq. sharing at least 1 domai # DR SWISS-2DPAGE; GET REGION ON 2D PAGE. line = line[5:] # Remove the comments at the end of the line i = line.find('[') if i >= 0: line = line[:i] cols = line.rstrip(_CHOMP).split(';') cols = [col.lstrip() for col in cols] self.data.cross_references.append(tuple(cols)) applied to this line of the TrEMBL record for A2RB21_ASPNG: DR GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...; IEA:EC. got me this tuple: ('GO', 'GO:0016277', 'F:') The bracketed term was interpreted as a comment and the whole line was stripped. Thanks, --Ruchira On Sun, Feb 24, 2008 at 8:47 AM, Peter wrote: > On Sun, Feb 24, 2008 at 4:28 PM, Ruchira Datta > wrote: > > > > Hi Peter, > > > > I had tried SeqRecord first, but it didn't include the references, > which I > > absolutely need. > > The good news is I think the references are included now (in Biopython > CVS), see enhancement Bug 2235: > http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > > > While inclusion of newlines may be understandable, it's a bug. The > newline > > is stripped from several other fields by _RecordConsumer, e.g., > > ... > > Off the top of my head, I would say that example is a little different > - reference number lines do not span multiple lines. > > > The newlines are never significant in any field. > > You are probably right - although perhaps they could be important in > long text fields where a line break has been inserted mid word and a > hyphenation added. > > The newlines are also important if using the Record object to recreate > the raw file (e.g. to save to disk). However I doubt anyone is doing > this. Having a __str__ method defined like there is in the > Bio.GenBank.Record.Record object which would make this easier. > > > In a couple of weeks I might be able to check out the cvs > > version and provide a patch. > > Please do. > > Peter > From biopython at maubp.freeserve.co.uk Sun Feb 24 12:48:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 17:48:29 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> Message-ID: <320fb6e00802240948s6f0aff89r29ef9cff286dd895@mail.gmail.com> On Sun, Feb 24, 2008 at 5:36 PM, Ruchira Datta wrote: > I just found another bug, which would be a bit trickier to fix properly. > > This code: > > def database_cross_reference(self, line): > # From CLD1_HUMAN, Release 39: > # DR EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence] > # DR PRODOM [Domain structure / List of seq. sharing at least 1 > domai > # DR SWISS-2DPAGE; GET REGION ON 2D PAGE. > line = line[5:] > # Remove the comments at the end of the line > i = line.find('[') > if i >= 0: > line = line[:i] > cols = line.rstrip(_CHOMP).split(';') > cols = [col.lstrip() for col in cols] > self.data.cross_references.append(tuple(cols)) > > applied to this line of the TrEMBL record for A2RB21_ASPNG: > > DR GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...; > IEA:EC. > > got me this tuple: > > ('GO', 'GO:0016277', 'F:') > > The bracketed term was interpreted as a comment and the whole line was > stripped. That does look tricky... especially if we want to preserve backwards compatibility. This "F" cross reference looks like the partial text for the GO term. I wonder how common this is? (square brackets in the cross references themselves). I can't see the use of "F" mentioned here: http://www.expasy.org/sprot/userman.html#DR_line Could you file a bug and add a few more other examples if you find them. Thanks Peter From ruchira.datta at gmail.com Sun Feb 24 12:53:10 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sun, 24 Feb 2008 09:53:10 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802240948s6f0aff89r29ef9cff286dd895@mail.gmail.com> References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> <320fb6e00802240948s6f0aff89r29ef9cff286dd895@mail.gmail.com> Message-ID: On Sun, Feb 24, 2008 at 9:48 AM, Peter wrote: > On Sun, Feb 24, 2008 at 5:36 PM, Ruchira Datta > wrote: > > I just found another bug, which would be a bit trickier to fix properly. > > > > This code: > > > > def database_cross_reference(self, line): > > # From CLD1_HUMAN, Release 39: > > # DR EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence] > > # DR PRODOM [Domain structure / List of seq. sharing at least > 1 > > domai > > # DR SWISS-2DPAGE; GET REGION ON 2D PAGE. > > line = line[5:] > > # Remove the comments at the end of the line > > i = line.find('[') > > if i >= 0: > > line = line[:i] > > cols = line.rstrip(_CHOMP).split(';') > > cols = [col.lstrip() for col in cols] > > self.data.cross_references.append(tuple(cols)) > > > > applied to this line of the TrEMBL record for A2RB21_ASPNG: > > > > DR GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...; > > IEA:EC. > > > > got me this tuple: > > > > ('GO', 'GO:0016277', 'F:') > > > > The bracketed term was interpreted as a comment and the whole line was > > stripped. > > That does look tricky... especially if we want to preserve backwards > compatibility. This "F" cross reference looks like the partial text > for the GO term. I wonder how common this is? (square brackets in the > cross references themselves). I can't see the use of "F" mentioned > here: http://www.expasy.org/sprot/userman.html#DR_line > > Could you file a bug and add a few more other examples if you find them. > > Thanks > > Peter > Here 'F;' means the annotation refers to the molecular function part of the Gene Ontology (as opposed to, e.g., 'P:' for biological process). I think this is quite rare, but I'll see if any other examples came up. --Ruchira From sbassi at gmail.com Sun Feb 24 19:30:06 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Feb 2008 21:30:06 -0300 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <4E967ED0-13DF-45AF-A305-A32AD9B4B303@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> <4E967ED0-13DF-45AF-A305-A32AD9B4B303@gmx.net> Message-ID: On Sun, Feb 24, 2008 at 1:02 PM, Hilmar Lapp wrote: > If you assure me that that would work (with a current release of > Biopython), I'll change it accordingly. Peter proposal (using SeqIO.parse) works with Python 1.44 I've just tested. WARNING: The BioSQL module it is not from 1.44, it is from CVS. So this document can be followed using current CVS version of Biopython, not the "plain" 1.44. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From biosql at hotmail.com Mon Feb 25 11:32:13 2008 From: biosql at hotmail.com (Jonathan Boulais) Date: Mon, 25 Feb 2008 11:32:13 -0500 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: Hi everyone, I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files into the Biosql database. It takes a hell of a time... Would it be faster to parse the .dat file and write the data into a temporary files and import it in one shot ? Just a suggestion, Jonathan > Date: Sat, 23 Feb 2008 22:00:02 +0000 > From: anaryin at gmail.com > To: biopython at biopython.org > Subject: [BioPython] Uniprot Parser > > Hello all! > > I've written a small parser for the uniprot_sprot.dat files that come out > once and again because I read about some incompatibilities of the > Biopython's with the source files. Now I want to rewrite and clean the code > and I'm considering (strongly) to rewrite my parser. It's a mess of a code > (though it works) and I'd rather use something more... readable! So, I'm > asking, basically, is Biopython's parser good already or are there still > some incompatibilities? > > Thanks a lot! > > Jo?o Rodrigues > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython _________________________________________________________________ From biopython at maubp.freeserve.co.uk Mon Feb 25 11:52:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 25 Feb 2008 16:52:31 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> On Mon, Feb 25, 2008 at 4:32 PM, Jonathan Boulais wrote: > > Hi everyone, > > I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files > into the Biosql database. It takes a hell of a time... What version of Biopython are you using? One thing you could try is timing a simple script that only reads in the SwissProt file but doesn't do anything with the BioSQL database - to try and get a feel for which bit is slow. If its the parsing that is slow, you could try commenting out the bit which deals with the EBI ** lines (see bug 2353 for details), namely line 359 in CVS, self._skip_starstar(uhandle), and see if that makes a big difference. Peter From biopython at maubp.freeserve.co.uk Mon Feb 25 12:16:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 25 Feb 2008 17:16:51 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802222259g31093d33m2728f054ed19fc23@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> <320fb6e00802191609t4689a819m1a50bd09dc152fdb@mail.gmail.com> <22c5c6390802192340n2f4cab8vb767eea6f098cac5@mail.gmail.com> <320fb6e00802200422o597c8f56r8bfec5d5d3c9b035@mail.gmail.com> <22c5c6390802212255v5819ff9fk304e30bbdffdd31d@mail.gmail.com> <22c5c6390802222259g31093d33m2728f054ed19fc23@mail.gmail.com> Message-ID: <320fb6e00802250916k3cdb0847va65fefa26cc5febe@mail.gmail.com> Hi Sebastian, Did you mean to send this email to me only? On Sat, Feb 23, 2008 at 6:59 AM, smriti Sebastian wrote: > hi, > One more help plz. > I need to retrieve the hits which are coming under > "Sequences not found previously or not previously below threshold:" from > PSI-Blast output file.. > or else i need to avoid those id's while parsing the psi-blast output using > PsiBlastParser. > Is there any way to do that? > I tried new_seqs attribute of rounds.But it didn't help me. > I have attached a sample output from psi-blast.Plz help > Thanks in advance. The round object has "alignments" which includes all the hits, and "reused_seqs" which is only those above the "Sequences not found previously or not previously below threshold:" line, while "new_seqs" is only those below the line. Perhaps something like this will be helpful... Peter #!usr/bin/python import Bio.Blast.NCBIStandalone b_parser=Bio.Blast.NCBIStandalone.PSIBlastParser() b_record=b_parser.parse(open('trial_psi_blast.txt','r')) for rnd in b_record.rounds: old = len(rnd.reused_seqs) new = len(rnd.new_seqs) assert old+new == len(round.alignments) print "Round number %i, with %i old and %i new" \ % (rnd.number, old, new) for i,aln in enumerate(round.alignments) : #The identifier is the first word (split on white space) identifier = rnd.alignments[i].title.split()[0] #Remove the leading > if present as it isn't used #on the reused_seqs results. if identifier[0] == ">" : identifier = identifier[1:] if i < old: reused = rnd.reused_seqs[i] assert reused.title.split()[0] == identifier print "%i - %s reused, score %i, exp %f" \ % (i, identifier, reused.score, reused.e) else : novel = rnd.new_seqs[i-old] assert novel.title.split()[0] == identifier print "%i - %s novel, score %i, exp %f" \ % (i, identifier, novel.score, novel.e) print "Done" From biosql at hotmail.com Mon Feb 25 12:48:11 2008 From: biosql at hotmail.com (Jonathan Boulais) Date: Mon, 25 Feb 2008 12:48:11 -0500 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> References: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> Message-ID: I don't think the parser is the problem Peter, but surely the continuous importing request toward the database. I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing). Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. But 3 days of parsing and importing is... a little bit too long for me :) Anyway I hope it could help, Jonathan > Date: Mon, 25 Feb 2008 16:52:31 +0000 > From: biopython at maubp.freeserve.co.uk > To: biosql at hotmail.com > Subject: Re: [BioPython] Uniprot Parser > CC: biopython at lists.open-bio.org > > On Mon, Feb 25, 2008 at 4:32 PM, Jonathan Boulais wrote: > > > > Hi everyone, > > > > I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files > > into the Biosql database. It takes a hell of a time... > > What version of Biopython are you using? > > One thing you could try is timing a simple script that only reads in > the SwissProt file but doesn't do anything with the BioSQL database - > to try and get a feel for which bit is slow. > > If its the parsing that is slow, you could try commenting out the bit > which deals with the EBI ** lines (see bug 2353 for details), namely > line 359 in CVS, self._skip_starstar(uhandle), and see if that makes a > big difference. > > Peter _________________________________________________________________ From mmokrejs at ribosome.natur.cuni.cz Mon Feb 25 13:30:54 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 25 Feb 2008 19:30:54 +0100 Subject: [BioPython] Uniprot Parser In-Reply-To: References: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> Message-ID: <47C3095E.9050306@ribosome.natur.cuni.cz> Hi Jonathan, drop temporarily the indexes on all mysql rows, and make mysql introduce the indexes after importing. Otherwise index has to be updated after every change to a column. Learn 'ALTER TABLE' use. ;-) Martin Jonathan Boulais wrote: > I don't think the parser is the problem Peter, but surely the continuous importing request toward the database. > I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing). > > Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. > But 3 days of parsing and importing is... a little bit too long for me :) From biosql at hotmail.com Mon Feb 25 14:10:48 2008 From: biosql at hotmail.com (Jonathan Boulais) Date: Mon, 25 Feb 2008 14:10:48 -0500 Subject: [BioPython] Uniprot Parser In-Reply-To: <47C3095E.9050306@ribosome.natur.cuni.cz> References: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> <47C3095E.9050306@ribosome.natur.cuni.cz> Message-ID: Many thanks Martin ! Indeed, the DISABLE KEYS sounds very logical to my problem. Jonathan > Date: Mon, 25 Feb 2008 19:30:54 +0100 > From: mmokrejs at ribosome.natur.cuni.cz > To: biosql at hotmail.com > CC: biopython at lists.open-bio.org > Subject: Re: [BioPython] Uniprot Parser > > Hi Jonathan, > drop temporarily the indexes on all mysql rows, and make mysql introduce > the indexes after importing. Otherwise index has to be updated after every > change to a column. Learn 'ALTER TABLE' use. ;-) > Martin > > Jonathan Boulais wrote: > > I don't think the parser is the problem Peter, but surely the continuous importing request toward the database. > > I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing). > > > > Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. > > But 3 days of parsing and importing is... a little bit too long for me :) _________________________________________________________________ From bsantos at biocant.pt Wed Feb 27 10:41:07 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Wed, 27 Feb 2008 15:41:07 -0000 Subject: [BioPython] How to detect sequences that not produce alignments Message-ID: <000301c87957$2d24c800$876e5800$@pt> Hi people, I have been using Bio.Blast for a while to perform BLAST searches in my scripts. Now I'm trying to detect which sequences in a multifasta align against a databases and the ones that don't align at all. By some experiments I had done I noticed that even if the blast_records instance as no alignments at all I couldn't detect them because they are not incorporated in the blast_records instance as an empty list. There is any way to detect which blast_records are empty? Or the module simply ignores this cases and don't put them on the blast_records? Thank you all in adavance, Best Regards, Bruno Santos From sbassi at gmail.com Wed Feb 27 10:54:06 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 27 Feb 2008 12:54:06 -0300 Subject: [BioPython] How to detect sequences that not produce alignments In-Reply-To: <000301c87957$2d24c800$876e5800$@pt> References: <000301c87957$2d24c800$876e5800$@pt> Message-ID: On Wed, Feb 27, 2008 at 12:41 PM, Bruno Santos wrote: > By some experiments I had done I noticed that even if the blast_records > instance as no alignments at all I couldn't detect them because they are not > incorporated in the blast_records instance as an empty list. There is any > way to detect which blast_records are empty? Or the module simply ignores > this cases and don't put them on the blast_records? I think that the problem is that the XML file has no record of a "no hit" sequence. So Biopython parser can't process that record (since it is not even in the XML file). I guess that the only way to know the "negative hits" is to compare the input file with the XML output and then make the difference. I remember have done that once (I should have the script somewhere if you ask me). -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Thu Feb 28 12:18:43 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu, 28 Feb 2008 14:18:43 -0300 Subject: [BioPython] How to detect sequences that not produce alignments In-Reply-To: <000301c87957$2d24c800$876e5800$@pt> References: <000301c87957$2d24c800$876e5800$@pt> Message-ID: On Wed, Feb 27, 2008 at 12:41 PM, Bruno Santos wrote: > way to detect which blast_records are empty? Or the module simply ignores > this cases and don't put them on the blast_records? Here is my code (I put a copy here http://pastebin.com/f74133375 if formating get lost in the mail). from Bio import SeqIO from Bio.Blast import NCBIXML def blastcomp(fastafile,blastfile): handle = open(fastafile) fastanames=set() #Reads the fasta names for record in SeqIO.parse(handle, "fasta") : fastanames.add(record.name) handle.close() blastnames=set() #Reads the blast names b_records=NCBIXML.parse(open(blastfile)) for b_record in b_records: blastnames.add(b_record.query) return fastanames.difference(blastnames) blastfile="/home/sbassi/bioinfo/INTA/filtracMT.xml" fastafile='INTA/allfiltrados.txt' print blastcomp(fastafile,blastfile) -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From mjldehoon at yahoo.com Fri Feb 1 07:22:19 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2008 23:22:19 -0800 (PST) Subject: [BioPython] blast parse In-Reply-To: <47A07222.9000200@biotec.tu-dresden.de> Message-ID: <807461.62532.qm@web62405.mail.re1.yahoo.com> I have added a DeprecationWarning to NCBIXML.BlastParser.parse. --Michiel. Christof Winter wrote: Michiel de Hoon wrote: > Dear Jose, > > To get the records one-by-one, use > > from Bio.Blast import NCBIXML blast_parse = NCBIXML.parse(blasth) for > blast_result in blast_parse: # do whatever with blast_result > > This avoids having to read the complete XML file all at once. > > To the developers: We should probably think about removing the > NCBIXML.BlastParser.parse, and perhaps adding a NCBIXML.read function to read > exactly one record from the XML file. I thinks removing NCBIXML.BlastParser.parse is a good idea. We should keep it simple. Christof --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From e.picardi at unical.it Tue Feb 5 14:55:10 2008 From: e.picardi at unical.it (Ernesto) Date: Tue, 5 Feb 2008 15:55:10 +0100 Subject: [BioPython] GFF parser Message-ID: <9DE3866D-D345-4C88-8935-A793336259D7@unical.it> Dear All, I found around Internet a very interesting GFF parser written in Python by Martin Knudsen. Since I know that at the moment there isn't a real GFF parser in BioPython, we could think to add the one by Martin. For sure, requesting the permission to the author. The parser can be downloaded from the following web page: http:// www.daimi.au.dk/~martink/birc/scripts.html Hope this help, Ernesto -------------------------------------------------------- Dr Ernesto Picardi, PhD Dept. of Biochemistry and Molecular Biology University of Bari Italy E-mail: e.picardi at unical.it -------------------------------------------------------- From chris.lasher at gmail.com Wed Feb 6 03:27:19 2008 From: chris.lasher at gmail.com (Chris Lasher) Date: Tue, 5 Feb 2008 22:27:19 -0500 Subject: [BioPython] Biopython to begin transition to Subversion Message-ID: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> Hello all Biopythonistas, In the next upcoming weeks, Biopython will begin and complete its transition from CVS to Subversion (SVN) as its revision control system. This transition will likely not affect end users of Biopython except that to get the development version, a checkout with a Subversion client, rather than a CVS client, will be necessary. For developers, we will need to determine a suitable range of dates (a week) during which we will "freeze" the CVS repository for its transition to SVN. From the freeze and thereon, commits to the CVS repository will no longer be possible. Instead, commits not placed in during the freeze will need to take place in the Subversion repository once we have it running. This week, we hope to have a "dry run" of the Subversion repository available for the developers to poke around and make sure the transition will include everything necessary. Following that, we'll have the freeze and complete the transition. If you have any questions, I'll be checking posts to the list, or you may feel free contact me directly. Best, Chris From cjfields at uiuc.edu Wed Feb 6 03:33:42 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 5 Feb 2008 21:33:42 -0600 Subject: [BioPython] Biopython to begin transition to Subversion In-Reply-To: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> References: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> Message-ID: Let me know if you need any help. chris On Feb 5, 2008, at 9:27 PM, Chris Lasher wrote: > Hello all Biopythonistas, > > In the next upcoming weeks, Biopython will begin and complete its > transition from CVS to Subversion (SVN) as its revision control > system. > > This transition will likely not affect end users of Biopython except > that to get the development version, a checkout with a Subversion > client, rather than a CVS client, will be necessary. > > For developers, we will need to determine a suitable range of dates (a > week) during which we will "freeze" the CVS repository for its > transition to SVN. From the freeze and thereon, commits to the CVS > repository will no longer be possible. Instead, commits not placed in > during the freeze will need to take place in the Subversion repository > once we have it running. This week, we hope to have a "dry run" of the > Subversion repository available for the developers to poke around and > make sure the transition will include everything necessary. Following > that, we'll have the freeze and complete the transition. > > If you have any questions, I'll be checking posts to the list, or you > may feel free contact me directly. > > Best, > Chris > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From dalke at dalkescientific.com Wed Feb 6 11:03:38 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 6 Feb 2008 12:03:38 +0100 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com> <69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com> <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> Message-ID: <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> On Feb 6, 2008, at 11:44 AM, Peter wrote: > Am I right in thinking the authors have not made any of their sample > input files available? In the case of the multi GB Blast file, this > is perhaps justified. Also I didn't see any timing script. the alignment programs contain the test data. the fasta parser and blast parser do not contain test data. The lack of data is not justified as having a 9GB file adds little to the comparison over having a 9 MB file as it should scale linearly. It does show that the parsers can handle large files, but big whoop. And the test is unaffected by having a 9MB file duplicated 1,000 times. the neighbor-joining code contains no test data There's no timing script. Andrew dalke at dalkescientific.com From jblanca at btc.upv.es Wed Feb 6 16:06:08 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 6 Feb 2008 17:06:08 +0100 Subject: [BioPython] Alignment add_sequence Message-ID: <200802061706.08830.jblanca@btc.upv.es> Hello, I'm building an alignment object from a set of seqRecords using the following code: from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC my_alpha = IUPAC.IUPACAmbiguousDNA() ali = Alignment(my_alpha) for seqName in sequences.keys(): seq = sequences[seqName].seq.tostring() start = mesh[seqName]['location_begin'] id = sequences[seqName].id ali.add_sequence(id, seq, start) Is this the best way to do it? Everything is working as expected, but I have a problem with this implementation. My seqRecords have additional annotations and I'm loosing them. Maybe this could be solved with a new function like: def add_sequence(self, seqRecord, start = None, end = None, weight = 1.0): Also in this way the we woudn't need to create a new SeqRecord for every sequence and it should be quicker. The result could be something like: from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC my_alpha = IUPAC.IUPACAmbiguousDNA() ali = Alignment(my_alpha) for seqName in sequences.keys(): start = mesh[seqName]['location_begin'] ali.add_sequence(sequences[seqName], start) With such a function a problem could appear if an annotation named 'start' or 'end' is already in the annotation dict. But this could be solved raising an expection in that case. What do you think? Thanks for your help. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From p.j.a.cock at googlemail.com Wed Feb 6 16:20:20 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Feb 2008 16:20:20 +0000 Subject: [BioPython] Alignment add_sequence In-Reply-To: <200802061706.08830.jblanca@btc.upv.es> References: <200802061706.08830.jblanca@btc.upv.es> Message-ID: <320fb6e00802060820h609d5f10vccba4953455794bb@mail.gmail.com> On Feb 6, 2008 4:06 PM, Jose Blanca wrote: > Hello, > I'm building an alignment object from a set of seqRecords using the following > code: > ... > Is this the best way to do it? No, not really. See below .. > Everything is working as expected, but I have a > problem with this implementation. My seqRecords have additional annotations > and I'm loosing them. Yes, using that method the alignment is creating a new SeqRecord for each sequence with no annotation. > Maybe this could be solved with a new function like: > def add_sequence(self, seqRecord, start = None, end = None, > weight = 1.0): This has been discussed before, along with other limitations of the current alignment class, e.g. on bug 1944 http://bugzilla.open-bio.org/show_bug.cgi?id=1944 Right now I would suggest you try the Bio.SeqIO.to_alignment() function, although this doesn't try and do anything clever with start/end annotation: http://biopython.org/wiki/SeqIO Peter From nuin at genedrift.org Wed Feb 6 16:07:41 2008 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 06 Feb 2008 11:07:41 -0500 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com> <69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com> <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> Message-ID: <47A9DB4D.4030801@genedrift.org> Hi all I am running pylint on the code and getting some evaluation. Currently the alignment.py scored -10.16/10, mainly because of indentation issues and lack of spaces between operators. NJ.py scored -7.66/10 parse.py scored -6.10/10 readFasta.py scored -7.00/10 Of course this test just measures the "Pythonic" level of the code, but it does not check the code itself for quality. Cheers Paulo Andrew Dalke wrote: > On Feb 6, 2008, at 11:44 AM, Peter wrote: >> Am I right in thinking the authors have not made any of their sample >> input files available? In the case of the multi GB Blast file, this >> is perhaps justified. Also I didn't see any timing script. > > the alignment programs contain the test data. > > the fasta parser and blast parser do not contain test data. The lack > of data is not justified as having a 9GB file adds little to the > comparison over having a 9 MB file as it should scale linearly. It > does show that the parsers can handle large files, but big whoop. And > the test is unaffected by having a 9MB file duplicated 1,000 times. > > the neighbor-joining code contains no test data > > There's no timing script. > > Andrew > dalke at dalkescientific.com > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mcolosimo at mitre.org Wed Feb 6 15:28:15 2008 From: mcolosimo at mitre.org (Colosimo, Marc E.) Date: Wed, 6 Feb 2008 10:28:15 -0500 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com><69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com><320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> Message-ID: What is biology in python or more to the point why is there yet another mailing list (Web site?) for biology in python? >From looking at their archive messages: 1. Need to establish python/biology community..... Isn't that what BioPython is? If not, why not? I'll also point out that there is "CoreBio" a python toolkit for writing computational biology applications I don't want to subscribe to another mailing list, install another suite of code, keep track of another Web site. -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Wednesday, February 06, 2008 6:04 AM To: biopython at lists.open-bio.org Cc: biology-in-python at lists.idyll.org Subject: Re: [BioPython] [bip] Bioinformatics Programming Language Shootout,Python performance poopoo'd On Feb 6, 2008, at 11:44 AM, Peter wrote: > Am I right in thinking the authors have not made any of their sample > input files available? In the case of the multi GB Blast file, this > is perhaps justified. Also I didn't see any timing script. the alignment programs contain the test data. the fasta parser and blast parser do not contain test data. The lack of data is not justified as having a 9GB file adds little to the comparison over having a 9 MB file as it should scale linearly. It does show that the parsers can handle large files, but big whoop. And the test is unaffected by having a 9MB file duplicated 1,000 times. the neighbor-joining code contains no test data There's no timing script. Andrew dalke at dalkescientific.com _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From tiagoantao at gmail.com Wed Feb 6 17:05:33 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 6 Feb 2008 17:05:33 +0000 Subject: [BioPython] [Biopython-dev] Biopython to begin transition to Subversion In-Reply-To: <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> References: <128a885f0802051927g1d773a51l5b0e7b914e347ffd@mail.gmail.com> <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> Message-ID: <6d941f120802060905h3bc09488tbd7ea3c85bce5914@mail.gmail.com> Hi, On Feb 6, 2008 4:27 PM, Peter wrote: > Michiel - do you think we should try and do another release before the > CVS freeze and migration? We've had a lots little changes, plus > Tiago's PopGen work and my own efforts with BioSQL. There are still a > few open issues, but I think a release soon would be reasonable > (depending on your time commitments of course). Just FYI: As I noticed that the SVN move would be happening sooner or later, I decided to put everything into a stable state and stop at that point. Hopefully all that there is PopGen related is stable and ready to move (code, test, doc). As soon as we move to SVN I will get back into committing (now the really interesting stuff will start: statistics and maybe HapMap). Tiago From cjfields at uiuc.edu Wed Feb 6 17:19:33 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 6 Feb 2008 11:19:33 -0600 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com><69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com><320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> Message-ID: <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> On Feb 6, 2008, at 9:28 AM, Colosimo, Marc E. wrote: > > What is biology in python or more to the point why is there yet > another > mailing list (Web site?) for biology in python? The BioPython group primarily focuses on the BioPython suite of tools. Other groups might address more general computational issues which may or may not pertain to BioPython. There are similar efforts with perl. >> From looking at their archive messages: > > 1. Need to establish python/biology community..... > > Isn't that what BioPython is? If not, why not? > > I'll also point out that there is "CoreBio" a python toolkit for > writing computational biology applications > > > I don't want to subscribe to another mailing list, install another > suite of code, keep track of another Web site. > ... You don't have to if you don't want to. This was probably cross- posted by Andrew to bring in discussion on this paper with like-minds from BioPython. BTW, Andrew et al, speaking as a perl/BioPerl programmer, I also think it's a terribly researched and written piece; surprised it got past the reviewers. Programming language 'shootouts' are always controversial (anything with a 'my language is better that yours' conclusion is bound to cause arguments). One would think a shootout means setting strict rules and having the best/brightest put forward their qualifying code, but clearly in this case that didn't happen. chris From dalke at dalkescientific.com Wed Feb 6 18:48:20 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 6 Feb 2008 19:48:20 +0100 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com><69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com><320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> Message-ID: <6BAB63D7-57AB-4F3A-A38C-FE8185F88533@dalkescientific.com> On Feb 6, 2008, at 6:19 PM, Chris Fields wrote: > You don't have to if you don't want to. This was probably cross- > posted by Andrew to bring in discussion on this paper with like- > minds from BioPython. That was Peter who started the cross-post > Peter > P.S. Hello from Biopython I'm just the one who wrote a lot last year pushing people on the BIP list to use more from Biopython, such as http://lists.idyll.org/pipermail/biology-in-python/2007-August/ 000046.html or for that matter many of my posts from last Augus http://lists.idyll.org/pipermail/biology-in-python/2007-August/ :) Andrew dalke at dalkescientific.com From p.j.a.cock at googlemail.com Wed Feb 6 20:47:44 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Feb 2008 20:47:44 +0000 Subject: [BioPython] [bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd In-Reply-To: <6BAB63D7-57AB-4F3A-A38C-FE8185F88533@dalkescientific.com> References: <128a885f0802051200l1913c1a4waac6bd8e653c06ea@mail.gmail.com> <69b1b6300802051436q348f4f25va5f26cd82a949cca@mail.gmail.com> <320fb6e00802060244j6764c84apcf3eae45b0badcb0@mail.gmail.com> <4A584E83-E765-49B7-A383-5F2D7D861269@dalkescientific.com> <2E897128-1227-434D-815B-BA4BC88F2053@uiuc.edu> <6BAB63D7-57AB-4F3A-A38C-FE8185F88533@dalkescientific.com> Message-ID: <320fb6e00802061247v65366f95u56752325b1f797a5@mail.gmail.com> Andrew Dalke wrote: > That was Peter who started the > cross-post Ah. So it was - entirely accidentally due to an unwittingly set reply-to field. I've fixed my email settings, and would like to apologise to anyone on the biopython mailing list who ended getting caught up in the thread as a result (especially Marc). If any biopython people would like to join in the discussion about this paper, please do join the BIP list - otherwise let's stop the double posting. The original link was: http://www.biomedcentral.com/1471-2105/9/82 Andrew Dalke wrote: > I'm just the one who wrote a lot last year pushing people on the BIP > list to use more from Biopython, such as ... A sentiment I agree with ;) Peter From vmatthewa at gmail.com Wed Feb 6 21:21:47 2008 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Wed, 6 Feb 2008 14:21:47 -0700 Subject: [BioPython] Iterating through an alignment to calculate the number of gaps and their lengths Message-ID: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> Hi Everyone, I was wondering if anyone could help, I am trying to write a little python script to iterate through an alignment and determine the number of gaps the alignment has and their lengths and output that information as a list. Such as this made up alignemt: Seq1 ATT-AGC-C Seq2 AT--AGCTC and your program runs and outputs like 2 gaps of length 1 outputted as a list like this [1,1] or something like that. I am still learning about python strings and iterators and am not sure how you would approach this? Appreciate any help you could give. Thanks. Sincerely, Matthew From ruchira.datta at gmail.com Wed Feb 6 21:39:02 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Wed, 6 Feb 2008 13:39:02 -0800 Subject: [BioPython] Iterating through an alignment to calculate the number of gaps and their lengths In-Reply-To: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> References: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> Message-ID: Hi, Matthew try this: import re contiguous_gap = re.compile('-+') gappy_regions = contiguous_gap.findall(seq) Now gappy_regions contains a list of the gappy regions, e.g., if seq = 'ILV--F---AAS', then gappy_regions will be ['--','---'] Then to find the lengths of the gappy_regions, you can just say [len(region) for region in gappy_regions] which would give you in the above example [2,3] Hope this helps, --Ruchira Ruchira S. Datta , Ph.D Postdoctoral Researcher Berkeley Phylogenomics Group 324D Stanley Hall Department of Bioengineering California Institute for Quantitative Biosciences (QB3) University of California Berkeley , CA 94720 Phone: (510) 642-6642 Email: ruchira at berkeley.edu On Feb 6, 2008 1:21 PM, Matthew Abravanel wrote: > Hi Everyone, > > I was wondering if anyone could help, I am trying to write a little python > script to iterate through an alignment and determine the number of gaps > the > alignment has and their lengths and output that information as a list. > Such as this made up alignemt: > > Seq1 ATT-AGC-C > Seq2 AT--AGCTC > > and your program runs and outputs like 2 gaps of length 1 outputted as a > list like this [1,1] or something like that. I am still learning about > python strings and iterators and am not sure how you would approach this? > Appreciate any help you could give. Thanks. > > Sincerely, > > Matthew > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Feb 6 21:57:48 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Feb 2008 21:57:48 +0000 Subject: [BioPython] Iterating through an alignment to calculate the number of gaps and their lengths In-Reply-To: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> References: <8fc5e4c20802061321w6c6024ved8a3a0a7a413ba1@mail.gmail.com> Message-ID: <320fb6e00802061357m4306b700x7fda2957d1cfb2e2@mail.gmail.com> On Feb 6, 2008 9:21 PM, Matthew Abravanel wrote: > Hi Everyone, > > I was wondering if anyone could help, I am trying to write a little python > script to iterate through an alignment and determine the number of gaps the > alignment has and their lengths and output that information as a list. > Such as this made up alignemt: > > Seq1 ATT-AGC-C > Seq2 AT--AGCTC > > and your program runs and outputs like 2 gaps of length 1 outputted as a > list like this [1,1] or something like that. I am still learning about > python strings and iterators and am not sure how you would approach this? > Appreciate any help you could give. Thanks. I would start with using Bio.SeqIO to read in the sequences as SeqRecord objects - I'm assuming you have them in a file (e.g. fasta format, or maybe clustal?). See the tutorial or http://biopython.org/wiki/SeqIO e.g. from Bio import SeqIO handle = open("example.fasta") for rec in SeqIO.parse(handle, "fasta") : print rec.id, len(rec.seq), rec.seq.count("-") The above code will simple count the number of gap characters. I think you wanted to look at the sequence strings and how long each stretch of gap characters is? Rather than counting the number of gap characters? Well that is a little more complicated... perhaps something like this: from Bio import SeqIO handle = open("example.fasta") gap = "-" for rec in SeqIO.parse(handle, "fasta") : print rec.id, rec.seq #TODO - Handle leading or trailing gaps in_gap = False gap_len = 0 for letter in rec.seq : if letter == gap and not in_gap : #Start of a gap in_gap = True assert gap_len == 0, "Logic error?" gap_len = 1 elif in_gap and letter == gap : #Continuation of a gap gap_len += 1 elif in_gap and letter <> gap : #End of the gap... print " - Found a gap of length %i" % gap_len #Reset in_gap = False gap_len = 0 Note that this doesn't record a running tally of the gap lengths found, for which a python dictionary might be sensible. Peter From mjldehoon at yahoo.com Thu Feb 7 01:10:06 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 6 Feb 2008 17:10:06 -0800 (PST) Subject: [BioPython] [Biopython-dev] Biopython to begin transition to Subversion In-Reply-To: <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> Message-ID: <617104.88204.qm@web62413.mail.re1.yahoo.com> Peter wrote:Michiel - do you think we should try and do another release before the CVS freeze and migration? We've had a lots little changes, plus Tiago's PopGen work and my own efforts with BioSQL. There are still a few open issues, but I think a release soon would be reasonable (depending on your time commitments of course). I think that the Subversion/CVS issue is separate from our release schedule, so I don't think that the transition to Subversion by itself should be a reason for a release. However, we can probably make a release soon after the transition. I would like to finalize my work on Bio.WWW before making a release, but hopefully that won't be too complicated. --Michiel --------------------------------- Never miss a thing. Make Yahoo your homepage. From biopython-dev at maubp.freeserve.co.uk Thu Feb 7 09:33:49 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Feb 2008 09:33:49 +0000 Subject: [BioPython] Alignment add_sequence In-Reply-To: <200802070925.28882.jblanca@btc.upv.es> References: <200802061706.08830.jblanca@btc.upv.es> <320fb6e00802060820h609d5f10vccba4953455794bb@mail.gmail.com> <200802070925.28882.jblanca@btc.upv.es> Message-ID: <320fb6e00802070133n67a549b5k8868a025f423dc82@mail.gmail.com> On Feb 7, 2008 8:25 AM, Jose Blanca wrote: > Hi: > I think I can't use Bio.SeqIO.to_alignment() because the > sequences have different lengths and start at different > positions. It's and EST alignmet not a clustal-like one. > I have also looked at your proposal in bug 1944 and I really > like it, specially the clever __getitem__ method. But I can't > use it because the different lengths of the sequences. > I'm going to add an add_seqRecord method. Now, thanks to you I > understand why this is not a good solution. But, at least, it > will do for this time. The whole idea behind the current alignment class is that all the sequences are the same length (often with gaps). I don't think this fits with your intended usage - unless you pad each record with leading gap characters (according to its start) and then pad the end until they are all the same length. You could write a function to take a list of SeqRecords and pad them like this (note the example will be easier to read in a mono-spaced font): e.g. CONSENSUS: AGGCCTGAGGCCCCTTTT, start 0 EST1 : CGCAGGCCCGAGGCC, start -3 EST2 : GGCCTGAGGCCCCTT, start 1 EST3 : CTGAGGCCACTTTTTCGC, start 4 In this case we want to add (start+3) gaps to each line, where -3 = min(starts). This becomes: ---AGGCCTGAGGCCCCTTTT, start 0 CGCAGGCCCGAGGCC, start -3 ----GGCCTGAGGCCCCTT, start 1 -------CTGAGGCCACTTTTTCGC, start 4 Then work out the maximum length, and pad all the sequences with trailing gaps: ---AGGCCTGAGGCCCCTTTT---- CGCAGGCCCGAGGCC---------- ----GGCCTGAGGCCCCTT------ -------CTGAGGCCACTTTTTCGC A little bit of work, but now all the sequences are the same length and the Biopython alignment class will be happy. As far as I know, there is nothing for this built into Biopython at the moment. Could you tell us what your input file looks like (e.g. link to the file format?) Peter From peter at maubp.freeserve.co.uk Thu Feb 7 09:36:34 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Feb 2008 09:36:34 +0000 Subject: [BioPython] [Biopython-dev] Biopython to begin transition to Subversion In-Reply-To: <617104.88204.qm@web62413.mail.re1.yahoo.com> References: <320fb6e00802060827p37c0aeabk55fa378a4cb35abf@mail.gmail.com> <617104.88204.qm@web62413.mail.re1.yahoo.com> Message-ID: <320fb6e00802070136r7984d523rcc3c683d8f897431@mail.gmail.com> On Feb 7, 2008 1:10 AM, Michiel de Hoon wrote: > I think that the Subversion/CVS issue is separate from our release schedule, > so I don't think that the transition to Subversion by itself should be a reason > for a release. However, we can probably make a release soon after the > transition. I would like to finalize my work on Bio.WWW before making a > release, but hopefully that won't be too complicated. > > --Michiel You're right the CVS/SVN migration isn't directly linked - but its a nice excuse to get a release out ;) I'd forgotten you still had the Bio.WWW module to sort out, sorry. Peter From kosa at genesilico.pl Thu Feb 7 14:15:28 2008 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 07 Feb 2008 15:15:28 +0100 Subject: [BioPython] Alignment class In-Reply-To: References: Message-ID: <47AB1280.7040209@genesilico.pl> Peter wrote: > The whole idea behind the current alignment class is that all the > sequences are the same length (often with gaps). I was always wondering what is the reason that you made the alignment class which requires all sequences have the same length (even if incl. gaps)? Jan Kosinski :. From biopython at maubp.freeserve.co.uk Thu Feb 7 14:59:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Feb 2008 14:59:46 +0000 Subject: [BioPython] Alignment class In-Reply-To: <47AB1280.7040209@genesilico.pl> References: <47AB1280.7040209@genesilico.pl> Message-ID: <320fb6e00802070659w43dca423s19d98ae354e1a121@mail.gmail.com> On Feb 7, 2008 2:15 PM, Jan Kosinski wrote: > Peter wrote: > > The whole idea behind the current alignment class is that all the > > sequences are the same length (often with gaps). > I was always wondering what is the reason that you made the alignment > class which requires all sequences have the same length (even if incl. > gaps)? The design of the current alignment class predates my involvement, but from the point of view of the code (and the column access in particular) it assumes the sequences have the same length. This assumption (with leading/trailing gaps) is also common to all the alignment file formats I have worked with. I like this abstraction as you can regard the alignment as an array of characters (using matrix notation or what ever). I can see that the EST alignment case is a little different, in that by convention the leading/trailing "gaps" are not shown. It would be possible to write an new EST class which stored the sequences without leading/trailing "gap"s, but took into account the start offset, and would allow access to the "columns" inserting leading/trailing gaps where a given sequence has not started or has already finished. I don't see that this would be any more useful (except perhaps for a small memory saving) In general leading/trailing gaps can mean the limits of a gene, or the limit of a domain with an gene, or the limits of a sequenced fragment, etc. Sometimes there really is no character to go there, in other cases the sequence concerns does continue but for whatever reason it was not included in the alignment. One possibility (depending on what you want to do with the alignment) is to use different characters for internal gaps, leading "gaps" and trailing "gaps". Peter From dalke at dalkescientific.com Thu Feb 7 16:09:25 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 7 Feb 2008 17:09:25 +0100 Subject: [BioPython] psyco reference in bioinformatics article Message-ID: <86A36405-443A-4A76-9352-5C902FB5CAC6@dalkescientific.com> Does anyone know of a bioinformatics reference that mentions using psyco for improving performance of a Python program, and which mentions numbers? I know of one for medical informatics that mentions numbers http://www.biomedcentral.com/1472-6947/2/9/ Preparation of name and address data for record linkage using hidden Markov models Tim Churches, Peter Christen, Kim Lim, and Justin Xi Zhu BMC Medical Informatics and Decision Making 2002, 2:9doi: 10.1186/1472-6947-2-9 > one million address records on the PC platform took 14,061 seconds > (234 minutes), or 5832 seconds (97 minutes) with the Psyco just-in- > time Python compiler enabled > and an Entrez search finds one for bioinformatics which doesn't mention numbers http://www.pubmedcentral.nih.gov/articlerender.fcgi? tool=pmcentrez&artid=1635261 Nucleic Acids Res. 2006 November; 34(20): 5730?5739. doi: 10.1093/ nar/gkl585. A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites Brian T. Naughton, Eugene Fratkin, Serafim Batzoglou, and Douglas L. Brutlag > MotifScan will use psyco (http://psyco.sourceforge.net) for a > performance > gain, if it is installed. Andrew dalke at dalkescientific.com From jblanca at btc.upv.es Fri Feb 8 15:23:34 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 8 Feb 2008 16:23:34 +0100 Subject: [BioPython] Alignment class In-Reply-To: <320fb6e00802070659w43dca423s19d98ae354e1a121@mail.gmail.com> References: <47AB1280.7040209@genesilico.pl> <320fb6e00802070659w43dca423s19d98ae354e1a121@mail.gmail.com> Message-ID: <200802081623.34767.jblanca@btc.upv.es> Hi: I've been thinking a little little on this alignment problem. On Thursday 07 February 2008 15:59:46 Peter wrote: > On Feb 7, 2008 2:15 PM, Jan Kosinski wrote: > > Peter wrote: > > > The whole idea behind the current alignment class is that all the > > > sequences are the same length (often with gaps). > > > > I was always wondering what is the reason that you made the alignment > > class which requires all sequences have the same length (even if incl. > > gaps)? > > The design of the current alignment class predates my involvement, but > from the point of view of the code (and the column access in > particular) it assumes the sequences have the same length. This > assumption (with leading/trailing gaps) is also common to all the > alignment file formats I have worked with. I like this abstraction as > you can regard the alignment as an array of characters (using matrix > notation or what ever). This kind of alignment is useful, but in my opinion it would be better if the sequences could have different lengths and start points. > > I can see that the EST alignment case is a little different, in that > by convention the leading/trailing "gaps" are not shown. It would be > possible to write an new EST class which stored the sequences without > leading/trailing "gap"s, but took into account the start offset, and > would allow access to the "columns" inserting leading/trailing gaps > where a given sequence has not started or has already finished. I > don't see that this would be any more useful (except perhaps for a > small memory saving) > > In general leading/trailing gaps can mean the limits of a gene, or the > limit of a domain with an gene, or the limits of a sequenced fragment, > etc. Sometimes there really is no character to go there, in other > cases the sequence concerns does continue but for whatever reason it > was not included in the alignment. > > One possibility (depending on what you want to do with the alignment) > is to use different characters for internal gaps, leading "gaps" and > trailing "gaps". That would be a good solution for the EST case, althogh it could have some memory problems with longer sequences. Anyway I felt like experimenting a bit so I looked at bioperl for inspiration. For this problem they use ranges and LocatableSeqs. I don't know if we need a full featured BioRange class for this problem, I've coded one, but I haven't used. I have coded a draft of a LocatableSeq class and I've done some minimal modifications to the newAlignment proposal from bug 1944 (http://bugzilla.open-bio.org/show_bug.cgi?id=1944). Maybe I should have created an Alignment subclass, but I think the most relevant change is the new LocatableSeq class. This is not a finished work, but it's mostly working and I would like to know your opinions. This is my first atempt to create something in python. I'm ready to learn from you, I will take the suggestions and criticisms with a smile, so don't be shy. I guess that I could have broken some style rules, I hope to learn them with some time and help. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) -------------- next part -------------- A non-text attachment was scrubbed... Name: newAlignment.py Type: application/x-python Size: 22982 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LocatableSeq.py Type: application/x-python Size: 8195 bytes Desc: not available URL: From rwbarrette at gmail.com Fri Feb 8 17:07:35 2008 From: rwbarrette at gmail.com (Roger Barrette) Date: Fri, 8 Feb 2008 12:07:35 -0500 Subject: [BioPython] Clustalw error: .aln not produced Message-ID: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> Hey all, I'm trying to run clustalw from python (windows) using the simple script example below; ** *import os from Bio.Clustalw import MultipleAlignCL from Bio.Clustalw import do_alignment from sys import ** *cline = MultipleAlignCL("c:\\adenotest.fasta") cline.set_output("c:\\adeno4.aln") print "Command line: ", cline* *align = do_alignment(cline) for seq in align.get_all_seqs(): print seq.description print seq.seq* ** This generates the command line "clustalw c:\adenotest.fasta -OUTFILE=c:\adeno4.aln" ** However, I continuously get the following error message: *IOError: Output .aln file c:\adeno4.aln not produced, commandline: clustalw c:\adenotest.fasta -OUTFILE=c:\adeno4.aln* ** I do have the clustalw executable in the path, and when I copy the generated command line for clustalw into the windows command line, it runs fine, and generates the alignment, with no errors. I updated the clustalw _init_ file, but the error still remains. Any thoughts or suggestions would be greatly appreciated. Thanks. -Roger ** ** From biopython at maubp.freeserve.co.uk Fri Feb 8 17:17:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Feb 2008 17:17:23 +0000 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> Message-ID: <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> On Feb 8, 2008 5:07 PM, Roger Barrette wrote: > Hey all, > > I'm trying to run clustalw from python (windows) using the simple script > example below; > ... > I do have the clustalw executable in the path, and when I copy the generated > command line for clustalw into the windows command line, it runs fine, and > generates the alignment, with no errors. > > I updated the clustalw _init_ file, but the error still remains. Any > thoughts or suggestions would be greatly appreciated. Thanks. Are you sure you are using the latest Bio/Clustalw/__init__.py from CVS? I would have expected it to try a command line like: clustalw -INFILE=c:\adenotest.fasta -OUTFILE=c:\adeno4.aln What version of clustalw do you have (in case that makes a difference)? Have you tried supplying the full path to the clustalw.exe file? Peter From rwbarrette at gmail.com Fri Feb 8 17:53:05 2008 From: rwbarrette at gmail.com (Roger Barrette) Date: Fri, 8 Feb 2008 12:53:05 -0500 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> Message-ID: <2af454d50802080953l5d8ffd1j42bf3ff5e5bf80c1@mail.gmail.com> Hi Peter, I'm using the version 1.16 of the clustalw_init_file, and clustalw version 2.0. I notice that when I run clustalw from the windows command line, it generates the adeno4.aln file. After this file is generated, the script WILL successfully run from python. It doesn't appear to be able to create a new file when called from the python script, but it will update and modify the existing one. Am I not setting up the files correctly? I'm not sure what you mean by "supply the full path to the clustalw.exefile". I have the location of the executable clustalw.exe described in the system path, and it runs directly from the windows command line, so I would assume it is properly mapped. If you mean the path to the .fasta file and location of the output file for clustalw to use; they are being input directly at the clustalw command, or am I missing something? Thanks. -Roger On 2/8/08, Peter wrote: > > On Feb 8, 2008 5:07 PM, Roger Barrette wrote: > > Hey all, > > > > I'm trying to run clustalw from python (windows) using the simple script > > example below; > > ... > > I do have the clustalw executable in the path, and when I copy the > generated > > command line for clustalw into the windows command line, it runs fine, > and > > generates the alignment, with no errors. > > > > I updated the clustalw _init_ file, but the error still remains. Any > > thoughts or suggestions would be greatly appreciated. Thanks. > > Are you sure you are using the latest Bio/Clustalw/__init__.py from > CVS? I would have expected it to try a command line like: > > clustalw -INFILE=c:\adenotest.fasta -OUTFILE=c:\adeno4.aln > > What version of clustalw do you have (in case that makes a difference)? > Have you tried supplying the full path to the clustalw.exe file? > > Peter > From rwbarrette at gmail.com Fri Feb 8 18:18:16 2008 From: rwbarrette at gmail.com (Roger Barrette) Date: Fri, 8 Feb 2008 13:18:16 -0500 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <2af454d50802080953l5d8ffd1j42bf3ff5e5bf80c1@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> <2af454d50802080953l5d8ffd1j42bf3ff5e5bf80c1@mail.gmail.com> Message-ID: <2af454d50802081018i217e3080j97d6aef897f7934a@mail.gmail.com> Hey Peter, I went and downloaded clustalw version 1.83, and that fixed the problem, so it would appear at least that part of the problem has to do with clustalw 2.0. Thanks. -Roger On 2/8/08, Roger Barrette wrote: > > Hi Peter, > > I'm using the version 1.16 of the clustalw_init_file, and clustalw version > 2.0. > I notice that when I run clustalw from the windows command line, it > generates the adeno4.aln file. After this file is generated, the script > WILL successfully run from python. It doesn't appear to be able to create a > new file when called from the python script, but it will update and modify > the existing one. Am I not setting up the files correctly? > > I'm not sure what you mean by "supply the full path to the clustalw.exefile". I have the location of the executable > clustalw.exe described in the system path, and it runs directly from the > windows command line, so I would assume it is properly mapped. If you mean > the path to the .fasta file and location of the output file for clustalw to > use; they are being input directly at the clustalw command, or am I missing > something? Thanks. > > -Roger > > > On 2/8/08, Peter wrote: > > > > On Feb 8, 2008 5:07 PM, Roger Barrette wrote: > > > Hey all, > > > > > > I'm trying to run clustalw from python (windows) using the simple > > script > > > example below; > > > ... > > > I do have the clustalw executable in the path, and when I copy the > > generated > > > command line for clustalw into the windows command line, it runs fine, > > and > > > generates the alignment, with no errors. > > > > > > I updated the clustalw _init_ file, but the error still remains. Any > > > thoughts or suggestions would be greatly appreciated. Thanks. > > > > Are you sure you are using the latest Bio/Clustalw/__init__.py from > > CVS? I would have expected it to try a command line like: > > > > clustalw -INFILE=c:\adenotest.fasta -OUTFILE=c:\adeno4.aln > > > > What version of clustalw do you have (in case that makes a difference)? > > Have you tried supplying the full path to the clustalw.exe file? > > > > Peter > > > > -- Roger William Barrette II, Ph.D Microbiologist USDA /APHIS/ VS/ FADDL Plum Island Animal Disease Center P.O. Box 848, Greenport, NY 11944 631-323-3300 (Lab) 631-323-3200 x4415 (Office) RWBarrette at gmail.com Roger.W.Barrette at APHIS.USDA.GOV From biopython at maubp.freeserve.co.uk Fri Feb 8 19:10:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Feb 2008 19:10:57 +0000 Subject: [BioPython] Clustalw error: .aln not produced In-Reply-To: <2af454d50802080942h38b881bdxc39850ce49575f7d@mail.gmail.com> References: <2af454d50802080907w5c8b8796r7c6b99185bdf650d@mail.gmail.com> <320fb6e00802080917y4a8dff0cw4f52559b41fcbb0a@mail.gmail.com> <2af454d50802080942h38b881bdxc39850ce49575f7d@mail.gmail.com> Message-ID: <320fb6e00802081110q31bd45ccv7bca1f1fa31d4595@mail.gmail.com> Hi Robin, > I'm using the version 1.16 of the clustalw_init_file, That is the latest revision of Bio/Clustalw/__init__.py in CVS, good. > and clustalw version 2.0. I haven't tried that, only version 1.83 I think. This could be the problem, but you did say that the command line work when run by hand. I might have time to check the new version this weekend... > I'm not sure what you mean by "supply the full path to the clustalw.exe > file". I have the location of the executable clustalw.exe described in the > system path, and it runs directly from the windows command line, so I would > assume it is properly mapped. I meant rather than trusting Windows will find the executable on the system path, try specifying it in full, e.g. C:\Program Files\Clustal\Clustalw.exe Peter From hlapp at gmx.net Thu Feb 14 01:54:28 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 14 Feb 2008 10:54:28 +0900 Subject: [BioPython] [BioSQL-l] update DBSeqRecords In-Reply-To: <47B32040.1040400@ucd.ie> References: <47B2BAD7.9000109@ucd.ie> <6B2BB40A-8F6A-4757-8A3E-944759298144@gmx.net> <47B32040.1040400@ucd.ie> Message-ID: <69A88D1B-4462-4669-BA44-FFF869947437@gmx.net> Andreas - I really don't know anything about Biopython (but many others on the list may, especially the Biopython list, which I'm cc'ing too). So - I'm passing this on to Biopythonians to respond. -hilmar On Feb 14, 2008, at 1:52 AM, Andreas De Stefani wrote: > Thanks Hilmar, > > I kind of figured this and i am just using the adaptor to execute > the sql statement to delete the entry. > I also noticed that i cannot access all the information via > biopython/biosql, i would like to show the comments for each entry > but i cant find any attribute in the DBSeqRecord to access this > information. Is this something which will be added in the near future? > > My workaround is to use the adaptor from the record and just > execute a sql query ... but that might not be the ideal way to do it!? > > thanks again, > > Andreas > > > > Hilmar Lapp wrote: >> As Peter says this is easily possible, simply delete the sequence >> (protein) first that you want to update and then reload it. >> >> This is also called the 'refresh' mode of updating. >> >> -hilmar >> >> On Feb 13, 2008, at 6:39 PM, Andreas De Stefani wrote: >> >>> Hi Guys, >>> >>> I was wondering if it is possible to update a single DBSeqRecord, >>> without having to delete the whole sub datbase first... >>> >>> I am using BioPython and BioSQL and what I intend todo is to >>> create a local "cache" for protein informations which i get from >>> the web, and after a month or so i would like to re-fetch the >>> info from the web and update the local protein information >>> "cache" (which uses BioSQL). >>> >>> It basically will work like this: >>> >>> if the user requests information for a certain protein the >>> program queries the local DB using the accession number and sees >>> if there is information about the protein, if not (or if the >>> protein is expired, ie older than a month) it gets the info from >>> the web (expasy) and loads (updates the protein information in) >>> the local database. However, is a update of a single protein >>> entry possible? when inserting the same protein i get the >>> following error: >>> >>> (, IntegrityError(1062, >>> "Duplicate entry 'P08317-1-0' for key 2"), >> 0xd6b170>) >>> >>> i am just using db.load(...) again, but maybe there is another >>> way to update entries? >>> >>> Hope somebody can help me with this, thanks very much in advance! >>> >>> Andy >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > > -- > Biontrack - bioinformatics solutions > e: andreas.destefani at biontrack.com > w: www.biontrack.com > t: +353 (0)1 716 3760 > f: +353 (0)1 716 3709 > m: +353 85 141 9941 -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ULNJUJERYDIX at spammotel.com Thu Feb 14 06:02:18 2008 From: ULNJUJERYDIX at spammotel.com (Kevin Lam) Date: Thu, 14 Feb 2008 14:02:18 +0800 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes Message-ID: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Hi I have been scouring through the web for something I thought was a rather simple task but I can't find the answer. How do I get the sequence coordinates for exons of genes in a stretch of genome demarcated by say HoxA13 and Hox A1 ? below is the example of the data I am looking for. 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 From p.j.a.cock at googlemail.com Thu Feb 14 11:01:14 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Feb 2008 11:01:14 +0000 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes In-Reply-To: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> References: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Message-ID: <320fb6e00802140301s111a08famecd9a23b74aba1aa@mail.gmail.com> Hi Kevin, Where do you normally get your genomes from? I am most familiar with the NCBI formats, so I would start by examining the GenBank file for the relevant genome. Have a look by hand first - it may well have features for these genes, and in particular a CDS feature which marks out the introns/exons for you. Biopython will read GenBank files, although I would say dealing with the locations via the SeqFeature object is a little fiddly... have a look at the main documentation and also perhaps http://www2.warwick.ac.uk/go/peter_cock/python/genbank/ Peter On Thu, Feb 14, 2008 at 6:02 AM, Kevin Lam wrote: > Hi > I have been scouring through the web for something I thought was a rather > simple task but I can't find the answer. > > How do I get the sequence coordinates for exons of genes in a stretch of > genome demarcated by say HoxA13 and Hox A1 ? > > below is the example of the data I am looking for. > > 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 > 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From tiagoantao at gmail.com Thu Feb 14 11:20:40 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 14 Feb 2008 11:20:40 +0000 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes In-Reply-To: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> References: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Message-ID: <6d941f120802140320v14ef63d6h755238b26459f01@mail.gmail.com> On Thu, Feb 14, 2008 at 6:02 AM, Kevin Lam wrote: > I have been scouring through the web for something I thought was a rather > simple task but I can't find the answer. > > How do I get the sequence coordinates for exons of genes in a stretch of > genome demarcated by say HoxA13 and Hox A1 ? > > below is the example of the data I am looking for. > > 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 > 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 Have a look at the UCSC Genome Browser http://genome.cse.ucsc.edu/cgi-bin/hgTables on the table knownGene you have things like lists of exonStarts and exonEnds. I would like, in the long run, to support this in biopython (I have python code which I can share), but this won't happen in the next few months for sure (unless it is some sort of team work...). From sdavis2 at mail.nih.gov Thu Feb 14 11:26:05 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 14 Feb 2008 06:26:05 -0500 Subject: [BioPython] retrieve sequence coordinates of exons for a stretch of genes In-Reply-To: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> References: <5b6410e0802132202n6dc2a828x6d29e28fcb81e17b@mail.gmail.com> Message-ID: <264855a00802140326j68f49ddbo339d1906c15b2844@mail.gmail.com> On Thu, Feb 14, 2008 at 1:02 AM, Kevin Lam wrote: > Hi > I have been scouring through the web for something I thought was a rather > simple task but I can't find the answer. > > How do I get the sequence coordinates for exons of genes in a stretch of > genome demarcated by say HoxA13 and Hox A1 ? > > below is the example of the data I am looking for. > > 1026087..1026688 1026807..1026834 1026839..1027045 HOXD12 > 1033641..1034421 1035192..1035427 1035428..1035873 HOXD11 UCSC and Ensembl both offer simple tools for doing this sort of thing. In UCSC, they call it the "table browser", while in Ensembl, they call it ensmart. Both allow you to specify a region and get various interesting pieces of information from those regions. I would look at those two interfaces, as they will do what you need. Alternatively, both offer open MySQL access to the underlying databases. Of course, this assumes that the organism that you are interested in is available in UCSC and/or Ensembl. If you need more details, feel free to ask.... Sean From pet85 at libero.it Mon Feb 18 22:19:29 2008 From: pet85 at libero.it (Crivellaro Patrizia) Date: Mon, 18 Feb 2008 23:19:29 +0100 Subject: [BioPython] Fwd: Message-ID: Do someone know how to save a sequence in FASTA format not as a text file .txt but as a file .fasta?? thank you very very much! From biopython at maubp.freeserve.co.uk Tue Feb 19 00:04:35 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Feb 2008 00:04:35 +0000 Subject: [BioPython] Fwd: In-Reply-To: References: Message-ID: <320fb6e00802181604q476d860fx9f8e53d7bfa00174@mail.gmail.com> On 2/18/08, Crivellaro Patrizia wrote: > Do someone know how to save a sequence in FASTA format not > as a text file .txt but as a file .fasta?? > thank you very very much! How have you got your sequences in the first place? How about something very simple like: name = "Test" seq = "ATAGACTACGCATACGACT" handle = open("example.fasta", "w") handle.write(">%s\n%s\n" % (name, seq)) handle.close() Maybe you should read the Biopython tutorial or http://biopython.org/wiki/SeqIO for more ideas? Peter From smriti.sebastuan at gmail.com Tue Feb 19 04:59:30 2008 From: smriti.sebastuan at gmail.com (smriti Sebastian) Date: Tue, 19 Feb 2008 10:29:30 +0530 Subject: [BioPython] Parser Message-ID: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> Hi, Can anyone plz help me how to parse description part of PSI Blast output.When I use the description method I am getting an error there is no such attribute.Thanks in advance From biopython at maubp.freeserve.co.uk Tue Feb 19 09:55:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Feb 2008 09:55:07 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> Message-ID: <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> On 2/19/08, smriti Sebastian wrote: > Hi, > Can anyone plz help me how to parse description part of PSI Blast > output.When I use the description method I am getting an error there is no > such attribute.Thanks in advance Could you show us your code, and the full error message? The BLAST examples in the tutorial should be helpful... Peter From biopython at maubp.freeserve.co.uk Wed Feb 20 00:09:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Feb 2008 00:09:09 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> Message-ID: <320fb6e00802191609t4689a819m1a50bd09dc152fdb@mail.gmail.com> Hi smriti. That does look helpful. Assuming its not too big, could you email me the psi_out file (off the list to avoid clogging up everyone email). Once we sort this out, it would be a good idea for us to update the PSI Blast section of the tutorial... Peter On Feb 19, 2008 6:00 PM, smriti Sebastian wrote: > > Hi , > My code is like this: > > #!usr/bin/python > fh=open('psi_out','r') > import Bio > from Bio.Blast import * > import Bio.Blast.NCBIStandalone > b_parser=Bio.Blast.NCBIStandalone.PSIBlastParser() > b_record=b_parser.parse(fh) > > E_VALUE_THRESH=0.04 > for line in b_record.rounds: > > for record in line.descriptions: > print record > > My error: > > Traceback (most recent call last): > File "parse_psi.py", line 12, in > for record in line.descriptions: > AttributeError: Round instance has no attribute 'descriptions' Whatever the "line" object does, it seems it doesn't have a "descriptions" attribute. What does dir(line) give? Peter From ivan at biodec.com Wed Feb 20 09:31:30 2008 From: ivan at biodec.com (Ivan Rossi) Date: Wed, 20 Feb 2008 10:31:30 +0100 (CET) Subject: [BioPython] plone4bio project starts Message-ID: Dear list members, We are pleased to announce a new Plone project: plone4bio. As the name suggest it is intended as a set of products to do bioinformatics within the Plone CMS. Plone4Bio takes advantage of Biopython. What is plone4bio The rationale of the plone4bio project is to provide an integrated environment where it is possible to manage and analyze biological sequences. The plone4bio package provides the possibility to add a new plone content type, called sequence, than can be either written by hand or imported from a FASTA file, and to apply to that sequence a program, called predictor, that gives back a plot of predicted probabilities for the sequence to have a given property (the property that the predictor tries to determine). thus a predictor can try to assess if a protein sequence is trans-membrane, whether a signal peptide exists, and so on. plone4bio.base The plone4bio.base is a package that defines a skeleton predictor: deriving from that it is possible to integrate any other application and visualize all the results together. biocomp.pscoils This is an example predictor, encapsulating the pscoils algorithm by Fariselli et al. available at http://www.biocomp.unibo.it/ It is intended both as an example on how to integrate one's own predictor in the plone4bio framework. Requirements 1. python2.4 2. python setup tools (the python-setuptools Debian package) 3. biopython 4. PIL Download and Project page The software is available at http://www.plone4bio.org Further information Available either through the web site (plone4bio.org) or subscribing to the mailing list (p4b at biodec dot com) For installation and documentation issues refer to README.txt and INSTALL.txt files from the archive, or the script published on the plone4bio wiki site. plone4bio is published under the GPL license. This product is produced independently from the product Plone, and carries no guarantee from the Plone Foundation about quality, suitability or anything else. The supplier of this product assumes all responsibility for it. -- Ivan Rossi, PhD - ivan AT biodec dot com - ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, 40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Wed Feb 20 12:22:48 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Feb 2008 12:22:48 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802192340n2f4cab8vb767eea6f098cac5@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> <320fb6e00802191609t4689a819m1a50bd09dc152fdb@mail.gmail.com> <22c5c6390802192340n2f4cab8vb767eea6f098cac5@mail.gmail.com> Message-ID: <320fb6e00802200422o597c8f56r8bfec5d5d3c9b035@mail.gmail.com> On Wed, Feb 20, 2008 at 7:40 AM, smriti Sebastian wrote: > dir(line) is not giving descriptions in it.But if we check the > NCBIStandalone.py file it has an attribute called descriptions. > I am attaching the file The b_record object is a Bio.Blast.Record.PSIBlast instance, which has different attributes to the "normal blast" object. In particular, the list "rounds" of Bio.Blast.Record.Round objects, and the boolean/integer "converged". Try: help(Bio.Blast.Record.PSIBlast) help(Bio.Blast.Record.Round) I'm not sure exactly what you want to achieve, but perhaps something like this would be a start: #!usr/bin/python fh=open('psi_out','r') import Bio from Bio.Blast import * import Bio.Blast.NCBIStandalone b_parser=Bio.Blast.NCBIStandalone.PSIBlastParser() b_record=b_parser.parse(fh) E_VALUE_THRESH=0.04 for round in b_record.rounds: print round for aln in round.alignments : print aln Peter From lueck at ipk-gatersleben.de Thu Feb 21 14:50:58 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 21 Feb 2008 15:50:58 +0100 Subject: [BioPython] write a genbank file Message-ID: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> Hi! Does someone can give me an example how I can write in python a new genbank file? I want to make a blast and to use the location of the match as a feature in a genbank file (and finally to work on it in DNA Star). Is it at all possible? Thanks in advance! Stefanie From hlapp at gmx.net Fri Feb 22 03:21:19 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 Feb 2008 22:21:19 -0500 Subject: [BioPython] BioSQL documentation for Biopython Message-ID: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Hi all, there is some Biopython-related documentation of BioSQL and using Biopython's language binding within the BioSQL codebase: http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/ trunk/doc/biopython This is about 5 years old (since it has been last updated by Brad Chapman), according to the svn log. Could some Biopythonist check this material whether or not it still has any relevance, and whether there are any errors? I'll be releasing within the next couple of days, so if this is outdated I'd like to remove it from (at least) the release branch. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ULNJUJERYDIX at spammotel.com Fri Feb 22 07:01:48 2008 From: ULNJUJERYDIX at spammotel.com (Kevin Lam) Date: Fri, 22 Feb 2008 15:01:48 +0800 Subject: [BioPython] **Fwd: write a genbank file In-Reply-To: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> References: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> Message-ID: <5b6410e0802212301q61133365tfc8e819c825d9e09@mail.gmail.com> http://www.embl-heidelberg.de/~chenna/PySAT/ might be helpful From sbassi at gmail.com Fri Feb 22 23:45:19 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 22 Feb 2008 21:45:19 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: > Could some Biopythonist check this material whether or not it still > has any relevance, and whether there are any errors? I've tried to import the schema (biosqldb-mysql.sql) into my MySQL server (5.0.45-Debian_1ubuntu3-log) and got this: Error SQL query: -- CONFIG: you may want to add this for mysql because MySQL often is broken -- with respect to using the composite index for the initial keys - - CREATE INDEX ontrel_subjectid ON term_relationship( subject_term_id ); MySQL said: #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id)' at line 1 So I tried with "compatibility with SQL323" (option in the phpmyadmin), but got the same result. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Sat Feb 23 02:41:52 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 Feb 2008 21:41:52 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> Interesting. Apparently MySQL needs a space after the double-dash comment prefix: In MySQL, the ?-- ? (double-dash) comment style requires the second dash to be followed by at least one whitespace or control character (such as a space, tab, newline, and so on). This syntax differs slightly from standard SQL comment syntax [...]. If you add that space, i.e., change the line: --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); to -- CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); and do the same for all other lines where this occurs, does it work then? I've also updated the MySQL schema on svn: http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/ trunk/sql/biosqldb-mysql.sql BTW the reason this hasn't come up before is that most everyone uses the mysql command line client to instantiate the schema, which ignores lines starting with '--'. -hilmar On Feb 22, 2008, at 6:45 PM, Sebastian Bassi wrote: > On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: >> Could some Biopythonist check this material whether or not it still >> has any relevance, and whether there are any errors? > > I've tried to import the schema (biosqldb-mysql.sql) into my MySQL > server (5.0.45-Debian_1ubuntu3-log) and got this: > > Error > SQL query: > -- CONFIG: you may want to add this for mysql because MySQL often > is broken > -- with respect to using the composite index for the initial keys > - - CREATE INDEX ontrel_subjectid ON term_relationship( > subject_term_id > ); > > MySQL said: > #1064 - You have an error in your SQL syntax; check the manual that > corresponds to your MySQL server version for the right syntax to use > near '--CREATE INDEX ontrel_subjectid ON > term_relationship(subject_term_id)' at line 1 > > So I tried with "compatibility with SQL323" (option in the > phpmyadmin), but got the same result. > > > -- > Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 > Bioinformatics news: http://www.bioinformatica.info > Tutorial libre de Python: http://tinyurl.com/2az5d5 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sbassi at gmail.com Sat Feb 23 03:17:36 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 01:17:36 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> Message-ID: > If you add that space, i.e., change the line: > --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); > to > -- CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); .... > and do the same for all other lines where this occurs, does it work > then? Yes, I did it and now it works. So I will keep on testing the documentation. > I've also updated the MySQL schema on svn: > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/ > trunk/sql/biosqldb-mysql.sql I've just check it out (via web) and it is still the same as before. May be that the web version is delayed? BTW, when I click on "Blame/Annotate" in the web SVN (http://code.open-bio.org/svnweb/index.cgi/biosql/blame/biosql-schema/trunk/sql/biosqldb-mysql.sql), I get this: An error occured Error string not specified yet: Can't find a temporary directory: Error string not specified yet at /usr/lib/perl5/site_perl/5.8.8/SVN/Web/Blame.pm line 146 Maybe an issue with the installation. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Sat Feb 23 03:28:30 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 Feb 2008 22:28:30 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> Message-ID: <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> On Feb 22, 2008, at 10:17 PM, Sebastian Bassi wrote: > I've just check it out (via web) and it is still the same as before. > May be that the web version is delayed? Yes, sorry that was my mistake. The URL was to the anonymous access mirror, which gets updated only every hour or so. > BTW, when I click on "Blame/Annotate" in the web SVN > [...] I get this: > > An error occured Yes, I know. I don't know what the issue is but I've reported it earlier to support at open-bio.org. Thanks for your help and for reporting the issue. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sbassi at gmail.com Sat Feb 23 05:49:34 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 03:49:34 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 1:28 AM, Hilmar Lapp wrote: > Yes, sorry that was my mistake. The URL was to the anonymous access > mirror, which gets updated only every hour or so. Thats OK. Here is my next report: There is a part here: "For this example, we are going to assume we have a GenBank file on our computer called cor6_6.gb that we are going to work with." I think the tutorial should state that the cor6_6.gb is included with Biopython (under Test/Genbank). Also a link to the file won't hurt. When I tried to follow the step by step guide, I found this error (I am using Biopython 1.44): >>> from BioSQL import BioSeqDatabase >>> server=BioSeqDatabase.open_database(driver = "MySQLdb", user = "", passwd="", host = "localhost", db = "bioseqdb") >>> db = server.new_database("cold") >>> from Bio import GenBank >>> parser = GenBank.FeatureParser() >>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) >>> db.load(iterator) Traceback (most recent call last): File "", line 1, in File "BioSQL/BioSeqDatabase.py", line 414, in load db_loader.load_seqrecord(cur_record) File "BioSQL/Loader.py", line 30, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "BioSQL/Loader.py", line 250, in _load_bioentry_table version)) File "BioSQL/BioSeqDatabase.py", line 277, in execute self.cursor.execute(sql, args or ()) File "/var/lib/python-support/python2.5/MySQLdb/cursors.py", line 151, in execute query = query % db.literal(args) TypeError: not all arguments converted during string formatting Should I donwload the BioSQL from CVS? -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Sat Feb 23 17:58:14 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 23 Feb 2008 12:58:14 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> Message-ID: <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> On Feb 23, 2008, at 12:49 AM, Sebastian Bassi wrote: > File "/var/lib/python-support/python2.5/MySQLdb/cursors.py", line > 151, in execute > query = query % db.literal(args) > TypeError: not all arguments converted during string formatting > > Should I donwload the BioSQL from CVS? You mean from SVN, probably? I don't know but it seems to me that problem is in some (Bio)Python code? I.e., (re-)downloading BioSQL from anonymous SVN would only update the schema, and there was no update, so I can't imagine how that would help. Or did you mean (re-)downloading the BioSQL bindings from Biopython? That would be a question for the Biopython folks (I actually don't use Biopython). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From srikrishnamohan at gmail.com Sat Feb 23 18:07:45 2008 From: srikrishnamohan at gmail.com (km) Date: Sat, 23 Feb 2008 23:37:45 +0530 Subject: [BioPython] Bio.SCOP problem Message-ID: Hi all, I have a problem with Bio.SCOP module in BioPython There is absolutely no documentation for Bio.SCOP module and looking at the source code, I found a way to load scop parseable files (from astral db) to get the domain information represented as attributes of scop object (Bio.SCOP.Scop (...)) Now the problem is that each domain shows the parent to be of None type object !!! How do I traverse thru the hierarchy ? ie ., given a domain, how do i Know which fold it belongs to and corresponding family and class ?? any hints ? Am i missing something ? regards, KM From sbassi at gmail.com Sat Feb 23 19:07:47 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 17:07:47 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 3:58 PM, Hilmar Lapp wrote: > You mean from SVN, probably? I don't know but it seems to me that > problem is in some (Bio)Python code? Yes, the problem was that I was using 1.44 biopython without the new BioSQL code from Peter. Biopython repository is still in CVS, not SVN (at least biopython is not listed here: http://code.open-bio.org/svnweb/index.cgi/) Now with the new code, I could reproduce the tutorial, up to here: >>> from BioSQL import BioSeqDatabase >>> server=BioSeqDatabase.open_database(driver = "MySQLdb", user = "X",passwd="X", host = "localhost", db = "bioseqdb") >>> db = server.new_database("cold") >>> from Bio import GenBank >>> parser = GenBank.FeatureParser() >>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) >>> db.load(iterator) 6 But when I look into the mysql, there is no new record!. The "6" is supposed to be the number of records loaded into the database. But my database is empty (it has the schema, but w/o data). > That would be a question for the Biopython folks (I actually don't > use Biopython). I am copying this into biopython and biopython-dev mailing list. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From hlapp at gmx.net Sat Feb 23 19:20:35 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 23 Feb 2008 14:20:35 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Feb 23, 2008, at 2:07 PM, Sebastian Bassi wrote: > Now with the new code, I could reproduce the tutorial, up to here: > >>>> from BioSQL import BioSeqDatabase >>>> server=BioSeqDatabase.open_database(driver = "MySQLdb", user = > "X",passwd="X", host = "localhost", db = "bioseqdb") >>>> db = server.new_database("cold") >>>> from Bio import GenBank >>>> parser = GenBank.FeatureParser() >>>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) >>>> db.load(iterator) > 6 > > But when I look into the mysql, there is no new record!. The "6" is > supposed to be the number of records loaded into the database. But my > database is empty (it has the schema, but w/o data). I.e., there is no error from the db.load() command, just no data? Does the Biopython binding enable or disable auto-commit? If the latter (which would be the Right Thing(tm) to do), you will have to commit the transaction. (Obviously I don't know what the API method would be for this, but db.commit() might be a good start.) BioSQL uses InnoDB on MySQL, and hence will be transactional unless you make the language's db driver to auto-commit. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sbassi at gmail.com Sat Feb 23 19:50:50 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 17:50:50 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 5:20 PM, Hilmar Lapp wrote: > I.e., there is no error from the db.load() command, just no data? Yes, there was no error, the only response was "6". > Does the Biopython binding enable or disable auto-commit? If the > latter (which would be the Right Thing(tm) to do), you will have to Yes, when working with MySQLdb, it does not auto-commit. You have to do DB_HANDLE.commit(). There is no commit method in db: >>> dir(db) ['__doc__', '__getitem__', '__init__', '__module__', '__repr__', 'adaptor', 'dbid', 'get_PrimarySeq_stream', 'get_Seq_by_acc', 'get_Seq_by_id', 'get_Seq_by_primary_id', 'get_Seq_by_ver', 'get_Seqs_by_acc', 'get_all_primary_ids', 'items', 'keys', 'load', 'lookup', 'name', 'values'] > BioSQL uses InnoDB on MySQL, and hence will be transactional unless > you make the language's db driver to auto-commit. I am looking at the DatabaseLoader class (in loader.py) but I don't see any commit statement, anyway, I don't understand this class, so I may be missing something. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Sat Feb 23 20:58:03 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 18:58:03 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <16F88C8D-70DE-4020-B97D-B3D43AF530AF@gmx.net> <219760B8-1D60-4549-9151-9C1ECB46FE4B@gmx.net> <87D56EF5-9BC8-4401-8A09-2BB3104BE1CE@gmx.net> Message-ID: On Sat, Feb 23, 2008 at 5:50 PM, Sebastian Bassi wrote: > > BioSQL uses InnoDB on MySQL, and hence will be transactional unless > > you make the language's db driver to auto-commit. > I am looking at the DatabaseLoader class (in loader.py) but I don't > see any commit statement, anyway, I don't understand this class, so I > may be missing something. I've just found the answer. Here is what was missing: server.adaptor.commit() I found it here: http://www.biopython.org/wiki/BioSQL So the document IMHO should be changed, for example: ">>> db.load(iterator) 6 And the GenBank file is loaded into the database. Notice that the load function returns the number of records loaded (6 in this case). This is useful for sanity checking to make sure that you didn't try to load a massive file and end up with a result like 3." To: ">>> db.load(iterator) 6 >>> server.adaptor.commit() And the GenBank file is loaded into the database. Notice that the load function returns the number of records loaded (6 in this case). This is useful for sanity checking to make sure that you didn't try to load a massive file and end up with a result like 3." A link to http://www.biopython.org/wiki/BioSQL could be added. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Sat Feb 23 21:31:27 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sat, 23 Feb 2008 19:31:27 -0200 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: > Could some Biopythonist check this material whether or not it still > has any relevance, and whether there are any errors? Everything went OK. I could follow the whole document. The only minor difference I found was: >>> print feature.location (0..880) It is in fact: >>> print feature.location [0:880] > I'll be releasing within the next couple of days, so if this is > outdated I'd like to remove it from (at least) the release branch. I think there is no need to remove it, just add the ">>> server.adaptor.commit()" and a link to the wiki (http://www.biopython.org/wiki/BioSQL) Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From anaryin at gmail.com Sat Feb 23 22:00:02 2008 From: anaryin at gmail.com (=?ISO-8859-1?Q?Jo=E3o_Rodrigues?=) Date: Sat, 23 Feb 2008 22:00:02 +0000 Subject: [BioPython] Uniprot Parser Message-ID: Hello all! I've written a small parser for the uniprot_sprot.dat files that come out once and again because I read about some incompatibilities of the Biopython's with the source files. Now I want to rewrite and clean the code and I'm considering (strongly) to rewrite my parser. It's a mess of a code (though it works) and I'd rather use something more... readable! So, I'm asking, basically, is Biopython's parser good already or are there still some incompatibilities? Thanks a lot! Jo?o Rodrigues From ruchira.datta at gmail.com Sat Feb 23 22:44:43 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sat, 23 Feb 2008 14:44:43 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: I've been using Bio.SwissProt.SProt to parse this file. The only glitch that came up so far is that when some fields span multiple lines (e.g., OS, the species field), SProt puts a newline in the field. This is not correct--it should be just a blank space. However, this can easily be corrected within SProt itself without requiring a forked parser. At least two other parsers for this file have been written by people in my group, but I have pushed and implemented standardization on the BioPython one. Part of the point of BioPython is to have one central repository for development and maintenance of things like this, so that hundreds of people don't have to spend their time reinventing the wheel. It is much preferable that people contribute changes rather than creating a forked version. --Ruchira On Sat, Feb 23, 2008 at 2:00 PM, Jo?o Rodrigues wrote: > Hello all! > > I've written a small parser for the uniprot_sprot.dat files that come out > once and again because I read about some incompatibilities of the > Biopython's with the source files. Now I want to rewrite and clean the > code > and I'm considering (strongly) to rewrite my parser. It's a mess of a code > (though it works) and I'd rather use something more... readable! So, I'm > asking, basically, is Biopython's parser good already or are there still > some incompatibilities? > > Thanks a lot! > > Jo?o Rodrigues > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From hlapp at gmx.net Sun Feb 24 04:30:56 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 23 Feb 2008 23:30:56 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: Thanks a lot for your help Sebastian - I've updated the documentation now, also changing the CVS links to SVN, and adding in some wiki links. It took me a while to figure out the hevea tool that they had used originally to convert to text and HTML, but it's re-converted now. I still couldn't manage the title and authors to be printed, so I just copied those parts from the old txt and HMTL versions. -hilmar On Feb 23, 2008, at 4:31 PM, Sebastian Bassi wrote: > On Fri, Feb 22, 2008 at 1:21 AM, Hilmar Lapp wrote: >> Could some Biopythonist check this material whether or not it still >> has any relevance, and whether there are any errors? > > Everything went OK. I could follow the whole document. The only minor > difference I found was: > >>>> print feature.location > (0..880) > > It is in fact: > >>>> print feature.location > [0:880] > >> I'll be releasing within the next couple of days, so if this is >> outdated I'd like to remove it from (at least) the release branch. > > I think there is no need to remove it, just add the ">>> > server.adaptor.commit()" and a link to the wiki > (http://www.biopython.org/wiki/BioSQL) > > Best, > SB. > > > -- > Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 > Bioinformatics news: http://www.bioinformatica.info > Tutorial libre de Python: http://tinyurl.com/2az5d5 -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun Feb 24 10:42:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 10:42:55 +0000 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> Message-ID: <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> I've been away a few days, but it looks like you and Sebastian has worked out where things had been going wrong. Good job :) > > Everything went OK. I could follow the whole document. The only minor > > difference I found was: > > > >>>> print feature.location > > (0..880) > > > > It is in fact: > > > >>>> print feature.location > > [0:880] That was a change made a year and a half ago - it is just cosmetic. http://bugzilla.open-bio.org/show_bug.cgi?id=1902 > >> I'll be releasing within the next couple of days, so if this is > >> outdated I'd like to remove it from (at least) the release branch. As you'll have noticed, Biopython 1.44 has a few problems with BioSQL, and the next release currently in CVS will be a lot better. It might be worth adding a warning to the BioSQL release for any Biopython users to wait for Biopython 1.45. The only other thing I noticed was this example code: >>> from Bio import GenBank >>> parser = GenBank.FeatureParser() >>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) I would write this using Bio.SeqIO as we a promoting this as a uniform sequence input/output library in Biopython (as in the wiki page Sebastian mentioned). i.e. >>> from Bio import SeqIO >>> iterator = SeqIO.parse(open("cor6_6.gb"), "genbank") (However I have not yet sat down and gone through the whole document) Peter From biopython at maubp.freeserve.co.uk Sun Feb 24 10:51:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 10:51:22 +0000 Subject: [BioPython] write a genbank file In-Reply-To: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> References: <004501c87499$2b461290$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00802240251m657fe4b6i44f3b128c213268d@mail.gmail.com> On Thu, Feb 21, 2008 at 2:50 PM, Stefanie L?ck wrote: > Hi! > > Does someone can give me an example how I can write in python a new > genbank file? I want to make a blast and to use the location of the match > as a feature in a genbank file (and finally to work on it in DNA Star). > > Is it at all possible? Writing GenBank files isn't easy at the moment. Depending on your needs, creating Bio.GenBank.Record.Record objects and writing them to file may work. I hope to include support for writing GenBank files from SeqRecord objects in Bio.SeqIO later... but that would still be complicated in your case as you would have to create SeqFeatures from each BLAST match. Peter From biopython at maubp.freeserve.co.uk Sun Feb 24 12:58:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 12:58:13 +0000 Subject: [BioPython] Bio.SCOP problem In-Reply-To: References: Message-ID: <320fb6e00802240458r66b44e54qc22cf8a2c62eba2e@mail.gmail.com> On Sat, Feb 23, 2008 at 6:07 PM, km wrote: > Hi all, > I have a problem with Bio.SCOP module in BioPython > There is absolutely no documentation for Bio.SCOP module and looking at the > source code, I found a way to load scop parseable files (from astral db) to ... Have you looked at the SCOP unit tests? Those could be quite helpful. Peter From biopython at maubp.freeserve.co.uk Sun Feb 24 13:06:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 13:06:20 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> On Sat, Feb 23, 2008 at 10:44 PM, Ruchira Datta wrote: > I've been using Bio.SwissProt.SProt to parse this file. The only glitch > that came up so far is that when some fields span multiple lines (e.g., OS, > the species field), SProt puts a newline in the field. This is not > correct--it should be just a blank space. However, this can easily be > corrected within SProt itself without requiring a forked parser. I'm guessing you are using the parser to return Record objects, which are a fairly simple direct mapping of the raw file format - and I can understand why the newlines were included. If you use the parser to get SeqRecord objects (which are generic and not tied to the SwissProt/UniProt format), then the newlines are removed. > At least two other parsers for this file have been written by people in my > group, but I have pushed and implemented standardization on the BioPython > one. Part of the point of BioPython is to have one central repository for > development and maintenance of things like this, so that hundreds of people > don't have to spend their time reinventing the wheel. It is much preferable > that people contribute changes rather than creating a forked version. > > --Ruchira From hlapp at gmx.net Sun Feb 24 16:02:33 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 24 Feb 2008 11:02:33 -0500 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> Message-ID: <4E967ED0-13DF-45AF-A305-A32AD9B4B303@gmx.net> On Feb 24, 2008, at 5:42 AM, Peter wrote: > The only other thing I noticed was this example code: > >>>> from Bio import GenBank >>>> parser = GenBank.FeatureParser() >>>> iterator = GenBank.Iterator(open("cor6_6.gb"), parser) > > I would write this using Bio.SeqIO as we a promoting this as a uniform > sequence input/output library in Biopython (as in the wiki page > Sebastian mentioned). i.e. > >>>> from Bio import SeqIO >>>> iterator = SeqIO.parse(open("cor6_6.gb"), "genbank") > > (However I have not yet sat down and gone through the whole document) If you assure me that that would work (with a current release of Biopython), I'll change it accordingly. BTW also in regard to a previous comment from Sebastian, the file cor6_6.gb is in fact in that same directory in biosql. As another aside, if either of you would like write permission to biosql so you can maintain that document yourself that would be no problem. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ruchira.datta at gmail.com Sun Feb 24 16:28:33 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sun, 24 Feb 2008 08:28:33 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> Message-ID: On Sun, Feb 24, 2008 at 5:06 AM, Peter wrote: > On Sat, Feb 23, 2008 at 10:44 PM, Ruchira Datta > wrote: > > I've been using Bio.SwissProt.SProt to parse this file. The only glitch > > that came up so far is that when some fields span multiple lines (e.g., > OS, > > the species field), SProt puts a newline in the field. This is not > > correct--it should be just a blank space. However, this can easily be > > corrected within SProt itself without requiring a forked parser. > > I'm guessing you are using the parser to return Record objects, which > are a fairly simple direct mapping of the raw file format - and I can > understand why the newlines were included. If you use the parser to > get SeqRecord objects (which are generic and not tied to the > SwissProt/UniProt format), then the newlines are removed. > Hi Peter, I had tried SeqRecord first, but it didn't include the references, which I absolutely need. While inclusion of newlines may be understandable, it's a bug. The newline is stripped from several other fields by _RecordConsumer, e.g., def reference_number(self, line): rn = line[5:].rstrip() ... and it needs to be stripped from this one, instead of def organism_species(self, line): self.data.organism += line[5:] The newlines are never significant in any field. In a couple of weeks I might be able to check out the cvs version and provide a patch. --Ruchira > > > At least two other parsers for this file have been written by people in > my > > group, but I have pushed and implemented standardization on the > BioPython > > one. Part of the point of BioPython is to have one central repository > for > > development and maintenance of things like this, so that hundreds of > people > > don't have to spend their time reinventing the wheel. It is much > preferable > > that people contribute changes rather than creating a forked version. > > > > --Ruchira > From biopython at maubp.freeserve.co.uk Sun Feb 24 16:47:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 16:47:01 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> Message-ID: <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> On Sun, Feb 24, 2008 at 4:28 PM, Ruchira Datta wrote: > > Hi Peter, > > I had tried SeqRecord first, but it didn't include the references, which I > absolutely need. The good news is I think the references are included now (in Biopython CVS), see enhancement Bug 2235: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > While inclusion of newlines may be understandable, it's a bug. The newline > is stripped from several other fields by _RecordConsumer, e.g., > ... Off the top of my head, I would say that example is a little different - reference number lines do not span multiple lines. > The newlines are never significant in any field. You are probably right - although perhaps they could be important in long text fields where a line break has been inserted mid word and a hyphenation added. The newlines are also important if using the Record object to recreate the raw file (e.g. to save to disk). However I doubt anyone is doing this. Having a __str__ method defined like there is in the Bio.GenBank.Record.Record object which would make this easier. > In a couple of weeks I might be able to check out the cvs > version and provide a patch. Please do. Peter From ruchira.datta at gmail.com Sun Feb 24 17:36:56 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sun, 24 Feb 2008 09:36:56 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> Message-ID: I just found another bug, which would be a bit trickier to fix properly. This code: def database_cross_reference(self, line): # From CLD1_HUMAN, Release 39: # DR EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence] # DR PRODOM [Domain structure / List of seq. sharing at least 1 domai # DR SWISS-2DPAGE; GET REGION ON 2D PAGE. line = line[5:] # Remove the comments at the end of the line i = line.find('[') if i >= 0: line = line[:i] cols = line.rstrip(_CHOMP).split(';') cols = [col.lstrip() for col in cols] self.data.cross_references.append(tuple(cols)) applied to this line of the TrEMBL record for A2RB21_ASPNG: DR GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...; IEA:EC. got me this tuple: ('GO', 'GO:0016277', 'F:') The bracketed term was interpreted as a comment and the whole line was stripped. Thanks, --Ruchira On Sun, Feb 24, 2008 at 8:47 AM, Peter wrote: > On Sun, Feb 24, 2008 at 4:28 PM, Ruchira Datta > wrote: > > > > Hi Peter, > > > > I had tried SeqRecord first, but it didn't include the references, > which I > > absolutely need. > > The good news is I think the references are included now (in Biopython > CVS), see enhancement Bug 2235: > http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > > > While inclusion of newlines may be understandable, it's a bug. The > newline > > is stripped from several other fields by _RecordConsumer, e.g., > > ... > > Off the top of my head, I would say that example is a little different > - reference number lines do not span multiple lines. > > > The newlines are never significant in any field. > > You are probably right - although perhaps they could be important in > long text fields where a line break has been inserted mid word and a > hyphenation added. > > The newlines are also important if using the Record object to recreate > the raw file (e.g. to save to disk). However I doubt anyone is doing > this. Having a __str__ method defined like there is in the > Bio.GenBank.Record.Record object which would make this easier. > > > In a couple of weeks I might be able to check out the cvs > > version and provide a patch. > > Please do. > > Peter > From biopython at maubp.freeserve.co.uk Sun Feb 24 17:48:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Feb 2008 17:48:29 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> Message-ID: <320fb6e00802240948s6f0aff89r29ef9cff286dd895@mail.gmail.com> On Sun, Feb 24, 2008 at 5:36 PM, Ruchira Datta wrote: > I just found another bug, which would be a bit trickier to fix properly. > > This code: > > def database_cross_reference(self, line): > # From CLD1_HUMAN, Release 39: > # DR EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence] > # DR PRODOM [Domain structure / List of seq. sharing at least 1 > domai > # DR SWISS-2DPAGE; GET REGION ON 2D PAGE. > line = line[5:] > # Remove the comments at the end of the line > i = line.find('[') > if i >= 0: > line = line[:i] > cols = line.rstrip(_CHOMP).split(';') > cols = [col.lstrip() for col in cols] > self.data.cross_references.append(tuple(cols)) > > applied to this line of the TrEMBL record for A2RB21_ASPNG: > > DR GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...; > IEA:EC. > > got me this tuple: > > ('GO', 'GO:0016277', 'F:') > > The bracketed term was interpreted as a comment and the whole line was > stripped. That does look tricky... especially if we want to preserve backwards compatibility. This "F" cross reference looks like the partial text for the GO term. I wonder how common this is? (square brackets in the cross references themselves). I can't see the use of "F" mentioned here: http://www.expasy.org/sprot/userman.html#DR_line Could you file a bug and add a few more other examples if you find them. Thanks Peter From ruchira.datta at gmail.com Sun Feb 24 17:53:10 2008 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Sun, 24 Feb 2008 09:53:10 -0800 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802240948s6f0aff89r29ef9cff286dd895@mail.gmail.com> References: <320fb6e00802240506u763a9f70rf029a4f5fcd487d5@mail.gmail.com> <320fb6e00802240847k6529d1b6h53b0f84dd4c4e37a@mail.gmail.com> <320fb6e00802240948s6f0aff89r29ef9cff286dd895@mail.gmail.com> Message-ID: On Sun, Feb 24, 2008 at 9:48 AM, Peter wrote: > On Sun, Feb 24, 2008 at 5:36 PM, Ruchira Datta > wrote: > > I just found another bug, which would be a bit trickier to fix properly. > > > > This code: > > > > def database_cross_reference(self, line): > > # From CLD1_HUMAN, Release 39: > > # DR EMBL; [snip]; -. [EMBL / GenBank / DDBJ] [CoDingSequence] > > # DR PRODOM [Domain structure / List of seq. sharing at least > 1 > > domai > > # DR SWISS-2DPAGE; GET REGION ON 2D PAGE. > > line = line[5:] > > # Remove the comments at the end of the line > > i = line.find('[') > > if i >= 0: > > line = line[:i] > > cols = line.rstrip(_CHOMP).split(';') > > cols = [col.lstrip() for col in cols] > > self.data.cross_references.append(tuple(cols)) > > > > applied to this line of the TrEMBL record for A2RB21_ASPNG: > > > > DR GO; GO:0016277; F:[myelin basic protein]-arginine N-methyltra...; > > IEA:EC. > > > > got me this tuple: > > > > ('GO', 'GO:0016277', 'F:') > > > > The bracketed term was interpreted as a comment and the whole line was > > stripped. > > That does look tricky... especially if we want to preserve backwards > compatibility. This "F" cross reference looks like the partial text > for the GO term. I wonder how common this is? (square brackets in the > cross references themselves). I can't see the use of "F" mentioned > here: http://www.expasy.org/sprot/userman.html#DR_line > > Could you file a bug and add a few more other examples if you find them. > > Thanks > > Peter > Here 'F;' means the annotation refers to the molecular function part of the Gene Ontology (as opposed to, e.g., 'P:' for biological process). I think this is quite rare, but I'll see if any other examples came up. --Ruchira From sbassi at gmail.com Mon Feb 25 00:30:06 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Feb 2008 21:30:06 -0300 Subject: [BioPython] BioSQL documentation for Biopython In-Reply-To: <4E967ED0-13DF-45AF-A305-A32AD9B4B303@gmx.net> References: <895066B4-1004-4FF9-A135-7A0FEEEF8DF2@gmx.net> <320fb6e00802240242t1622e753r6c3f383db3f46d6c@mail.gmail.com> <4E967ED0-13DF-45AF-A305-A32AD9B4B303@gmx.net> Message-ID: On Sun, Feb 24, 2008 at 1:02 PM, Hilmar Lapp wrote: > If you assure me that that would work (with a current release of > Biopython), I'll change it accordingly. Peter proposal (using SeqIO.parse) works with Python 1.44 I've just tested. WARNING: The BioSQL module it is not from 1.44, it is from CVS. So this document can be followed using current CVS version of Biopython, not the "plain" 1.44. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From biosql at hotmail.com Mon Feb 25 16:32:13 2008 From: biosql at hotmail.com (Jonathan Boulais) Date: Mon, 25 Feb 2008 11:32:13 -0500 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: Hi everyone, I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files into the Biosql database. It takes a hell of a time... Would it be faster to parse the .dat file and write the data into a temporary files and import it in one shot ? Just a suggestion, Jonathan > Date: Sat, 23 Feb 2008 22:00:02 +0000 > From: anaryin at gmail.com > To: biopython at biopython.org > Subject: [BioPython] Uniprot Parser > > Hello all! > > I've written a small parser for the uniprot_sprot.dat files that come out > once and again because I read about some incompatibilities of the > Biopython's with the source files. Now I want to rewrite and clean the code > and I'm considering (strongly) to rewrite my parser. It's a mess of a code > (though it works) and I'd rather use something more... readable! So, I'm > asking, basically, is Biopython's parser good already or are there still > some incompatibilities? > > Thanks a lot! > > Jo?o Rodrigues > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython _________________________________________________________________ From biopython at maubp.freeserve.co.uk Mon Feb 25 16:52:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 25 Feb 2008 16:52:31 +0000 Subject: [BioPython] Uniprot Parser In-Reply-To: References: Message-ID: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> On Mon, Feb 25, 2008 at 4:32 PM, Jonathan Boulais wrote: > > Hi everyone, > > I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files > into the Biosql database. It takes a hell of a time... What version of Biopython are you using? One thing you could try is timing a simple script that only reads in the SwissProt file but doesn't do anything with the BioSQL database - to try and get a feel for which bit is slow. If its the parsing that is slow, you could try commenting out the bit which deals with the EBI ** lines (see bug 2353 for details), namely line 359 in CVS, self._skip_starstar(uhandle), and see if that makes a big difference. Peter From biopython at maubp.freeserve.co.uk Mon Feb 25 17:16:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 25 Feb 2008 17:16:51 +0000 Subject: [BioPython] Parser In-Reply-To: <22c5c6390802222259g31093d33m2728f054ed19fc23@mail.gmail.com> References: <22c5c6390802182059w4b11d66bi1a0a76c5d4c7f018@mail.gmail.com> <320fb6e00802190155i77354f6el17a44f86f114ad83@mail.gmail.com> <22c5c6390802191000h71e37ba7ha545e4b259797fb5@mail.gmail.com> <320fb6e00802191609t4689a819m1a50bd09dc152fdb@mail.gmail.com> <22c5c6390802192340n2f4cab8vb767eea6f098cac5@mail.gmail.com> <320fb6e00802200422o597c8f56r8bfec5d5d3c9b035@mail.gmail.com> <22c5c6390802212255v5819ff9fk304e30bbdffdd31d@mail.gmail.com> <22c5c6390802222259g31093d33m2728f054ed19fc23@mail.gmail.com> Message-ID: <320fb6e00802250916k3cdb0847va65fefa26cc5febe@mail.gmail.com> Hi Sebastian, Did you mean to send this email to me only? On Sat, Feb 23, 2008 at 6:59 AM, smriti Sebastian wrote: > hi, > One more help plz. > I need to retrieve the hits which are coming under > "Sequences not found previously or not previously below threshold:" from > PSI-Blast output file.. > or else i need to avoid those id's while parsing the psi-blast output using > PsiBlastParser. > Is there any way to do that? > I tried new_seqs attribute of rounds.But it didn't help me. > I have attached a sample output from psi-blast.Plz help > Thanks in advance. The round object has "alignments" which includes all the hits, and "reused_seqs" which is only those above the "Sequences not found previously or not previously below threshold:" line, while "new_seqs" is only those below the line. Perhaps something like this will be helpful... Peter #!usr/bin/python import Bio.Blast.NCBIStandalone b_parser=Bio.Blast.NCBIStandalone.PSIBlastParser() b_record=b_parser.parse(open('trial_psi_blast.txt','r')) for rnd in b_record.rounds: old = len(rnd.reused_seqs) new = len(rnd.new_seqs) assert old+new == len(round.alignments) print "Round number %i, with %i old and %i new" \ % (rnd.number, old, new) for i,aln in enumerate(round.alignments) : #The identifier is the first word (split on white space) identifier = rnd.alignments[i].title.split()[0] #Remove the leading > if present as it isn't used #on the reused_seqs results. if identifier[0] == ">" : identifier = identifier[1:] if i < old: reused = rnd.reused_seqs[i] assert reused.title.split()[0] == identifier print "%i - %s reused, score %i, exp %f" \ % (i, identifier, reused.score, reused.e) else : novel = rnd.new_seqs[i-old] assert novel.title.split()[0] == identifier print "%i - %s novel, score %i, exp %f" \ % (i, identifier, novel.score, novel.e) print "Done" From biosql at hotmail.com Mon Feb 25 17:48:11 2008 From: biosql at hotmail.com (Jonathan Boulais) Date: Mon, 25 Feb 2008 12:48:11 -0500 Subject: [BioPython] Uniprot Parser In-Reply-To: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> References: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> Message-ID: I don't think the parser is the problem Peter, but surely the continuous importing request toward the database. I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing). Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. But 3 days of parsing and importing is... a little bit too long for me :) Anyway I hope it could help, Jonathan > Date: Mon, 25 Feb 2008 16:52:31 +0000 > From: biopython at maubp.freeserve.co.uk > To: biosql at hotmail.com > Subject: Re: [BioPython] Uniprot Parser > CC: biopython at lists.open-bio.org > > On Mon, Feb 25, 2008 at 4:32 PM, Jonathan Boulais wrote: > > > > Hi everyone, > > > > I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files > > into the Biosql database. It takes a hell of a time... > > What version of Biopython are you using? > > One thing you could try is timing a simple script that only reads in > the SwissProt file but doesn't do anything with the BioSQL database - > to try and get a feel for which bit is slow. > > If its the parsing that is slow, you could try commenting out the bit > which deals with the EBI ** lines (see bug 2353 for details), namely > line 359 in CVS, self._skip_starstar(uhandle), and see if that makes a > big difference. > > Peter _________________________________________________________________ From mmokrejs at ribosome.natur.cuni.cz Mon Feb 25 18:30:54 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 25 Feb 2008 19:30:54 +0100 Subject: [BioPython] Uniprot Parser In-Reply-To: References: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> Message-ID: <47C3095E.9050306@ribosome.natur.cuni.cz> Hi Jonathan, drop temporarily the indexes on all mysql rows, and make mysql introduce the indexes after importing. Otherwise index has to be updated after every change to a column. Learn 'ALTER TABLE' use. ;-) Martin Jonathan Boulais wrote: > I don't think the parser is the problem Peter, but surely the continuous importing request toward the database. > I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing). > > Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. > But 3 days of parsing and importing is... a little bit too long for me :) From biosql at hotmail.com Mon Feb 25 19:10:48 2008 From: biosql at hotmail.com (Jonathan Boulais) Date: Mon, 25 Feb 2008 14:10:48 -0500 Subject: [BioPython] Uniprot Parser In-Reply-To: <47C3095E.9050306@ribosome.natur.cuni.cz> References: <320fb6e00802250852g2db268d9xf9cc1d2d2654071a@mail.gmail.com> <47C3095E.9050306@ribosome.natur.cuni.cz> Message-ID: Many thanks Martin ! Indeed, the DISABLE KEYS sounds very logical to my problem. Jonathan > Date: Mon, 25 Feb 2008 19:30:54 +0100 > From: mmokrejs at ribosome.natur.cuni.cz > To: biosql at hotmail.com > CC: biopython at lists.open-bio.org > Subject: Re: [BioPython] Uniprot Parser > > Hi Jonathan, > drop temporarily the indexes on all mysql rows, and make mysql introduce > the indexes after importing. Otherwise index has to be updated after every > change to a column. Learn 'ALTER TABLE' use. ;-) > Martin > > Jonathan Boulais wrote: > > I don't think the parser is the problem Peter, but surely the continuous importing request toward the database. > > I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing). > > > > Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. > > But 3 days of parsing and importing is... a little bit too long for me :) _________________________________________________________________ From bsantos at biocant.pt Wed Feb 27 15:41:07 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Wed, 27 Feb 2008 15:41:07 -0000 Subject: [BioPython] How to detect sequences that not produce alignments Message-ID: <000301c87957$2d24c800$876e5800$@pt> Hi people, I have been using Bio.Blast for a while to perform BLAST searches in my scripts. Now I'm trying to detect which sequences in a multifasta align against a databases and the ones that don't align at all. By some experiments I had done I noticed that even if the blast_records instance as no alignments at all I couldn't detect them because they are not incorporated in the blast_records instance as an empty list. There is any way to detect which blast_records are empty? Or the module simply ignores this cases and don't put them on the blast_records? Thank you all in adavance, Best Regards, Bruno Santos From sbassi at gmail.com Wed Feb 27 15:54:06 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 27 Feb 2008 12:54:06 -0300 Subject: [BioPython] How to detect sequences that not produce alignments In-Reply-To: <000301c87957$2d24c800$876e5800$@pt> References: <000301c87957$2d24c800$876e5800$@pt> Message-ID: On Wed, Feb 27, 2008 at 12:41 PM, Bruno Santos wrote: > By some experiments I had done I noticed that even if the blast_records > instance as no alignments at all I couldn't detect them because they are not > incorporated in the blast_records instance as an empty list. There is any > way to detect which blast_records are empty? Or the module simply ignores > this cases and don't put them on the blast_records? I think that the problem is that the XML file has no record of a "no hit" sequence. So Biopython parser can't process that record (since it is not even in the XML file). I guess that the only way to know the "negative hits" is to compare the input file with the XML output and then make the difference. I remember have done that once (I should have the script somewhere if you ask me). -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Thu Feb 28 17:18:43 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu, 28 Feb 2008 14:18:43 -0300 Subject: [BioPython] How to detect sequences that not produce alignments In-Reply-To: <000301c87957$2d24c800$876e5800$@pt> References: <000301c87957$2d24c800$876e5800$@pt> Message-ID: On Wed, Feb 27, 2008 at 12:41 PM, Bruno Santos wrote: > way to detect which blast_records are empty? Or the module simply ignores > this cases and don't put them on the blast_records? Here is my code (I put a copy here http://pastebin.com/f74133375 if formating get lost in the mail). from Bio import SeqIO from Bio.Blast import NCBIXML def blastcomp(fastafile,blastfile): handle = open(fastafile) fastanames=set() #Reads the fasta names for record in SeqIO.parse(handle, "fasta") : fastanames.add(record.name) handle.close() blastnames=set() #Reads the blast names b_records=NCBIXML.parse(open(blastfile)) for b_record in b_records: blastnames.add(b_record.query) return fastanames.difference(blastnames) blastfile="/home/sbassi/bioinfo/INTA/filtracMT.xml" fastafile='INTA/allfiltrados.txt' print blastcomp(fastafile,blastfile) -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5