From bugzilla-daemon at portal.open-bio.org Mon Jun 2 04:19:50 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 04:19:50 -0400 Subject: [Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: Message-ID: <200806020819.m528JoXn006809@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2502 ------- Comment #19 from ibdeno at gmail.com 2008-06-02 04:19 EST ------- Thank you, Peter. In principle, I don't use that information. I will try then with the XML parser. Cheers, Miguel (In reply to comment #18) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 04:49:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 04:49:55 -0400 Subject: [Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: Message-ID: <200806020849.m528ntdY008609@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2502 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #20 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 04:49 EST ------- Marking this bug as fixed. The original report was about parsing the plain text output which is fixed - see comment 12, and Bio/Blast/NCBIStandalone.py CVS revision 1.72. I have not added the 2.2.18 plain text file as a unit test since its over 750kb. For the XML output from 2.2.18, as far as I can tell we are not ignoring any important information from PSI-BLAST, as it is simply not included. If the NCBI updates the XML output from blastpgp then we should revisit the XML parsing. Thank you Miguel for your report and assistance. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 06:37:51 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 06:37:51 -0400 Subject: [Biopython-dev] [Bug 2503] An error when parsing NCBIWWW Blast output In-Reply-To: Message-ID: <200806021037.m52Abpj9019177@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2503 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 06:37 EST ------- Dear Prashanth, Unless you can provide some more information, I'm going to have to close Bug 2503, as you haven't given us enough to go on. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 08:57:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 08:57:20 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806021257.m52CvKt4026676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 08:57 EST ------- I've added simple __str__ and __repr__ methods to the alignment class in Bio/Align/Generic.py CVS revision 1.8, which give output like this: str(a): DNAAlphabet() alignment with 3 rows and 14 columns ACGATCAGCTAGCT Alpha CCGATCAGCTAGCT Beta ACGATGAGCTAGCT Gamma repr(a): <__main__.Alignment instance (3 records of length 14, DNAAlphabet()) at 9e96c2c> The string output gets truncated to show a maximum of 20 rows and 50 columns, which allowing for typical identifiers will still display nicely on a default terminal. I now intend to update the tutorial, as being able to print an alignment should make it much easier to explain and get to grips with. Note that there is still some interesting code in both attachment 732 (the __getitem__ method) and in attachment 770 (e.g. subclassing list and adding __len__, __add__, __radd__ etc). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 09:26:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 09:26:28 -0400 Subject: [Biopython-dev] [Bug 2507] New: Adding __getitem__ to SeqRecord for element access and slicing Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2507 Summary: Adding __getitem__ to SeqRecord for element access and slicing Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 1944 nThis: With a Seq object, you can access individual letters and create sub-sequences using slicing. You can even use a stride to reverse the sequence, or select every third letter. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> print my_seq GATCGATGGGCCTATATAGGATCGAAAATCGC >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) >>> my_seq[5:10] Seq('ATGGG', IUPACUnambiguousDNA()) >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) >>> my_seq[5] 'A' Currently, these operations cannot be done with a SeqRecord object. This enhancement bug is to allow element access and splicing (perhaps even with a stride) on SeqRecord objects, where the annotations are taken into consideration, and preserved as far as reasonably possible. Looking at the different SeqRecord properties, this is what I think should happen for creating a sub-sequence: .id, .name, .description (three strings) - preserve? Blindly preserving these may not always be meaningful. For example, if the description was "Complete plasmid" then it doesn't really apply to a sub-sequence. Perhaps we should preserve only the id and name, and set the description to "sub-sequence"? .annotations (dictionary) - either preserve or lose? Some annotation entries will still be valid for a sub-sequence (e.g. "source" or references). Others will not (e.g. anything describing its coordinates within a larger parent sequence). There is no reliable way to decide on a case by case basis. .dbxrefs (list of strings) - preserve? Any database cross-references would arguably still apply to a sub-sequence or even a reversed sequence. .features (list of SeqFeatures) - select only those features still in the new sub-sequence, and adjust their locations for the new coordinates. Supporting strides other than +1 would be complicated! For simplicity, I would say any feature only partially within the sub-sequence should be discarded. In summary, one clearly defined set of actions on creating a sub-sequence could be to preserve all the annotation data except the SeqFeatures which would be handled sensibly. [If we later support "per-letter-annotation" in either a Seq or SeqRecord subclass, then this too should be spliced] Adding a __getitem__ method to the SeqRecord as outlined above should be compatible with the suggestion that the SeqRecord subclasses the Seq object (see bug 2351). A related point, when accessing single letters, e.g. record[0], should a single letter string be returned (which lacks any annotation) as currently happens with the Seq object? P.S. I'm marking this new enhancement bug as blocking bug 1944. Once SeqRecord objects support splicing, this would make annotation preserving slicing of alignment objects much more straightforward. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 09:26:33 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 09:26:33 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806021326.m52DQXk2029561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2507 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 10:00:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 10:00:15 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806021400.m52E0FJK032027@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 10:00 EST ------- Simple implementation with ignores the features (non-trivial) to be added to the SeqRecord class in Bio/SeqRecord.py def __getitem__(self, index) : if isinstance(index, int) : #TODO - Should single letters be returned as just #strings? This prevents the inclusion of any annotation. #Revisit this once the Seq object is a subclass of string. return self.seq[index] elif isinstance(index, slice) : answer = self.__class__(self.seq[index], id=self.id, name=self.name, description=self.description) #COPY the annotation dict and dbxefs list: answer.annotations = dict(self.annotations.iteritems()) answer.dbxrefs = self.dbxrefs[:] #TODO - select relevant features, and add them with #adjusted coordinates. Take special care with a stride! return answer raise ValueError, "Invalid index" -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 10:12:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 10:12:29 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806021412.m52ECT86000330@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #2 from jblanca at btc.upv.es 2008-06-02 10:12 EST ------- Does this means that SeqRecord would deprecate the .seq attribute? If the .seq attribute is not removed slicing could be used in it like: my_seq[1:100] and my_seq.seq[1:100]. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jun 2 10:14:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Jun 2008 15:14:40 +0100 Subject: [Biopython-dev] sequence class proposal In-Reply-To: <1211779470.483a498e18e3e@webmail.upv.es> References: <320fb6e00805251437n34362f0bm2a323cd1194afaa@mail.gmail.com> <1211779470.483a498e18e3e@webmail.upv.es> Message-ID: <320fb6e00806020714s2c789f61ke676a448e2ec871a@mail.gmail.com> In reply to Jose, I (Peter) wrote: >> One of your points seemed to be that the SeqRecord couldn't have a >> __getitem__ and methods like reverse, complement, etc. I don't see >> why it couldn't have these. Perhaps rather than introducing a whole >> new class, enhancing the SeqRecord would be a better avenue. I've filed Bug 2507 to try and show what I had in mind for the __getitem__ method. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 Adding further methods for (reverse) complement etc could be done in much the same way. Returning to extending Biopython to support per-letter-annotation, I can see two options: Right now, the SeqRecord object HAS a Seq object. If we create a new RichSeq which subclasses the Seq object to provide per-letter-annotation, then you could use a SeqRecord where the .seq property is in fact a RichSeq object. The SeqRecord class doesn't need to have any changes made for this to work (assuming the RichSeq provides the same API as the Seq object). If we make the SeqRecord a subclass of the Seq object, then I would suggest either RichSeq subclassing SeqRecord subclassing Seq, or perhaps SeqRecord subclassing RichSeq subclassing Seq. It depends on if you think the id/name/description/dbxrefs/etc properties would be useful in common use cases of the RichSeq object. Its not going to be possible for all three classes to have the same __init__ parameters without breaking existing scripts (and only supporting the lowest common denominator). Peter From jblanca at btc.upv.es Mon Jun 2 15:11:19 2008 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 2 Jun 2008 21:11:19 +0200 Subject: [Biopython-dev] Fwd: Re: sequence class proposal Message-ID: <1212433879.484445d7a6117@webmail.upv.es> ----- Mensaje reenviado de Blanca Postigo Jose Miguel ----- Fecha: Mon, 2 Jun 2008 21:08:59 +0200 De: Blanca Postigo Jose Miguel Responder-A: Blanca Postigo Jose Miguel Asunto: Re: [Biopython-dev] sequence class proposal Para: Peter Mensaje citado por Peter : > In reply to Jose, I (Peter) wrote: > >> One of your points seemed to be that the SeqRecord couldn't have a > >> __getitem__ and methods like reverse, complement, etc. I don't see > >> why it couldn't have these. Perhaps rather than introducing a whole > >> new class, enhancing the SeqRecord would be a better avenue. > > I've filed Bug 2507 to try and show what I had in mind for the > __getitem__ method. > http://bugzilla.open-bio.org/show_bug.cgi?id=2507 I think that would be great. I've just added to the bug a question about the .seq property of SeqRecord. > Adding further methods for (reverse) complement etc could be done in > much the same way. > > Returning to extending Biopython to support per-letter-annotation, I > can see two options: > > Right now, the SeqRecord object HAS a Seq object. If we create a new > RichSeq which subclasses the Seq object to provide > per-letter-annotation, then you could use a SeqRecord where the .seq > property is in fact a RichSeq object. The SeqRecord class doesn't > need to have any changes made for this to work (assuming the RichSeq > provides the same API as the Seq object). Here I had a slighty different idea, but maybe yours is better. Basically my RichSeq proposal is just a RichSeq with slicing and without the seq property. The problem with the approach that you describe is that the RichSeq should have the per-letter-annotation, so SeqRecord would have a general annotation and RichSeq (in the .seq) would have other features. I would find that confusing. > > If we make the SeqRecord a subclass of the Seq object, then I would > suggest either RichSeq subclassing SeqRecord subclassing Seq, or > perhaps SeqRecord subclassing RichSeq subclassing Seq. It depends on > if you think the id/name/description/dbxrefs/etc properties would be > useful in common use cases of the RichSeq object. If SeqRecord is a subclass of Seq RichSeq is not necessary anymore. That's what I was proposing. The problem is that the current users of SeqRecord would had a hard time with the new behaviour, because in that case supporting the seq property would be hard. To avoid that breakage I was proposing to create RichSeq. RichSeq would be just the SeqRecord that you propose but would allow the users to migrate to RichSeq without forcing them to change to a new SeqRecord behaviour. > > Its not going to be possible for all three classes to have the same > __init__ parameters without breaking existing scripts (and only > supporting the lowest common denominator). That's another reason to rename your new proposed SeqRecord to RichSeq. > > Peter > Jose Blanca -- ----- Fin del mensaje reenviado ----- -- From biopython at maubp.freeserve.co.uk Mon Jun 2 15:51:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Jun 2008 20:51:30 +0100 Subject: [Biopython-dev] Fwd: Re: sequence class proposal In-Reply-To: <1212433879.484445d7a6117@webmail.upv.es> References: <1212433879.484445d7a6117@webmail.upv.es> Message-ID: <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> Jose wrote: > > I've filed Bug 2507 to try and show what I had in mind for the > > __getitem__ method. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2507 > > I think that would be great. Good :) Does anyone else want to comment? > I've just added to the bug a question about the .seq property of SeqRecord. http://bugzilla.open-bio.org/show_bug.cgi?id=2507#c2 reads: > Does this means that SeqRecord would deprecate the .seq attribute? > If the .seq attribute is not removed slicing could be used in it like: > my_seq[1:100] and my_seq.seq[1:100]. I was not intending to deprecate the SeqRecord's .seq property at this time (I think that should happen in preparation for if/when the SeqRecord becomes a subclass of the Seq object). With my idea described on bug 2507, given a SeqRecord object my_seq_record: my_seq_record[1:100] -> another SeqRecord (with annotation) my_seq_record.seq[1:100] -> just a Seq object (no annotation) my_seq_record.seq.tostring()[1:100] -> just a string (no annotation or alphabet) str(my_seq_record.seq)[1:100] -> just a string (no annotation or alphabet) These trivial examples would all "contain" the same sequence string. This enhancement could be done right now, and shouldn't impeed any future per-letter-annotation enhancements. Perhaps per-letter-annotation enhancements could be added to the SeqRecord class directly... I need to fully digest the discussion on the BioSQL list, see: http://lists.open-bio.org/pipermail/biosql-l/2008-May/thread.html Peter From mjldehoon at yahoo.com Mon Jun 2 20:19:59 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 2 Jun 2008 17:19:59 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez & Bio.EUtil In-Reply-To: <320fb6e00805300717v60f0b153i88b5e9a8aee1744c@mail.gmail.com> Message-ID: <624249.42121.qm@web62408.mail.re1.yahoo.com> OK I'll double-check. I may not have noticed some missing DTDs if they were downloaded automatically from the internet. I think Biopython should ship the most common DTDs. At least the ones needed for test_Entrez, which probably covers most of the use cases of Bio.Entrez. --Michiel. Peter wrote: On 24 May 2008, Michiel de Hoon wrote: > Dear all, > > I have essentially completed the parser in Bio.Entrez. The internals of the new design look more complicated to start with, but I can see how much more general it is than the older versions :) Should it work starting from an empty DTDs folder - or will we ship Biopython with most of the current files? I've had trouble with Biopython trying to fetch missing DTD files from the internet. I think the problem is the NCBI using relative URLs. The following quick hack seems to help in Parser.py but only in some cases (because as listed below, the NCBI have two different base paths): 279,280c279,288 < warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % filename) < handle = urllib.urlopen(systemId) --- > warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % path) > if "/" in systemId : > #Assume this is a full path, e.g. > #http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedline_080101.dtd > handle = urllib.urlopen(systemId) > else : > #Its a relative path, and I'm not sure how to best get the base path: > handle = urllib.urlopen("http://www.ncbi.nlm.nih.gov/entrez/query/DTD/"+systemId) (Also note there seem to be some tab/space isssues in this file). >From http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ I've downloaded the following files using wget: egquery.dtd eSearch_020511.dtd nlmcommon_080101.dtd pubmed_080101.dtd eInfo_020511.dtd eSpell.dtd nlmmedline_080101.dtd taxon.dtd eLink_020511.dtd eSummary_041029.dtd nlmmedlinecitation_080101.dtd uilist.dtd ePost_020511.dtd nlmsharedcatcit_080101.dtd Additionally http://www.ncbi.nlm.nih.gov/dtd/ provided some further XML files needed for the test_Entrez.py unit test: NCBI_GBSeq.dtd NCBI_GBSeq.mod.dtd NCBI_Entity.mod.dtd NCBI_Mim.dtd NCBI_Mim.mod.dtd With all the above files, then the unit test file test_Entrez.py doesn't give any missing DTD warnings - but still has a couple of failures. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 3 00:39:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 00:39:24 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806030439.m534dOYI021682@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-06-03 00:39 EST ------- I agree that type checking is a problem. I am not sure if a specialized function in Bio.File is a good idea. The question is not if "this object is a file-like object", but "does this object have the attributes/methods needed". So I would prefer to add checks only for the required attributes/methods in each of the iterators. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Jun 3 00:33:27 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 2 Jun 2008 21:33:27 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez & Bio.EUtil In-Reply-To: <624249.42121.qm@web62408.mail.re1.yahoo.com> Message-ID: <112249.61498.qm@web62410.mail.re1.yahoo.com> I checked but I did not see any missing DTDs. Most of the DTDs in the list you sent are in Biopython's CVS under Bio/Entrez/DTDs, and are included correctly if I do a fresh checkout from CVS. Maybe could you try with a fresh checkout? --Michiel. Michiel de Hoon wrote: OK I'll double-check. I may not have noticed some missing DTDs if they were downloaded automatically from the internet. I think Biopython should ship the most common DTDs. At least the ones needed for test_Entrez, which probably covers most of the use cases of Bio.Entrez. --Michiel. Peter wrote: On 24 May 2008, Michiel de Hoon wrote: > Dear all, > > I have essentially completed the parser in Bio.Entrez. The internals of the new design look more complicated to start with, but I can see how much more general it is than the older versions :) Should it work starting from an empty DTDs folder - or will we ship Biopython with most of the current files? I've had trouble with Biopython trying to fetch missing DTD files from the internet. I think the problem is the NCBI using relative URLs. The following quick hack seems to help in Parser.py but only in some cases (because as listed below, the NCBI have two different base paths): 279,280c279,288 < warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % filename) < handle = urllib.urlopen(systemId) --- > warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % path) > if "/" in systemId : > #Assume this is a full path, e.g. > #http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedline_080101.dtd > handle = urllib.urlopen(systemId) > else : > #Its a relative path, and I'm not sure how to best get the base path: > handle = urllib.urlopen("http://www.ncbi.nlm.nih.gov/entrez/query/DTD/"+systemId) (Also note there seem to be some tab/space isssues in this file). >From http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ I've downloaded the following files using wget: egquery.dtd eSearch_020511.dtd nlmcommon_080101.dtd pubmed_080101.dtd eInfo_020511.dtd eSpell.dtd nlmmedline_080101.dtd taxon.dtd eLink_020511.dtd eSummary_041029.dtd nlmmedlinecitation_080101.dtd uilist.dtd ePost_020511.dtd nlmsharedcatcit_080101.dtd Additionally http://www.ncbi.nlm.nih.gov/dtd/ provided some further XML files needed for the test_Entrez.py unit test: NCBI_GBSeq.dtd NCBI_GBSeq.mod.dtd NCBI_Entity.mod.dtd NCBI_Mim.dtd NCBI_Mim.mod.dtd With all the above files, then the unit test file test_Entrez.py doesn't give any missing DTD warnings - but still has a couple of failures. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 3 05:16:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 05:16:48 -0400 Subject: [Biopython-dev] [Bug 2446] Comments in CT tags cause Bio.Sequencing.Ace.ACEParser to fail. In-Reply-To: Message-ID: <200806030916.m539GmwZ001955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2446 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-03 05:16 EST ------- As pointed out on the mailing list, the test cases attached to this bug have disappeared (some expiry issue?). In the mean time, we could probably just edit the sole existing test case in Tests/Ace/contig1.ace to add a comment to an existing CT tag. Looking at this file, for example edit: CT{ Contig1 repeat phrap 52 53 555456:555432 This is the forst line of comment for c1 and this the second for c1 } to become: CT{ Contig1 repeat phrap 52 53 555456:555432 COMMENT{ This is the first line of comment for c1 and this the second for c1} } In the short term, we could either ignore the COMMENT tags within a CT tag, or just treat them as plain next. Supporting the nested structure within the current would require changes to the current Record structure. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 3 07:46:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 07:46:58 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806031146.m53BkwAB009224@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #5 from cracka80 at gmail.com 2008-06-03 07:46 EST ------- (In reply to comment #4) > I agree that type checking is a problem. > I am not sure if a specialized function in Bio.File is a good idea. The > question is not if "this object is a file-like object", but "does this object > have the attributes/methods needed". So I would prefer to add checks only for > the required attributes/methods in each of the iterators. > The function I have written does exactly this - it checks for the necessary attributes and methods for a given object. The iterators would then only need to call ``File.is_filelike()`` on each object passed into them, rather than a type checking procedure. This is in accordance with the design pattern "Program to an 'interface', not an 'implementation'." (Gang of Four). Would you like me to provide a diff against the current revision of Biopython, with suggested changes? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 3 11:07:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 11:07:35 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806031507.m53F7Zm7019694@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-06-03 11:07 EST ------- Two things: 1) Some of the code that does type checking for file-like-ness seems to be quite old and possibly outdated (e.g. Gobase.Iterator). We should take this opportunity to go through these modules and check if they are still useful. 2) Many of these modules (especially the ones that use an "Iterator" class) would be written differently in modern Python (in particular by making use of a generator function instead of an Iterator class). So I'd like to suggest the following: -) For the modules whose usability is dubious in 2008, let's check on the mailing list if anybody is still using them. If not, we can simply deprecate them. -) For the modules that are still useful, use try/except clauses to check for the necessary attributes. The current function checks for 'read', 'readline', 'readlines', and '__iter__', whereas the parser probably only needs one of them. -) If possible, I'd prefer to convert to modern Python as much as possible (though formally that is not within the scope of this bug report). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 4 15:50:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Jun 2008 15:50:14 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806041950.m54JoEPj029720@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #3 from jblanca at btc.upv.es 2008-06-04 15:50 EST ------- Created an attachment (id=927) --> (http://bugzilla.open-bio.org/attachment.cgi?id=927&action=view) RichSeq proposal I have coded a sequence class that fullfils the requirements that I would like to see. It's very similar to SeqRecord, but it is not compatible with it. It has no seq property, although that can be solved. The problem with SeqRecord is that it is not possible to create a class with an __init__ compatible with Seq and SeqRecord at the same time. This proposed class is just a draft, it needs more work but I would like to receive comments about it. It inherits from MutableSeq so it should be named MutableRichSeq, but it seems that I'm too lazy to such a long name, I promise to change the name in a later version and to create a RichSeq with Seq as parent. Besides RichSeq there is in the attachment two other classes, RichFeature and BioRange, but I would comment on that in another post. I think that it is quite important to convert Seq and MutableSeq to newclasses, what do you think about that? With the new classes we can use properties. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 4 16:19:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Jun 2008 16:19:41 -0400 Subject: [Biopython-dev] [Bug 2508] New: NCBIStandalone.blastall: provide support for '-F F' and make it safe Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2508 Summary: NCBIStandalone.blastall: provide support for '-F F' and make it safe Product: Biopython Version: 1.44 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz The local NCBI blast by default masks low-complexity region by SEG algorithm. I do not see a variable to affect this in NCBIStandalone.blastall(). Luckily, NCBIStandalone.blastall() is an unsafe function and does not check whether I pass multiple arguments in a value expected to be a string or number. Thus, I can do: _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix='IDENTITY -F 0') but imagine I would have done: _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix='IDENTITY -F 0; rm -rf /etc/passwd') The function should be protected against such attacks like if it would have been directly exposed to web users as a CGI script. I propose similar defensive strategy for all functions calling os.system(), os.exec(), os.popen*(), etc. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 04:52:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 04:52:47 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806050852.m558qlPF031059@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 04:52 EST ------- I replied to comment 2 on the mailing list. I had intended this particular bugzilla entry (bug 2507) to be very narrow in scope - purely a small backwards compatible change to the current SeqRecord Some of the questions in comment 3 might have fit better on Bug 2351 although its getting rather long. Rather than taking this issue further off topic, I'll reply on the mailing list again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 5 05:17:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Jun 2008 10:17:00 +0100 Subject: [Biopython-dev] Fwd: Re: sequence class proposal In-Reply-To: <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> References: <1212433879.484445d7a6117@webmail.upv.es> <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> Message-ID: <320fb6e00806050217y1c437b01qa7fd21d75a609e8c@mail.gmail.com> This is in reply to Jose's comment 3 on bug 2507, which was quite broad. http://bugzilla.open-bio.org/show_bug.cgi?id=2507#c3 > I have coded a sequence class that fullfils the requirements that I > would like to see. It's very similar to SeqRecord, but it is not compatible > with it. It has no seq property, although that can be solved. The problem > with SeqRecord is that it is not possible to create a class with an __init__ > compatible with Seq and SeqRecord at the same time. Even if one day the SeqRecord is a subclass of the Seq object, there is no requirement that it have the same __init__ arguments. In fact, have to be different because for a SeqRecord you should also supply an identifier (and potentially a name, description and other annotation). > This proposed class is just a draft, it needs more work but I would like to > receive comments about it. It inherits from MutableSeq so it should be > named MutableRichSeq, but it seems that I'm too lazy to such a long name, > I promise to change the name in a later version and to create a RichSeq > with Seq as parent. I agree with you here that when getting a single letter (amino acid or nucleotide) from a sequence with per-letter-annotation, e.g. my_sequence[5], it would be very nice to have the per-letter-annotation like the quality included. This does mean the object returned can't just be a single one character string. However, because the current Seq and MutableSeq classes return a simple string, unless we return a subclass of a string, this risks breaking other peoples code. So, I would conclude that Seq needs to subclass a string BEFORE we start including support for per-letter-annotation. Ideally we would have alphabet aware versions of all the string functions before we made this change (see Bug 2351). > Besides RichSeq there is in the attachment two other classes, RichFeature > and BioRange, but I would comment on that in another post. Your BioRange and BioFeature classes seem somewhat similar to the current SeqFeature class with its locations (and sub features). > I think that it is quite important to convert Seq and MutableSeq to newclasses, > what do you think about that? With the new classes we can use properties. I have been thinking about deprecating the Seq.data property (and also the MutableSeq). The data string (or array) should really be a private implementation detail, perhaps Seq._data following the underscore for private convention. We can then add property methods to make the Seq.data available (perhaps with a deprecation warning). Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 5 05:36:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 05:36:18 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806050936.m559aINS001028@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 05:36 EST ------- Created an attachment (id=928) --> (http://bugzilla.open-bio.org/attachment.cgi?id=928&action=view) Patch to Bio/SeqRecord.py adding __getitem__ and __len__ and __iter__ Patch based on my comment 1, with addition of __len__ allowing len(my_record) rather than len(my_record.seq) and an explicit __iter__ method (although this is not required, it lets us give a doc string). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 06:18:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:18:11 -0400 Subject: [Biopython-dev] [Bug 2509] New: Deprecating the .data property of the Seq and MutableSeq objects Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2509 Summary: Deprecating the .data property of the Seq and MutableSeq objects Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 2351 nThis: In anticipation that the Seq and MutableSeq objects will eventually subclass the python string, their data property is not needed and confusing. The following patch will replace it with a new-class style property methods and a docstring declaring it to be deprecated. In the case of the Seq object, the sequence should be read only but the user can currently modify the data property in place. In the case of the MutableSeq, the fact that it is internally an array of characters should be a private implementation detail. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 06:18:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:18:14 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200806051018.m55AIE7S003198@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2509 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 06:47:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:47:43 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806051047.m55AlhBe004755@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 06:47 EST ------- Note that adding __len__ has a knock on effect when dealing with SeqRecord objects with a zero length sequence - they now evaluate to False rather than True. This was an issue for some of the unit tests where "if record" was used rather than the more explicit "if record is not None". This change could therefore have unexpected side effects in existing scripts, however adding __len__ is required if we intend to make the SeqRecord act more like the Seq object. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 07:03:27 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 07:03:27 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806051103.m55B3RUU005472@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 07:03 EST ------- You seem to have identified two issues. Adding support for -F should be fairly easy. For the security issue, the caller should be validating their input. Also if running from a web-server, the permissions should also be restricted - failing to do this is asking for trouble. However, defence in layers would be good. Would you suggest a simple check for the ";" character? What about escaped semi-colons? Also this a platform dependant issue. The ";" character is Unix only. At the Windows command line you have to use an &&. Do you have a patch in mind? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 08:56:21 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 08:56:21 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806051256.m55CuLfC010670@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2008-06-05 08:56 EST ------- For the latter issue, I would go and use some python library to escape shell metacharacters. cgi.escape() doesn't do what I would like to. Or cgi.wrap()? Google search returned some hints: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498202 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66012 http://e-articles.info/e/a/title/Command-Injection/ https://bugs.gentoo.org/show_bug.cgi?id=187971#c5 https://bugs.gentoo.org/show_bug.cgi?id=187971#c23 http://mail.python.org/pipermail/python-3000/2007-May/007192.html http://www.owasp.org/index.php/Interpreter_Injection http://www.velocityreviews.com/forums/t352309-sql-escaping-module.html One could learn or even use escaping functions from e.g. MySQLdb.escape() of MySQLdb.connection.escape_string() but I don't think it is a complete solution. I will try to think of it more later. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 09:25:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 09:25:43 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200806051325.m55DPhrQ012033@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 09:25 EST ------- I've commited this patch to CVS as part of BioSQL/BioSeq.py revision 1.24 If you could update you installation of Biopython to CVS and test this please Eric, then I think we can mark this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 09:29:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 09:29:25 -0400 Subject: [Biopython-dev] [Bug 2509] Deprecating the .data property of the Seq and MutableSeq objects In-Reply-To: Message-ID: <200806051329.m55DTP30012244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2509 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 09:29 EST ------- Created an attachment (id=929) --> (http://bugzilla.open-bio.org/attachment.cgi?id=929&action=view) Patch to Bio/Seq.py This turns out to be quite a big change, and while the unit tests still pass more extensive testing would be a good idea. Alternatively, we could just leave expose .data as a read only property, and switch to ._data (or a string subclass) instead. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 13:55:02 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 13:55:02 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806051755.m55Ht2TS024644@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #7 from cracka80 at gmail.com 2008-06-05 13:55 EST ------- I understand your approach that these functions should be converted to modern Python, but it must also be remembered that Biopython as a whole is Python 2.3-compatible, so care must be taken not to modernise code too much. I can't remember when iterators were phased in, but it should be possible, I think it was around 2.2 anyway. (In reply to comment #6) > Two things: > 1) Some of the code that does type checking for file-like-ness seems to be > quite old and possibly outdated (e.g. Gobase.Iterator). We should take this > opportunity to go through these modules and check if they are still useful. > 2) Many of these modules (especially the ones that use an "Iterator" class) > would be written differently in modern Python (in particular by making use of a > generator function instead of an Iterator class). > > So I'd like to suggest the following: > -) For the modules whose usability is dubious in 2008, let's check on the > mailing list if anybody is still using them. If not, we can simply deprecate > them. > -) For the modules that are still useful, use try/except clauses to check for > the necessary attributes. The current function checks for 'read', 'readline', > 'readlines', and '__iter__', whereas the parser probably only needs one of > them. > -) If possible, I'd prefer to convert to modern Python as much as possible > (though formally that is not within the scope of this bug report). > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 7 04:26:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 7 Jun 2008 04:26:54 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806070826.m578Qsj4019312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-06-07 04:26 EST ------- (In reply to comment #7) > I understand your approach that these functions should be converted to modern > Python, but it must also be remembered that Biopython as a whole is Python > 2.3-compatible, so care must be taken not to modernise code too much. I can't > remember when iterators were phased in, but it should be possible, I think it > was around 2.2 anyway. > Bio.Blast.NCBIXML already uses generator functions to return iterators, so I think we are fine as far as compatibility with Python 2.3 and later is concerned. I'll ask on the mailing list if Bio.Gobase has any users, to get started. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 7 04:35:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Gobase, anybody? Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com> Hi everbody, As part of bug report 2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454, I started looking at the Bio.Gobase module. This module provides access to the gobase database: http://megasun.bch.umontreal.ca/gobase/ This module is about seven years old and (AFAICT) is not actively maintained. We don't have documentation for this module, but the unit tests suggests that it parses HTML files from gobase. I am not sure exactly where the HTML files came from, but I doubt that after seven years this still works. So I was wondering: Does anybody use Bio.Gobase? If not, I suggest we deprecate it for the next release, and remove it in some future release. If there are users, we need to make some (small) changes to this module (that is what the original bug report was about). --Michiel. From bugzilla-daemon at portal.open-bio.org Mon Jun 9 08:45:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 9 Jun 2008 08:45:24 -0400 Subject: [Biopython-dev] [Bug 2511] New: setup.py problem with del sys.modules["Martel"] Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2511 Summary: setup.py problem with del sys.modules["Martel"] Product: Biopython Version: Not Applicable Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I'm currently trying to install Biopython from source (CVS) on a clean Mac OS X machine, without reportlab, Numeric or mxTextTools. I've run into a small issue with "python setup.py build" related to the testing for an existing Martel distribution (since Martel has been distributed separately from Biopython before) due to the lack of mxTextTools. Traceback (most recent call last): File "setup.py", line 508, in 'Bio.PopGen': ['SimCoal/data/*.par'], File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/core.py", line 151, in setup dist.run_commands() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 974, in run_commands self.run_command(cmd) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/command/build.py", line 112, in run self.run_command(cmd_name) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/cmd.py", line 333, in run_command self.distribution.run_command(command) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "setup.py", line 157, in run if not is_Martel_installed(): File "setup.py", line 292, in is_Martel_installed del sys.modules["Martel"] # Delete the old version of Martel. The function is_Martel_installed() starts by trying to load the bundled Martel, by calling can_import("Martel"). This is failing with an ImportError from mxTextTools - and hence the Martel version of the bundled copy cannot be determined. The next line of is_Martel_installed() causes the problem: del sys.modules["Martel"] I think this only makes sense if the module could be imported, patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 9 08:46:51 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 9 Jun 2008 08:46:51 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806091246.m59Ckpts011798@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-09 08:46 EST ------- Created an attachment (id=930) --> (http://bugzilla.open-bio.org/attachment.cgi?id=930&action=view) Patch to setup.py How does this look Michiel? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 10 07:37:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Jun 2008 12:37:42 +0100 Subject: [Biopython-dev] Giving the SeqRecord a length? Evaluating it as a boolean Message-ID: <320fb6e00806100437n21e53369p36c85a810007ca19@mail.gmail.com> Something we've discussed before is making the SeqRecord more like a Seq object, perhaps even subclassing it. I've got a patch on Bug 2507 to make some small steps in this direction - accessing elements of the sequence by indexing the SeqRecord, i.e. letter = my_seq_record[5], or iterating over the letters in a SeqRecord's sequence. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 In addition, I would like to give the SeqRecord a length, allowing len(my_seq_record) rather than len(my_seq_record.seq). However, this has a side effect on the evaluation of a SeqRecord as a boolean. Before, any sequence was True, but if we add the __len__ method then any SeqRecord with a zero length sequence will evaluate as False. This is a real issue, for example you can have GenBank files without a sequence (see our unit test cases). One example where this is important is if you are using an iterator via the .next() method and had been checking for a returned None by using "if record:" (something some of the older unit tests were doing) you would have to start using "if record is not None:" instead. If the old behaviour is desirable (evaluating a SeqRecord as a boolean is alway True), we could implement a __nonzero__ method to preserve it, see: http://docs.python.org/ref/customization.html What do people think? Would adding a __len__ method to the SeqRecord cause trouble? Peter From mjldehoon at yahoo.com Tue Jun 10 19:17:56 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 10 Jun 2008 16:17:56 -0700 (PDT) Subject: [Biopython-dev] Giving the SeqRecord a length? Evaluating it as a boolean In-Reply-To: <320fb6e00806100437n21e53369p36c85a810007ca19@mail.gmail.com> Message-ID: <797428.30617.qm@web62402.mail.re1.yahoo.com> +1 for adding a __len__ method, with a __nonzero__ method to let all SeqRecord objects evaluate as true. --Michiel. Peter wrote: Something we've discussed before is making the SeqRecord more like a Seq object, perhaps even subclassing it. I've got a patch on Bug 2507 to make some small steps in this direction - accessing elements of the sequence by indexing the SeqRecord, i.e. letter = my_seq_record[5], or iterating over the letters in a SeqRecord's sequence. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 In addition, I would like to give the SeqRecord a length, allowing len(my_seq_record) rather than len(my_seq_record.seq). However, this has a side effect on the evaluation of a SeqRecord as a boolean. Before, any sequence was True, but if we add the __len__ method then any SeqRecord with a zero length sequence will evaluate as False. This is a real issue, for example you can have GenBank files without a sequence (see our unit test cases). One example where this is important is if you are using an iterator via the .next() method and had been checking for a returned None by using "if record:" (something some of the older unit tests were doing) you would have to start using "if record is not None:" instead. If the old behaviour is desirable (evaluating a SeqRecord as a boolean is alway True), we could implement a __nonzero__ method to preserve it, see: http://docs.python.org/ref/customization.html What do people think? Would adding a __len__ method to the SeqRecord cause trouble? Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 10 19:30:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 10 Jun 2008 19:30:20 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806102330.m5ANUKfo019481@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-06-10 19:30 EST ------- (In reply to comment #1) > Created an attachment (id=930) --> (http://bugzilla.open-bio.org/attachment.cgi?id=930&action=view) [details] > Patch to setup.py > > How does this look Michiel? > That looks find to me, though eventually I would prefer to get rid of the dependence on Martel/mxTextTools altogether. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 10 19:42:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 10 Jun 2008 19:42:52 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806102342.m5ANgqct019925@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-10 19:42 EST ------- In reply to comment 2, would it make sense for the unit test framework to treat the mxTextTools (or reportlab, or Numeric) import errors as a missing external dependency? In the unit tests we used to "ignore" any tests which failed with an ImportError, but have now switched to our own MissingExternalDependencyError exception. We want to distinguish ImportErrors which are external to Biopython (and therefore can be considered as missing dependencies) from those internal to Biopython (perhaps due to refactoring or removal of code - a real unit test failure). One way to do this would be in the bits of Biopython that try to import mxTextTools (or any other module) to raise MissingExternalDependencyError (or something that is a subclass of both MissingExternalDependencyError and the built in ImportError). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 02:54:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 02:54:32 -0400 Subject: [Biopython-dev] [Bug 2516] New: Make it clear what is numeric and what is numpy Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2516 Summary: Make it clear what is numeric and what is numpy Product: Biopython Version: 1.45 Platform: PC URL: http://www.biopython.org/DIST/docs/install/Installation. html OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Hi, although both packages are from the same source site, numpy is the newer implementation whereas numeric is the old, deprecated implementation, right? Why do you say in the installation docs the following? "The Numerical Python distribution (also known an Numeric or Numpy) is a fast implementation of arrays and associated array functionality. This is important for a number of Biopython modules that deal with number processing. The main web site for Numeric is: http://sourceforge.net/projects/numpy and downloads are available from:..." I think it is fooling. BTW, is numpy-1.1.0 supported? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 04:47:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 04:47:32 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806110847.m5B8lWxd010254@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 04:47 EST ------- Patch checked into CVS as Biopython/setup.py revision 1.133, marking this bug as fixed. The issue I raised in comment 3 is still outstanding (external ImportErrors and the unit tests). We may want to file a separate bug, or discuss this on the dev mailing list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 04:53:30 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 04:53:30 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806110853.m5B8rU2t010552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 04:53 EST ------- That text is rather out of date - if you are familiar with the history of Numeric, numarray and numpy you'll know that the old module used with "import Numeric" was called Numerical Python or NumPy for short. This shorthand was used in lots of documentation (not just in Biopython). I think the choice to call the third generation of the array packages numpy has caused a lot of confusion. See http://numpy.scipy.org/#older_array We had updated the Biopython website and other bits of documentation, but had missed this one. Thank you for point this out. P.S. Supporting numpy instead of Numeric is Biopython Bug 2251. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 05:04:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 05:04:47 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806110904.m5B94li8011303@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 05:04 EST ------- I raised the issue of evaluating a SeqRecord as a boolean with a proposal that would could add __len__ but also add __nonzero__ to ensure that any SeqRecord evaluates as True (even if the sequence is of length zero): http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003756.html Michiel was in favour of this: > +1 for adding a __len__ method, with a __nonzero__ method to let all SeqRecord > objects evaluate as true. The patch isn't ready yet because in addition it doesn't get deal with the SeqFeature objects. I think the SeqFeature class needs a _shift(offset) method to return a copy of itself with its location (and the locations of any sub-features) adjusted. I'm still not sure about handling strides, and I am tempted to rule that if a stride other than one is used then the features of the SeqRecord are lost. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 09:57:56 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 09:57:56 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806111357.m5BDvu1I024400@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #928 is|0 |1 obsolete| | ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 09:57 EST ------- Created an attachment (id=937) --> (http://bugzilla.open-bio.org/attachment.cgi?id=937&action=view) Patch to Bio/SeqRecord.py and Bio/SeqFeature.py This modifies the SeqRecord to give it __getitem__ (supporting sliced annotations including features), __len__ (to return the length of the sequence). __nonzero__ (to ensure any SeqRecord evaluates as True regardless of the length of its sequence) and __iter__ (to explicitly support iteration over the sequence with a docstring). As part of this, assorted objects in SeqFeature.py get a private _shift() method taking an integer offset to return a self copy with an adjusted location. Note that slices with a stride (other than one) will result in the features being lost. Handling (positive) strides would require complicated consideration about if an exact location is still present, and if not replacing it with either a fuzzy position or a range. Negative strides are worse! The current set of unit tests seem fine, but addition checks would need to be added to validate this new behaviour. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 11:26:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 11:26:59 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806111526.m5BFQxMw029057@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2008-06-11 11:26 EST ------- I "fixed" SwissProt.SProt.Iterator by deprecating it. Instead of SwissProt.SProt.Iterator, we recommend using Bio.SwissProt.parse and Bio.SeqIO.parse. Next on the to-do list is SwissProt.KeyWList.extract_keywords. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 12 10:23:16 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 12 Jun 2008 10:23:16 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806121423.m5CENG95026678@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2008-06-12 10:23 EST ------- SwissProt.KeyWList.extract_keywords could only parse very old SwissProt files. I deprecated it and wrote a new function "parse" that parses current SwissProt files. This function does not do the file-like check. Prosite.Iterator and Prosite.Prodoc.Iterator are next. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Thu Jun 12 10:33:56 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Thu, 12 Jun 2008 16:33:56 +0200 Subject: [Biopython-dev] CVS access and developers web site In-Reply-To: <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> References: <483E7578.50402@biologie.uni-kl.de> <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> Message-ID: <485133D4.2060405@biologie.uni-kl.de> Peter Cock wrote: > Hi Frank, > > I would try emailing support at helpdesk.open-bio.org using the email > address associated with your CVS username. If you've changed email > address, and you run into problems, I expect Michiel or I could vouch > for you. > Is somebody monitoring that email address? I got an automated response about two weeks ago, and then nothing happened. > For the website, the wiki usernames are entirely separate and you > should be able to create a new account if you don't have one already. > If you want to update the tutorial new HTML and PDF files are loaded > with each release from the version in CVS. > Thanks Peter, got access to the wiki and updated personal data. Frank > Peter > > On Thu, May 29, 2008 at 10:20 AM, Frank Kauff wrote: > >> Hi folks, >> >> although I've been quiet for a while, I'm still doing some changes to the >> Nexus parser of biopython from time to time.... I totally lost my passwords >> to access the repository. Could someone please send me a new password to get >> write access to cvs? And I would also like to change the information on the >> biopython developers web site, as they are somewhat outdated. >> And is this the right place to ask for such things? >> >> Thanks! >> >> Frank >> > > From bugzilla-daemon at portal.open-bio.org Thu Jun 12 11:42:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 12 Jun 2008 11:42:58 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806121542.m5CFgw9t029594@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #11 from cracka80 at gmail.com 2008-06-12 11:42 EST ------- Maybe it's a good idea for any parsers/iterators to just use the iterator-like ability of file handles? Writers would have to function slightly differently, but since file objects, StringIOs and any other file-like objects must provide an __iter__ method, it's probably a good idea to take that into consideration when developing a common interface. In addition, writers could output iterators or generators, so that they can be chained together to operate on files. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 12:24:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 12:24:29 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806131624.m5DGOTKw025954@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #12 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 12:24 EST ------- (In reply to comment #11) > Maybe it's a good idea for any parsers/iterators to just use the iterator-like > ability of file handles? In principle, yes. In practice, it's not so easy because many parsers in Biopython follow the framework in Bio.ParserSupport. These parsers are not really written to deal with lines pulled one-by-one from a file handle. To reconcile these two, I pull out data line-by-line from the file handle, store it in a string, and then call the parser to parse it. This is not ideal, and it may be a good idea for Biopython at some point to change its parser strategy. > Writers would have to function slightly differently, > but since file objects, StringIOs and any other file-like objects must provide > an __iter__ method, it's probably a good idea to take that into consideration > when developing a common interface. In addition, writers could output > iterators or generators, so that they can be chained together to operate > on files. > Writers should also be able to just print the record to the screeen. I don't see how that is easily achievable with generators. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 12:27:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 12:27:47 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806131627.m5DGRlTE026072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 12:27 EST ------- Medline.Iterator, Prosite.Iterator, and Prosite.Prodoc.Iterator are now fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 22:29:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 22:29:13 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806140229.m5E2TDdD014417@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #14 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 22:29 EST ------- I deprecated Bio.Gobase, since no users came forward on the mailing list. Bio.Rebase is also problematic. It parses HTML from the Rebase database, but it was written in 2000 and cannot parse current HTML from Rebase (which looks completely different from the HTML used in 2000). I'll ask on the mailing list if anybody is willing to update Bio.Rebase. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jun 13 22:34:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Rebase Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com> Hi everybody, As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module. This module parses files (in HTML format) from the Rebase database: http://rebase.neb.com/rebase/rebase.html Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files. Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated. --Michiel. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 22:50:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 22:50:42 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806140250.m5E2ogvf014920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 22:50 EST ------- According to the Numerical Python website, the NumPy documentation will become freely available on September 1, 2008. That would be a good time to start thinking seriously about converting from the "old" Numerical Python to the "new" NumPy 1.1. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jun 13 22:46:37 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:46:37 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP maintainer? Message-ID: <523172.98428.qm@web62402.mail.re1.yahoo.com> Still looking at Bug 2454 (http://bugzilla.open-bio.org/show_bug.cgi?id=2454). To fix this bug, I'd like to make some changes to Bio.SCOP. Is anybody currently maintaining Bio.SCOP? The changes I'd like to make are small, but it would be better to discuss with the Bio.SCOP maintainer (if there is one) so I won't get in their way. --Michiel. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 05:52:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 14 Jun 2008 05:52:09 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200806140952.m5E9q9X9032018@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-06-14 05:52 EST ------- We now have parsers for XML returned by Entrez, provided the corresponding DTDs are available. Bio/Entrez/DTDs contains most (all?) DTDs currently used by Entrez. If later some DTDs appear to be missing, we can simply add them to Bio/Entrez/DTDs. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 06:29:12 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 14 Jun 2008 06:29:12 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806141029.m5EATC64001227@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-06-14 06:29 EST ------- Updated the installation instructions (in CVS, at least). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Sat Jun 14 18:51:26 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Jun 2008 23:51:26 +0100 Subject: [Biopython-dev] CVS access and developers web site In-Reply-To: <485133D4.2060405@biologie.uni-kl.de> References: <483E7578.50402@biologie.uni-kl.de> <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> <485133D4.2060405@biologie.uni-kl.de> Message-ID: <320fb6e00806141551t56422a98v752e34bbbb38d0aa@mail.gmail.com> >> Hi Frank, >> >> I would try emailing support at helpdesk.open-bio.org using the email >> address associated with your CVS username. If you've changed email >> address, and you run into problems, I expect Michiel or I could vouch >> for you. >> > > Is somebody monitoring that email address? I got an automated response about > two weeks ago, and then nothing happened. > Maybe someone is on holiday - or they are caught up with BOSC 2008 work? I can suggest a few specific people at OBF to try and contact directly if you are still stuck. In the short term, if there are any urgent fixes you think need to be checked in, stick them on Bugzilla and I'm sure one of us will be able to commit them on your behalf. Peter From bugzilla-daemon at portal.open-bio.org Sun Jun 15 03:03:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 15 Jun 2008 03:03:18 -0400 Subject: [Biopython-dev] [Bug 2468] Tutorial needs a fix: Bio.WWW.NCBI In-Reply-To: Message-ID: <200806150703.m5F73IF2007099@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2468 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-06-15 03:03 EST ------- I created a subsection Examples to the tutorial chapter on Bio.Entrez, and added the example from section 2.5 and Martin's taxonomy example to it. With the Bio.Entrez currently in CVS, finding the lineage works as follows: >>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae") >>> record = Entrez.read(handle) >>> record["IdList"] ['158330'] >>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode='xml') >>> records = Entrez.read(handle) >>> records[0]['Lineage'] 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 16 15:23:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Jun 2008 15:23:43 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806161923.m5GJNhZw012022@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #937 is|0 |1 obsolete| | ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-16 15:23 EST ------- Created an attachment (id=942) --> (http://bugzilla.open-bio.org/attachment.cgi?id=942&action=view) Patch to Bio/SeqRecord.py and Bio/SeqFeature.py I've checked in the SeqRecord __len__ and __nonzero__ methods with CVS Bio/SeqRecord.py revision 1.17 The earlier __getitem__ and __iter__ patch has been updated accordingly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 16 16:08:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Jun 2008 16:08:00 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806162008.m5GK80bv014002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-16 16:07 EST ------- Created an attachment (id=943) --> (http://bugzilla.open-bio.org/attachment.cgi?id=943&action=view) Minimal __getitem__ method for generic alignment This patch just adds a __getitem__ to the alignment which ONLY accepts a single integer index and returns the corresponding SeqRecord object. I propose to add this NOW, as I think even just this is a worthwhile improvement. This is a natural expectation given the current __iter__ behaviour and the model of the alignment as a list of SeqRecord objects. Its also part of the more rich behaviour discussed above, which we can add more easily if/when the SeqRecord gets a __getitem__ method (bug 2507). Comments on this particular patch? Should we add __len__ at the same time giving the number of rows in the alignments? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Tue Jun 17 03:35:38 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 17 Jun 2008 09:35:38 +0200 Subject: [Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO In-Reply-To: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> References: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> Message-ID: <200806170935.38904.jblanca@btc.upv.es> Hi: My main use of the Alignment class is to parse Ace files. I've been thinking about that problem recently. My proposal to modify SeqRecord was due to this problem. I think that the best solution would be to treat the Alignment as a sequence. The consensus would be the actual sequences and the aligned read would be features with per-base-annotations. I've implemented such a class and it works fine for me. In fact the Alignment class is just a wrapper around a standard SeqRecord (I name it RichSeq in my implementation). To do that you just need a SeqRecord with a __getitem__ method. You have already proposing that so that's not a problem. Padding with spaces is not an option when you're dealing with genomic wide alignments, that's one of the problems of the actual Alignment class. If you want I can send my implementation to the list, although it could take a while because I've got my home computer dead. Best regards, Jose Blanca On Monday 16 June 2008 16:01:31 Peter wrote: > I've recently had to deal with some contig files in the Ace format > (output by CAP3, but many assembly files will produce this output). > > We have a module for parsing Ace files in Biopython, > Bio.Sequencing.Ace but I was wondering about integrating this into the > Bio.SeqIO or Bio.AlignIO framework. > http://www.biopython.org/wiki/SeqIO > http://www.biopython.org/wiki/AlignIO > > I'd like to hear from anyone currently using Ace files, on how they > tend to treat the data - and if they think a SeqRecord or Alignment > based representation would be useful. > > Each contig in an Ace file could be treated as a SeqRecord using the > consensus sequence. The identifiers of each sub-sequence used to > build the consensus could be stored as database cross-references, or > perhaps we could store these as SeqFeatures describing which part of > the consensus they support. This would then fit into Bio.SeqIO quite > well. > > Alternatively, each contig could be treated as an alignment (with a > consensus) and integrated into Bio.AlignIO. One drawback for this is > doing this with the current generic alignment class would require > padding the start and/or end of each sequence with gaps in order to > make every sequence the same length. However, if we did this (or > created a more specialised alignment class), the Ace file format would > then fit into Bio.AlignIO too. > > So, Ace users - would either (or both) of the above approaches make > sense for how you use the Ace contig files? > > Thanks > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 17 04:46:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 09:46:22 +0100 Subject: [Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO In-Reply-To: <200806170935.38904.jblanca@btc.upv.es> References: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> <200806170935.38904.jblanca@btc.upv.es> Message-ID: <320fb6e00806170146j6f1843e6hed4166ad62c84423@mail.gmail.com> On Tue, Jun 17, 2008 at 8:35 AM, Jose Blanca wrote: > Hi: > My main use of the Alignment class is to parse Ace files. I've been thinking > about that problem recently. My proposal to modify SeqRecord was due to this > problem. I think that the best solution would be to treat the Alignment as a > sequence. The consensus would be the actual sequences and the aligned read > would be features with per-base-annotations. So integrating the "ace" format into Bio.SeqIO representing the consensus sequence of each contig as a SeqRecord would be useful. Initially I would try and represent the aligned reads as SeqFeature objects (much like when reading a genome from a GenBank file you get CDS features with their amino acid translation). Note that for memory reasons, I would be inclined to scan over the Ace file in one pass (using the existing Iterator in the Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank points out in the code comments, this means we can't easily include the WA, CT, RT and WR tags found in the Ace file footer. Do you use this information Jose? > I've implemented such a class > and it works fine for me. In fact the Alignment class is just a wrapper > around a standard SeqRecord (I name it RichSeq in my implementation). > To do that you just need a SeqRecord with a __getitem__ method. You have > already proposing that so that's not a problem. Your enthusiasm Jose is one of the things motivating me to try and do more with the Seq and SeqRecord. Without a third party to offer feedback, making big changes is risky. > Padding with spaces is not an option when you're dealing with genomic wide > alignments, that's one of the problems of the actual Alignment class. It might make sense to talk about a "Contig Alignment" object/class, compared to the existing "multiple sequence alignment" object/class where all the sequences are the same length. Ideally these should provide as similar an API as possible - even if the internals are different. One idea is a sub-class of the current alignment class which stores an offset (>=0) for each supporting read, used when accessing columns. Maybe we should check out BioPerl etc for inspiration? > If you want I can send my implementation to the list, although it could take a > while because I've got my home computer dead. Good luck with the broken computer - I hope you have an easier time fixing it / rebuilding it than I did last time this hapended to me. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jun 17 05:16:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 10:16:29 +0100 Subject: [Biopython-dev] Iterating over Ace contig files Message-ID: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> Hello Frank, I wanted to get your opinion on iterating over the Ace file contig by contig, and what is lost in the WA, CT, RT and WR tags at the end of the file by doing this. As large sequencing runs become more common, iterating over the file in a single pass WITHOUT keeping everything in memory does seem to be desirable. Similar past discussions: http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html Would you object to me rewording your module's header-comment not to say that the Ace Iterator is NOT deprecated, but rather that it has certain drawbacks. [The context for this is my recent thread on the Biopython dev mailing list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO and/or Bio.AlignIO - I've included a little context below.] Thanks, Peter -- Peter wrote: >> So integrating the "ace" format into Bio.SeqIO representing the >> consensus sequence of each contig as a SeqRecord would be useful. >> Initially I would try and represent the aligned reads as SeqFeature >> objects (much like when reading a genome from a GenBank file you get >> CDS features with their amino acid translation). >> >> Note that for memory reasons, I would be inclined to scan over the Ace >> file in one pass (using the existing Iterator in the >> Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank >> points out in the code comments, this means we can't easily include >> the WA, CT, RT and WR tags found in the Ace file footer. Do you use >> this information Jose? Jose replied, > I haven't used the iterator because of the deprecation warning of the code. I > tried with about 40000 alignments and it worked in a computer with 8 GB of ram. > I there are more sequences, and there will be with the 454 sequencer, we will > have trouble reading all at once. I vote for the iterator approach. I have not > used the information of this tag, but I don't know also what they mean. I've > been looking for documentation about this format, but I've found none, do you > have any good ace documentation? From bugzilla-daemon at portal.open-bio.org Tue Jun 17 07:23:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 07:23:59 -0400 Subject: [Biopython-dev] [Bug 2520] New: Reading ACE assembly contig files in Bio.SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2520 Summary: Reading ACE assembly contig files in Bio.SeqIO Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk As I suggested on the mailing list, we could use Bio.Sequencing.Ace to parse ACE assembly files, and then turn each contig into a SeqRecord using the consensus sequence. I will attach a basic implementation which only uses the consensus sequence and its name. For now this ignores all the meta data and in particular the read information. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 17 07:29:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 07:29:15 -0400 Subject: [Biopython-dev] [Bug 2520] Reading ACE assembly contig files in Bio.SeqIO In-Reply-To: Message-ID: <200806171129.m5HBTFVG026790@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2520 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-17 07:29 EST ------- Created an attachment (id=944) --> (http://bugzilla.open-bio.org/attachment.cgi?id=944&action=view) New file Bio/SeqIO/AceIO.py This new file would be added to Bio.SeqIO in the usual way (updating Bio/SeqIO/__init__.py to import this module and map the format "ace" to the new iterator). Handling different gap characters in Bio.SeqIO (and translating them when reading and writing files) has not been formalised. Where possible, converting them into dashes on loading seems to be a sensisble route to take. Therefore I deliberately map any "*" gap characters in the consensus sequence into "-" characters, which are used by default in the alphabet class and are far more commonly used. The "*" character is typically associated with a stop codon in protein sequences, which is another reason to avoid using it here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Tue Jun 17 09:06:34 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 17 Jun 2008 15:06:34 +0200 Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> References: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> Message-ID: <4857B6DA.9040309@biologie.uni-kl.de> Hi Peter, makes totally sense to me. Feel free to do the changes as you see it fit Frank Peter wrote: > Hello Frank, > > I wanted to get your opinion on iterating over the Ace file contig by > contig, and what is lost in the WA, CT, RT and WR tags at the end of > the file by doing this. As large sequencing runs become more common, > iterating over the file in a single pass WITHOUT keeping everything in > memory does seem to be desirable. > > Similar past discussions: > http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html > http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html > > Would you object to me rewording your module's header-comment not to > say that the Ace Iterator is NOT deprecated, but rather that it has > certain drawbacks. > > [The context for this is my recent thread on the Biopython dev mailing > list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO > and/or Bio.AlignIO - I've included a little context below.] > > Thanks, > > Peter > > -- > > Peter wrote: > >>> So integrating the "ace" format into Bio.SeqIO representing the >>> consensus sequence of each contig as a SeqRecord would be useful. >>> Initially I would try and represent the aligned reads as SeqFeature >>> objects (much like when reading a genome from a GenBank file you get >>> CDS features with their amino acid translation). >>> >>> Note that for memory reasons, I would be inclined to scan over the Ace >>> file in one pass (using the existing Iterator in the >>> Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank >>> points out in the code comments, this means we can't easily include >>> the WA, CT, RT and WR tags found in the Ace file footer. Do you use >>> this information Jose? >>> > > Jose replied, > >> I haven't used the iterator because of the deprecation warning of the code. I >> tried with about 40000 alignments and it worked in a computer with 8 GB of ram. >> I there are more sequences, and there will be with the 454 sequencer, we will >> have trouble reading all at once. I vote for the iterator approach. I have not >> used the information of this tag, but I don't know also what they mean. I've >> been looking for documentation about this format, but I've found none, do you >> have any good ace documentation? >> > > From biopython at maubp.freeserve.co.uk Tue Jun 17 09:53:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 14:53:23 +0100 Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <4857B6DA.9040309@biologie.uni-kl.de> References: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> <4857B6DA.9040309@biologie.uni-kl.de> Message-ID: <320fb6e00806170653g482b104fl739107fcada06dc8@mail.gmail.com> On Tue, Jun 17, 2008 at 2:06 PM, Frank Kauff wrote: > Hi Peter, > > makes totally sense to me. Feel free to do the changes as you see it fit > > Frank Thanks Frank. I've checked in some comment changes to both Ace.py and Phd.py, aimed at both improving the documentation and trying and make epydoc happier for the automatic API documentation: http://biopython.org/DIST/docs/api/ Peter P.S. I also added an __iter__ method to the Ace Iterator (Phd already had one). From mjldehoon at yahoo.com Tue Jun 17 10:08:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 17 Jun 2008 07:08:31 -0700 (PDT) Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <320fb6e00806170653g482b104fl739107fcada06dc8@mail.gmail.com> Message-ID: <399611.60966.qm@web62415.mail.re1.yahoo.com> Note that bug #2454 also pertains to the Ace and Phd parsers. If you are modifying the Ace and Phd parsers, can you fix this bug at the same time? http://bugzilla.open-bio.org/show_bug.cgi?id=2454 --Michiel. Peter wrote: On Tue, Jun 17, 2008 at 2:06 PM, Frank Kauff wrote: > Hi Peter, > > makes totally sense to me. Feel free to do the changes as you see it fit > > Frank Thanks Frank. I've checked in some comment changes to both Ace.py and Phd.py, aimed at both improving the documentation and trying and make epydoc happier for the automatic API documentation: http://biopython.org/DIST/docs/api/ Peter P.S. I also added an __iter__ method to the Ace Iterator (Phd already had one). _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 17 10:43:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 10:43:42 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806171443.m5HEhgua005645@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-17 10:43 EST ------- I've removed the strict file-like test in: Bio/Sequencing/Ace.py revision: 1.12 Bio/Sequencing/Phd.py revision: 1.6 In these cases, the handle is immediately turned into an UndoHandle which will be able to check for a sufficiently file like object. Hopefully that's what you meant Michiel - we could go further and introduce a parse() function and deprecate the Iterator objects in these modules. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 06:34:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 06:34:43 -0400 Subject: [Biopython-dev] [Bug 2503] An error when parsing NCBIWWW Blast output In-Reply-To: Message-ID: <200806181034.m5IAYhS1026214@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2503 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 06:34 EST ------- I'm closing this bug as "INVALID" due to a lack of information. If you are still having trouble Prashantha, and can give us some more information, please re-open this bug. Thank you. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 07:34:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 07:34:26 -0400 Subject: [Biopython-dev] [Bug 2497] Unit tests do not cover Bio.Blast.NCBIWWW.qblast() In-Reply-To: Message-ID: <200806181134.m5IBYQjC032061@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2497 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 07:34 EST ------- I checked in a slightly revised version of this as test_NCBI_qblast.py - marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 08:01:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 08:01:11 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806181201.m5IC1BxA001255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 08:01 EST ------- Created an attachment (id=946) --> (http://bugzilla.open-bio.org/attachment.cgi?id=946&action=view) Patch to Bio/Blast/NCBIStandalone.py and Tests/test_NCBIStandalone.py Suggested patch for the command injection risk. Can anyone think of a legitimate reason for a ; or & character in the parameters of a BLAST command line? This patch is very simple and will reject any keyword parameter containing the ; or && characters. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 18 10:00:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 15:00:56 +0100 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> Message-ID: <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> This is returning to a thread last year, about getting a SeqRecord into a string in a particular file format (e.g. fasta). Jared Flatow had suggest adding a method to the SeqRecord itself. Jared wrote: > > ... To always have to write to a file feels strange, but I see > > that it would be messy to go OO since there are so many formats. > > However, giving preference to fasta over other formats by making it > > innate doesn't seem like such a terrible idea. I do have mixed > > feelings about 'bloating' the code which is why I asked, and you have > > convinced me that this is not quite appropriate given existing > > convention. However the idea would be to put the to_fasta or > > to_format method inside the SeqRecord, then to call it from the IO > > when needed to actually write to a file, but call it directly when > > all that is wanted is a string... > > Its debatable isn't it? I suspect that for most users, when they want a > record in a particular file format its for writing to a file. However, > adding a to_format() method to a SeqRecord some sense (suitable for > sequential file formats only). This would take a format name and return > a string, by calling Bio.SeqIO with a StringIO object internally. > > Peter Jared - On reflection, do you think adding a method like this to the SeqRecord (or even just for the FASTA format) would be useful? I recently found myself wanting to use this sort of functionality, and remembered this old thread. This time I was wondering about using the method name tostring (matching the name of a Seq object method). In order to mimic the Seq object's method, the format would be optional and when omitted would give the sequence as a string. Otherwise one of the lower case strings used in Bio.SeqIO should be supplied. There is a sample implementation at the end of this email. ? On Wed, Oct 17, 2007 Michiel De Hoon wrote: > How about the following: > > SeqIO.write(sequences, handle, format) returns the properly formatted string > if handle==None. I can see the above is simpler than having to supply a StringIO handle, but it doesn't make the functionality available directly from the SeqRecord object. It also complicates the API of the SeqIO module with a special case. Peter -- ###################################### For the SeqRecord class, in Bio/SeqRecord.py ###################################### def tostring(self, format=None) : """Returns the record as a string in the specified file format. If the file format is omitted (default), the sequence itself is returned as a string. Otherwise the format should be a lower case string supported by Bio.SeqIO, which is used to turn the SeqRecord into a string.""" if format : from StringIO import StringIO from Bio import SeqIO handle = StringIO() SeqIO.write([self], handle, format) handle.seek(0) return handle.read() else : #Return the sequence as a string return self.seq.tostring() ############################################ From jflatow at northwestern.edu Wed Jun 18 11:25:18 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Wed, 18 Jun 2008 10:25:18 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Message-ID: <55567F98-C5F5-4A2F-8542-502F17F485E9@northwestern.edu> Quick correction: On Jun 18, 2008, at 10:16 AM, Jared Flatow wrote: > Hi Peter, > > On Jun 18, 2008, at 9:00 AM, Peter wrote: > >> Jared - On reflection, do you think adding a method like this to the >> SeqRecord (or even just for the FASTA format) would be useful? > > Yes I still think so. In fact, for sequences, I would say that I > pretty much never deal with a format ever than FASTA, so even making > the __str__ method of SeqRecord return the FASTA format as well > seems reasonable, though perhaps my use cases are different than > others. > > However, py3k and 2.6 will make available the functionality > described in PEP 3101: > > http://www.python.org/dev/peps/pep-3101/ > > I think it would be best to define some semantics that are > compatible with this PEP. This would basically mean using the > __format__ method (which could be the same as the tostring method > you have defined below). To achieve backward compatibility and/or a > more OO interface, tostring could just be an alias for __format__. > Thus, instead of calling format(seq_rec, 'fasta') one could call > seq_rec.tostring('fasta') and these would be equivalent. The PEP > also states that format(seq_rec) should be the same as str(seq_rec). On second thought it seems like a .format method (similar to the one the string class is acquiring) should be used as an alias to __format__ (somehow I think tostring should always be the same as __str__) > In short, I think creating methods to return formatted versions of > objects (SeqRecords) is a good idea, but most especially if it is > done in a way consistent with the language's vision. > > Best, > jared From bugzilla-daemon at portal.open-bio.org Wed Jun 18 11:36:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 11:36:48 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806181536.m5IFamvB015695@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #16 from mdehoon at ims.u-tokyo.ac.jp 2008-06-18 11:36 EST ------- (In reply to comment #15) > I've removed the strict file-like test in: > > Bio/Sequencing/Ace.py revision: 1.12 > Bio/Sequencing/Phd.py revision: 1.6 > > In these cases, the handle is immediately turned into an UndoHandle which will > be able to check for a sufficiently file like object. > > Hopefully that's what you meant Michiel Actually, I think we should avoid using an UndoHandle altogether, now that Python has generator functions. > - we could go further and introduce a > parse() function and deprecate the Iterator objects in these modules. > That would make things a lot easier. An Iterator class was useful in older versions of Python, but generator functions provide a cleaner alternative. In Ace.py, we'd need three functions: 1) read(handle), which returns one record (Contig) read from the handle, and None otherwise; 2) parse(handle), a generator function returning an iterator over the records; 3) a local function _process_line(line, record) These functions then look like this: def read(handle): record = None for line in handle: if line[:2]=='CO': break else: return None record = Contig() for line in handle: if line[:2]=='CO': return record else: _process_line(line, record) def parse(handle): record = None for line in handle: if line[:2]=='CO': if record: yield record record = Contig() _process_line(line, record) if record: return record The actual work is done in _process_line. So we don't need to store the read lines explicitly; this is now taken care of by the generator function. Hence, we don't need to convert the handle to an UndoHandle. In addition, handle can now also be a list of lines instead of a file handle. In this respect, I think Zachary was right in comment #11: > Maybe it's a good idea for any parsers/iterators to just > use the iterator-like ability of file handles? In other words, as long as we can pull lines from the handle, we can parse it. In Phd.py, it's even simpler. Here, we only need the read() and parse() function: def read(handle): for line in handle: if line.startswith("BEGIN_SEQUENCE"): record = Record() elif line.startswith("END_SEQUENCE"): return record else: # do the actual processing of the other lines here def parse(handle): while True: record = read(handle) if not record: return yield record Again, we can process each line just as they come along. No UndoHandle, no Parser, no Consumer, no Scanner needed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Wed Jun 18 11:16:59 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Wed, 18 Jun 2008 10:16:59 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> Message-ID: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Hi Peter, On Jun 18, 2008, at 9:00 AM, Peter wrote: > Jared - On reflection, do you think adding a method like this to the > SeqRecord (or even just for the FASTA format) would be useful? Yes I still think so. In fact, for sequences, I would say that I pretty much never deal with a format ever than FASTA, so even making the __str__ method of SeqRecord return the FASTA format as well seems reasonable, though perhaps my use cases are different than others. However, py3k and 2.6 will make available the functionality described in PEP 3101: http://www.python.org/dev/peps/pep-3101/ I think it would be best to define some semantics that are compatible with this PEP. This would basically mean using the __format__ method (which could be the same as the tostring method you have defined below). To achieve backward compatibility and/or a more OO interface, tostring could just be an alias for __format__. Thus, instead of calling format(seq_rec, 'fasta') one could call seq_rec.tostring('fasta') and these would be equivalent. The PEP also states that format(seq_rec) should be the same as str(seq_rec). In short, I think creating methods to return formatted versions of objects (SeqRecords) is a good idea, but most especially if it is done in a way consistent with the language's vision. Best, jared From yair.benita at gmail.com Wed Jun 18 13:26:02 2008 From: yair.benita at gmail.com (Yair Benita) Date: Wed, 18 Jun 2008 13:26:02 -0400 Subject: [Biopython-dev] BioPax parser Message-ID: Hi Guys, Does anyone have a biopax parser written in python? Thanks, Yair From biopython at maubp.freeserve.co.uk Wed Jun 18 13:42:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 18:42:13 +0100 Subject: [Biopython-dev] BioPax parser In-Reply-To: References: Message-ID: <320fb6e00806181042y169f580epbd8c876eb3cb57fa@mail.gmail.com> On Wed, Jun 18, 2008 at 6:26 PM, Yair Benita wrote: > Hi Guys, > Does anyone have a biopax parser written in python? > Thanks, > Yair I don't know of any (but I haven't searched). From a quick look on www.biopax.org they use XML, so you should be able to parse it in python fairly easily - but I guess some sort of object orientated representation of the data would be very nice to have. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 19 06:08:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:08:55 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806191008.m5JA8t0v016495@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:08 EST ------- On the issue of the low-complexity filter, that is actually already supported in NCBIStandalone.blastall(), NCBIStandalone.blastpgp() and NCBIStandalone.rpsblast() using the optional argument 'filter'. This is described in the doc string too, although it doesn't use the phrase "low complexity" which might be clearer. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 19 06:20:03 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:20:03 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200806191020.m5JAK3OZ017201@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:20 EST ------- I'm marking this as fixed now, but if anyone does find an issue with it please re-open the bug. Thanks for your work on this Eric. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 19 06:41:22 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:41:22 -0400 Subject: [Biopython-dev] [Bug 2408] GenBank records do not contain U's In-Reply-To: Message-ID: <200806191041.m5JAfMNK018058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2408 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:41 EST ------- Given there were no other opinions voiced on how to handle this, I went ahead and fixed this in Bio/GenBank/__init__.py CVS revision 1.83 For records from RNA, if the sequence contains T but not U, we will use a DNA alphabet in the Seq object. Thanks for raising this Marcin. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Thu Jun 19 09:04:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT) Subject: [Biopython-dev] Bio.CDD, anyone? Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com> Hi everybody, Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD. So I am wondering: 1) Is anybody using Bio.CDD? 2) Is anybody willing to update Bio.CDD to handle current HTML? 3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago. --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 19 09:38:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 14:38:29 +0100 Subject: [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From mjldehoon at yahoo.com Thu Jun 19 09:58:25 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:58:25 -0700 (PDT) Subject: [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> Message-ID: <352888.20937.qm@web62409.mail.re1.yahoo.com> > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website. --Michiel. Peter wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From biopython at maubp.freeserve.co.uk Thu Jun 19 17:08:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 22:08:13 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? Message-ID: <320fb6e00806191408t45a45da8hda0c2fc8a39aae57@mail.gmail.com> Hi Michiel, I've just tried the unit tests on a clean checkout on Linux, and there is a problem with test_Entrez.py (shown below). I'm pretty sure it was working for me on Mac OS X this afternoon, so this may be platform specific. I haven't using Biopython on Windows recently so I don't know if that is working or not. If you can't reproduce this, let me know and I do some investigation here. The good news is all the other tests seem fine on Linux (bar the GFF, dnal and the population genetics tests for which I don't have the external dependencies installed). Peter This is the output I get on python 2.4.3, using 64bit Ubuntu Dapper Drake (a little old now). maubp at shuttle2:~/repository/biopython/Tests$ python test_Entrez.py Test parsing database list returned by EInfo ... ok Test parsing database info returned by EInfo ... ok Test parsing XML returned by ESearch from the Journals database ... ok Test parsing XML returned by ESearch when no items were found ... ok Test parsing XML returned by ESearch from the Nucleotide database ... ok Test parsing XML returned by ESearch from PubMed Central ... ok Test parsing XML returned by ESearch from the Protein database ... ok Test parsing XML returned by ESearch from PubMed (first test) ... ok Test parsing XML returned by ESearch from PubMed (second test) ... ok Test parsing XML returned by ESearch from PubMed (third test) ... ok Test parsing XML returned by EPost ... ok Test parsing XML returned by EPost with an invalid id (overflow tag) ... ok Test parsing XML returned by EPost with incorrect arguments ... ERROR Test parsing XML returned by ESummary from the Journals database ... ok Test parsing XML returned by ESummary from the Nucleotide database ... ok Test parsing XML returned by ESummary from the Protein database ... ok Test parsing XML returned by ESummary from PubMed ... ok Test parsing XML returned by ESummary from the Structure database ... ok Test parsing XML returned by ESummary from the Taxonomy database ... ok Test parsing XML returned by ESummary from the UniSTS database ... ok Test parsing XML returned by ESummary with incorrect arguments ... ERROR Test parsing cancerchromosomes links returned by ELink ... ok Test parsing medline indexed articles returned by ELink ... ok Test parsing Nucleotide to Protein links returned by ELink ... ok Test parsing pubmed links returned by ELink (first test) ... ok Test parsing pubmed links returned by ELink (second test) ... ok Test parsing pubmed link returned by ELink (third test) ... ok Test parsing pubmed links returned by ELink (fourth test) ... ok Test parsing pubmed links returned by ELink (fifth test) ... ok Test parsing pubmed links returned by ELink (sixth test) ... ok Test parsing XML returned by EFetch, Journals database ... ok Test parsing XML returned by EFetch, Nucleotide database (first test) ... ok Test parsing XML returned by EFetch, Protein database ... ok Test parsing XML returned by EFetch, OMIM database ... ok Test parsing XML returned by EFetch, PubMed database (first test) ... ok Test parsing XML returned by EFetch, PubMed database (second test) ... ok Test parsing XML returned by EFetch, Taxonomy database ... ok Test parsing XML output returned by EGQuery (first test) ... ok Test parsing XML output returned by EGQuery (second test) ... ok Test parsing XML output returned by ESpell ... ok ====================================================================== ERROR: Test parsing XML returned by EPost with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 560, in t_wrong assert exception.message=="Wrong DB name" AttributeError: RuntimeError instance has no attribute 'message' ====================================================================== ERROR: Test parsing XML returned by ESummary with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 943, in t_wrong assert exception.message=="Neither query_key nor id specified" AttributeError: RuntimeError instance has no attribute 'message' ---------------------------------------------------------------------- Ran 40 tests in 0.471s FAILED (errors=2) From biopython at maubp.freeserve.co.uk Fri Jun 20 05:31:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 10:31:21 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <320fb6e00806191408t45a45da8hda0c2fc8a39aae57@mail.gmail.com> References: <320fb6e00806191408t45a45da8hda0c2fc8a39aae57@mail.gmail.com> Message-ID: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> > Hi Michiel, > > I've just tried the unit tests on a clean checkout on Linux, and there > is a problem with test_Entrez.py (shown below). I'm pretty sure it > was working for me on Mac OS X this afternoon, so this may be platform > specific. I haven't using Biopython on Windows recently so I don't > know if that is working or not. I've just checked, and on a clean CVS checkout under Mac OS 10.5 Leopard with python 2.5.2, test_Entrez.py passes. A clean check out last night on 64bit Ubuntu Dapper Drake with python 2.4.3 failed. So whatever is going wrong is probably OS specific or perhaps python version specific. Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 20 06:07:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 06:07:59 -0400 Subject: [Biopython-dev] [Bug 2524] New: Handle missing libraries like TextTools in run_tests.py Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2524 Summary: Handle missing libraries like TextTools in run_tests.py Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Once upon a time, we treated any ImportError from a unit test as a reason to skip the test gracefully, as these are *usually* from missing external dependencies. This could hide real errors if we had (re)moved a Biopython module. We now use the Bio.MissingExternalDependencyError exception, and the unit tests themselve will raise this for missing command line tools or certain optional libraries like MySQLdb. However, the Bio.MissingExternalDependencyError exception does not get raised when the following commonly used external dependencies are missing: import TextTools import Numeric import reportlab It is now possible to install Biopython without TextTools and reportlab (and Numeric?), and make use of a lot of its functionality - but the unit tests give nasty error messages. I propose we either: (a) Add a special case to run_tests.py to catch specific ImportError cases and skip the test with a suitable message (patch to follow). Specifically TextTools, reportlab and Numeric - but potentially other third party libraries like MySQLdb could be handled too. This keeps the individual unit tests simple. or: (b) Modify all the tests using these semi-optional libraries to catch the ImportError and raise MissingExternalDependencyError instead. As the tests themselves generally don't directly import the external library this is perhaps messy. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 20 06:09:37 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 06:09:37 -0400 Subject: [Biopython-dev] [Bug 2524] Handle missing libraries like TextTools in run_tests.py In-Reply-To: Message-ID: <200806201009.m5KA9b98019988@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2524 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-20 06:09 EST ------- Created an attachment (id=948) --> (http://bugzilla.open-bio.org/attachment.cgi?id=948&action=view) Patch to Tests/run_tests.py Adds a hard coded list of known import errors to be treated as missing external dependencies (i.e. skip the test). This is implemented as a dict allowing a URL to be given. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 20 06:16:49 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 06:16:49 -0400 Subject: [Biopython-dev] [Bug 2525] New: The unit tests GUI run_tests.py does not track skipped tests Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2525 Summary: The unit tests GUI run_tests.py does not track skipped tests Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Running run_tests.py without the --no-gui command line option counts any skipped tests as passed (green). Furthermore, the skipped message is just printed to the command line (if run from a terminal). Ideally the test framework would report these skipped tests in the GUI, perhaps even with a clickable entry (like the failures) to show the message. [On a personal note, I never use the run_tests.py GUI, and would rather it was not the default. If no one likes it, we could just remove the GUI] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 20 08:17:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 08:17:15 -0400 Subject: [Biopython-dev] [Bug 2525] The unit tests GUI run_tests.py does not track skipped tests In-Reply-To: Message-ID: <200806201217.m5KCHFoF025054@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2525 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2008-06-20 08:17 EST ------- > [On a personal note, I never use the run_tests.py GUI, and would rather it was > not the default. If no one likes it, we could just remove the GUI] > Personally, I don't see the advantage of the GUI, and I can live without it. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jun 20 08:14:30 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 20 Jun 2008 05:14:30 -0700 (PDT) Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> Message-ID: <795994.35527.qm@web62408.mail.re1.yahoo.com> Hi Peter, Thanks for letting me know. It turned out that there were two problems with older Python versions (2.3 and 2.4). One issue was not in Bio.Entrez but in the test script itself, using a feature that is only available in Python 2.5. This is now fixed in CVS. The second issue is with Python 2.3: It does not copy data files to the build directory. Then, when you run "python run_tests.py test_Entrez.py" you will get many error messages about missing DTD files. If you run "python test_entrez.py" instead, the tests are done from the installed Biopython instead of the one in the build directory, and then no errors occur. I guess the only way to solve this is to modify run_tests.py to skip test_Entrez if Python is version 2.3. Unless somebody else has a better suggestion, I will do that. --Michiel. Peter wrote: > Hi Michiel, > > I've just tried the unit tests on a clean checkout on Linux, and there > is a problem with test_Entrez.py (shown below). I'm pretty sure it > was working for me on Mac OS X this afternoon, so this may be platform > specific. I haven't using Biopython on Windows recently so I don't > know if that is working or not. I've just checked, and on a clean CVS checkout under Mac OS 10.5 Leopard with python 2.5.2, test_Entrez.py passes. A clean check out last night on 64bit Ubuntu Dapper Drake with python 2.4.3 failed. So whatever is going wrong is probably OS specific or perhaps python version specific. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Fri Jun 20 08:43:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 13:43:55 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <795994.35527.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> <795994.35527.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00806200543u62d385fcka3aa9026986549ba@mail.gmail.com> On Fri, Jun 20, 2008 at 1:14 PM, Michiel de Hoon wrote: > Hi Peter, > > Thanks for letting me know. > > It turned out that there were two problems with older Python versions (2.3 and 2.4). > One issue was not in Bio.Entrez but in the test script itself, using a > feature that is only available in Python 2.5. This is now fixed in CVS. Good work. > The second issue is with Python 2.3: It does not copy data files to the > build directory. Then, when you run "python run_tests.py test_Entrez.py" > you will get many error messages about missing DTD files. If you run > "python test_entrez.py" instead, the tests are done from the installed > Biopython instead of the one in the build directory, and then no errors occur. I had suspected there was something like this happening on my Windows machine (which is on python 2.3) but at the time you were still busy updating the code so I didn't worry about it. This issue with non-python files in the build directory reminds me of something Tiago found with his Population Genetics work. I'd have to go over the old emails to double check. > I guess the only way to solve this is to modify run_tests.py to skip > test_Entrez if Python is version 2.3. Unless somebody else has a better > suggestion, I will do that. We could modify setup.py under python 2.3 to make sure these files are copied. Is this related to the (reverted) package_data change you tried recently? Peter From biopython at maubp.freeserve.co.uk Fri Jun 20 09:23:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 14:23:21 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <320fb6e00806200543u62d385fcka3aa9026986549ba@mail.gmail.com> References: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> <795994.35527.qm@web62408.mail.re1.yahoo.com> <320fb6e00806200543u62d385fcka3aa9026986549ba@mail.gmail.com> Message-ID: <320fb6e00806200623n2148b735t1071aa40b0f24a7c@mail.gmail.com> >> The second issue is with Python 2.3: It does not copy data files to the >> build directory. Then, when you run "python run_tests.py test_Entrez.py" >> you will get many error messages about missing DTD files. If you run >> "python test_entrez.py" instead, the tests are done from the installed >> Biopython instead of the one in the build directory, and then no errors occur. > > ... > > This issue with non-python files in the build directory reminds me of > something Tiago found with his Population Genetics work. I'd have to > go over the old emails to double check. I was thinking of bug 2375, where Tiago had to add a work arround for data files not present in the build directory. http://bugzilla.open-bio.org/show_bug.cgi?id=2375 Peter From biopython at maubp.freeserve.co.uk Fri Jun 20 10:42:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 15:42:57 +0100 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Message-ID: <320fb6e00806200742w7e9e57dbt8d0d3362573cf9a@mail.gmail.com> On Wed, Jun 18, 2008 at 4:16 PM, Jared Flatow wrote: > However, py3k and 2.6 will make available the functionality described in PEP > 3101: > > http://www.python.org/dev/peps/pep-3101/ > > I think it would be best to define some semantics that are compatible with > this PEP. That is interesting - the PEP has been accepted, but I guess we should wait and see exactly what python 2.6 and 3.0 end up using before trying to integrate this into the SeqRecord. > In short, I think creating methods to return formatted versions of objects > (SeqRecords) is a good idea, but most especially if it is done in a way > consistent with the language's vision. That does sound wise - but I'm a little hazy on how exactly PEP-3101 will work in practice for generic complex objects. Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 20 11:01:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 11:01:17 -0400 Subject: [Biopython-dev] [Bug 2526] New: SeqFeature's .id property is not preserved in BioSQL Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2526 Summary: SeqFeature's .id property is not preserved in BioSQL Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk As per the title, a SeqFeature's .id property is not preserved after a save/retreive in BioSQL. I found this while working on Bug 2235, where my modified "swiss" parser creates SeqRecord objects with SeqFeature object which may have their .id set. Note that in GenBank and EMBL, the SeqFeature objects do not have their id property set, and so are not affected. I need to review the BioSQL schema to see if there is a suitable field that Biopython is ignoring, and if there is, use it. If not, we can probably use a tagged qualifier - ideally with the same name as the other Bio* projects. See also test_BioSQL_SeqIO.py revision 1.17 which includes a word arround to avoid this limitation. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Fri Jun 20 12:16:10 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Fri, 20 Jun 2008 11:16:10 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <320fb6e00806200742w7e9e57dbt8d0d3362573cf9a@mail.gmail.com> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> <320fb6e00806200742w7e9e57dbt8d0d3362573cf9a@mail.gmail.com> Message-ID: <0FB6DD30-426C-43F3-BEBE-1728FA1E9D79@northwestern.edu> On Jun 20, 2008, at 9:42 AM, Peter wrote: > On Wed, Jun 18, 2008 at 4:16 PM, Jared Flatow > wrote: >> However, py3k and 2.6 will make available the functionality >> described in PEP >> 3101: >> >> http://www.python.org/dev/peps/pep-3101/ >> >> I think it would be best to define some semantics that are >> compatible with >> this PEP. > > That is interesting - the PEP has been accepted, but I guess we should > wait and see exactly what python 2.6 and 3.0 end up using before > trying to integrate this into the SeqRecord. I agree, there's a couple of things that may still change, but the betas for 2.6 and 3.0 are out and that PEP has been around a while so I would say it's pretty much stable. At least as far as how the general mechanism will work, I don't believe that is likely to change. >> In short, I think creating methods to return formatted versions of >> objects >> (SeqRecords) is a good idea, but most especially if it is done in a >> way >> consistent with the language's vision. > > That does sound wise - but I'm a little hazy on how exactly PEP-3101 > will work in practice for generic complex objects. Yes I had to read it a few times through to understand how exactly it will work, here is what I know: All objects now get the __format__ method which has a signature like this: def __format__(self, format_spec): # return a formatted string The format_spec (format specifier) can be defined by the object, so essentially it's totally customizable (if you want to do really crazy things there is a Formatter that can be messed with, but we should and can avoid this). This object method works like other customizable python methods, and there's a corresponding builtin, so calling format(obj, "the format specifier") will simply call obj.__format__(self, "the format specifier"). Thus we can define the format_spec for a SeqRecord to differentiate between FASTA and whatever other formats we want to define. The string class is also getting a .format method which just calls the .__format__ method in an OO way instead of using the builtin. We can do the same thing, and it seems like most use cases will be to call seq_rec.format('fasta'). All this works for all python versions, except you typically can't call it using format(seq_rec, 'fasta') except in 2.6 or 3.0. Besides the builtin format, we gain the ability to embed the format within other strings. So, using the implementation you provided earlier which just returns the underlying Seq as a string if no format is specified, we might define the __format__ method like this: def __format__(self, format_spec=None): if format_spec: from StringIO import StringIO from Bio import SeqIO handle = StringIO() SeqIO.write([self], handle, format) handle.seek(0) return handle.read() return str(self) def __str__(self): return str(self.seq) Now that means I can also embed this in formatted strings, like so: "this is my sequence: {0}".format(seq_rec) Or: "this is my sequence in fasta format: {0:fasta}".format(seq_rec) All in all, its pretty much what you'd expect (and the same as what you had before). There's only a few small benefits we get for doing it this way (right now), but I don't think we can go wrong using the __format__ method like it was meant to be used, and who knows what future use cases this may simplify. jared From bugzilla-daemon at portal.open-bio.org Sat Jun 21 00:19:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 21 Jun 2008 00:19:59 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200806210419.m5L4JxfJ001994@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #22 from mdehoon at ims.u-tokyo.ac.jp 2008-06-21 00:19 EST ------- (In reply to comment #15) > The solution in Bio/PopGen/SimCoal/__init__.py to find builtin_tpl_dir is not > so beautiful, but on the other hand I don't see a better way to do it. I ran into the same problem with Bio/Entrez, which needs a bunch of DTD files in Bio/Entrez/DTDs/. The attached patch to setup.py modifies the build command such that the data files are copied to the build directory when running "python setup.py build". This solves the problem with Bio.Entrez, and should also solve the problem with Bio/PopGen/SimCoal without using the workaround in Bio/PopGen/SimCoal/__init__.py. Can you guys try this patch on the platforms and python versions you have access to? Just to make sure I didn't miss anything before committing to CVS. Recently there have been quite a lot of updates to CVS, so you may need to start from a fresh CVS checkout. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 21 00:21:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 21 Jun 2008 00:21:13 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200806210421.m5L4LDPg002064@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 ------- Comment #23 from mdehoon at ims.u-tokyo.ac.jp 2008-06-21 00:21 EST ------- Created an attachment (id=950) --> (http://bugzilla.open-bio.org/attachment.cgi?id=950&action=view) Patch to setup.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 21 01:11:18 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 20 Jun 2008 22:11:18 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP Message-ID: <251322.99482.qm@web62401.mail.re1.yahoo.com> Bio.SCOP is one of the modules affected by Bug 2454 (http://bugzilla.open-bio.org/show_bug.cgi?id=2454), which is basically about how Biopython uses file handles. Bio.SCOP contains parsers for several file formats used by SCOP. I am using Bio.SCOP.Hie as an example here, but the same applies to the other parsers. The Bio.SCOP parsers define a Parser and a Iterator class (similar to other older Biopython parsers). Typical usage is as follows: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> parser = Hie.Parser() >>> records = Hier.Iterator(handle, parser) >>> for record in records: ... # record is an instance of Bio.SCOP.Hie.Record Now, in the SCOP file format, each record is on one line in the data file. So we don't need the Iterator: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> parser = Hie.Parser() >>> for line in handle: ... record = parser.parse(line) ... # record is an instance of Bio.SCOP.Hie.Record This solves Bug #2454 (which occurs in the Iterator class), and is more general than the Iterator class (e.g., now we can parse a list of lines). To take this one step further, the Parser class is not really needed either. Although Parser is a class, we are not using the functionality of a class (no inheritance, and the object self is never used). In essence, the parse() function inside the Parser class may as well live outside of it. There are several ways to simplify this module; each of them essentially amount to moving the parse() function: 1) Move the parse() function to the Record class initializer: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> for line in handle: ... record = Hie.Record(line) ... # record is an instance of Bio.SCOP.Hie.Record 2) Move the parse() function outside of the Parser class, and rename it read() for consistency with other Biopython parsers: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> while True: ... record = Hie.read(handle) ... if not record: break ... # record is an instance of Bio.SCOP.Hie.Record 3) Move the parse() function outside of the Parser class, and use it as a generator function: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> records = Hie.parse(handle) >>> for record in records: ... # record is an instance of Bio.SCOP.Hie.Record Comments, suggestions, preferences? --Michiel. From bugzilla-daemon at portal.open-bio.org Sat Jun 21 07:31:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 21 Jun 2008 07:31:14 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806211131.m5LBVEWb019981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #17 from mdehoon at ims.u-tokyo.ac.jp 2008-06-21 07:31 EST ------- I added a DeprecationWarning to Bio.Rebase. Next on the to-do list is Bio.SCOP. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 21 07:36:43 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 21 Jun 2008 04:36:43 -0700 (PDT) Subject: [Biopython-dev] [BioPython] Bio.CDD, anyone? In-Reply-To: <485A70B0.1010202@gmail.com> Message-ID: <195444.96577.qm@web62403.mail.re1.yahoo.com> As far as I can tell, the test files were created by saving the HTML source code from the CDD web site to a file. As the CDD web site has changed its HTML is the meantime, we cannot reproduce the HTML files used by the Bio.CDD tests. Unless somebody objects in the next couple of days, I'll add a DeprecationWarning to Bio.CDD. --Michiel. Bruce Southey wrote: Hi, Do you know how the test files were created? If there is not an easy answer then it makes the decision easier. Anyhow, I vote to remove this module as, in addition to the things previously mentioned, it would far better to support interproscan (http://www.ebi.ac.uk/Tools/InterProScan/ ) than just a single tool. Bruce _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From bugzilla-daemon at portal.open-bio.org Sun Jun 22 00:51:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 22 Jun 2008 00:51:58 -0400 Subject: [Biopython-dev] [Bug 2527] New: Bug in NCBIXML.py in _end_BlastOutput_version() Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2527 Summary: Bug in NCBIXML.py in _end_BlastOutput_version() Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cdputnam at ucsd.edu biopython version is from Fedora distribution: python-biopython-1.45-1.fc7 For a recently run NCBIWWW Blast (following the tutorial at http://biopython.org/DIST/docs/tutorial/Tutorial.html), I ran into a problem in parsing by _end_BlastOutput_version with the version information: BLASTP 2.2.18+ Traceback (most recent call last): File "blast2.py", line 7, in for blast_record in blast_records: File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 577, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 216, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range I've worked around this bug for now by commenting out the offending line and setting the date to an empty string: def _end_BlastOutput_version(self): """version number of the BLAST engine (e.g., 2.1.2) Save this to put on each blast record object """ self._header.version = self._value.split()[1] # self._header.date = self._value.split()[2][1:-1] self._header.date = '' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 22 00:52:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 22 Jun 2008 00:52:45 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806220452.m5M4qjiE029058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 cdputnam at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |cdputnam at ucsd.edu -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 22 01:52:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 22 Jun 2008 01:52:05 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806220552.m5M5q5rQ031580@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2008-06-22 01:52 EST ------- I believe that this is already fixed in CVS. Could you try the latest version of Bio/Blast/NCBIXML.py, available at http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/?cvsroot=biopython and let us know if it fixes the bug? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 06:54:22 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 06:54:22 -0400 Subject: [Biopython-dev] [Bug 2528] New: NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2528 Summary: NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz I have already mentioned this on the email list few weeks ago ... NCBI Blast 2.2.18 (but was a case of also previous version as far as I remember) does not flush output buffers when run from under mod_python-3.3.11/apache-2.2.8. I tried to flush the buffers or disable buffering but it does not help. In the end, a working solution is to move the using subprocess module introduced in python 2.4 and which deprecates os.system, os.exec, os.popen* and other functions. The following patch works for me, so the user receives back into his/her web browser the blast stdout. Somehow, one has to copy the data into another variable and close the file descriptors used by blastall binary. Unfortunately, still a stale process can be seen in "ps -ef" output: apache 5382 5323 47 12:31 ? 00:00:04 [blastall] But as I have said, at least the data is not buffered anymore. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 06:55:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 06:55:26 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806231055.m5NAtQCC030683@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #1 from mmokrejs at ribosome.natur.cuni.cz 2008-06-23 06:55 EST ------- Created an attachment (id=951) --> (http://bugzilla.open-bio.org/attachment.cgi?id=951&action=view) NCBIStandalone.py.patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 06:56:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 06:56:00 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806231056.m5NAu0or030728@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #5 from mmokrejs at ribosome.natur.cuni.cz 2008-06-23 06:56 EST ------- (In reply to comment #4) Yes, the "filter" argument is not clear, please improve the docs in the sources and on the web. At the best I would in addition propose renaming the argument. Regarding the patch in comment #3, I think it should be more strict and blast* functions should only accept explicitly listed arguments in the function definition, so no kwargs, etc. But it is a good startup. In general, I would propose to provide a general wrapper function to be placed in front of _ALL_ popen3() calls. And, conjuction, replace the popen3 calls with subprocess.Popen. See Bug #2528 on the NCBIStandalone.blastall() where is a working example of this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 11:01:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 11:01:17 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806231501.m5NF1Hth014356@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #18 from mdehoon at ims.u-tokyo.ac.jp 2008-06-23 11:01 EST ------- See the discussion on the mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003819.html for some ideas for Bio.SCOP. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 11:16:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 11:16:29 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806231516.m5NFGTgD015331@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 cdputnam at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from cdputnam at ucsd.edu 2008-06-23 11:16 EST ------- The latest NCBIXML.py does fix the problem with Blast version parsing. Just so you know, I had to comment out two lines in _end_Hsp_bit_score, similar to the version of the file I already had. I'm guessing this is a version mismatch with some other file that I didn't update (I only replaced NCBIXML.py). The error was: AttributeError: Description instance has no attribute 'bits' And the commented version of the function is: def _end_Hsp_bit_score(self): """bit score of HSP """ self._hsp.bits = float(self._value) #if self._descr.bits == None: # self._descr.bits = float(self._value) Thanks for your help. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 05:38:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 05:38:54 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806240938.m5O9csKZ032756@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 05:38 EST ------- With this patch we have to wait for the sub-process to finish before we can read its output. This is a potential drawback as it delays the parsing. Currently we should be able to can parse this iteratively as the queries are processed. Also, you are loading the entire output into memory (as a list of strings, which you then turn into a StringIO handle). This is potentially a very bad idea, as in extreme cases Blast XML files can be GB in size. I'm not keen on your solution, but I don't know what to suggest for your original problem, running Blast under mod_python-3.3.11/apache-2.2.8. Two minor points: Do you think we can do anything better on Python 2.3? Did you intend something similar for blastpgp and rpsblast. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 24 05:46:19 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Jun 2008 10:46:19 +0100 Subject: [Biopython-dev] Bio.SCOP In-Reply-To: <251322.99482.qm@web62401.mail.re1.yahoo.com> References: <251322.99482.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00806240246u8afdb6fp51cd31000ebe3d9@mail.gmail.com> On Sat, Jun 21, 2008 at 6:11 AM, Michiel de Hoon wrote: > Bio.SCOP contains parsers for several file > formats used by SCOP. I am using Bio.SCOP.Hie > as an example here, but the same applies to > the other parsers. > > The Bio.SCOP parsers define a Parser and a Iterator > class (similar to other older Biopython parsers). I would deprecate the Parser and Iterator objects, and introduce a parse(handle) function to iterate over a file (following our recent convention) and a perhaps a read() function too (taking a handle or a single line?), Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 24 06:17:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 06:17:41 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806241017.m5OAHfdK002192@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #3 from mmokrejs at ribosome.natur.cuni.cz 2008-06-24 06:17 EST ------- Hi Peter, well I am not much happy with this either, and I do understand your points. I will try to come up with another solution. Would be best to disable buffering in popen3() but I failed to get it working. Will give it some more thought next week. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 06:35:50 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 06:35:50 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806241035.m5OAZo3p003784@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 06:35 EST ------- Regarding comment 2, I think you need to update Bio/Blast/Record.py as well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 06:36:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 06:36:18 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806241036.m5OAaIIt003857@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-06-24 06:36 EST ------- Is there an easy way to replicate this issue? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 07:30:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 07:30:45 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806241130.m5OBUjYU007159@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 07:30 EST ------- P.S. This is a duplicate of Bug 2499 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 09:05:46 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 09:05:46 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806241305.m5OD5jZa012413@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 09:05 EST ------- Checking in Tests/test_NCBIStandalone.py new revision: 1.14 Checking in Bio/Blast/NCBIStandalone.py new revision: 1.73 I've checked in my suggested patch, and tried to improve the filter documentation by including the phrase "low complexity". It might be worth passing this suggestion on to the NCBI as their own command line tools just use the term filter. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Wed Jun 25 10:04:09 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 25 Jun 2008 07:04:09 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP.FileIndex Message-ID: <141582.2274.qm@web62413.mail.re1.yahoo.com> Hi everybody, When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed if file reading is done via a buffer (which is often the case in Python). Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex? If not, I think we should deprecate it instead of trying to fix it. --Michiel. From bugzilla-daemon at portal.open-bio.org Wed Jun 25 11:55:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Jun 2008 11:55:58 -0400 Subject: [Biopython-dev] [Bug 2529] New: NCBI BLAST XML parser does not support the online blast version 2.2.18+ Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2529 Summary: NCBI BLAST XML parser does not support the online blast version 2.2.18+ Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P1 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: lordnapi at gmail.com QAContact: lordnapi at gmail.com Hello, I have performed a blast search of PDB database. I am having a problem while parsing the blast result on both Windows and Linux machines. The following four lines of code provides me the same error. Thanks. Ahmet >>> from Bio.Blast import NCBIWWW >>> from Bio.Blast import NCBIXML >>> results_handle = NCBIWWW.qblast( 'blastp', 'pdb', 'ASFPVEILPFLYLGCAKDSTNLDVLEEFGIKYILNVTPNLPNLFENAGEFKYKQIPISDHWSQNLSQ') >>> blast_record = NCBIXML.parse( results_handle ).next() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 25 12:09:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Jun 2008 12:09:24 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806251609.m5PG9OWX002384@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #5 from mmokrejs at ribosome.natur.cuni.cz 2008-06-25 12:09 EST ------- (In reply to comment #4) > Is there an easy way to replicate this issue? > I believe run under mod_python a blast search and try to display it on the web the results, that's all I actually do. On the server the blastall processes did not flush it's cache, so if you would connect to the running process by strace utility you would see it has done write() of some line being not yet the last one of the output. The process hangs like this for ages, until you do "kill -HUP $pid", then it it flushes the write buffer and exits successfully. Happens with blast 2.2.18 at least. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 25 12:24:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Jun 2008 12:24:45 -0400 Subject: [Biopython-dev] [Bug 2529] NCBI BLAST XML parser does not support the online blast version 2.2.18+ In-Reply-To: Message-ID: <200806251624.m5PGOjgf003205@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2529 lordnapi at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #1 from lordnapi at gmail.com 2008-06-25 12:24 EST ------- The problem was caused by not having data in BLASTP 2.2.18+ in the XML files. I fixed the problem for myself by changing _end_BlastOutput_version function in the Blast/NCBIXML.py file to the following (starts at line 208). I still don't know if having date is important elsewhere. def _end_BlastOutput_version(self): """version number of the BLAST engine (e.g., 2.1.2) Save this to put on each blast record object """ self._valuesplit = self._value.split() self._header.version = self._valuesplit[1] if len(self._valuesplit) > 2 : self._header.date = self._value.split()[2][1:-1] else: self._header.date = '' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Wed Jun 25 20:01:07 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 25 Jun 2008 17:01:07 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython Message-ID: <254082.68438.qm@web62401.mail.re1.yahoo.com> Dear all, Recently NCBI blocked access for a Biopython user who? was making 50,000 requests to NCBI at a rate of 18 requests per second during peak hours. This user was using the search_for function in Bio.GenBank, which internally uses Bio.EUtils. Apparently, Bio.EUtils does not follow the 3 seconds sleep rule betwen requests. NCBI also asked us to send requests for the Entrez E-Utilities to the EUtils web address, and not to the regular NCBI web address. I don't know if Bio.EUtils does that. Bio.Entrez does use the 3 seconds sleep rule, and the eight E-Utilities functions all make use of the EUtils web address, though it is possible to pass a different web address as one of the arguments. The "query" function, which is not part of the E-Utilities, does use the standard NCBI web address. To avoid such problems in the future, I'd like to propose the following: 1) Deprecate Bio.EUtils. Its functionality is covered by Bio.Entrez, which (from release 1.46) will have a parser. Bio.EUtils is currently used by the following modules: Bio/config/DBRegistry.py Bio/dbdefs/fasta.py Bio/dbdefs/genbank.py Bio/dbdefs/medline.py Bio/GenBank/__init__.py We were already planning to remove Bio.config and Bio.dbdefs, so we'd only have to modify Bio.GenBank. 2) Remove the 'query' function from Bio.Entrez. Anyway accessing NCBI's web site from Python to get HTML back doesn't make a lot of sense. 3) Remove the argument for a user-specified web address to make sure that always the E-Utilities address is used. --Michiel. From dalke at dalkescientific.com Wed Jun 25 21:52:07 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 03:52:07 +0200 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <254082.68438.qm@web62401.mail.re1.yahoo.com> References: <254082.68438.qm@web62401.mail.re1.yahoo.com> Message-ID: <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> On Jun 26, 2008, at 2:01 AM, Michiel de Hoon wrote: > Bio.Entrez does use the 3 seconds sleep rule, and the eight E- > Utilities functions all make use of the EUtils web address, though > it is possible to pass a different web address as one of the > arguments. The "query" function, which is not part of the E- > Utilities, does use the standard NCBI web address. What is the proper EUtils web address? Entrez/__init__.py uses cgi='http://www.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' while the documentation at http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html claims "Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov", which I think should be "http://eutils.ncbi.nlm.nih.gov/ entrez/eutils/epost.fcgi" > To avoid such problems in the future, I'd like to propose the > following: > 1) Deprecate Bio.EUtils. Its functionality is covered by > Bio.Entrez, which (from release 1.46) will have a parser. I looked over Bio.Entrez and it handles only a subset of what Bio.EUtils does. For example, it doesn't have any support to help track WebEnv as it changes over each request, nor support for alternate format types. I would deprecate Bio.EUtils for another reason - there's no maintainer. > 2) Remove the 'query' function from Bio.Entrez. Anyway accessing > NCBI's web site from Python to get HTML back doesn't make a lot of > sense. Okay, now I'm quite confused. This is functionality that Bio.EUtils supports. >>> from Bio.EUtils import HistoryClient >>> client = HistoryClient.HistoryClient() >>> result = client.search("Michiel de Hoon[AU]") >>> print result.efetch("text", "docsum").read() 1: de Hoon M, Hayashizaki Y. Deep cap analysis gene expression (CAGE): genome-wide identification of promoters, quantification of their expression, and network inference. Biotechniques. 2008 Apr;44(5):627-8, 630, 632. Review. PMID: 18474037 [PubMed - indexed for MEDLINE] 2: Sierro N, Makita Y, de Hoon M, Nakai K. DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 2008 Jan;36(Database issue):D93-6. Epub 2007 Oct 25. PMID: 17962296 [PubMed - indexed for MEDLINE] 3: Makita Y, de Hoon MJ, Danchin A. Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes. BMC Bioinformatics. 2007 Feb 8;8:47. PMID: 17286872 [PubMed - indexed for MEDLINE] 4: de Hoon MJ, Makita Y, Nakai K, Miyano S. Prediction of transcriptional terminators in Bacillus subtilis and related species. PLoS Comput Biol. 2005 Aug;1(3):e25. Epub 2005 Aug 12. PMID: 16110342 [PubMed - indexed for MEDLINE] 5: de Hoon MJ, Imoto S, Kobayashi K, Ogasawara N, Miyano S. Inferring gene regulatory networks from time-ordered gene expression data of Bacillus subtilis using differential equations. Pac Symp Biocomput. 2003;:17-28. PMID: 12603014 [PubMed - indexed for MEDLINE] (The default returns this in XML format.) >>> print result.efetch().read(500) 18474037 2008 05 13 2008 06 3) Remove the argument for a user-specified web address to make > sure that always the E-Utilities address is used. Yes. Andrew dalke at dalkescientific.com From bugzilla-daemon at portal.open-bio.org Thu Jun 26 05:20:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 05:20:55 -0400 Subject: [Biopython-dev] [Bug 2529] NCBI BLAST XML parser does not support the online blast version 2.2.18+ In-Reply-To: Message-ID: <200806260920.m5Q9Ktlt019555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2529 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 05:20 EST ------- This is a duplicate of Bug 2499, reopening in order to mark this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 26 05:21:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 05:21:38 -0400 Subject: [Biopython-dev] [Bug 2529] NCBI BLAST XML parser does not support the online blast version 2.2.18+ In-Reply-To: Message-ID: <200806260921.m5Q9Lcp6019606@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2529 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |DUPLICATE ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 05:21 EST ------- The fix for the 2.2.18+ XML output is already in CVS, see Bug 2499 *** This bug has been marked as a duplicate of bug 2499 *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 26 05:21:40 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 05:21:40 -0400 Subject: [Biopython-dev] [Bug 2499] Bio.Blast.NCBIXML cannot handle XML without date in BlastOutput_version In-Reply-To: Message-ID: <200806260921.m5Q9Lebn019619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2499 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lordnapi at gmail.com ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 05:21 EST ------- *** Bug 2529 has been marked as a duplicate of this bug. *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 26 06:25:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 11:25:38 +0100 Subject: [Biopython-dev] [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <320fb6e00806260325m3b92ff8n143141c73a1a60dd@mail.gmail.com> Andrew wrote: > > I thought I put a rate limiter into the code, but looking at it now I see I > didn't. The documentation clearly states that users must follow NCBI's > recommendations, but who actually reads documentation? > >>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >>> , not the standard NCBI Web address. > > That change was announced on May 21, 2003, and most likely no one on the > Biopython dev group tracks the EUtils mailing list. It was also after I > wrote the code, but to be fair I was subscribed to the utilities list at the > time and should have caught the change. > > I think the correct fix is to this code in ThinClient.py: > > def __init__(self, > opener = None, > tool = TOOL, > email = EMAIL, > baseurl = "http://www.ncbi.nlm.nih.gov/entrez/eutils/"): > > Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/". I > have not tested this. I've tested that fix, and it seems to be OK with test_EUtils.py and test_SeqIO_online.py which calls Bio.EUTils via Bio.GenBank, checked in as Bio/EUtils/ThinClient.py revision 1.6 I'll have a look at your other specific suggestions too. Thanks for taking the time to go over this Andrew. Peter From p.j.a.cock at googlemail.com Thu Jun 26 06:47:05 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 Jun 2008 11:47:05 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> References: <254082.68438.qm@web62401.mail.re1.yahoo.com> <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> Message-ID: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> On Thu, Jun 26, 2008 at 2:52 AM, Andrew Dalke wrote: > On Jun 26, 2008, at 2:01 AM, Michiel de Hoon wrote: >> >> Bio.Entrez does use the 3 seconds sleep rule, and the eight E-Utilities >> functions all make use of the EUtils web address, though it is possible to >> pass a different web address as one of the arguments. The "query" function, >> which is not part of the E-Utilities, does use the standard NCBI web >> address. > > What is the proper EUtils web address? > > Entrez/__init__.py uses > cgi='http://www.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' > while the documentation at > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html > claims "Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov", > which I think should be > "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi" Yes, for ePost that is correct: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html [On a related note, following Andrew's suggestion, I have updated CVS to use the new base URL in Bio/EUtils/ThinClient.py] >> To avoid such problems in the future, I'd like to propose the following: >> 1) Deprecate Bio.EUtils. Its functionality is covered by Bio.Entrez, which >> (from release 1.46) will have a parser. > > I looked over Bio.Entrez and it handles only a subset of what Bio.EUtils > does. For example, it doesn't have any support to help track WebEnv as it > changes over each request, nor support for alternate format types. No, Bio.Entrez does not support the WebEnv / history interface. It can request data in different format types though, although it will only parse the XML output. > I would deprecate Bio.EUtils for another reason - there's no maintainer. This is a strong reason - although we are still using Bio.EUtils in Bio.GenBank (and probably in other places too). >> 2) Remove the 'query' function from Bio.Entrez. Anyway accessing NCBI's >> web site from Python to get HTML back doesn't make a lot of sense. > > Okay, now I'm quite confused. This is functionality that Bio.EUtils > supports. I think Michiel meant getting a handle containing raw HTML isn't very sensible, and this is what the Bio.Entrez.query() function does. If it can only return HTML, then I agree, its not very useful and could be removed. >> 3) Remove the argument for a user-specified web address to make sure that >> always the E-Utilities address is used. > > Yes. > Unlike BLAST where you may have a local webserver, is there any reason for to use a URL other than the NCBI's one? Peter From dalke at dalkescientific.com Thu Jun 26 07:03:19 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 13:03:19 +0200 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> References: <254082.68438.qm@web62401.mail.re1.yahoo.com> <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> Message-ID: <52BDC1F6-52F8-4A42-B738-DFBB119F9C27@dalkescientific.com> On Jun 26, 2008, at 12:47 PM, Peter Cock wrote: > I think Michiel meant getting a handle containing raw HTML isn't very > sensible, and this is what the Bio.Entrez.query() function does. I meant to point out that supporting the search interface, with machine parseable, is functionality in Bio.EUtils that isn't in Bio.Entrez. > Unlike BLAST where you may have a local webserver, is there any reason > for to use a URL other than the NCBI's one? I can't think of any. (I can make up one - setting up a local mock server for tests. But that's not seriously going to happen.) Andrew dalke at dalkescientific.com From biopython at maubp.freeserve.co.uk Thu Jun 26 07:40:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 12:40:54 +0100 Subject: [Biopython-dev] [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <5CD393BF-D4FB-4700-B7CC-2417C9845010@dalkescientific.com> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com> <5CD393BF-D4FB-4700-B7CC-2417C9845010@dalkescientific.com> Message-ID: <320fb6e00806260440n4a933b60of5a7c8eee4e15a89@mail.gmail.com> On Thu, Jun 26, 2008 at 12:26 PM, Andrew Dalke wrote: > On Jun 26, 2008, at 1:21 PM, Peter wrote: >> >> Looking over the code, should this wait also be done for the >> ThinClient's epost() method as well? > > Where? It gets the URL from an instance variable, which is set in the > constructor. The ThinClient class is defined In Bio/EUtils/ThinClient.py, and I have added a 3 second wait to its _get() method. I think we should also add the three second wait to the epost() method. Both methods will construct their URL using self.baseurl, so they are both going to hit the same server. Note that for the implementation, I would probably define a new _wait() method to check the time since the last call, and call this _wait() method from both _get() and epost(). >> This complexity is also daunting for anyone else considering taking >> over the Bio.EUtils code base. > > My incomplete rewrite uses elementtree which does reduce some of the > complexity. But the NCBI interface is a mess. I can see why Michiel has kept things simple in Bio.Entrez - this should cater to most user's needs. Peter From mjldehoon at yahoo.com Thu Jun 26 07:45:45 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 04:45:45 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> Message-ID: <402220.93857.qm@web62411.mail.re1.yahoo.com> > > I would deprecate Bio.EUtils for another reason - there's no maintainer. This is what I meant. I am sure that we can fix Bio.EUtils for now, but I don't see how we can maintain it in the future. That is why originally we decided to focus on Bio.WWW.NCBI (renamed to Bio.Entrez) instead. > - although we are still using Bio.EUtils in Bio.GenBank > (and probably in other places too). As far as I can tell, Bio.GenBank is currently the only module in which Bio.EUtils is used, not counting modules that themselves have been deprecated. It shouldn't be too complicated to modify Bio.GenBank to use Bio.Entrez instead. >>> 2) Remove the 'query' function from Bio.Entrez. >>> Anyway accessing NCBI's web site from Python >>> to get HTML back doesn't make a lot of sense. > >> Okay, now I'm quite confused. This is functionality >> that Bio.EUtils supports. > > I think Michiel meant getting a handle containing > raw HTML isn't very sensible, and this is what the > Bio.Entrez.query() function does. If it can only > return HTML, then I agree, its not very useful and > could be removed. That is indeed what I meant. (It is still possible to get raw HTML by using the other EUtilities, for example efetch, but from a scripting language efetch is more likely to be used to get XML or some plain-text output). --Michiel From mjldehoon at yahoo.com Thu Jun 26 08:50:10 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 05:50:10 -0700 (PDT) Subject: [Biopython-dev] New release Message-ID: <390323.35893.qm@web62411.mail.re1.yahoo.com> Hi everybody, I think we should make a new Biopython release within the next couple of weeks to solve the issues with NCBI and to get the fixed Blast parser out (for output from Blast 2.2.18). There are a few outstanding issues that hopefully can be fixed before the next release: 1) NCBI access from Bio.GenBank 2) Bug #2454 (Iterators can't use file-like objects), which affects a number of parsers in Biopython 3) Martel-based parsers. >From a technical viewpoint, none of these are very complicated. 2) is almost finished. With respect to 3), a small number of parsers in Biopython are based on Martel (none of the major ones as far as I can tell). For some of these parsers, it is not quite clear if they are still useful. For the remaining ones, it would be nice if they could be rewritten without using Martel -- that would let us get rid of the dependency on mxTextTools. Any other urgent issues that need to be resolved before a release? --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 26 08:53:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 13:53:09 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <402220.93857.qm@web62411.mail.re1.yahoo.com> References: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> <402220.93857.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> > As far as I can tell, Bio.GenBank is currently the only module in which > Bio.EUtils is used, not counting modules that themselves have been > deprecated. It shouldn't be too complicated to modify Bio.GenBank to use > Bio.Entrez instead. Looking back at CVS, it used to use Bio.WWW.NCBI once upon a time (which is now Bio.Entrez), and had explicit rate limiting. Then four years ago Brad moved the Bio.GenBank.download_many() and search_for() functions over to using Bio.EUtils (CVS revision 1.51 of Bio/GenBank/__init__.py). Brad also appears to have changed the functionality of Bio.GenBank.download_many() from a call back mechanism to returning a handle. We could still return a handle, but it would require fetching all the records (perhaps in batches), and concatenating them. I think it would make more sense to deprecate the Bio.GenBank.download_many() function, and direct people to Bio.Entrez.efetch() instead. The Bio.GenBank.search_for() still seems somewhat useful, but without a default limit on the number of returned IDs, this could easily be abused. Again, we could deprecate this and direct people to Bio.Entrez.esearch() instead. Peter From mjldehoon at yahoo.com Thu Jun 26 09:41:24 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 06:41:24 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> Message-ID: <8498.83228.qm@web62412.mail.re1.yahoo.com> > The Bio.GenBank.search_for() still seems somewhat > useful, but without a default limit on the number > of returned IDs, this could easily be abused. > Again, we could deprecate this and direct people > to Bio.Entrez.esearch() instead. As always, I am in favor of deprecating functions whose purpose is dubious. F # Using Bio.GenBank >>> from Bio import GenBank >>> gi_list = GenBank.search_for("Opuntia AND rpl16") >>> gi_list ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] # Same thing, using Bio.Entrez >>> from Bio import Entrez >>> handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16") >>> record = Entrez.read(handle) >>> record["IdList"] ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [Biopython-dev] NCBI Abuse activity with Biopython To: mjldehoon at yahoo.com Cc: "Biopython Developers Mailing List" Date: Thursday, June 26, 2008, 8:53 AM > As far as I can tell, Bio.GenBank is currently the only module in which > Bio.EUtils is used, not counting modules that themselves have been > deprecated. It shouldn't be too complicated to modify Bio.GenBank to use > Bio.Entrez instead. Looking back at CVS, it used to use Bio.WWW.NCBI once upon a time (which is now Bio.Entrez), and had explicit rate limiting. Then four years ago Brad moved the Bio.GenBank.download_many() and search_for() functions over to using Bio.EUtils (CVS revision 1.51 of Bio/GenBank/__init__.py). Brad also appears to have changed the functionality of Bio.GenBank.download_many() from a call back mechanism to returning a handle. We could still return a handle, but it would require fetching all the records (perhaps in batches), and concatenating them. I think it would make more sense to deprecate the Bio.GenBank.download_many() function, and direct people to Bio.Entrez.efetch() instead. The Bio.GenBank.search_for() still seems somewhat useful, but without a default limit on the number of returned IDs, this could easily be abused. Again, we could deprecate this and direct people to Bio.Entrez.esearch() instead. Peter From mjldehoon at yahoo.com Thu Jun 26 09:51:55 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 06:51:55 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> Message-ID: <597121.15112.qm@web62401.mail.re1.yahoo.com> [Sorry, hit the send button too soon] > The Bio.GenBank.search_for() still seems somewhat > useful, but without a default limit on the number > of returned IDs, this could easily be abused. > Again, we could deprecate this and direct people > to Bio.Entrez.esearch() instead. As always, I am in favor of deprecating functions whose purpose is dubious. As an example, this is a Genbank search done via Bio.GenBank and via Bio.Entrez: # Using Bio.GenBank >>> from Bio import GenBank >>> gi_list = GenBank.search_for("Opuntia AND rpl16") >>> gi_list ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] # Same thing, using Bio.Entrez >>> from Bio import Entrez >>> handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16") >>> record = Entrez.read(handle) >>> record["IdList"] ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] I believe that GenBank.search_for automatically takes care of the retmax parameter (the maximum number of ids to return), but I agree that this can be abused easily. > Brad also appears to have changed the functionality of > Bio.GenBank.download_many() from a call back mechanism > to returning a handle. We could still return a handle, but it would > require fetching all the records (perhaps in batches), and > concatenating them. I think it would make more sense to deprecate > the Bio.GenBank.download_many() function, and direct people to > Bio.Entrez.efetch() instead. Agree. Btw, NCBIDictionary definitely needs to go. >From the documentation, continuing the example above: >>> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank") >>> gb_record = ncbi_dict[gi_list[0]] Hence, we're running efetch once for each key separately; this is exactly what NCBI advised against. --Michiel. From mjldehoon at yahoo.com Thu Jun 26 10:01:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 07:01:31 -0700 (PDT) Subject: [Biopython-dev] Bio.ECell, anybody? Message-ID: <712489.88060.qm@web62410.mail.re1.yahoo.com> This is one of the Martel-based parser whose relevance in 2008 is unclear to me. >From the docstring: Ecell converts the ECell input from spreadsheet format to an intermediate format, described in http://www.e-cell.org/manual/chapter2E.html#3.2.? It provides an alternative to the perl script supplied with the Ecell2 distribution at http://bioinformatics.org/project/?group_id=49. Currently, ECell is at version 3.1.106 (and uses Python as the scripting interface! Yay!). The link to the chapter in the ECell manual is dead. Is anybody using the Bio.ECell module? --Michiel From biopython at maubp.freeserve.co.uk Thu Jun 26 10:43:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 15:43:10 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <597121.15112.qm@web62401.mail.re1.yahoo.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> OK then - I will deprecate the Bio.GenBank.search_for() and Bio.GenBank,download_many() functions, suggesting Bio.Entrez instead. I will also update the tutorial on this. On Thu, Jun 26, 2008 at 2:51 PM, Michiel de Hoon wrote: > Btw, NCBIDictionary definitely needs to go. > From the documentation, continuing the example above: >>>> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank") >>>> gb_record = ncbi_dict[gi_list[0]] > Hence, we're running efetch once for each key separately; this is exactly what NCBI advised against. If the user wants to run a Entrez search and then fetch some/all of the results, then yes, the NCBI would not want us to do a multiple separate efetch calls by idenifier. Could you prepare an example using Bio.Entrez with the "history" (WebEnv argument)? However, if the user has provided the list of GI numbers (e.g. from a file), there is no existing NCBI search data to refer to, and I don't see any other option. So there is a use-case for the Bio.GenBank.NCBIDictionary class. Peter From mjldehoon at yahoo.com Thu Jun 26 10:49:49 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 07:49:49 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> Message-ID: <525848.21341.qm@web62410.mail.re1.yahoo.com> --- On Thu, 6/26/08, Peter wrote: However, if the user has provided the list of GI numbers (e.g. from a file), there is no existing NCBI search data to refer to, and I don't see any other option. So there is a use-case for the Bio.GenBank.NCBIDictionary class. In that case, the following can be used: >>> from Bio import Entrez >>> idlist = ['123','456','453',.....] # a list of GI numbers >>> ids = ",".join(idlist) >>> handle = Entrez.efetch(db='nucleotide', id=ids, retmode='xml') >>> records = Entrez.read(handle) # records is now a list of records corresponding to '123', '456', '453',... --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 26 12:05:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 17:05:36 +0100 Subject: [Biopython-dev] [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <79693088-0D38-459E-ADEC-FF2757E41912@dalkescientific.com> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com> <5CD393BF-D4FB-4700-B7CC-2417C9845010@dalkescientific.com> <320fb6e00806260440n4a933b60of5a7c8eee4e15a89@mail.gmail.com> <79693088-0D38-459E-ADEC-FF2757E41912@dalkescientific.com> Message-ID: <320fb6e00806260905i599a53f3v367045d3ee07ffbf@mail.gmail.com> On Thu, Jun 26, 2008 at 12:48 PM, Andrew Dalke wrote: >> I think we should >> also add the three second wait to the epost() method. > > I see it now. Yes, that needs it as well. Good - I've updated that in CVS, Bio/EUtils/ThinClient.py revision 1.8 >> I can see why Michiel has kept things simple in Bio.Entrez - this >> should cater to most user's needs. > > Sad, but true. EUtils (the server and the client) offer a lot more than > what most users need. > Agreed. Thanks again Andrew for your advice on where Bio.EUtils needed updating - it certainly meant this got dealt with more quickly. Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 13:04:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 18:04:26 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> Message-ID: <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> Michiel, I started working on a patch to mark Bio.GenBank.search_for() etc as deprecated, but on reflection I don't really like the longer code needed with Bio.Entrez - for example this one liner: from Bio import GenBank gi_list = GenBank.search_for("Opuntia AND rpl16") becomes: from Bio import Entrez handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16") gi_list = Entrez.read(handle)["IdList"] One idea that might be worth discussing is having variations of the Entrez.e* functions which will parse the XML and return the results. i.e. something like this: def esearch2(...) : """Calls ESearch and parses the returned XML.""" return read(esearch(..., retmode="XML")) Then we can write, from Bio import Entrez gi_list = Entrez.esearch2(db='nucleotide', term="Opuntia AND rpl16")["IdList"] (An alternative naming convention like a "p" might be nicer) My initial plan was to get the search results back as plain text (retmode='uilist'), thus avoiding parsing the XML. However, after reading the Entrez documentation, and some experimentation to confirm this, I was surprised to find the ESearch will only return XML. The NCBI appear to suggest that if you want your search results in another format use the WebEnv session history, and then ask EFetch to reformat it (!). This does work, but means making two internet calls: from Bio import Entrez handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16", usehistory="y") session = Entrez.read(handle)['WebEnv'] gi_list = Entrez.efetch(db='nucleotide', WebEnv=session, query_key=1, rettype='uilist').read().split('\n') As an aside, do we really have to include the database in the efetch call above? Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 16:32:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 21:32:07 +0100 Subject: [Biopython-dev] New release In-Reply-To: <390323.35893.qm@web62411.mail.re1.yahoo.com> References: <390323.35893.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> On Thu, Jun 26, 2008 at 1:50 PM, Michiel de Hoon wrote: > Hi everybody, > > I think we should make a new Biopython release within the next couple of weeks > to solve the issues with NCBI and to get the fixed Blast parser out (for output > from Blast 2.2.18). There are a few outstanding issues that hopefully can be > fixed before the next release: > 1) NCBI access from Bio.GenBank > 2) Bug #2454 (Iterators can't use file-like objects), which affects a number of parsers in Biopython > 3) Martel-based parsers. Given the updates to Bio.EUtils to enforce the 3 second rule, the urgent part of issue (1) is now resolved, and any futher refinements needn't hold up the release. >From a technical viewpoint, none of these are very complicated. 2) is almost finished. While there are still outstanding parsers affected by issue (2) (Bug 2454), I don't think this need hold up the release. > With respect to 3), a small number of parsers in Biopython are based on > Martel (none of the major ones as far as I can tell). For some of these > parsers, it is not quite clear if they are still useful. For the remaining ones, > it would be nice if they could be rewritten without using Martel -- that would > let us get rid of the dependency on mxTextTools. Again, while removing the dependency on mxTextTools is a worthwhile aim, I don't think this should hold up the release. > Any other urgent issues that need to be resolved before a release? There is an AlignInfo alphabet issue I'm currently working on, and expect to have fixed tomorrow. Peter From dalke at dalkescientific.com Thu Jun 26 17:40:51 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 23:40:51 +0200 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> Message-ID: <5DF39193-B52A-4EB9-84D3-C9626984DEA8@dalkescientific.com> On Jun 26, 2008, at 7:04 PM, Peter wrote: > I started working on a patch to mark Bio.GenBank.search_for() etc as > deprecated, but on reflection I don't really like the longer code > needed with Bio.Entrez > One idea that might be worth discussing is having variations of the > Entrez.e* functions which will parse the XML and return the results. > i.e. something like this: > > def esearch2(...) : > """Calls ESearch and parses the returned XML.""" > return read(esearch(..., retmode="XML")) What about calling it "search"? That is, the one that does everything the default way as most people expect is the one which doesn't need the prefix? > My initial plan was to get the search results back as plain text > (retmode='uilist'), thus avoiding parsing the XML. However, after > reading the Entrez documentation, and some experimentation to confirm > this, I was surprised to find the ESearch will only return XML. The > NCBI appear to suggest that if you want your search results in another > format use the WebEnv session history, and then ask EFetch to reformat > it (!). This does work, but means making two internet calls: That's my memory of it too. > As an aside, do we really have to include the database in the > efetch call above? Yes. Or you did 5 years ago. Andrew dalke at dalkescientific.com From biopython at maubp.freeserve.co.uk Thu Jun 26 17:53:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 22:53:40 +0100 Subject: [Biopython-dev] New release In-Reply-To: <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> References: <390323.35893.qm@web62411.mail.re1.yahoo.com> <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> Message-ID: <320fb6e00806261453l649f4ce3i83a6ed38fec54965@mail.gmail.com> >> Any other urgent issues that need to be resolved before a release? > > There is an AlignInfo alphabet issue I'm currently working on, and > expect to have fixed tomorrow. Fixed, I think. Alphabets can be annoying, especially gapped alphabets! Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 18:05:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 23:05:45 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <5DF39193-B52A-4EB9-84D3-C9626984DEA8@dalkescientific.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> <5DF39193-B52A-4EB9-84D3-C9626984DEA8@dalkescientific.com> Message-ID: <320fb6e00806261505w6e51d168i78987ac109a6f015@mail.gmail.com> On Thu, Jun 26, 2008 at 10:40 PM, Andrew Dalke wrote: > On Jun 26, 2008, at 7:04 PM, Peter wrote: >> >> I started working on a patch to mark Bio.GenBank.search_for() etc as >> deprecated, but on reflection I don't really like the longer code >> needed with Bio.Entrez > >> One idea that might be worth discussing is having variations of the >> Entrez.e* functions which will parse the XML and return the results. >> i.e. something like this: >> >> def esearch2(...) : >> """Calls ESearch and parses the returned XML.""" >> return read(esearch(..., retmode="XML")) > > What about calling it "search"? That is, the one that does everything the > default way as most people expect is the one which doesn't need the prefix? I like that idea for the naming :) What do you think Michiel, as this is your module? Peter From mjldehoon at yahoo.com Thu Jun 26 19:16:23 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 16:16:23 -0700 (PDT) Subject: [Biopython-dev] New release In-Reply-To: <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> Message-ID: <501202.26872.qm@web62413.mail.re1.yahoo.com> OK, then let's make a new release as soon as possible, and perhaps another one soon after that. Tentative date is this Sunday, around noon GMT. All biopython unit tests pass (at least, on my machine), so it should be straightforward to build a release. --Michiel. --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [Biopython-dev] New release To: mjldehoon at yahoo.com Cc: biopython-dev at biopython.org Date: Thursday, June 26, 2008, 4:32 PM On Thu, Jun 26, 2008 at 1:50 PM, Michiel de Hoon wrote: > Hi everybody, > > I think we should make a new Biopython release within the next couple of weeks > to solve the issues with NCBI and to get the fixed Blast parser out (for output > from Blast 2.2.18). There are a few outstanding issues that hopefully can be > fixed before the next release: > 1) NCBI access from Bio.GenBank > 2) Bug #2454 (Iterators can't use file-like objects), which affects a number of parsers in Biopython > 3) Martel-based parsers. Given the updates to Bio.EUtils to enforce the 3 second rule, the urgent part of issue (1) is now resolved, and any futher refinements needn't hold up the release. >From a technical viewpoint, none of these are very complicated. 2) is almost finished. While there are still outstanding parsers affected by issue (2) (Bug 2454), I don't think this need hold up the release. > With respect to 3), a small number of parsers in Biopython are based on > Martel (none of the major ones as far as I can tell). For some of these > parsers, it is not quite clear if they are still useful. For the remaining ones, > it would be nice if they could be rewritten without using Martel -- that would > let us get rid of the dependency on mxTextTools. Again, while removing the dependency on mxTextTools is a worthwhile aim, I don't think this should hold up the release. > Any other urgent issues that need to be resolved before a release? There is an AlignInfo alphabet issue I'm currently working on, and expect to have fixed tomorrow. Peter From mjldehoon at yahoo.com Thu Jun 26 19:20:49 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 16:20:49 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806261505w6e51d168i78987ac109a6f015@mail.gmail.com> Message-ID: <900951.88468.qm@web62414.mail.re1.yahoo.com> There are some other possibilities, for example to use the retout parameter. This parameter lets you choose between XML, HTML, plain text, ... format for the results. We could make the rule that without an explicit value for this parameter, the Bio.Entrez.e* functions return the parsed results. If we're not sure what to do, I suggest we keep the search_for function in Bio.GenBank for the upcoming release, and take this issue up later. --Michiel. --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [Biopython-dev] NCBI Abuse activity with Biopython To: "Biopython Developers Mailing List" Cc: "Andrew Dalke" Date: Thursday, June 26, 2008, 6:05 PM On Thu, Jun 26, 2008 at 10:40 PM, Andrew Dalke wrote: > On Jun 26, 2008, at 7:04 PM, Peter wrote: >> >> I started working on a patch to mark Bio.GenBank.search_for() etc as >> deprecated, but on reflection I don't really like the longer code >> needed with Bio.Entrez > >> One idea that might be worth discussing is having variations of the >> Entrez.e* functions which will parse the XML and return the results. >> i.e. something like this: >> >> def esearch2(...) : >> """Calls ESearch and parses the returned XML.""" >> return read(esearch(..., retmode="XML")) > > What about calling it "search"? That is, the one that does everything the > default way as most people expect is the one which doesn't need the prefix? I like that idea for the naming :) What do you think Michiel, as this is your module? Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Thu Jun 26 19:45:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 00:45:50 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <900951.88468.qm@web62414.mail.re1.yahoo.com> References: <320fb6e00806261505w6e51d168i78987ac109a6f015@mail.gmail.com> <900951.88468.qm@web62414.mail.re1.yahoo.com> Message-ID: <320fb6e00806261645y1819cddx620d430f34d7e725@mail.gmail.com> On Fri, Jun 27, 2008 at 12:20 AM, Michiel de Hoon wrote: > There are some other possibilities, for example to use the retout parameter. > This parameter lets you choose between XML, HTML, plain text, ... format for > the results. I'm not sure if its rettype, retmode or retout - but something like that. > We could make the rule that without an explicit value for this > parameter, the Bio.Entrez.e* functions return the parsed results. You suggestion to automatically do the parsing when XML format is requested would prevent the user from parsing the XML themselves (e.g. using SAX or DOM). It would also spoil my plan to include some of the Entrez sequence XML formats in Bio.SeqIO as this would need Bio.efetch(...) to return a handle with XML in it. > If we're not sure what to do, I suggest we keep the search_for function in > Bio.GenBank for the upcoming release, and take this issue up later. That would be expedient. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 26 19:47:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 19:47:14 -0400 Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails with blastall 2.2.14 In-Reply-To: Message-ID: <200806262347.m5QNlESr031036@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 ------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 19:47 EST ------- Created an attachment (id=952) --> (http://bugzilla.open-bio.org/attachment.cgi?id=952&action=view) Patch to Bio/Blast/NCBIStandalone.py This is a very rough attempt at fixing multiquery BLAST output from recent versions of NCBI BLAST. It seems to work for the file I tested, but breaks the final part of the unit test due to the alignments shown as "Flat Query-Anchored with(out) Identities", described here: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/multi_formats.html See also unit test files bt005 and bt045 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 26 20:37:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 20:37:14 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200806270037.m5R0bEkY000324@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 ------- Comment #24 from mdehoon at ims.u-tokyo.ac.jp 2008-06-26 20:37 EST ------- I committed my patch to setup.py, as it seems to work fine with Python 2.3, 2.4, and 2.5 on all platforms. Leaving this bug open, since we still need to remove the workaround in Bio/PopGen/SimCoal/__init__.py. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Jun 27 10:12:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 15:12:45 +0100 Subject: [Biopython-dev] Bio.AlignIO and Bio.Entrez documentation Message-ID: <320fb6e00806270712w134e1c5cm903b811c55fc60e1@mail.gmail.com> Hi all, I've realised that there is quite a lot of new content in the Tutorial since the last release. In addition to my new chapter on Bio.AlignIO, Michiel and I have both spent a good chunk of time on the Bio.Entrez chapter of the tutorial. Michiel wrote the bulk of this chapter and has updated it to cover the new XML parser. I've just been adding information based on the NCBI guidelines (for example encouraging people to include their email address in the Entrez calls), and I've just added another section with an example using the history/webenv for a combined esearch and efetch. If anyone could spare some time to proof read the tutorial, concentrating on either or both of these new chapters (and trying the examples) it would be appreciated. Those of you with CVS access can of course check in any little fixes - but if you spot anything significant its probably worth discussing first. Ideally we can fix any little typos before Michiel releases Biopython 1.46 (tentatively this Sunday, around noon GMT). Peter P.S. If you'd like to help out and can't read or run LaTeX, let me know by email and I'll send you the latest edition of the tutorial as a PDF or HTML file. From biopython at maubp.freeserve.co.uk Fri Jun 27 11:42:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 16:42:16 +0100 Subject: [Biopython-dev] Removing obsolete bits of the Tutorial Message-ID: <320fb6e00806270842r6231adfdo8edff7a07a329cdf@mail.gmail.com> I'm still in documentation mode, and I've just removed bits of documentation of a few deprecated or obsolete bits of code. I've just got the the "BioRegistry ? automatically ?nding sequence sources" section of the tutorial/cookbook, and this either needs major updating or removing. First of all since Biopython 1.44, the line "from Bio import db" had to be "from Bio.config.DBRegistry import db". And secondly, given this is all based on Martel parsers, the list of supported formats is now a lot thinner. Would anyone object to me removing this section of the tutorial/cookbook? We might be able to deprecate it too, but I'm not sure what side effects that might have so its a bit risky this close to a planned release. Then there is the section on "Parser Design" which focuses on the scanner/consumer model and lists lots of the events these parsers (used to) generate. I don't think any of this is useful, and suspect that a lot of it is out of date. Again, should we just remove this section? Peter From mjldehoon at yahoo.com Fri Jun 27 11:54:13 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 08:54:13 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806261645y1819cddx620d430f34d7e725@mail.gmail.com> Message-ID: <224711.6366.qm@web62411.mail.re1.yahoo.com> > > We could make the rule that without an explicit value for this > > parameter, the Bio.Entrez.e* functions return the parsed results. > You suggestion to automatically do the parsing when XML format is > requested would prevent the user from parsing the XML themselves (e.g. > using SAX or DOM).Actually I was suggesting to do the parsing only if no format is requested, and to return a handle to XML if XML format is requested. But from the current examples in the Bio.Entrez chapter in the tutorial, it appears that typically users will have to write some glue code anyway to make optimally use of Bio.Entrez for their purposes. In that case, I suppose that whether or not we return a handle or an object from the Bio.Entrez.e* functions makes little difference. --Michiel. From biopython at maubp.freeserve.co.uk Fri Jun 27 12:06:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 17:06:58 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <224711.6366.qm@web62411.mail.re1.yahoo.com> References: <320fb6e00806261645y1819cddx620d430f34d7e725@mail.gmail.com> <224711.6366.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00806270906p3d0d3a1dyf78b64bc2f0afa13@mail.gmail.com> On Fri, Jun 27, 2008 at 4:54 PM, Michiel de Hoon wrote: >> Your suggestion to automatically do the parsing when XML format is >> requested would prevent the user from parsing the XML themselves (e.g. >> using SAX or DOM). > > Actually I was suggesting to do the parsing only if no format is > requested, and to return a handle to XML if XML format is requested. Oh I see. But determining the format is a complex combination of the retmode and rettype parameters... quite confusing it its own right! Especially as the are multiple different XML file formats for the same result set. http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchlit_help.html http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchlit_help.html > But from the current examples in the Bio.Entrez chapter in the tutorial, it appears > that typically users will have to write some glue code anyway to make optimally > use of Bio.Entrez for their purposes. In that case, I suppose that whether or not > we return a handle or an object from the Bio.Entrez.e* functions makes little difference. Fair point. Certainly the "esearch and efetch" example is relatively complicated, and having a combined "esearch then parse" function wouldn't make much difference. Let's leave this suggestion for the time being (having versions of the Bio.Entrez functions which include the call to Bio.Entrez.read() to parse the XML). Peter From mjldehoon at yahoo.com Fri Jun 27 12:01:54 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 09:01:54 -0700 (PDT) Subject: [Biopython-dev] Removing obsolete bits of the Tutorial In-Reply-To: <320fb6e00806270842r6231adfdo8edff7a07a329cdf@mail.gmail.com> Message-ID: <215121.11545.qm@web62405.mail.re1.yahoo.com> > I've just got the the "BioRegistry ? automatically ?nding sequence > sources" section of the tutorial/cookbook, and this either needs major > updating or removing > ... > Would anyone object to me removing this section of the > tutorial/cookbook? I think it's better to remove it. Then there is the section on "Parser Design" which focuses on the scanner/consumer model and lists lots of the events these parsers (used to) generate. I don't think any of this is useful, and suspect that a lot of it is out of date. Again, should we just remove this section? That too. Otherwise, we may inadvertently be causing new Biopython developers to write their parsers using this out of date parser design, which as far as I know is not being used in the major Biopython modules. --Michiel From mjldehoon at yahoo.com Fri Jun 27 12:40:13 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 09:40:13 -0700 (PDT) Subject: [Biopython-dev] Modules to be removed from Biopython Message-ID: <492634.64872.qm@web62414.mail.re1.yahoo.com> Hi everybody, In recent releases, we have been using the rule of thumb to remove all modules from a new Biopython release that were deprecated two releases ago. For the upcoming release, this means that we will remove the modules that were deprecated in Biopython 1.44. In that release, quite a lot of modules were deprecated; these modules will not appear in Biopython 1.46. Some of the modules to be removed are relatively simple cases, which I think can be removed without causing any real pain to anybody: Bio.crc (moved to Bio.SeqUtils.CheckSum) Bio.Fasta.index_file Bio.Fasta.Dictionary Bio.GenBank.index_file Bio.GenBank.Dictionary Bio.Geo.Iterator (replaced by Bio.Geo.parse) Bio.KEGG.Compound.Iterator (replaced by Bio.KEGG.Compound.parse) Bio.KEGG.Enzyme.Iterator (replaced by Bio.KEGG.Enzyme.parse) Bio.KEGG.Map.Iterator (replaced by Bio.KEGG.Enzyme.parse) Bio.lcc (moved to Bio.SeqUtils.lcc) Bio.MarkupEditor Bio.Medline.NLMMedlineXML Bio.Medline.nlmmedline_001211_format Bio.Medline.nlmmedline_010319_format Bio.Medline.nlmmedline_011101_format Bio.Medline.nlmmedline_031101_format Bio.MultiProc Bio.SeqIO.FASTA.py Bio.SeqIO.generic.py But, there is also a set of interconnected modules where it's not 100% clear if they can be removed without causing some surprises: Bio.builders Bio.config Bio.dbdefs Bio.formatdefs Bio.dbdefs Bio.expressions Bio.FormatIO Bio.Std Bio.StdHandler It is probably OK to remove these, since these were deprecated we did not get a barrage of complaints from our users. Personally, I think it is important to keep the code base clean, so I am in favor of removing these (and see if anybody complains; in that case, we can always put these modules back in and make a new release). But I can live with keeping these modules for another release round. If anybody thinks that that would be better, please let us know. --Michiel From biopython at maubp.freeserve.co.uk Fri Jun 27 12:50:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 17:50:17 +0100 Subject: [Biopython-dev] Modules to be removed from Biopython In-Reply-To: <492634.64872.qm@web62414.mail.re1.yahoo.com> References: <492634.64872.qm@web62414.mail.re1.yahoo.com> Message-ID: <320fb6e00806270950k479eda23ia96d3c2d36557510@mail.gmail.com> On Fri, Jun 27, 2008 at 5:40 PM, Michiel de Hoon wrote: > Hi everybody, > > In recent releases, we have been using the rule of thumb to remove all > modules from a new Biopython release that were deprecated two releases ago. I was wondering if there was a stated policy on this. > For the upcoming release, this means that we will remove the modules > that were deprecated in Biopython 1.44. In that release, quite a lot of > modules were deprecated; these modules will not appear in Biopython 1.46. > > Some of the modules to be removed are relatively simple cases, which I > think can be removed without causing any real pain to anybody: > > Bio.crc (moved to Bio.SeqUtils.CheckSum) > Bio.Fasta.index_file > Bio.Fasta.Dictionary > Bio.GenBank.index_file > Bio.GenBank.Dictionary > Bio.Geo.Iterator (replaced by Bio.Geo.parse) > Bio.KEGG.Compound.Iterator (replaced by Bio.KEGG.Compound.parse) > Bio.KEGG.Enzyme.Iterator (replaced by Bio.KEGG.Enzyme.parse) > Bio.KEGG.Map.Iterator (replaced by Bio.KEGG.Enzyme.parse) > Bio.lcc (moved to Bio.SeqUtils.lcc) > Bio.MarkupEditor > Bio.Medline.NLMMedlineXML > Bio.Medline.nlmmedline_001211_format > Bio.Medline.nlmmedline_010319_format > Bio.Medline.nlmmedline_011101_format > Bio.Medline.nlmmedline_031101_format > Bio.MultiProc > Bio.SeqIO.FASTA.py > Bio.SeqIO.generic.py Those all look fine to remove. I agree here. > But, there is also a set of interconnected modules where it's not 100% > clear if they can be removed without causing some surprises: > Bio.builders > Bio.config > Bio.dbdefs > Bio.formatdefs > Bio.dbdefs > Bio.expressions > Bio.FormatIO > Bio.Std > Bio.StdHandler > It is probably OK to remove these, since these were deprecated we did > not get a barrage of complaints from our users. Personally, I think it is > important to keep the code base clean, so I am in favor of removing > these (and see if anybody complains; in that case, we can always put > these modules back in and make a new release). But I can live with > keeping these modules for another release round. If anybody thinks > that that would be better, please let us know. Given some of these are very interconnected, I would be inclined to leave them in for one more release. However I'm content to see them go. If no one else has any qualms, then please carry on. Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 12:54:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 17:54:16 +0100 Subject: [Biopython-dev] Removing obsolete bits of the Tutorial In-Reply-To: <215121.11545.qm@web62405.mail.re1.yahoo.com> References: <320fb6e00806270842r6231adfdo8edff7a07a329cdf@mail.gmail.com> <215121.11545.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00806270954r4ee7b16fw3210cd77f1708a3@mail.gmail.com> On Fri, Jun 27, 2008 at 5:01 PM, Michiel de Hoon wrote: > >> I've just got the the "BioRegistry ? automatically ?nding sequence >> sources" section of the tutorial/cookbook, and this either needs major >> updating or removing >> ... >> Would anyone object to me removing this section of the >> tutorial/cookbook? > > I think it's better to remove it. Gone. >> Then there is the section on "Parser Design" which focuses on the >> scanner/consumer model and lists lots of the events these parsers >> (used to) generate. I don't think any of this is useful, and suspect >> that a lot of it is out of date. Again, should we just remove this >> section? > > That too. Otherwise, we may inadvertently be causing new > Biopython developers to write their parsers using this out of > date parser design, which as far as I know is not being used > in the major Biopython modules. It's not entirely out of date - don't SAX based XML parsers do something similar? And quite a few major modules still follow this scheme (e.g. Bio.GenBank and Bio.SwissProt). Anyway, I have removed most of this section leaving only a short overview. Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 13:49:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 18:49:53 +0100 Subject: [Biopython-dev] Recent Bio.Nexus updates Message-ID: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> Hi Frank, I see you've got your CVS access working again - good :) I wanted to ask you about two of your recent changes to Bio/Nexus/Nexus.py First of all, you've added a new method export_phylip(), which seems to be a simple function to record the Nexus object's alignment as a PHYLIP format alignment. One point of concern is code duplication (Bio.AlignIO can write PHYLIP files). Also, you don't seem to be following the "spec" strictly, as the taxon names are not cropped to ten characters, nor are any "illegal" characters dealt with. More generally, I wonder if this method is really needed - perhaps instead a general method to return a Bio.Align.Generic.Alignment object would be preferable. This could then be used in conjunction with any of the alignment formats supported in Bio.AlignIO. Secondly, you seem to have reverted the alphabet change to Bio/Nexus/Nexus.py made in revision 1.12 to fix Bug 2380. Was this deliberate or just accidental? http://bugzilla.open-bio.org/show_bug.cgi?id=2380 Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 17:58:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 22:58:04 +0100 Subject: [Biopython-dev] [BioPython] Entrez In-Reply-To: <1214569152.6026.9.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> <1214562160.6026.2.camel@ubuntu> <320fb6e00806270416x76d8b388mdd79577927001f32@mail.gmail.com> <1214569152.6026.9.camel@ubuntu> Message-ID: <320fb6e00806271458t4e043c39sb664c4346c8a6949@mail.gmail.com> Just forwarding this to the mailing list - Binbin's problem is resolved (although I don't know what was wrong originally). A happy ending :) Peter ---------- Forwarded message ---------- From: binbin Date: Fri, Jun 27, 2008 at 1:19 PM Subject: Re: [BioPython] Entrez To: Peter i re-install the biopyton1.45 and now i can import Entrez! thanks very much! ? 2008-06-27?? 13:16 +0200?Peter??? > On Fri, Jun 27, 2008 at 11:22 AM, binbin wrote: > > thank you for answering, i am a beginner of biopython,in the "Biopython > > Tutorial and Cookbook": > > 2.5 Connecting with biological databases: > > this is found > > "from Bio import Entrez" > > > > i tried this but it did work for me, that is why i asked. > > That should have worked if your installation of Biopython 1.45 was successful. > > We may be able to work out what is wrong. What operating system are > you using, which version of python, and how did you install Biopython? > > Regards, > > Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 18:06:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 23:06:14 +0100 Subject: [Biopython-dev] [BioPython] Bio.SCOP.FileIndex In-Reply-To: <141582.2274.qm@web62413.mail.re1.yahoo.com> References: <141582.2274.qm@web62413.mail.re1.yahoo.com> Message-ID: <320fb6e00806271506i1af1db34n1aec65605fd6f83c@mail.gmail.com> On Wed, Jun 25, 2008 at 3:04 PM, Michiel de Hoon wrote: > Hi everybody, > > When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed > if file reading is done via a buffer (which is often the case in Python). Are you talking about Bio/SCOP/FileIndex.py? The whole design seems to be geared to indexing the position of record in a file - down to the fact that it takes as filename rather than a handle. Why does it need "fixing"? > Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex? > If not, I think we should deprecate it instead of trying to fix it. We've deprecated similar functionality in Bio.GenBank, although if I recall correctly that was because it was using Martel and broke with mxTextTools 3.0, and therefore fixing it was non-trivial. If Bio.SCOP.FileIndex is broken, then deprecation seems sensible. Peter From mjldehoon at yahoo.com Fri Jun 27 22:21:53 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 19:21:53 -0700 (PDT) Subject: [Biopython-dev] [BioPython] Bio.SCOP.FileIndex In-Reply-To: <320fb6e00806271506i1af1db34n1aec65605fd6f83c@mail.gmail.com> Message-ID: <216781.61321.qm@web62403.mail.re1.yahoo.com> --- On Fri, 6/27/08, Peter wrote: Are you talking about Bio/SCOP/FileIndex.py? The whole design seems to begeared to indexing the position of record in a file - down to the fact that it takes as filename rather than a handle. Why does it need "fixing"? FileIndex pulls out records from the iterator one by one, and then calls .tell() on the file handle to find the starting position of each record. The problem is that (due to buffered reading from the file handle) .tell() does not correspond to the record starting positions. Taking the essential pieces of FileIndex: >>> input = open("mydatafile.txt") >>> while True: ...???? next_line = input.next() ...???? print input.tell() ... 8192 8192 8192 8192 8192 ... 8192 8192 18432 18432 18432 ... It works because in the iterators that are actually used in Bio.SCOP call readline() internally, which reads exactly one line so that .tell() returns the expected answer. But, calling readline() in the iterator is a limitation (e.g., you cannot run it on a list of lines). Another option is to let FileIndex itself call readline(): class FileIndex(dict): ??? def __init__(self, filename, record_gen, key_gen) ??????? ... ??????? f = open(filename) ??????? while True: ??????????? line = f.readline() ??????????? self[key] = f.tell() # store location ... ??? def __getitem__(self, key): ??????? location = dict.__getitem__[key] ??????? f.seek(location) ??????? line = f.readline() ??????? return record_gen(line) This works, but it means changing how users call FileIndex. Which is also OK, but before modifying FileIndex it would be good to know if anybody is actually using this functionality. --Michiel. From mjldehoon at yahoo.com Fri Jun 27 22:28:48 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 19:28:48 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank.NCBIDictionary, Bio.PubMed.Dictionary Message-ID: <982950.87150.qm@web62409.mail.re1.yahoo.com> Does anybody have any further objections to deprecating Bio.GenBank.NCBIDictionary and Bio.PubMed.Dictionary? These two classes download records from NCBI one by one, which is exactly what NCBI advised against. --Michiel From bugzilla-daemon at portal.open-bio.org Sat Jun 28 16:09:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 16:09:44 -0400 Subject: [Biopython-dev] [Bug 2530] New: Bio.Seq.translate() treats invalid codons as stops Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2530 Summary: Bio.Seq.translate() treats invalid codons as stops Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk The following results are with CVS. Biopython 1.45 may be different, I have recently tweaked the translate function for some less dramatic issues. I would like Bio.Seq.translate() to raise exceptions on untranslatable codons, rather than inserting a stop character. e.g. for "N at N" or "TA-". Currently: >>> from Bio.Seq import translate >>> translate("TAA") '*' >>> translate("TAG") '*' >>> translate("TAA") '*' >>> translate("TAC") 'Y' >>> translate("TAN") ... Bio.Data.CodonTable.TranslationError: 'TAN' >>> translate("NNN") ... Bio.Data.CodonTable.TranslationError: 'TAN' >>> translate("AAA") 'K' >>> translate("ANA") 'X' >>> translate("AXA") 'X' That is all fine. However, >>> translate("A at A") '*' >>> translate("A-A") '*' These should also raise a TranslationError. Suggested non-trivial patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 28 16:19:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 16:19:09 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806282019.m5SKJ9l2011097@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 16:19 EST ------- Created an attachment (id=953) --> (http://bugzilla.open-bio.org/attachment.cgi?id=953&action=view) Patch to Bio/Seq.py Bio/Data/CodonTable.py and the test_seq.py unit test The basic idea of this patch is to include the stop codons in the CodonTable's forward table dictionary. Currently, when doing the translation a stop codon is inserted when the key is undefined (but this also happens for invalid codons). Instead, by including the stop codons in the forward table, we can do a single mapping. Any KeyError becomes a translation error. However, this is a fiarly significant change to the existing CodonTable objects. The are a strange odd bunch of objects - with the ambiguous codon tables being very odd. I have replaced all of these with a single codon table which includes all the DNA and RNA codons, including the ambiguous ones. All the existing variants of DNA/RNA/Generic and (un)ambiguous CodonTables are more replaced with the single object. We still have one per NCBI codon table. I think that the CodonTable could be made simpler still, but I wanted to at least try and remain API backwards compatible (bar the dictionary change). Then, I tweaked the Bio.Seq translate method to take advantage of this. NOTE - We don't have a unit test for Bio.Data.CodonTable or Bio.Translate, so it would be wise to write one BEFORE commiting this patch. If there are any other bits of code using Bio.Data.CodonTable they could also be affected. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Jun 28 16:32:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Jun 2008 21:32:09 +0100 Subject: [Biopython-dev] Failing unit tests under Windows Message-ID: <320fb6e00806281332v44ba6139xd2531c57f53f92e@mail.gmail.com> I run python 2.3.5 on Windows, and compile from source with MSCV 6.0 (which is a different setup to the one Michiel uses for the builds). I just thought I should document the unit test oddities I see on this machine: test_ProtPram - fails with a single floating point difference, 0.562 versus 0.563. test_Wise - doesn't fail gracefully due to a problem detecting dnal http://bugzilla.open-bio.org/show_bug.cgi?id=2469 test_psw - fails due to a "doctest of" versus "Doctest: " string difference. This may be due to the different version of python? We can probably fix this in run_tests.py test_KDTree - fails with ImportError: No module named _CKDTree I do select yes when asked if I want to build Bio.KDTree - does this work for anyone under Windows? Peter From bugzilla-daemon at portal.open-bio.org Sat Jun 28 16:39:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 16:39:45 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806282039.m5SKdjUA011740@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 16:39 EST ------- Actually there is a unit test, test_translate.py - maybe the lower case T confused me? The bad news is this unit test fails with my patch, due to the Bio.Translate module using an incredibly strict check on the alphabet. I'll try and come up with a less invasive change to Bio.Data.CodonTable which makes Bio.Translate happy again - but probably not tonight. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 28 21:57:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 21:57:54 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806290157.m5T1vshF022329@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #953 is|0 |1 obsolete| | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 21:57 EST ------- (From update of attachment 953) There is an underlying issue in Bio.Data.CodonTable, which is at least commented: # These two are WRONG! I need to get the # list of ambiguous codons which code for # the stop codons XXX For example, R = A or G, so UAR = UAA or UAG / TAR = TAA or TAG = stop codons. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 28 22:37:01 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 22:37:01 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806290237.m5T2b1Wu023585@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 22:37 EST ------- Created an attachment (id=954) --> (http://bugzilla.open-bio.org/attachment.cgi?id=954&action=view) Rough patch to Bio/Data/CodonTable.py This includes some self testing, but needs further validation before being trusted. For example, is it enough to compare just pairs of unambiguous start/stop codons when generating the set of possible ambiguous start/stop codons? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sun Jun 29 02:22:43 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 28 Jun 2008 23:22:43 -0700 (PDT) Subject: [Biopython-dev] [BioPython] Bio.SCOP.FileIndex In-Reply-To: <216781.61321.qm@web62403.mail.re1.yahoo.com> Message-ID: <584421.23968.qm@web62410.mail.re1.yahoo.com> It turned out that Bio.SCOP.FileIndex was used as a base class in Bio.SCOP.Cla and Bio.SCOP.Raf. Without using Bio.SCOP.FileIndex as a base class, the derived classes in Bio.SCOP.Cla and Bio.SCOP.Raf were easy to fix. So I deprecated Bio.SCOP.FileIndex, while keeping Bio.SCOP's functionality intact by fixing the derived classes. --Michiel From bugzilla-daemon at portal.open-bio.org Sun Jun 29 02:24:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 02:24:42 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806290624.m5T6Og3F029458@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #19 from mdehoon at ims.u-tokyo.ac.jp 2008-06-29 02:24 EST ------- Bio.SCOP is fixed now (added a parse() function as a replacement for the Iterator class, which is now deprecated). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 06:09:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 06:09:25 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806291009.m5TA9PfZ021963@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #7 from mmokrejs at ribosome.natur.cuni.cz 2008-06-29 06:09 EST ------- Quoting from http://www.python.org/dev/peps/pep-0324/ - No implicit call of /bin/sh. This means that there is no need for escaping dangerous shell meta characters. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 06:55:04 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 06:55:04 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806291055.m5TAt4qX023404@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-29 06:55 EST ------- Hmm. Another reason to move to Python 2.4+, see also Bug 2480. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sun Jun 29 07:15:00 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 29 Jun 2008 04:15:00 -0700 (PDT) Subject: [Biopython-dev] CVS freeze for release 1.46 Message-ID: <799546.26730.qm@web62413.mail.re1.yahoo.com> Hi everybody, I will start to creating the new release from now. Please don't make any commits to CVS until the new release is out. Thanks! --Michiel. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 10:35:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 10:35:11 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806291435.m5TEZBAh032091@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #954 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-29 10:35 EST ------- Created an attachment (id=955) --> (http://bugzilla.open-bio.org/attachment.cgi?id=955&action=view) Patches Bio/Data/CodonTable.py for ambiguous start/stop codons This implements the stub function list_ambiguous_codons, and adds a lot of in-situ asserts which could later be moved to a unit test. e.g. ['TAG', 'TAA'] -> ['TAG', 'TAA', 'TAR'] ['UAG', 'UGA'] -> ['UAG', 'UGA', 'URA'] Note that ['TAG', 'TGA'] -> ['TAG', 'TGA'], this does not add 'TRR' is this could be a stop codon or a coding amino acid. Thus only two more codons are added in the following example: e.g. ['TGA', 'TAA', 'TAG'] -> ['TGA', 'TAA', 'TAG', 'TRA', 'TAR'] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sun Jun 29 10:43:25 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 29 Jun 2008 07:43:25 -0700 (PDT) Subject: [Biopython-dev] New release 1.46 Message-ID: <899008.26338.qm@web62403.mail.re1.yahoo.com> Hi everybody, Release 1.46 is essentially done. Feel free to start committing to CVS again. Currently I am not able to update Biopython's wiki pages. This looks like an problem with the wiki, since I am getting a blank screen without any error message. So I cannot update the website and send out the announcement yet. --Michiel From biopython at maubp.freeserve.co.uk Sun Jun 29 11:09:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Jun 2008 16:09:47 +0100 Subject: [Biopython-dev] New release 1.46 In-Reply-To: <899008.26338.qm@web62403.mail.re1.yahoo.com> References: <899008.26338.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00806290809r6ad238d3r3a16dfa145bc0186@mail.gmail.com> On Sun, Jun 29, 2008 at 3:43 PM, Michiel de Hoon wrote: > Hi everybody, > > Release 1.46 is essentially done. Feel free to start committing to CVS again. Well done - I hope you didn't give up your whole weekend for this. > Currently I am not able to update Biopython's wiki pages. This looks like an problem > with the wiki, since I am getting a blank screen without any error message. So I > cannot update the website and send out the announcement yet. I've been in touch with the OBF about this before. You'll notice all the other project pages are down too (check www.biosql.org and www.bioperl.org for example). I'm told they have something in place to automatically reboot the server, so it should fix itself within an hour or so, but it looks like they haven't resolved the underlying problem. I guess this means the new release files themselves are still waiting on your local machine(s)? That's a shame. Peter From mjldehoon at yahoo.com Sun Jun 29 11:07:36 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 29 Jun 2008 08:07:36 -0700 (PDT) Subject: [Biopython-dev] Removing obsolete bits of the Tutorial In-Reply-To: <320fb6e00806270954r4ee7b16fw3210cd77f1708a3@mail.gmail.com> Message-ID: <176230.99034.qm@web62415.mail.re1.yahoo.com> >> Then there is the section on "Parser Design" which focuses on the >> scanner/consumer model and lists lots of the events these parsers >> (used to) generate. I don't think any of this is useful, and suspect >> that a lot of it is out of date. Again, should we just remove this >> section? > > That too. Otherwise, we may inadvertently be causing new > Biopython developers to write their parsers using this out of > date parser design, which as far as I know is not being used > in the major Biopython modules. It's not entirely out of date - don't SAX based XML parsers do something similar? Yes, but there's a difference: In an XML file, we need to find out where the XML tags are to be able to parse the file. These tags can appear anywhere in the file. In flat-file text formats, typically different information is stored in different lines. So finding out where one piece of information ends and another one starts becomes trivial. We just need to pull out the lines one by one, and check whether they are a new piece of information or a continuation of the current piece of information. Especially for simple formats (e.g. Fasta), using a scanner / consumer model can be unnecessarily complex. But also for more complicated formats, parsing line by line can be entirely straightforward. For example, have a look at Bio/SwissProt/KeyWList.py, which currently contains a line-by-line parser and a scanner/consumer parser (which is deprecated). The former takes 26 lines, the latter more than a 100. --Michiel. From biopython at maubp.freeserve.co.uk Sun Jun 29 11:28:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Jun 2008 16:28:04 +0100 Subject: [Biopython-dev] Modules to be removed from Biopython In-Reply-To: <320fb6e00806270950k479eda23ia96d3c2d36557510@mail.gmail.com> References: <492634.64872.qm@web62414.mail.re1.yahoo.com> <320fb6e00806270950k479eda23ia96d3c2d36557510@mail.gmail.com> Message-ID: <320fb6e00806290828u7133ee40x8feba14b19c13be8@mail.gmail.com> > On Fri, Jun 27, 2008 at 5:40 PM, Michiel de Hoon wrote: >> For the upcoming release, this means that we will remove the modules >> that were deprecated in Biopython 1.44. In that release, quite a lot of >> modules were deprecated; these modules will not appear in Biopython 1.46. >> >> Some of the modules to be removed are relatively simple cases, which I >> think can be removed without causing any real pain to anybody: >> >> ... I see you removed most of the easy ones before making Biopython 1.46. Just to let you all know that I've just removed these three: >> Bio.SeqIO.FASTA.py >> Bio.SeqIO.generic.py >> Bio.FormatIO Peter From fkauff at biologie.uni-kl.de Mon Jun 30 04:34:30 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 30 Jun 2008 10:34:30 +0200 Subject: [Biopython-dev] Recent Bio.Nexus updates In-Reply-To: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> References: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> Message-ID: <48689A96.4010805@biologie.uni-kl.de> Hi Peter and Michiel, Peter wrote: > Hi Frank, > > I see you've got your CVS access working again - good :) > > I wanted to ask you about two of your recent changes to Bio/Nexus/Nexus.py > > First of all, you've added a new method export_phylip(), which seems > to be a simple function to record the Nexus object's alignment as a > PHYLIP format alignment. One point of concern is code duplication > (Bio.AlignIO can write PHYLIP files). Also, you don't seem to be > following the "spec" strictly, as the taxon names are not cropped to > ten characters, nor are any "illegal" characters dealt with. True - I ignored this delibaretely. I think except for old PHYLIP itself, all software I know handles longer taxon names by default. The format I used here is sometimes refered to as "relaxed phylip" but as it has become the standard for what people call phylip formt, so I just kept it this way. > More > generally, I wonder if this method is really needed - perhaps instead > a general method to return a Bio.Align.Generic.Alignment object would > be preferable. This could then be used in conjunction with any of the > alignment formats supported in Bio.AlignIO. > That is a possibility. I would then vouch for adding support for "relaxed phylip" to AlignIO.PhylipIO (which I could easily do with a little mofification of Nexus.export_phylip() myself) > Secondly, you seem to have reverted the alphabet change to > Bio/Nexus/Nexus.py made in revision 1.12 to fix Bug 2380. Was this > deliberate or just accidental? > http://bugzilla.open-bio.org/show_bug.cgi?id=2380 > > Sorry for that. I missed that bug. Thaks for re-fixing it. Frank > Thanks, > > Peter > > -- J-Prof. Dr. Frank Kauff Molecular Phylogenetics FB Biologie, 13/276 TU Kaiserslautern Postfach 3049 67653 Kaiserslautern Tel. +49 (0)631 205-2562 Fax. +49 (0)631 205-2998 email: fkauff at biologie.uni-kl.de skype: frank.kauff From biopython at maubp.freeserve.co.uk Mon Jun 30 05:12:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 10:12:17 +0100 Subject: [Biopython-dev] Recent Bio.Nexus updates In-Reply-To: <48689A96.4010805@biologie.uni-kl.de> References: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> <48689A96.4010805@biologie.uni-kl.de> Message-ID: <320fb6e00806300212m6b129a17he9dfd7c8af7cbc03@mail.gmail.com> >> First of all, you've added a new method export_phylip(), which seems >> to be a simple function to record the Nexus object's alignment as a >> PHYLIP format alignment. One point of concern is code duplication >> (Bio.AlignIO can write PHYLIP files). Also, you don't seem to be >> following the "spec" strictly, as the taxon names are not cropped to >> ten characters, nor are any "illegal" characters dealt with. > > True - I ignored this delibaretely. I think except for old PHYLIP itself, > all software I know handles longer taxon names by default. The format I used > here is sometimes refered to as "relaxed phylip" but as it has become the > standard for what people call phylip formt, so I just kept it this way. Sadly "relaxed phylip" is an even less well defined format! >> More >> generally, I wonder if this method is really needed - perhaps instead >> a general method to return a Bio.Align.Generic.Alignment object would >> be preferable. This could then be used in conjunction with any of the >> alignment formats supported in Bio.AlignIO. > > That is a possibility. I would then vouch for adding support for "relaxed > phylip" to AlignIO.PhylipIO (which I could easily do with a little > mofification of Nexus.export_phylip() myself) Would you expect spaces to be allowed in the names for "relaxed phylip" files? Writing the files is easy - checking that other tools can understand them is more hassle. And the flip side of this is reading assorted versions of "relaxed phylip" is also tricky. If you have a collection of various "valid" files (ideally output from or accepted by mainstream tools) we could use that to put together a test suite which would define the de-facto standard. But without that, I wouldn't be so confident about adding this to Biopython. >> Secondly, you seem to have reverted the alphabet change to >> Bio/Nexus/Nexus.py made in revision 1.12 to fix Bug 2380. Was this >> deliberate or just accidental? >> http://bugzilla.open-bio.org/show_bug.cgi?id=2380 > > Sorry for that. I missed that bug. Thaks for re-fixing it. There may be a more elegant way of fixing this. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 30 06:21:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 06:21:26 -0400 Subject: [Biopython-dev] [Bug 2509] Deprecating the .data property of the Seq and MutableSeq objects In-Reply-To: Message-ID: <200806301021.m5UALQVF020449@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2509 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 06:21 EST ------- See also Bug 2351, Make Seq more like a string, even subclass string? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 09:35:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 09:35:59 -0400 Subject: [Biopython-dev] [Bug 2531] New: Nexus and fasta parsers have a problem with identical taxa names Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2531 Summary: Nexus and fasta parsers have a problem with identical taxa names Product: Biopython Version: 1.44 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P4 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: abetanco at staffmail.ed.ac.uk When identical taxa names are used to identify different sequences, the nexus and fasta parser will output both taxa names, but output the same sequence for each of them. If it's not possible to store both sequences, maybe it would be better if only one of the sequences were written out, so at least it's obvious there's a problem? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 09:48:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 09:48:24 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301348.m5UDmO70030666@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 09:48 EST ------- Which Nexus and Fasta parsers? There is more than one way to load these file formats in Biopython - could you show us some sample code please? You can attach a pair of example input files if it helps. Thanks. Peter. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 10:21:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 10:21:41 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301421.m5UELfPj000799@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 10:21 EST ------- Can I repeat my request that you upload an example file (by creating an attachment to this bug) of a FASTA and NEXUS file that doesn't work for you. Here is a small Nexus file I just created by hand, with repeated taxon CYS1_DICDI (with almost the same sequence), and then below some example code using Bio.Nexus to parse it. ================================== #NEXUS [TITLE: NoName] begin data; dimensions ntax=4 nchar=50; format interleave datatype=protein gap=- symbols="FSTNKEYVQMCLAWPHDRIG"; matrix CYS1_DICDI -----MKVIL LFVLAVFTVF VSS------- --------RG IPPEEQ---- ALEU_HORVU MAHARVLLLA LAVLATAAVA VASSSSFADS NPIRPVTDRA ASTLESAVLG CATH_HUMAN ------MWAT LPLLCAGAWL LGV------- -PVCGAAELS VNSLEK---- CYS1_DICDI -----MKVIL LFVLAVFTVF VSS------- --------RG IPPEEQ---X ; end; ================================== Then in python, >>> filename = ... >>> handle = open(filename) >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(handle) >>> print n.matrix.keys() ['CATH_HUMAN', 'CYS1_DICDI', 'CYS1_DICDI.copy', 'ALEU_HORVU'] >>> n.matrix['CYS1_DICDI'] Seq('-----MKVILLFVLAVFTVFVSS---------------RGIPPEEQ----', IUPACProtein()) >>> n.matrix['CYS1_DICDI.copy'] Seq('-----MKVILLFVLAVFTVFVSS---------------RGIPPEEQ---X', IUPACProtein()) Note that Bio.Nexus has automatically renamed the duplicate entry 'CYS1_DICDI.copy' and that their different sequences have been loaded correctly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 10:36:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 10:36:06 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301436.m5UEa6WK001525@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #3 from abetanco at staffmail.ed.ac.uk 2008-06-30 10:36 EST ------- Created an attachment (id=956) --> (http://bugzilla.open-bio.org/attachment.cgi?id=956&action=view) nexus file Sorry for the overly complicated nexus file, but I can't seem to reproduce the bug with a simple example. In this case, HI99.Line5 is entered twice, and differs just at three sites (249, 417, and 452). The result I get at those three sites is the first sequence duplicated twice. 249 417 452 nexus file HI99.Line5 T T A HI99.Line5 C C G fasta output HI99.Line5 T T A HI99.Line5 T T A To do the conversion, I used this, which I think is just copied off the Biopython documentation site: #! /usr/bin/python if __name__ == '__main__' : from Bio import SeqIO import sys input_handle = open(sys.argv[1], "rU") output_handle = open(sys.argv[1].+"fas", "w") sequences = SeqIO.parse(input_handle, "nexus") SeqIO.write(sequences, output_handle, "fasta") output_handle.close() input_handle.close() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 10:52:08 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 10:52:08 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301452.m5UEq8DN002181@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 10:52 EST ------- Thanks for the example file - I can now reproduce a problem, which is progress. There is a rather cryptic error message from Bio.SeqIO, due to the fact that when Bio.Nexus parses the file it doesn't create a matrix. You can see this by using Bio.Nexus directly: >>> filename = ... >>> handle = open(filename) >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(handle) >>> n.matrix.keys() Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'keys' >>> n.matrix is None True This explains why trying to use Bio.SeqIO gives the following exception: TypeError: argument of type 'NoneType' is not iterable So, from my point of view this is good news (joke) as its not really a problem in Bio.SeqIO - although I will fix Bio.SeqIO so it fails gracefully. This seems to be a problem in Bio.Nexus, so its a job for Frank... I've got a couple more questions for you: (1) Where did this file come from? I'm not an expert on the details of the Nexus file format, but I am wondering which program wrote this file, as perhaps it is invalid in some way? (2) Could we add it to Biopython as an example for our unit tests? It might be a bit big as it is, but we could cut it down a little by hand first. P.S. I have retitled the bug from "Nexus and fasta parsers have a problem with identical taxa names" to "Bio.Nexus has a problem with identical taxa names". You don't seem to be parsing in any FASTA files, just trying to write one. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Mon Jun 30 10:55:16 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 30 Jun 2008 07:55:16 -0700 (PDT) Subject: [Biopython-dev] New release Message-ID: <97693.82874.qm@web62401.mail.re1.yahoo.com> Sorry, but I still can't edit the Biopython wiki pages, so I can't make the new release available. Can other people edit these pages? --Michiel. From biopython at maubp.freeserve.co.uk Mon Jun 30 10:56:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 15:56:39 +0100 Subject: [Biopython-dev] Bug 2531 - Bio.Nexus problem with file with repeated id Message-ID: <320fb6e00806300756l7e9f6fe6sc68cf1884cb2994@mail.gmail.com> Hi Frank, Would you be able to take a look at this new report, bug 2531: http://bugzilla.open-bio.org/show_bug.cgi?id=2531 The reporter Andrea Betancourt says she is using Biopython 1.44, while I am on CVS (which should be equivalent to Biopython 1.46 for Bio.Nexus). Her reported symptoms and what I see are different... but she has provided a test file to work from. Thanks, Peter From p.j.a.cock at googlemail.com Mon Jun 30 11:00:22 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Jun 2008 16:00:22 +0100 Subject: [Biopython-dev] New release In-Reply-To: <97693.82874.qm@web62401.mail.re1.yahoo.com> References: <97693.82874.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00806300800rd74082eqabbd1a2bef66da76@mail.gmail.com> On Mon, Jun 30, 2008 at 3:55 PM, Michiel de Hoon wrote: > Sorry, but I still can't edit the Biopython wiki pages, so I can't make the new > release available. Can other people edit these pages? No - as soon as I saw the wiki came back to life last night I tried, and have tried again today. I can make changes, view the preview and differences, but I just get a blank page when I click submit. I sent off an email to OBF to alert them in case you hadn't. I see the Biopython 1.46 files themselves are now online at http://biopython.org/DIST/ so at least some of the web-server is running properly :) We could just do the announcement by email and the news page, and fix the wiki later. But it does risk causing a little confusion in the short term. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 30 11:36:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 11:36:17 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301536.m5UFaHlo004669@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #5 from abetanco at staffmail.ed.ac.uk 2008-06-30 11:36 EST ------- The file was written by a Windows program called DNAsp (http://www.ub.es/dnasp/), which is widely used by population geneticists, which is not to say that it didn't write an invalid file. But it looked OK to me, other than the too short taxa names. (Those too short names were inherited from another program). I don't mind you using for the test unit, but it would be nice if it were cut down or something, as it is both unwieldy and unpublished data. A. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 11:38:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 11:38:00 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301538.m5UFc0S4004813@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 fkauff at biologie.uni-kl.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #6 from fkauff at biologie.uni-kl.de 2008-06-30 11:38 EST ------- Handling a handle works like a charm for me with the attachment provided: >>> handle=open('eg.nex') >>> n=Nexus.Nexus(handle) >>> n.matrix.keys() ['HI99.Line5.copy', 'am', 'HI99.Line1.copy', 'ezo', 'HI99.Line0.copy', 'DI05.Line5.copy', 'DI05.Line0.copy', 'DI05.Line8.copy1', 'DI05.Line1.copy1', 'HI99.Line3.copy', 'HI99.Line1.copy1', 'DI05.Line1.copy', 'DI05.Line9.copy', 'DI05.Line8.copy', 'HI99.Line4.copy', 'vir', 'DI05.Line8', 'DI05.Line9', 'HI99.Line2.copy', 'DI05.Line2', 'DI05.Line3', 'DI05.Line0', 'DI05.Line1', 'DI05.Line6', 'DI05.Line7', 'DI05.Line4', 'DI05.Line5', 'HI99.Line1', 'HI99.Line0', 'HI99.Line3', 'HI99.Line2', 'HI99.Line5', 'HI99.Line4'] However, Nexus.py needs unique taxon names. Non-unique taxon names won't make much sense in a nexus file imho. If Nexus.py encounters non-unique names, they are unified by adding a suffix (.copy, .copy1, ...) to it. Could this cause problems to SeqIO.NexusIO? Frank -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 12:12:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 12:12:29 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301612.m5UGCTnZ006531@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 12:12 EST ------- It looks like I didn't have the latest version of Bio.Nexus on this machine which may have added to the confusion. I've just updated to CVS (i.e. almost exactly Biopython 1.46). My issue with the matrix being None has gone away. Opps. >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(open('eg.nex')) >>> n.matrix.keys() ['HI99.Line5.copy', 'am', 'HI99.Line1.copy', 'ezo', 'HI99.Line0.copy', 'DI05.Line5.copy', 'DI05.Line0.copy', 'DI05.Line8.copy1', 'DI05.Line1.copy1', 'HI99.Line3.copy', 'HI99.Line1.copy1', 'DI05.Line1.copy', 'DI05.Line9.copy', 'DI05.Line8.copy', 'HI99.Line4.copy', 'vir', 'DI05.Line8', 'DI05.Line9', 'HI99.Line2.copy', 'DI05.Line2', 'DI05.Line3', 'DI05.Line0', 'DI05.Line1', 'DI05.Line6', 'DI05.Line7', 'DI05.Line4', 'DI05.Line5', 'HI99.Line1', 'HI99.Line0', 'HI99.Line3', 'HI99.Line2', 'HI99.Line5', 'HI99.Line4'] >>> assert [id for id in n.matrix] == n.matrix.keys() >>> n.matrix['HI99.Line5'] Seq('ATCGATAGCATTGCGG-GGACGACGATGGACATTTGGAAAACGAATATGAAAAT...GAG', IUPACAmbiguousDNA()) >>> n.matrix['HI99.Line5'][249-1] 'T' >>> n.matrix['HI99.Line5'][417-1] 'T' >>> n.matrix['HI99.Line5'][452-1] 'A' >>> n.matrix['HI99.Line5.copy'] Seq('ATCGATAGCATTGCGGCGGACGACGATGGACATTTGGAAAACGAATATGAAAAT...GAG', IUPACAmbiguousDNA()) >>> n.matrix['HI99.Line5.copy'][249-1] 'C' >>> n.matrix['HI99.Line5.copy'][417-1] 'C' >>> n.matrix['HI99.Line5.copy'][452-1] 'G' So far this looks good. However: >>> n.original_taxon_order ['vir', 'am', 'ezo', 'DI05.Line5', 'DI05.Line1', 'DI05.Line9', 'DI05.Line2', 'DI05.Line3', 'HI99.Line2', 'HI99.Line1', 'HI99.Line5', 'DI05.Line4', 'DI05.Line1', 'DI05.Line7', 'HI99.Line3', 'DI05.Line6', 'DI05.Line8', 'HI99.Line4', 'DI05.Line1', 'HI99.Line1', 'DI05.Line8', 'DI05.Line5', 'HI99.Line2', 'HI99.Line0', 'HI99.Line0', 'HI99.Line5', 'DI05.Line9', 'HI99.Line3', 'DI05.Line0', 'DI05.Line0', 'HI99.Line4', 'HI99.Line1', 'DI05.Line8'] In the Bio.SeqIO code that calls Bio.Nexus, I hadn't realized that Bio.Nexus kept the un-edited taxon names around. It is this list of the non-unique original identifiers that Bio.SeqIO was using, which explains why you end up with two copies of HI99.Line5. Sorry Frank - I was pointing fingers when it was my own bug after all! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 12:20:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 12:20:20 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301620.m5UGKK7M007026@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 12:20 EST ------- Frank, Looking back, the reason I was using the original_taxon_order list was I wanted to get the sequences in their original order. I see now that I can't use the elements in this list as keys to the matrix because the matrix keys are the modified taxon names. Is there any way to get the modified taxon names in the original order? Other than looping over original_taxon_order and repeating your naming algorithm? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 13:07:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 13:07:05 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301707.m5UH75I7009356@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 13:07 EST ------- Created an attachment (id=957) --> (http://bugzilla.open-bio.org/attachment.cgi?id=957&action=view) Sample input file Simple example file without a TAXA block Second example file to follow -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 13:22:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 13:22:23 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301722.m5UHMNo4010009@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 13:22 EST ------- Created an attachment (id=958) --> (http://bugzilla.open-bio.org/attachment.cgi?id=958&action=view) Second example file Using the first file where there is no TAXA block: >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(open('dup_names.nex')) >>> print n.matrix.keys() ['CATH_HUMAN', 'CYS1_DICDI', 'CYS1_DICDI.copy', 'ALEU_HORVU'] >>> print n.original_taxon_order ['CYS1_DICDI', 'ALEU_HORVU', 'CATH_HUMAN', 'CYS1_DICDI.copy'] Then with a TAXA block, >>> n2 = Nexus.Nexus(open('dup_names2.nex')) >>> print n2.matrix.keys() ['CATH_HUMAN', 'CYS1_DICDI', 'CYS1_DICDI.copy', 'ALEU_HORVU'] >>> print n2.original_taxon_order ['CYS1_DICDI', 'ALEU_HORVU', 'CATH_HUMAN', 'CYS1_DICDI'] Notice the different behaviour of the original_taxon_order list. In the first case it gets the modified names, in the second case it doesn't. Is this deliberate Frank? On the other hand, maybe Nexus files without a TAXA block are rare in real life? Are they? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Mon Jun 30 13:10:15 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 30 Jun 2008 19:10:15 +0200 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: <200806301612.m5UGCTnZ006531@portal.open-bio.org> References: <200806301612.m5UGCTnZ006531@portal.open-bio.org> Message-ID: <48691377.803@biologie.uni-kl.de> bugzilla-daemon at portal.open-bio.org wrote: > > > In the Bio.SeqIO code that calls Bio.Nexus, I hadn't realized that Bio.Nexus > kept the un-edited taxon names around. It is this list of the non-unique > original identifiers that Bio.SeqIO was using, which explains why you end up > with two copies of HI99.Line5. > > Sorry Frank - I was pointing fingers when it was my own bug after all! > > > Looking back, the reason I was using the original_taxon_order list was I wanted > to get the sequences in their original order. I see now that I can't use the > elements in this list as keys to the matrix because the matrix keys are the > modified taxon names. > > Is there any way to get the modified taxon names in the original order? Other > than looping over original_taxon_order and repeating your naming algorithm? > Actually -this *IS* a bug. All fingers were pointing correctly... Original_taxon labels was just kept just for compatibility, and is the same as taxlabels. Taxlabels is supposed to have the unique identifiers - it just doesn't work correctly with non-unique ids in interleaved data sets. Fix following soon Frank From bugzilla-daemon at portal.open-bio.org Mon Jun 30 13:28:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 13:28:25 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301728.m5UHSPVk010377@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 13:28 EST ------- Created an attachment (id=959) --> (http://bugzilla.open-bio.org/attachment.cgi?id=959&action=view) Tentative patch to Bio/SeqIO/NexusIO.py This seems to cope with Andrea's real input file and my two hand written ones. It works by taking the original_taxon_order lists, and applying the disambiguation algorithm if needed. Not very elegant! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 15:29:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 15:29:32 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301929.m5UJTWYQ015982@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 15:29 EST ------- Created an attachment (id=960) --> (http://bugzilla.open-bio.org/attachment.cgi?id=960&action=view) Suggested patch to Bio/Nexus/Nexus.py This modifies Bio.Nexus to ensure that the original_taxon_order uses the original (duplicated) names, resolving the discrepancy I reported in comment 10. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 17:18:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 17:18:48 -0400 Subject: [Biopython-dev] [Bug 2520] Reading ACE assembly contig files in Bio.SeqIO In-Reply-To: Message-ID: <200806302118.m5ULImoB021255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2520 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 17:18 EST ------- Checked into CVS. We'll need to revisit this once we have a good way of dealing with per-letter-annotation which would be suitable for the quality scores. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 18:50:01 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 18:50:01 -0400 Subject: [Biopython-dev] [Bug 2532] New: Using IUPAC alphabets in mixed case Seq objects Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2532 Summary: Using IUPAC alphabets in mixed case Seq objects Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Bio.Alphabets.IUPAC defines a number of alphabets with defined lists of valid letters which are in upper case ONLY. Bio.Nexus and Bio.Sequencing.Phd create Seq objects which use these alphabets even with mixed case sequences. This contradicts how I think the alphabet's .letters property is intended to be used (although currently this is not enforced by the Seq object). I suggest either: (a) Bio.Nexus etc switch to using generic DNA/RNA alphabets for any Seq objects including lower case letters (or more simply, all Seq objects). (b) We add lower case and mixed case variants of the alphabet objects, and use the mixed case IUPAC alphabets in Bio.Nexus etc for the Seq objects. There is also the option of (c) Extend the existing upper case only IUPAC alphabets to include lower case too, but I fear this could have unexpected side effects (e.g. where people looping over the expected set of letters). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 18:51:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 18:51:17 -0400 Subject: [Biopython-dev] [Bug 2532] Using IUPAC alphabets in mixed case Seq objects In-Reply-To: Message-ID: <200806302251.m5UMpHBf024519@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2532 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 18:51 EST ------- Created an attachment (id=961) --> (http://bugzilla.open-bio.org/attachment.cgi?id=961&action=view) Patch to Bio.Sequencing.Phd This takes the simple route of using a generic DNA alphabet. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 08:19:50 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 04:19:50 -0400 Subject: [Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: Message-ID: <200806020819.m528JoXn006809@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2502 ------- Comment #19 from ibdeno at gmail.com 2008-06-02 04:19 EST ------- Thank you, Peter. In principle, I don't use that information. I will try then with the XML parser. Cheers, Miguel (In reply to comment #18) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 08:49:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 04:49:55 -0400 Subject: [Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: Message-ID: <200806020849.m528ntdY008609@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2502 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #20 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 04:49 EST ------- Marking this bug as fixed. The original report was about parsing the plain text output which is fixed - see comment 12, and Bio/Blast/NCBIStandalone.py CVS revision 1.72. I have not added the 2.2.18 plain text file as a unit test since its over 750kb. For the XML output from 2.2.18, as far as I can tell we are not ignoring any important information from PSI-BLAST, as it is simply not included. If the NCBI updates the XML output from blastpgp then we should revisit the XML parsing. Thank you Miguel for your report and assistance. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 10:37:51 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 06:37:51 -0400 Subject: [Biopython-dev] [Bug 2503] An error when parsing NCBIWWW Blast output In-Reply-To: Message-ID: <200806021037.m52Abpj9019177@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2503 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 06:37 EST ------- Dear Prashanth, Unless you can provide some more information, I'm going to have to close Bug 2503, as you haven't given us enough to go on. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 12:57:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 08:57:20 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806021257.m52CvKt4026676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 08:57 EST ------- I've added simple __str__ and __repr__ methods to the alignment class in Bio/Align/Generic.py CVS revision 1.8, which give output like this: str(a): DNAAlphabet() alignment with 3 rows and 14 columns ACGATCAGCTAGCT Alpha CCGATCAGCTAGCT Beta ACGATGAGCTAGCT Gamma repr(a): <__main__.Alignment instance (3 records of length 14, DNAAlphabet()) at 9e96c2c> The string output gets truncated to show a maximum of 20 rows and 50 columns, which allowing for typical identifiers will still display nicely on a default terminal. I now intend to update the tutorial, as being able to print an alignment should make it much easier to explain and get to grips with. Note that there is still some interesting code in both attachment 732 (the __getitem__ method) and in attachment 770 (e.g. subclassing list and adding __len__, __add__, __radd__ etc). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 13:26:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 09:26:28 -0400 Subject: [Biopython-dev] [Bug 2507] New: Adding __getitem__ to SeqRecord for element access and slicing Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2507 Summary: Adding __getitem__ to SeqRecord for element access and slicing Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 1944 nThis: With a Seq object, you can access individual letters and create sub-sequences using slicing. You can even use a stride to reverse the sequence, or select every third letter. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> print my_seq GATCGATGGGCCTATATAGGATCGAAAATCGC >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) >>> my_seq[5:10] Seq('ATGGG', IUPACUnambiguousDNA()) >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) >>> my_seq[5] 'A' Currently, these operations cannot be done with a SeqRecord object. This enhancement bug is to allow element access and splicing (perhaps even with a stride) on SeqRecord objects, where the annotations are taken into consideration, and preserved as far as reasonably possible. Looking at the different SeqRecord properties, this is what I think should happen for creating a sub-sequence: .id, .name, .description (three strings) - preserve? Blindly preserving these may not always be meaningful. For example, if the description was "Complete plasmid" then it doesn't really apply to a sub-sequence. Perhaps we should preserve only the id and name, and set the description to "sub-sequence"? .annotations (dictionary) - either preserve or lose? Some annotation entries will still be valid for a sub-sequence (e.g. "source" or references). Others will not (e.g. anything describing its coordinates within a larger parent sequence). There is no reliable way to decide on a case by case basis. .dbxrefs (list of strings) - preserve? Any database cross-references would arguably still apply to a sub-sequence or even a reversed sequence. .features (list of SeqFeatures) - select only those features still in the new sub-sequence, and adjust their locations for the new coordinates. Supporting strides other than +1 would be complicated! For simplicity, I would say any feature only partially within the sub-sequence should be discarded. In summary, one clearly defined set of actions on creating a sub-sequence could be to preserve all the annotation data except the SeqFeatures which would be handled sensibly. [If we later support "per-letter-annotation" in either a Seq or SeqRecord subclass, then this too should be spliced] Adding a __getitem__ method to the SeqRecord as outlined above should be compatible with the suggestion that the SeqRecord subclasses the Seq object (see bug 2351). A related point, when accessing single letters, e.g. record[0], should a single letter string be returned (which lacks any annotation) as currently happens with the Seq object? P.S. I'm marking this new enhancement bug as blocking bug 1944. Once SeqRecord objects support splicing, this would make annotation preserving slicing of alignment objects much more straightforward. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 13:26:33 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 09:26:33 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806021326.m52DQXk2029561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2507 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 14:00:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 10:00:15 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806021400.m52E0FJK032027@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 10:00 EST ------- Simple implementation with ignores the features (non-trivial) to be added to the SeqRecord class in Bio/SeqRecord.py def __getitem__(self, index) : if isinstance(index, int) : #TODO - Should single letters be returned as just #strings? This prevents the inclusion of any annotation. #Revisit this once the Seq object is a subclass of string. return self.seq[index] elif isinstance(index, slice) : answer = self.__class__(self.seq[index], id=self.id, name=self.name, description=self.description) #COPY the annotation dict and dbxefs list: answer.annotations = dict(self.annotations.iteritems()) answer.dbxrefs = self.dbxrefs[:] #TODO - select relevant features, and add them with #adjusted coordinates. Take special care with a stride! return answer raise ValueError, "Invalid index" -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 14:12:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 10:12:29 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806021412.m52ECT86000330@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #2 from jblanca at btc.upv.es 2008-06-02 10:12 EST ------- Does this means that SeqRecord would deprecate the .seq attribute? If the .seq attribute is not removed slicing could be used in it like: my_seq[1:100] and my_seq.seq[1:100]. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jun 2 14:14:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Jun 2008 15:14:40 +0100 Subject: [Biopython-dev] sequence class proposal In-Reply-To: <1211779470.483a498e18e3e@webmail.upv.es> References: <320fb6e00805251437n34362f0bm2a323cd1194afaa@mail.gmail.com> <1211779470.483a498e18e3e@webmail.upv.es> Message-ID: <320fb6e00806020714s2c789f61ke676a448e2ec871a@mail.gmail.com> In reply to Jose, I (Peter) wrote: >> One of your points seemed to be that the SeqRecord couldn't have a >> __getitem__ and methods like reverse, complement, etc. I don't see >> why it couldn't have these. Perhaps rather than introducing a whole >> new class, enhancing the SeqRecord would be a better avenue. I've filed Bug 2507 to try and show what I had in mind for the __getitem__ method. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 Adding further methods for (reverse) complement etc could be done in much the same way. Returning to extending Biopython to support per-letter-annotation, I can see two options: Right now, the SeqRecord object HAS a Seq object. If we create a new RichSeq which subclasses the Seq object to provide per-letter-annotation, then you could use a SeqRecord where the .seq property is in fact a RichSeq object. The SeqRecord class doesn't need to have any changes made for this to work (assuming the RichSeq provides the same API as the Seq object). If we make the SeqRecord a subclass of the Seq object, then I would suggest either RichSeq subclassing SeqRecord subclassing Seq, or perhaps SeqRecord subclassing RichSeq subclassing Seq. It depends on if you think the id/name/description/dbxrefs/etc properties would be useful in common use cases of the RichSeq object. Its not going to be possible for all three classes to have the same __init__ parameters without breaking existing scripts (and only supporting the lowest common denominator). Peter From jblanca at btc.upv.es Mon Jun 2 19:11:19 2008 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 2 Jun 2008 21:11:19 +0200 Subject: [Biopython-dev] Fwd: Re: sequence class proposal Message-ID: <1212433879.484445d7a6117@webmail.upv.es> ----- Mensaje reenviado de Blanca Postigo Jose Miguel ----- Fecha: Mon, 2 Jun 2008 21:08:59 +0200 De: Blanca Postigo Jose Miguel Responder-A: Blanca Postigo Jose Miguel Asunto: Re: [Biopython-dev] sequence class proposal Para: Peter Mensaje citado por Peter : > In reply to Jose, I (Peter) wrote: > >> One of your points seemed to be that the SeqRecord couldn't have a > >> __getitem__ and methods like reverse, complement, etc. I don't see > >> why it couldn't have these. Perhaps rather than introducing a whole > >> new class, enhancing the SeqRecord would be a better avenue. > > I've filed Bug 2507 to try and show what I had in mind for the > __getitem__ method. > http://bugzilla.open-bio.org/show_bug.cgi?id=2507 I think that would be great. I've just added to the bug a question about the .seq property of SeqRecord. > Adding further methods for (reverse) complement etc could be done in > much the same way. > > Returning to extending Biopython to support per-letter-annotation, I > can see two options: > > Right now, the SeqRecord object HAS a Seq object. If we create a new > RichSeq which subclasses the Seq object to provide > per-letter-annotation, then you could use a SeqRecord where the .seq > property is in fact a RichSeq object. The SeqRecord class doesn't > need to have any changes made for this to work (assuming the RichSeq > provides the same API as the Seq object). Here I had a slighty different idea, but maybe yours is better. Basically my RichSeq proposal is just a RichSeq with slicing and without the seq property. The problem with the approach that you describe is that the RichSeq should have the per-letter-annotation, so SeqRecord would have a general annotation and RichSeq (in the .seq) would have other features. I would find that confusing. > > If we make the SeqRecord a subclass of the Seq object, then I would > suggest either RichSeq subclassing SeqRecord subclassing Seq, or > perhaps SeqRecord subclassing RichSeq subclassing Seq. It depends on > if you think the id/name/description/dbxrefs/etc properties would be > useful in common use cases of the RichSeq object. If SeqRecord is a subclass of Seq RichSeq is not necessary anymore. That's what I was proposing. The problem is that the current users of SeqRecord would had a hard time with the new behaviour, because in that case supporting the seq property would be hard. To avoid that breakage I was proposing to create RichSeq. RichSeq would be just the SeqRecord that you propose but would allow the users to migrate to RichSeq without forcing them to change to a new SeqRecord behaviour. > > Its not going to be possible for all three classes to have the same > __init__ parameters without breaking existing scripts (and only > supporting the lowest common denominator). That's another reason to rename your new proposed SeqRecord to RichSeq. > > Peter > Jose Blanca -- ----- Fin del mensaje reenviado ----- -- From biopython at maubp.freeserve.co.uk Mon Jun 2 19:51:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Jun 2008 20:51:30 +0100 Subject: [Biopython-dev] Fwd: Re: sequence class proposal In-Reply-To: <1212433879.484445d7a6117@webmail.upv.es> References: <1212433879.484445d7a6117@webmail.upv.es> Message-ID: <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> Jose wrote: > > I've filed Bug 2507 to try and show what I had in mind for the > > __getitem__ method. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2507 > > I think that would be great. Good :) Does anyone else want to comment? > I've just added to the bug a question about the .seq property of SeqRecord. http://bugzilla.open-bio.org/show_bug.cgi?id=2507#c2 reads: > Does this means that SeqRecord would deprecate the .seq attribute? > If the .seq attribute is not removed slicing could be used in it like: > my_seq[1:100] and my_seq.seq[1:100]. I was not intending to deprecate the SeqRecord's .seq property at this time (I think that should happen in preparation for if/when the SeqRecord becomes a subclass of the Seq object). With my idea described on bug 2507, given a SeqRecord object my_seq_record: my_seq_record[1:100] -> another SeqRecord (with annotation) my_seq_record.seq[1:100] -> just a Seq object (no annotation) my_seq_record.seq.tostring()[1:100] -> just a string (no annotation or alphabet) str(my_seq_record.seq)[1:100] -> just a string (no annotation or alphabet) These trivial examples would all "contain" the same sequence string. This enhancement could be done right now, and shouldn't impeed any future per-letter-annotation enhancements. Perhaps per-letter-annotation enhancements could be added to the SeqRecord class directly... I need to fully digest the discussion on the BioSQL list, see: http://lists.open-bio.org/pipermail/biosql-l/2008-May/thread.html Peter From mjldehoon at yahoo.com Tue Jun 3 00:19:59 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 2 Jun 2008 17:19:59 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez & Bio.EUtil In-Reply-To: <320fb6e00805300717v60f0b153i88b5e9a8aee1744c@mail.gmail.com> Message-ID: <624249.42121.qm@web62408.mail.re1.yahoo.com> OK I'll double-check. I may not have noticed some missing DTDs if they were downloaded automatically from the internet. I think Biopython should ship the most common DTDs. At least the ones needed for test_Entrez, which probably covers most of the use cases of Bio.Entrez. --Michiel. Peter wrote: On 24 May 2008, Michiel de Hoon wrote: > Dear all, > > I have essentially completed the parser in Bio.Entrez. The internals of the new design look more complicated to start with, but I can see how much more general it is than the older versions :) Should it work starting from an empty DTDs folder - or will we ship Biopython with most of the current files? I've had trouble with Biopython trying to fetch missing DTD files from the internet. I think the problem is the NCBI using relative URLs. The following quick hack seems to help in Parser.py but only in some cases (because as listed below, the NCBI have two different base paths): 279,280c279,288 < warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % filename) < handle = urllib.urlopen(systemId) --- > warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % path) > if "/" in systemId : > #Assume this is a full path, e.g. > #http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedline_080101.dtd > handle = urllib.urlopen(systemId) > else : > #Its a relative path, and I'm not sure how to best get the base path: > handle = urllib.urlopen("http://www.ncbi.nlm.nih.gov/entrez/query/DTD/"+systemId) (Also note there seem to be some tab/space isssues in this file). >From http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ I've downloaded the following files using wget: egquery.dtd eSearch_020511.dtd nlmcommon_080101.dtd pubmed_080101.dtd eInfo_020511.dtd eSpell.dtd nlmmedline_080101.dtd taxon.dtd eLink_020511.dtd eSummary_041029.dtd nlmmedlinecitation_080101.dtd uilist.dtd ePost_020511.dtd nlmsharedcatcit_080101.dtd Additionally http://www.ncbi.nlm.nih.gov/dtd/ provided some further XML files needed for the test_Entrez.py unit test: NCBI_GBSeq.dtd NCBI_GBSeq.mod.dtd NCBI_Entity.mod.dtd NCBI_Mim.dtd NCBI_Mim.mod.dtd With all the above files, then the unit test file test_Entrez.py doesn't give any missing DTD warnings - but still has a couple of failures. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 3 04:39:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 00:39:24 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806030439.m534dOYI021682@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-06-03 00:39 EST ------- I agree that type checking is a problem. I am not sure if a specialized function in Bio.File is a good idea. The question is not if "this object is a file-like object", but "does this object have the attributes/methods needed". So I would prefer to add checks only for the required attributes/methods in each of the iterators. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Jun 3 04:33:27 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 2 Jun 2008 21:33:27 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez & Bio.EUtil In-Reply-To: <624249.42121.qm@web62408.mail.re1.yahoo.com> Message-ID: <112249.61498.qm@web62410.mail.re1.yahoo.com> I checked but I did not see any missing DTDs. Most of the DTDs in the list you sent are in Biopython's CVS under Bio/Entrez/DTDs, and are included correctly if I do a fresh checkout from CVS. Maybe could you try with a fresh checkout? --Michiel. Michiel de Hoon wrote: OK I'll double-check. I may not have noticed some missing DTDs if they were downloaded automatically from the internet. I think Biopython should ship the most common DTDs. At least the ones needed for test_Entrez, which probably covers most of the use cases of Bio.Entrez. --Michiel. Peter wrote: On 24 May 2008, Michiel de Hoon wrote: > Dear all, > > I have essentially completed the parser in Bio.Entrez. The internals of the new design look more complicated to start with, but I can see how much more general it is than the older versions :) Should it work starting from an empty DTDs folder - or will we ship Biopython with most of the current files? I've had trouble with Biopython trying to fetch missing DTD files from the internet. I think the problem is the NCBI using relative URLs. The following quick hack seems to help in Parser.py but only in some cases (because as listed below, the NCBI have two different base paths): 279,280c279,288 < warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % filename) < handle = urllib.urlopen(systemId) --- > warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % path) > if "/" in systemId : > #Assume this is a full path, e.g. > #http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedline_080101.dtd > handle = urllib.urlopen(systemId) > else : > #Its a relative path, and I'm not sure how to best get the base path: > handle = urllib.urlopen("http://www.ncbi.nlm.nih.gov/entrez/query/DTD/"+systemId) (Also note there seem to be some tab/space isssues in this file). >From http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ I've downloaded the following files using wget: egquery.dtd eSearch_020511.dtd nlmcommon_080101.dtd pubmed_080101.dtd eInfo_020511.dtd eSpell.dtd nlmmedline_080101.dtd taxon.dtd eLink_020511.dtd eSummary_041029.dtd nlmmedlinecitation_080101.dtd uilist.dtd ePost_020511.dtd nlmsharedcatcit_080101.dtd Additionally http://www.ncbi.nlm.nih.gov/dtd/ provided some further XML files needed for the test_Entrez.py unit test: NCBI_GBSeq.dtd NCBI_GBSeq.mod.dtd NCBI_Entity.mod.dtd NCBI_Mim.dtd NCBI_Mim.mod.dtd With all the above files, then the unit test file test_Entrez.py doesn't give any missing DTD warnings - but still has a couple of failures. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 3 09:16:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 05:16:48 -0400 Subject: [Biopython-dev] [Bug 2446] Comments in CT tags cause Bio.Sequencing.Ace.ACEParser to fail. In-Reply-To: Message-ID: <200806030916.m539GmwZ001955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2446 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-03 05:16 EST ------- As pointed out on the mailing list, the test cases attached to this bug have disappeared (some expiry issue?). In the mean time, we could probably just edit the sole existing test case in Tests/Ace/contig1.ace to add a comment to an existing CT tag. Looking at this file, for example edit: CT{ Contig1 repeat phrap 52 53 555456:555432 This is the forst line of comment for c1 and this the second for c1 } to become: CT{ Contig1 repeat phrap 52 53 555456:555432 COMMENT{ This is the first line of comment for c1 and this the second for c1} } In the short term, we could either ignore the COMMENT tags within a CT tag, or just treat them as plain next. Supporting the nested structure within the current would require changes to the current Record structure. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 3 11:46:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 07:46:58 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806031146.m53BkwAB009224@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #5 from cracka80 at gmail.com 2008-06-03 07:46 EST ------- (In reply to comment #4) > I agree that type checking is a problem. > I am not sure if a specialized function in Bio.File is a good idea. The > question is not if "this object is a file-like object", but "does this object > have the attributes/methods needed". So I would prefer to add checks only for > the required attributes/methods in each of the iterators. > The function I have written does exactly this - it checks for the necessary attributes and methods for a given object. The iterators would then only need to call ``File.is_filelike()`` on each object passed into them, rather than a type checking procedure. This is in accordance with the design pattern "Program to an 'interface', not an 'implementation'." (Gang of Four). Would you like me to provide a diff against the current revision of Biopython, with suggested changes? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 3 15:07:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 11:07:35 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806031507.m53F7Zm7019694@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-06-03 11:07 EST ------- Two things: 1) Some of the code that does type checking for file-like-ness seems to be quite old and possibly outdated (e.g. Gobase.Iterator). We should take this opportunity to go through these modules and check if they are still useful. 2) Many of these modules (especially the ones that use an "Iterator" class) would be written differently in modern Python (in particular by making use of a generator function instead of an Iterator class). So I'd like to suggest the following: -) For the modules whose usability is dubious in 2008, let's check on the mailing list if anybody is still using them. If not, we can simply deprecate them. -) For the modules that are still useful, use try/except clauses to check for the necessary attributes. The current function checks for 'read', 'readline', 'readlines', and '__iter__', whereas the parser probably only needs one of them. -) If possible, I'd prefer to convert to modern Python as much as possible (though formally that is not within the scope of this bug report). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 4 19:50:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Jun 2008 15:50:14 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806041950.m54JoEPj029720@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #3 from jblanca at btc.upv.es 2008-06-04 15:50 EST ------- Created an attachment (id=927) --> (http://bugzilla.open-bio.org/attachment.cgi?id=927&action=view) RichSeq proposal I have coded a sequence class that fullfils the requirements that I would like to see. It's very similar to SeqRecord, but it is not compatible with it. It has no seq property, although that can be solved. The problem with SeqRecord is that it is not possible to create a class with an __init__ compatible with Seq and SeqRecord at the same time. This proposed class is just a draft, it needs more work but I would like to receive comments about it. It inherits from MutableSeq so it should be named MutableRichSeq, but it seems that I'm too lazy to such a long name, I promise to change the name in a later version and to create a RichSeq with Seq as parent. Besides RichSeq there is in the attachment two other classes, RichFeature and BioRange, but I would comment on that in another post. I think that it is quite important to convert Seq and MutableSeq to newclasses, what do you think about that? With the new classes we can use properties. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 4 20:19:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Jun 2008 16:19:41 -0400 Subject: [Biopython-dev] [Bug 2508] New: NCBIStandalone.blastall: provide support for '-F F' and make it safe Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2508 Summary: NCBIStandalone.blastall: provide support for '-F F' and make it safe Product: Biopython Version: 1.44 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz The local NCBI blast by default masks low-complexity region by SEG algorithm. I do not see a variable to affect this in NCBIStandalone.blastall(). Luckily, NCBIStandalone.blastall() is an unsafe function and does not check whether I pass multiple arguments in a value expected to be a string or number. Thus, I can do: _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix='IDENTITY -F 0') but imagine I would have done: _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix='IDENTITY -F 0; rm -rf /etc/passwd') The function should be protected against such attacks like if it would have been directly exposed to web users as a CGI script. I propose similar defensive strategy for all functions calling os.system(), os.exec(), os.popen*(), etc. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 08:52:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 04:52:47 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806050852.m558qlPF031059@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 04:52 EST ------- I replied to comment 2 on the mailing list. I had intended this particular bugzilla entry (bug 2507) to be very narrow in scope - purely a small backwards compatible change to the current SeqRecord Some of the questions in comment 3 might have fit better on Bug 2351 although its getting rather long. Rather than taking this issue further off topic, I'll reply on the mailing list again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 5 09:17:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Jun 2008 10:17:00 +0100 Subject: [Biopython-dev] Fwd: Re: sequence class proposal In-Reply-To: <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> References: <1212433879.484445d7a6117@webmail.upv.es> <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> Message-ID: <320fb6e00806050217y1c437b01qa7fd21d75a609e8c@mail.gmail.com> This is in reply to Jose's comment 3 on bug 2507, which was quite broad. http://bugzilla.open-bio.org/show_bug.cgi?id=2507#c3 > I have coded a sequence class that fullfils the requirements that I > would like to see. It's very similar to SeqRecord, but it is not compatible > with it. It has no seq property, although that can be solved. The problem > with SeqRecord is that it is not possible to create a class with an __init__ > compatible with Seq and SeqRecord at the same time. Even if one day the SeqRecord is a subclass of the Seq object, there is no requirement that it have the same __init__ arguments. In fact, have to be different because for a SeqRecord you should also supply an identifier (and potentially a name, description and other annotation). > This proposed class is just a draft, it needs more work but I would like to > receive comments about it. It inherits from MutableSeq so it should be > named MutableRichSeq, but it seems that I'm too lazy to such a long name, > I promise to change the name in a later version and to create a RichSeq > with Seq as parent. I agree with you here that when getting a single letter (amino acid or nucleotide) from a sequence with per-letter-annotation, e.g. my_sequence[5], it would be very nice to have the per-letter-annotation like the quality included. This does mean the object returned can't just be a single one character string. However, because the current Seq and MutableSeq classes return a simple string, unless we return a subclass of a string, this risks breaking other peoples code. So, I would conclude that Seq needs to subclass a string BEFORE we start including support for per-letter-annotation. Ideally we would have alphabet aware versions of all the string functions before we made this change (see Bug 2351). > Besides RichSeq there is in the attachment two other classes, RichFeature > and BioRange, but I would comment on that in another post. Your BioRange and BioFeature classes seem somewhat similar to the current SeqFeature class with its locations (and sub features). > I think that it is quite important to convert Seq and MutableSeq to newclasses, > what do you think about that? With the new classes we can use properties. I have been thinking about deprecating the Seq.data property (and also the MutableSeq). The data string (or array) should really be a private implementation detail, perhaps Seq._data following the underscore for private convention. We can then add property methods to make the Seq.data available (perhaps with a deprecation warning). Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 5 09:36:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 05:36:18 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806050936.m559aINS001028@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 05:36 EST ------- Created an attachment (id=928) --> (http://bugzilla.open-bio.org/attachment.cgi?id=928&action=view) Patch to Bio/SeqRecord.py adding __getitem__ and __len__ and __iter__ Patch based on my comment 1, with addition of __len__ allowing len(my_record) rather than len(my_record.seq) and an explicit __iter__ method (although this is not required, it lets us give a doc string). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 10:18:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:18:11 -0400 Subject: [Biopython-dev] [Bug 2509] New: Deprecating the .data property of the Seq and MutableSeq objects Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2509 Summary: Deprecating the .data property of the Seq and MutableSeq objects Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 2351 nThis: In anticipation that the Seq and MutableSeq objects will eventually subclass the python string, their data property is not needed and confusing. The following patch will replace it with a new-class style property methods and a docstring declaring it to be deprecated. In the case of the Seq object, the sequence should be read only but the user can currently modify the data property in place. In the case of the MutableSeq, the fact that it is internally an array of characters should be a private implementation detail. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 10:18:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:18:14 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200806051018.m55AIE7S003198@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2509 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 10:47:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:47:43 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806051047.m55AlhBe004755@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 06:47 EST ------- Note that adding __len__ has a knock on effect when dealing with SeqRecord objects with a zero length sequence - they now evaluate to False rather than True. This was an issue for some of the unit tests where "if record" was used rather than the more explicit "if record is not None". This change could therefore have unexpected side effects in existing scripts, however adding __len__ is required if we intend to make the SeqRecord act more like the Seq object. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 11:03:27 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 07:03:27 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806051103.m55B3RUU005472@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 07:03 EST ------- You seem to have identified two issues. Adding support for -F should be fairly easy. For the security issue, the caller should be validating their input. Also if running from a web-server, the permissions should also be restricted - failing to do this is asking for trouble. However, defence in layers would be good. Would you suggest a simple check for the ";" character? What about escaped semi-colons? Also this a platform dependant issue. The ";" character is Unix only. At the Windows command line you have to use an &&. Do you have a patch in mind? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 12:56:21 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 08:56:21 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806051256.m55CuLfC010670@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2008-06-05 08:56 EST ------- For the latter issue, I would go and use some python library to escape shell metacharacters. cgi.escape() doesn't do what I would like to. Or cgi.wrap()? Google search returned some hints: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498202 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66012 http://e-articles.info/e/a/title/Command-Injection/ https://bugs.gentoo.org/show_bug.cgi?id=187971#c5 https://bugs.gentoo.org/show_bug.cgi?id=187971#c23 http://mail.python.org/pipermail/python-3000/2007-May/007192.html http://www.owasp.org/index.php/Interpreter_Injection http://www.velocityreviews.com/forums/t352309-sql-escaping-module.html One could learn or even use escaping functions from e.g. MySQLdb.escape() of MySQLdb.connection.escape_string() but I don't think it is a complete solution. I will try to think of it more later. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 13:25:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 09:25:43 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200806051325.m55DPhrQ012033@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 09:25 EST ------- I've commited this patch to CVS as part of BioSQL/BioSeq.py revision 1.24 If you could update you installation of Biopython to CVS and test this please Eric, then I think we can mark this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 13:29:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 09:29:25 -0400 Subject: [Biopython-dev] [Bug 2509] Deprecating the .data property of the Seq and MutableSeq objects In-Reply-To: Message-ID: <200806051329.m55DTP30012244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2509 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 09:29 EST ------- Created an attachment (id=929) --> (http://bugzilla.open-bio.org/attachment.cgi?id=929&action=view) Patch to Bio/Seq.py This turns out to be quite a big change, and while the unit tests still pass more extensive testing would be a good idea. Alternatively, we could just leave expose .data as a read only property, and switch to ._data (or a string subclass) instead. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 17:55:02 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 13:55:02 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806051755.m55Ht2TS024644@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #7 from cracka80 at gmail.com 2008-06-05 13:55 EST ------- I understand your approach that these functions should be converted to modern Python, but it must also be remembered that Biopython as a whole is Python 2.3-compatible, so care must be taken not to modernise code too much. I can't remember when iterators were phased in, but it should be possible, I think it was around 2.2 anyway. (In reply to comment #6) > Two things: > 1) Some of the code that does type checking for file-like-ness seems to be > quite old and possibly outdated (e.g. Gobase.Iterator). We should take this > opportunity to go through these modules and check if they are still useful. > 2) Many of these modules (especially the ones that use an "Iterator" class) > would be written differently in modern Python (in particular by making use of a > generator function instead of an Iterator class). > > So I'd like to suggest the following: > -) For the modules whose usability is dubious in 2008, let's check on the > mailing list if anybody is still using them. If not, we can simply deprecate > them. > -) For the modules that are still useful, use try/except clauses to check for > the necessary attributes. The current function checks for 'read', 'readline', > 'readlines', and '__iter__', whereas the parser probably only needs one of > them. > -) If possible, I'd prefer to convert to modern Python as much as possible > (though formally that is not within the scope of this bug report). > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 7 08:26:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 7 Jun 2008 04:26:54 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806070826.m578Qsj4019312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-06-07 04:26 EST ------- (In reply to comment #7) > I understand your approach that these functions should be converted to modern > Python, but it must also be remembered that Biopython as a whole is Python > 2.3-compatible, so care must be taken not to modernise code too much. I can't > remember when iterators were phased in, but it should be possible, I think it > was around 2.2 anyway. > Bio.Blast.NCBIXML already uses generator functions to return iterators, so I think we are fine as far as compatibility with Python 2.3 and later is concerned. I'll ask on the mailing list if Bio.Gobase has any users, to get started. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 7 08:35:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Gobase, anybody? Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com> Hi everbody, As part of bug report 2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454, I started looking at the Bio.Gobase module. This module provides access to the gobase database: http://megasun.bch.umontreal.ca/gobase/ This module is about seven years old and (AFAICT) is not actively maintained. We don't have documentation for this module, but the unit tests suggests that it parses HTML files from gobase. I am not sure exactly where the HTML files came from, but I doubt that after seven years this still works. So I was wondering: Does anybody use Bio.Gobase? If not, I suggest we deprecate it for the next release, and remove it in some future release. If there are users, we need to make some (small) changes to this module (that is what the original bug report was about). --Michiel. From bugzilla-daemon at portal.open-bio.org Mon Jun 9 12:45:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 9 Jun 2008 08:45:24 -0400 Subject: [Biopython-dev] [Bug 2511] New: setup.py problem with del sys.modules["Martel"] Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2511 Summary: setup.py problem with del sys.modules["Martel"] Product: Biopython Version: Not Applicable Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I'm currently trying to install Biopython from source (CVS) on a clean Mac OS X machine, without reportlab, Numeric or mxTextTools. I've run into a small issue with "python setup.py build" related to the testing for an existing Martel distribution (since Martel has been distributed separately from Biopython before) due to the lack of mxTextTools. Traceback (most recent call last): File "setup.py", line 508, in 'Bio.PopGen': ['SimCoal/data/*.par'], File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/core.py", line 151, in setup dist.run_commands() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 974, in run_commands self.run_command(cmd) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/command/build.py", line 112, in run self.run_command(cmd_name) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/cmd.py", line 333, in run_command self.distribution.run_command(command) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "setup.py", line 157, in run if not is_Martel_installed(): File "setup.py", line 292, in is_Martel_installed del sys.modules["Martel"] # Delete the old version of Martel. The function is_Martel_installed() starts by trying to load the bundled Martel, by calling can_import("Martel"). This is failing with an ImportError from mxTextTools - and hence the Martel version of the bundled copy cannot be determined. The next line of is_Martel_installed() causes the problem: del sys.modules["Martel"] I think this only makes sense if the module could be imported, patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 9 12:46:51 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 9 Jun 2008 08:46:51 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806091246.m59Ckpts011798@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-09 08:46 EST ------- Created an attachment (id=930) --> (http://bugzilla.open-bio.org/attachment.cgi?id=930&action=view) Patch to setup.py How does this look Michiel? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 10 11:37:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Jun 2008 12:37:42 +0100 Subject: [Biopython-dev] Giving the SeqRecord a length? Evaluating it as a boolean Message-ID: <320fb6e00806100437n21e53369p36c85a810007ca19@mail.gmail.com> Something we've discussed before is making the SeqRecord more like a Seq object, perhaps even subclassing it. I've got a patch on Bug 2507 to make some small steps in this direction - accessing elements of the sequence by indexing the SeqRecord, i.e. letter = my_seq_record[5], or iterating over the letters in a SeqRecord's sequence. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 In addition, I would like to give the SeqRecord a length, allowing len(my_seq_record) rather than len(my_seq_record.seq). However, this has a side effect on the evaluation of a SeqRecord as a boolean. Before, any sequence was True, but if we add the __len__ method then any SeqRecord with a zero length sequence will evaluate as False. This is a real issue, for example you can have GenBank files without a sequence (see our unit test cases). One example where this is important is if you are using an iterator via the .next() method and had been checking for a returned None by using "if record:" (something some of the older unit tests were doing) you would have to start using "if record is not None:" instead. If the old behaviour is desirable (evaluating a SeqRecord as a boolean is alway True), we could implement a __nonzero__ method to preserve it, see: http://docs.python.org/ref/customization.html What do people think? Would adding a __len__ method to the SeqRecord cause trouble? Peter From mjldehoon at yahoo.com Tue Jun 10 23:17:56 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 10 Jun 2008 16:17:56 -0700 (PDT) Subject: [Biopython-dev] Giving the SeqRecord a length? Evaluating it as a boolean In-Reply-To: <320fb6e00806100437n21e53369p36c85a810007ca19@mail.gmail.com> Message-ID: <797428.30617.qm@web62402.mail.re1.yahoo.com> +1 for adding a __len__ method, with a __nonzero__ method to let all SeqRecord objects evaluate as true. --Michiel. Peter wrote: Something we've discussed before is making the SeqRecord more like a Seq object, perhaps even subclassing it. I've got a patch on Bug 2507 to make some small steps in this direction - accessing elements of the sequence by indexing the SeqRecord, i.e. letter = my_seq_record[5], or iterating over the letters in a SeqRecord's sequence. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 In addition, I would like to give the SeqRecord a length, allowing len(my_seq_record) rather than len(my_seq_record.seq). However, this has a side effect on the evaluation of a SeqRecord as a boolean. Before, any sequence was True, but if we add the __len__ method then any SeqRecord with a zero length sequence will evaluate as False. This is a real issue, for example you can have GenBank files without a sequence (see our unit test cases). One example where this is important is if you are using an iterator via the .next() method and had been checking for a returned None by using "if record:" (something some of the older unit tests were doing) you would have to start using "if record is not None:" instead. If the old behaviour is desirable (evaluating a SeqRecord as a boolean is alway True), we could implement a __nonzero__ method to preserve it, see: http://docs.python.org/ref/customization.html What do people think? Would adding a __len__ method to the SeqRecord cause trouble? Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 10 23:30:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 10 Jun 2008 19:30:20 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806102330.m5ANUKfo019481@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-06-10 19:30 EST ------- (In reply to comment #1) > Created an attachment (id=930) --> (http://bugzilla.open-bio.org/attachment.cgi?id=930&action=view) [details] > Patch to setup.py > > How does this look Michiel? > That looks find to me, though eventually I would prefer to get rid of the dependence on Martel/mxTextTools altogether. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 10 23:42:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 10 Jun 2008 19:42:52 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806102342.m5ANgqct019925@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-10 19:42 EST ------- In reply to comment 2, would it make sense for the unit test framework to treat the mxTextTools (or reportlab, or Numeric) import errors as a missing external dependency? In the unit tests we used to "ignore" any tests which failed with an ImportError, but have now switched to our own MissingExternalDependencyError exception. We want to distinguish ImportErrors which are external to Biopython (and therefore can be considered as missing dependencies) from those internal to Biopython (perhaps due to refactoring or removal of code - a real unit test failure). One way to do this would be in the bits of Biopython that try to import mxTextTools (or any other module) to raise MissingExternalDependencyError (or something that is a subclass of both MissingExternalDependencyError and the built in ImportError). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 06:54:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 02:54:32 -0400 Subject: [Biopython-dev] [Bug 2516] New: Make it clear what is numeric and what is numpy Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2516 Summary: Make it clear what is numeric and what is numpy Product: Biopython Version: 1.45 Platform: PC URL: http://www.biopython.org/DIST/docs/install/Installation. html OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Hi, although both packages are from the same source site, numpy is the newer implementation whereas numeric is the old, deprecated implementation, right? Why do you say in the installation docs the following? "The Numerical Python distribution (also known an Numeric or Numpy) is a fast implementation of arrays and associated array functionality. This is important for a number of Biopython modules that deal with number processing. The main web site for Numeric is: http://sourceforge.net/projects/numpy and downloads are available from:..." I think it is fooling. BTW, is numpy-1.1.0 supported? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 08:47:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 04:47:32 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806110847.m5B8lWxd010254@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 04:47 EST ------- Patch checked into CVS as Biopython/setup.py revision 1.133, marking this bug as fixed. The issue I raised in comment 3 is still outstanding (external ImportErrors and the unit tests). We may want to file a separate bug, or discuss this on the dev mailing list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 08:53:30 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 04:53:30 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806110853.m5B8rU2t010552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 04:53 EST ------- That text is rather out of date - if you are familiar with the history of Numeric, numarray and numpy you'll know that the old module used with "import Numeric" was called Numerical Python or NumPy for short. This shorthand was used in lots of documentation (not just in Biopython). I think the choice to call the third generation of the array packages numpy has caused a lot of confusion. See http://numpy.scipy.org/#older_array We had updated the Biopython website and other bits of documentation, but had missed this one. Thank you for point this out. P.S. Supporting numpy instead of Numeric is Biopython Bug 2251. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 09:04:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 05:04:47 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806110904.m5B94li8011303@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 05:04 EST ------- I raised the issue of evaluating a SeqRecord as a boolean with a proposal that would could add __len__ but also add __nonzero__ to ensure that any SeqRecord evaluates as True (even if the sequence is of length zero): http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003756.html Michiel was in favour of this: > +1 for adding a __len__ method, with a __nonzero__ method to let all SeqRecord > objects evaluate as true. The patch isn't ready yet because in addition it doesn't get deal with the SeqFeature objects. I think the SeqFeature class needs a _shift(offset) method to return a copy of itself with its location (and the locations of any sub-features) adjusted. I'm still not sure about handling strides, and I am tempted to rule that if a stride other than one is used then the features of the SeqRecord are lost. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 13:57:56 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 09:57:56 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806111357.m5BDvu1I024400@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #928 is|0 |1 obsolete| | ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 09:57 EST ------- Created an attachment (id=937) --> (http://bugzilla.open-bio.org/attachment.cgi?id=937&action=view) Patch to Bio/SeqRecord.py and Bio/SeqFeature.py This modifies the SeqRecord to give it __getitem__ (supporting sliced annotations including features), __len__ (to return the length of the sequence). __nonzero__ (to ensure any SeqRecord evaluates as True regardless of the length of its sequence) and __iter__ (to explicitly support iteration over the sequence with a docstring). As part of this, assorted objects in SeqFeature.py get a private _shift() method taking an integer offset to return a self copy with an adjusted location. Note that slices with a stride (other than one) will result in the features being lost. Handling (positive) strides would require complicated consideration about if an exact location is still present, and if not replacing it with either a fuzzy position or a range. Negative strides are worse! The current set of unit tests seem fine, but addition checks would need to be added to validate this new behaviour. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 15:26:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 11:26:59 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806111526.m5BFQxMw029057@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2008-06-11 11:26 EST ------- I "fixed" SwissProt.SProt.Iterator by deprecating it. Instead of SwissProt.SProt.Iterator, we recommend using Bio.SwissProt.parse and Bio.SeqIO.parse. Next on the to-do list is SwissProt.KeyWList.extract_keywords. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 12 14:23:16 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 12 Jun 2008 10:23:16 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806121423.m5CENG95026678@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2008-06-12 10:23 EST ------- SwissProt.KeyWList.extract_keywords could only parse very old SwissProt files. I deprecated it and wrote a new function "parse" that parses current SwissProt files. This function does not do the file-like check. Prosite.Iterator and Prosite.Prodoc.Iterator are next. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Thu Jun 12 14:33:56 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Thu, 12 Jun 2008 16:33:56 +0200 Subject: [Biopython-dev] CVS access and developers web site In-Reply-To: <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> References: <483E7578.50402@biologie.uni-kl.de> <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> Message-ID: <485133D4.2060405@biologie.uni-kl.de> Peter Cock wrote: > Hi Frank, > > I would try emailing support at helpdesk.open-bio.org using the email > address associated with your CVS username. If you've changed email > address, and you run into problems, I expect Michiel or I could vouch > for you. > Is somebody monitoring that email address? I got an automated response about two weeks ago, and then nothing happened. > For the website, the wiki usernames are entirely separate and you > should be able to create a new account if you don't have one already. > If you want to update the tutorial new HTML and PDF files are loaded > with each release from the version in CVS. > Thanks Peter, got access to the wiki and updated personal data. Frank > Peter > > On Thu, May 29, 2008 at 10:20 AM, Frank Kauff wrote: > >> Hi folks, >> >> although I've been quiet for a while, I'm still doing some changes to the >> Nexus parser of biopython from time to time.... I totally lost my passwords >> to access the repository. Could someone please send me a new password to get >> write access to cvs? And I would also like to change the information on the >> biopython developers web site, as they are somewhat outdated. >> And is this the right place to ask for such things? >> >> Thanks! >> >> Frank >> > > From bugzilla-daemon at portal.open-bio.org Thu Jun 12 15:42:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 12 Jun 2008 11:42:58 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806121542.m5CFgw9t029594@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #11 from cracka80 at gmail.com 2008-06-12 11:42 EST ------- Maybe it's a good idea for any parsers/iterators to just use the iterator-like ability of file handles? Writers would have to function slightly differently, but since file objects, StringIOs and any other file-like objects must provide an __iter__ method, it's probably a good idea to take that into consideration when developing a common interface. In addition, writers could output iterators or generators, so that they can be chained together to operate on files. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 16:24:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 12:24:29 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806131624.m5DGOTKw025954@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #12 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 12:24 EST ------- (In reply to comment #11) > Maybe it's a good idea for any parsers/iterators to just use the iterator-like > ability of file handles? In principle, yes. In practice, it's not so easy because many parsers in Biopython follow the framework in Bio.ParserSupport. These parsers are not really written to deal with lines pulled one-by-one from a file handle. To reconcile these two, I pull out data line-by-line from the file handle, store it in a string, and then call the parser to parse it. This is not ideal, and it may be a good idea for Biopython at some point to change its parser strategy. > Writers would have to function slightly differently, > but since file objects, StringIOs and any other file-like objects must provide > an __iter__ method, it's probably a good idea to take that into consideration > when developing a common interface. In addition, writers could output > iterators or generators, so that they can be chained together to operate > on files. > Writers should also be able to just print the record to the screeen. I don't see how that is easily achievable with generators. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 16:27:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 12:27:47 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806131627.m5DGRlTE026072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 12:27 EST ------- Medline.Iterator, Prosite.Iterator, and Prosite.Prodoc.Iterator are now fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 02:29:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 22:29:13 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806140229.m5E2TDdD014417@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #14 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 22:29 EST ------- I deprecated Bio.Gobase, since no users came forward on the mailing list. Bio.Rebase is also problematic. It parses HTML from the Rebase database, but it was written in 2000 and cannot parse current HTML from Rebase (which looks completely different from the HTML used in 2000). I'll ask on the mailing list if anybody is willing to update Bio.Rebase. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 14 02:34:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Rebase Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com> Hi everybody, As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module. This module parses files (in HTML format) from the Rebase database: http://rebase.neb.com/rebase/rebase.html Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files. Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated. --Michiel. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 02:50:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 22:50:42 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806140250.m5E2ogvf014920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 22:50 EST ------- According to the Numerical Python website, the NumPy documentation will become freely available on September 1, 2008. That would be a good time to start thinking seriously about converting from the "old" Numerical Python to the "new" NumPy 1.1. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 14 02:46:37 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:46:37 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP maintainer? Message-ID: <523172.98428.qm@web62402.mail.re1.yahoo.com> Still looking at Bug 2454 (http://bugzilla.open-bio.org/show_bug.cgi?id=2454). To fix this bug, I'd like to make some changes to Bio.SCOP. Is anybody currently maintaining Bio.SCOP? The changes I'd like to make are small, but it would be better to discuss with the Bio.SCOP maintainer (if there is one) so I won't get in their way. --Michiel. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 09:52:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 14 Jun 2008 05:52:09 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200806140952.m5E9q9X9032018@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-06-14 05:52 EST ------- We now have parsers for XML returned by Entrez, provided the corresponding DTDs are available. Bio/Entrez/DTDs contains most (all?) DTDs currently used by Entrez. If later some DTDs appear to be missing, we can simply add them to Bio/Entrez/DTDs. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 10:29:12 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 14 Jun 2008 06:29:12 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806141029.m5EATC64001227@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-06-14 06:29 EST ------- Updated the installation instructions (in CVS, at least). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Sat Jun 14 22:51:26 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Jun 2008 23:51:26 +0100 Subject: [Biopython-dev] CVS access and developers web site In-Reply-To: <485133D4.2060405@biologie.uni-kl.de> References: <483E7578.50402@biologie.uni-kl.de> <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> <485133D4.2060405@biologie.uni-kl.de> Message-ID: <320fb6e00806141551t56422a98v752e34bbbb38d0aa@mail.gmail.com> >> Hi Frank, >> >> I would try emailing support at helpdesk.open-bio.org using the email >> address associated with your CVS username. If you've changed email >> address, and you run into problems, I expect Michiel or I could vouch >> for you. >> > > Is somebody monitoring that email address? I got an automated response about > two weeks ago, and then nothing happened. > Maybe someone is on holiday - or they are caught up with BOSC 2008 work? I can suggest a few specific people at OBF to try and contact directly if you are still stuck. In the short term, if there are any urgent fixes you think need to be checked in, stick them on Bugzilla and I'm sure one of us will be able to commit them on your behalf. Peter From bugzilla-daemon at portal.open-bio.org Sun Jun 15 07:03:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 15 Jun 2008 03:03:18 -0400 Subject: [Biopython-dev] [Bug 2468] Tutorial needs a fix: Bio.WWW.NCBI In-Reply-To: Message-ID: <200806150703.m5F73IF2007099@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2468 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-06-15 03:03 EST ------- I created a subsection Examples to the tutorial chapter on Bio.Entrez, and added the example from section 2.5 and Martin's taxonomy example to it. With the Bio.Entrez currently in CVS, finding the lineage works as follows: >>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae") >>> record = Entrez.read(handle) >>> record["IdList"] ['158330'] >>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode='xml') >>> records = Entrez.read(handle) >>> records[0]['Lineage'] 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 16 19:23:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Jun 2008 15:23:43 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806161923.m5GJNhZw012022@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #937 is|0 |1 obsolete| | ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-16 15:23 EST ------- Created an attachment (id=942) --> (http://bugzilla.open-bio.org/attachment.cgi?id=942&action=view) Patch to Bio/SeqRecord.py and Bio/SeqFeature.py I've checked in the SeqRecord __len__ and __nonzero__ methods with CVS Bio/SeqRecord.py revision 1.17 The earlier __getitem__ and __iter__ patch has been updated accordingly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 16 20:08:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Jun 2008 16:08:00 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806162008.m5GK80bv014002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-16 16:07 EST ------- Created an attachment (id=943) --> (http://bugzilla.open-bio.org/attachment.cgi?id=943&action=view) Minimal __getitem__ method for generic alignment This patch just adds a __getitem__ to the alignment which ONLY accepts a single integer index and returns the corresponding SeqRecord object. I propose to add this NOW, as I think even just this is a worthwhile improvement. This is a natural expectation given the current __iter__ behaviour and the model of the alignment as a list of SeqRecord objects. Its also part of the more rich behaviour discussed above, which we can add more easily if/when the SeqRecord gets a __getitem__ method (bug 2507). Comments on this particular patch? Should we add __len__ at the same time giving the number of rows in the alignments? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Tue Jun 17 07:35:38 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 17 Jun 2008 09:35:38 +0200 Subject: [Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO In-Reply-To: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> References: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> Message-ID: <200806170935.38904.jblanca@btc.upv.es> Hi: My main use of the Alignment class is to parse Ace files. I've been thinking about that problem recently. My proposal to modify SeqRecord was due to this problem. I think that the best solution would be to treat the Alignment as a sequence. The consensus would be the actual sequences and the aligned read would be features with per-base-annotations. I've implemented such a class and it works fine for me. In fact the Alignment class is just a wrapper around a standard SeqRecord (I name it RichSeq in my implementation). To do that you just need a SeqRecord with a __getitem__ method. You have already proposing that so that's not a problem. Padding with spaces is not an option when you're dealing with genomic wide alignments, that's one of the problems of the actual Alignment class. If you want I can send my implementation to the list, although it could take a while because I've got my home computer dead. Best regards, Jose Blanca On Monday 16 June 2008 16:01:31 Peter wrote: > I've recently had to deal with some contig files in the Ace format > (output by CAP3, but many assembly files will produce this output). > > We have a module for parsing Ace files in Biopython, > Bio.Sequencing.Ace but I was wondering about integrating this into the > Bio.SeqIO or Bio.AlignIO framework. > http://www.biopython.org/wiki/SeqIO > http://www.biopython.org/wiki/AlignIO > > I'd like to hear from anyone currently using Ace files, on how they > tend to treat the data - and if they think a SeqRecord or Alignment > based representation would be useful. > > Each contig in an Ace file could be treated as a SeqRecord using the > consensus sequence. The identifiers of each sub-sequence used to > build the consensus could be stored as database cross-references, or > perhaps we could store these as SeqFeatures describing which part of > the consensus they support. This would then fit into Bio.SeqIO quite > well. > > Alternatively, each contig could be treated as an alignment (with a > consensus) and integrated into Bio.AlignIO. One drawback for this is > doing this with the current generic alignment class would require > padding the start and/or end of each sequence with gaps in order to > make every sequence the same length. However, if we did this (or > created a more specialised alignment class), the Ace file format would > then fit into Bio.AlignIO too. > > So, Ace users - would either (or both) of the above approaches make > sense for how you use the Ace contig files? > > Thanks > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 17 08:46:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 09:46:22 +0100 Subject: [Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO In-Reply-To: <200806170935.38904.jblanca@btc.upv.es> References: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> <200806170935.38904.jblanca@btc.upv.es> Message-ID: <320fb6e00806170146j6f1843e6hed4166ad62c84423@mail.gmail.com> On Tue, Jun 17, 2008 at 8:35 AM, Jose Blanca wrote: > Hi: > My main use of the Alignment class is to parse Ace files. I've been thinking > about that problem recently. My proposal to modify SeqRecord was due to this > problem. I think that the best solution would be to treat the Alignment as a > sequence. The consensus would be the actual sequences and the aligned read > would be features with per-base-annotations. So integrating the "ace" format into Bio.SeqIO representing the consensus sequence of each contig as a SeqRecord would be useful. Initially I would try and represent the aligned reads as SeqFeature objects (much like when reading a genome from a GenBank file you get CDS features with their amino acid translation). Note that for memory reasons, I would be inclined to scan over the Ace file in one pass (using the existing Iterator in the Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank points out in the code comments, this means we can't easily include the WA, CT, RT and WR tags found in the Ace file footer. Do you use this information Jose? > I've implemented such a class > and it works fine for me. In fact the Alignment class is just a wrapper > around a standard SeqRecord (I name it RichSeq in my implementation). > To do that you just need a SeqRecord with a __getitem__ method. You have > already proposing that so that's not a problem. Your enthusiasm Jose is one of the things motivating me to try and do more with the Seq and SeqRecord. Without a third party to offer feedback, making big changes is risky. > Padding with spaces is not an option when you're dealing with genomic wide > alignments, that's one of the problems of the actual Alignment class. It might make sense to talk about a "Contig Alignment" object/class, compared to the existing "multiple sequence alignment" object/class where all the sequences are the same length. Ideally these should provide as similar an API as possible - even if the internals are different. One idea is a sub-class of the current alignment class which stores an offset (>=0) for each supporting read, used when accessing columns. Maybe we should check out BioPerl etc for inspiration? > If you want I can send my implementation to the list, although it could take a > while because I've got my home computer dead. Good luck with the broken computer - I hope you have an easier time fixing it / rebuilding it than I did last time this hapended to me. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jun 17 09:16:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 10:16:29 +0100 Subject: [Biopython-dev] Iterating over Ace contig files Message-ID: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> Hello Frank, I wanted to get your opinion on iterating over the Ace file contig by contig, and what is lost in the WA, CT, RT and WR tags at the end of the file by doing this. As large sequencing runs become more common, iterating over the file in a single pass WITHOUT keeping everything in memory does seem to be desirable. Similar past discussions: http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html Would you object to me rewording your module's header-comment not to say that the Ace Iterator is NOT deprecated, but rather that it has certain drawbacks. [The context for this is my recent thread on the Biopython dev mailing list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO and/or Bio.AlignIO - I've included a little context below.] Thanks, Peter -- Peter wrote: >> So integrating the "ace" format into Bio.SeqIO representing the >> consensus sequence of each contig as a SeqRecord would be useful. >> Initially I would try and represent the aligned reads as SeqFeature >> objects (much like when reading a genome from a GenBank file you get >> CDS features with their amino acid translation). >> >> Note that for memory reasons, I would be inclined to scan over the Ace >> file in one pass (using the existing Iterator in the >> Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank >> points out in the code comments, this means we can't easily include >> the WA, CT, RT and WR tags found in the Ace file footer. Do you use >> this information Jose? Jose replied, > I haven't used the iterator because of the deprecation warning of the code. I > tried with about 40000 alignments and it worked in a computer with 8 GB of ram. > I there are more sequences, and there will be with the 454 sequencer, we will > have trouble reading all at once. I vote for the iterator approach. I have not > used the information of this tag, but I don't know also what they mean. I've > been looking for documentation about this format, but I've found none, do you > have any good ace documentation? From bugzilla-daemon at portal.open-bio.org Tue Jun 17 11:23:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 07:23:59 -0400 Subject: [Biopython-dev] [Bug 2520] New: Reading ACE assembly contig files in Bio.SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2520 Summary: Reading ACE assembly contig files in Bio.SeqIO Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk As I suggested on the mailing list, we could use Bio.Sequencing.Ace to parse ACE assembly files, and then turn each contig into a SeqRecord using the consensus sequence. I will attach a basic implementation which only uses the consensus sequence and its name. For now this ignores all the meta data and in particular the read information. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 17 11:29:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 07:29:15 -0400 Subject: [Biopython-dev] [Bug 2520] Reading ACE assembly contig files in Bio.SeqIO In-Reply-To: Message-ID: <200806171129.m5HBTFVG026790@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2520 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-17 07:29 EST ------- Created an attachment (id=944) --> (http://bugzilla.open-bio.org/attachment.cgi?id=944&action=view) New file Bio/SeqIO/AceIO.py This new file would be added to Bio.SeqIO in the usual way (updating Bio/SeqIO/__init__.py to import this module and map the format "ace" to the new iterator). Handling different gap characters in Bio.SeqIO (and translating them when reading and writing files) has not been formalised. Where possible, converting them into dashes on loading seems to be a sensisble route to take. Therefore I deliberately map any "*" gap characters in the consensus sequence into "-" characters, which are used by default in the alphabet class and are far more commonly used. The "*" character is typically associated with a stop codon in protein sequences, which is another reason to avoid using it here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Tue Jun 17 13:06:34 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 17 Jun 2008 15:06:34 +0200 Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> References: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> Message-ID: <4857B6DA.9040309@biologie.uni-kl.de> Hi Peter, makes totally sense to me. Feel free to do the changes as you see it fit Frank Peter wrote: > Hello Frank, > > I wanted to get your opinion on iterating over the Ace file contig by > contig, and what is lost in the WA, CT, RT and WR tags at the end of > the file by doing this. As large sequencing runs become more common, > iterating over the file in a single pass WITHOUT keeping everything in > memory does seem to be desirable. > > Similar past discussions: > http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html > http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html > > Would you object to me rewording your module's header-comment not to > say that the Ace Iterator is NOT deprecated, but rather that it has > certain drawbacks. > > [The context for this is my recent thread on the Biopython dev mailing > list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO > and/or Bio.AlignIO - I've included a little context below.] > > Thanks, > > Peter > > -- > > Peter wrote: > >>> So integrating the "ace" format into Bio.SeqIO representing the >>> consensus sequence of each contig as a SeqRecord would be useful. >>> Initially I would try and represent the aligned reads as SeqFeature >>> objects (much like when reading a genome from a GenBank file you get >>> CDS features with their amino acid translation). >>> >>> Note that for memory reasons, I would be inclined to scan over the Ace >>> file in one pass (using the existing Iterator in the >>> Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank >>> points out in the code comments, this means we can't easily include >>> the WA, CT, RT and WR tags found in the Ace file footer. Do you use >>> this information Jose? >>> > > Jose replied, > >> I haven't used the iterator because of the deprecation warning of the code. I >> tried with about 40000 alignments and it worked in a computer with 8 GB of ram. >> I there are more sequences, and there will be with the 454 sequencer, we will >> have trouble reading all at once. I vote for the iterator approach. I have not >> used the information of this tag, but I don't know also what they mean. I've >> been looking for documentation about this format, but I've found none, do you >> have any good ace documentation? >> > > From biopython at maubp.freeserve.co.uk Tue Jun 17 13:53:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 14:53:23 +0100 Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <4857B6DA.9040309@biologie.uni-kl.de> References: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> <4857B6DA.9040309@biologie.uni-kl.de> Message-ID: <320fb6e00806170653g482b104fl739107fcada06dc8@mail.gmail.com> On Tue, Jun 17, 2008 at 2:06 PM, Frank Kauff wrote: > Hi Peter, > > makes totally sense to me. Feel free to do the changes as you see it fit > > Frank Thanks Frank. I've checked in some comment changes to both Ace.py and Phd.py, aimed at both improving the documentation and trying and make epydoc happier for the automatic API documentation: http://biopython.org/DIST/docs/api/ Peter P.S. I also added an __iter__ method to the Ace Iterator (Phd already had one). From mjldehoon at yahoo.com Tue Jun 17 14:08:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 17 Jun 2008 07:08:31 -0700 (PDT) Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <320fb6e00806170653g482b104fl739107fcada06dc8@mail.gmail.com> Message-ID: <399611.60966.qm@web62415.mail.re1.yahoo.com> Note that bug #2454 also pertains to the Ace and Phd parsers. If you are modifying the Ace and Phd parsers, can you fix this bug at the same time? http://bugzilla.open-bio.org/show_bug.cgi?id=2454 --Michiel. Peter wrote: On Tue, Jun 17, 2008 at 2:06 PM, Frank Kauff wrote: > Hi Peter, > > makes totally sense to me. Feel free to do the changes as you see it fit > > Frank Thanks Frank. I've checked in some comment changes to both Ace.py and Phd.py, aimed at both improving the documentation and trying and make epydoc happier for the automatic API documentation: http://biopython.org/DIST/docs/api/ Peter P.S. I also added an __iter__ method to the Ace Iterator (Phd already had one). _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 17 14:43:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 10:43:42 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806171443.m5HEhgua005645@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-17 10:43 EST ------- I've removed the strict file-like test in: Bio/Sequencing/Ace.py revision: 1.12 Bio/Sequencing/Phd.py revision: 1.6 In these cases, the handle is immediately turned into an UndoHandle which will be able to check for a sufficiently file like object. Hopefully that's what you meant Michiel - we could go further and introduce a parse() function and deprecate the Iterator objects in these modules. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 10:34:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 06:34:43 -0400 Subject: [Biopython-dev] [Bug 2503] An error when parsing NCBIWWW Blast output In-Reply-To: Message-ID: <200806181034.m5IAYhS1026214@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2503 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 06:34 EST ------- I'm closing this bug as "INVALID" due to a lack of information. If you are still having trouble Prashantha, and can give us some more information, please re-open this bug. Thank you. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 11:34:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 07:34:26 -0400 Subject: [Biopython-dev] [Bug 2497] Unit tests do not cover Bio.Blast.NCBIWWW.qblast() In-Reply-To: Message-ID: <200806181134.m5IBYQjC032061@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2497 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 07:34 EST ------- I checked in a slightly revised version of this as test_NCBI_qblast.py - marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 12:01:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 08:01:11 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806181201.m5IC1BxA001255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 08:01 EST ------- Created an attachment (id=946) --> (http://bugzilla.open-bio.org/attachment.cgi?id=946&action=view) Patch to Bio/Blast/NCBIStandalone.py and Tests/test_NCBIStandalone.py Suggested patch for the command injection risk. Can anyone think of a legitimate reason for a ; or & character in the parameters of a BLAST command line? This patch is very simple and will reject any keyword parameter containing the ; or && characters. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 18 14:00:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 15:00:56 +0100 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> Message-ID: <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> This is returning to a thread last year, about getting a SeqRecord into a string in a particular file format (e.g. fasta). Jared Flatow had suggest adding a method to the SeqRecord itself. Jared wrote: > > ... To always have to write to a file feels strange, but I see > > that it would be messy to go OO since there are so many formats. > > However, giving preference to fasta over other formats by making it > > innate doesn't seem like such a terrible idea. I do have mixed > > feelings about 'bloating' the code which is why I asked, and you have > > convinced me that this is not quite appropriate given existing > > convention. However the idea would be to put the to_fasta or > > to_format method inside the SeqRecord, then to call it from the IO > > when needed to actually write to a file, but call it directly when > > all that is wanted is a string... > > Its debatable isn't it? I suspect that for most users, when they want a > record in a particular file format its for writing to a file. However, > adding a to_format() method to a SeqRecord some sense (suitable for > sequential file formats only). This would take a format name and return > a string, by calling Bio.SeqIO with a StringIO object internally. > > Peter Jared - On reflection, do you think adding a method like this to the SeqRecord (or even just for the FASTA format) would be useful? I recently found myself wanting to use this sort of functionality, and remembered this old thread. This time I was wondering about using the method name tostring (matching the name of a Seq object method). In order to mimic the Seq object's method, the format would be optional and when omitted would give the sequence as a string. Otherwise one of the lower case strings used in Bio.SeqIO should be supplied. There is a sample implementation at the end of this email. ? On Wed, Oct 17, 2007 Michiel De Hoon wrote: > How about the following: > > SeqIO.write(sequences, handle, format) returns the properly formatted string > if handle==None. I can see the above is simpler than having to supply a StringIO handle, but it doesn't make the functionality available directly from the SeqRecord object. It also complicates the API of the SeqIO module with a special case. Peter -- ###################################### For the SeqRecord class, in Bio/SeqRecord.py ###################################### def tostring(self, format=None) : """Returns the record as a string in the specified file format. If the file format is omitted (default), the sequence itself is returned as a string. Otherwise the format should be a lower case string supported by Bio.SeqIO, which is used to turn the SeqRecord into a string.""" if format : from StringIO import StringIO from Bio import SeqIO handle = StringIO() SeqIO.write([self], handle, format) handle.seek(0) return handle.read() else : #Return the sequence as a string return self.seq.tostring() ############################################ From jflatow at northwestern.edu Wed Jun 18 15:25:18 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Wed, 18 Jun 2008 10:25:18 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Message-ID: <55567F98-C5F5-4A2F-8542-502F17F485E9@northwestern.edu> Quick correction: On Jun 18, 2008, at 10:16 AM, Jared Flatow wrote: > Hi Peter, > > On Jun 18, 2008, at 9:00 AM, Peter wrote: > >> Jared - On reflection, do you think adding a method like this to the >> SeqRecord (or even just for the FASTA format) would be useful? > > Yes I still think so. In fact, for sequences, I would say that I > pretty much never deal with a format ever than FASTA, so even making > the __str__ method of SeqRecord return the FASTA format as well > seems reasonable, though perhaps my use cases are different than > others. > > However, py3k and 2.6 will make available the functionality > described in PEP 3101: > > http://www.python.org/dev/peps/pep-3101/ > > I think it would be best to define some semantics that are > compatible with this PEP. This would basically mean using the > __format__ method (which could be the same as the tostring method > you have defined below). To achieve backward compatibility and/or a > more OO interface, tostring could just be an alias for __format__. > Thus, instead of calling format(seq_rec, 'fasta') one could call > seq_rec.tostring('fasta') and these would be equivalent. The PEP > also states that format(seq_rec) should be the same as str(seq_rec). On second thought it seems like a .format method (similar to the one the string class is acquiring) should be used as an alias to __format__ (somehow I think tostring should always be the same as __str__) > In short, I think creating methods to return formatted versions of > objects (SeqRecords) is a good idea, but most especially if it is > done in a way consistent with the language's vision. > > Best, > jared From bugzilla-daemon at portal.open-bio.org Wed Jun 18 15:36:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 11:36:48 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806181536.m5IFamvB015695@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #16 from mdehoon at ims.u-tokyo.ac.jp 2008-06-18 11:36 EST ------- (In reply to comment #15) > I've removed the strict file-like test in: > > Bio/Sequencing/Ace.py revision: 1.12 > Bio/Sequencing/Phd.py revision: 1.6 > > In these cases, the handle is immediately turned into an UndoHandle which will > be able to check for a sufficiently file like object. > > Hopefully that's what you meant Michiel Actually, I think we should avoid using an UndoHandle altogether, now that Python has generator functions. > - we could go further and introduce a > parse() function and deprecate the Iterator objects in these modules. > That would make things a lot easier. An Iterator class was useful in older versions of Python, but generator functions provide a cleaner alternative. In Ace.py, we'd need three functions: 1) read(handle), which returns one record (Contig) read from the handle, and None otherwise; 2) parse(handle), a generator function returning an iterator over the records; 3) a local function _process_line(line, record) These functions then look like this: def read(handle): record = None for line in handle: if line[:2]=='CO': break else: return None record = Contig() for line in handle: if line[:2]=='CO': return record else: _process_line(line, record) def parse(handle): record = None for line in handle: if line[:2]=='CO': if record: yield record record = Contig() _process_line(line, record) if record: return record The actual work is done in _process_line. So we don't need to store the read lines explicitly; this is now taken care of by the generator function. Hence, we don't need to convert the handle to an UndoHandle. In addition, handle can now also be a list of lines instead of a file handle. In this respect, I think Zachary was right in comment #11: > Maybe it's a good idea for any parsers/iterators to just > use the iterator-like ability of file handles? In other words, as long as we can pull lines from the handle, we can parse it. In Phd.py, it's even simpler. Here, we only need the read() and parse() function: def read(handle): for line in handle: if line.startswith("BEGIN_SEQUENCE"): record = Record() elif line.startswith("END_SEQUENCE"): return record else: # do the actual processing of the other lines here def parse(handle): while True: record = read(handle) if not record: return yield record Again, we can process each line just as they come along. No UndoHandle, no Parser, no Consumer, no Scanner needed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Wed Jun 18 15:16:59 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Wed, 18 Jun 2008 10:16:59 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> Message-ID: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Hi Peter, On Jun 18, 2008, at 9:00 AM, Peter wrote: > Jared - On reflection, do you think adding a method like this to the > SeqRecord (or even just for the FASTA format) would be useful? Yes I still think so. In fact, for sequences, I would say that I pretty much never deal with a format ever than FASTA, so even making the __str__ method of SeqRecord return the FASTA format as well seems reasonable, though perhaps my use cases are different than others. However, py3k and 2.6 will make available the functionality described in PEP 3101: http://www.python.org/dev/peps/pep-3101/ I think it would be best to define some semantics that are compatible with this PEP. This would basically mean using the __format__ method (which could be the same as the tostring method you have defined below). To achieve backward compatibility and/or a more OO interface, tostring could just be an alias for __format__. Thus, instead of calling format(seq_rec, 'fasta') one could call seq_rec.tostring('fasta') and these would be equivalent. The PEP also states that format(seq_rec) should be the same as str(seq_rec). In short, I think creating methods to return formatted versions of objects (SeqRecords) is a good idea, but most especially if it is done in a way consistent with the language's vision. Best, jared From yair.benita at gmail.com Wed Jun 18 17:26:02 2008 From: yair.benita at gmail.com (Yair Benita) Date: Wed, 18 Jun 2008 13:26:02 -0400 Subject: [Biopython-dev] BioPax parser Message-ID: Hi Guys, Does anyone have a biopax parser written in python? Thanks, Yair From biopython at maubp.freeserve.co.uk Wed Jun 18 17:42:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 18:42:13 +0100 Subject: [Biopython-dev] BioPax parser In-Reply-To: References: Message-ID: <320fb6e00806181042y169f580epbd8c876eb3cb57fa@mail.gmail.com> On Wed, Jun 18, 2008 at 6:26 PM, Yair Benita wrote: > Hi Guys, > Does anyone have a biopax parser written in python? > Thanks, > Yair I don't know of any (but I haven't searched). From a quick look on www.biopax.org they use XML, so you should be able to parse it in python fairly easily - but I guess some sort of object orientated representation of the data would be very nice to have. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 19 10:08:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:08:55 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806191008.m5JA8t0v016495@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:08 EST ------- On the issue of the low-complexity filter, that is actually already supported in NCBIStandalone.blastall(), NCBIStandalone.blastpgp() and NCBIStandalone.rpsblast() using the optional argument 'filter'. This is described in the doc string too, although it doesn't use the phrase "low complexity" which might be clearer. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 19 10:20:03 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:20:03 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200806191020.m5JAK3OZ017201@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:20 EST ------- I'm marking this as fixed now, but if anyone does find an issue with it please re-open the bug. Thanks for your work on this Eric. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 19 10:41:22 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:41:22 -0400 Subject: [Biopython-dev] [Bug 2408] GenBank records do not contain U's In-Reply-To: Message-ID: <200806191041.m5JAfMNK018058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2408 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:41 EST ------- Given there were no other opinions voiced on how to handle this, I went ahead and fixed this in Bio/GenBank/__init__.py CVS revision 1.83 For records from RNA, if the sequence contains T but not U, we will use a DNA alphabet in the Seq object. Thanks for raising this Marcin. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Thu Jun 19 13:04:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT) Subject: [Biopython-dev] Bio.CDD, anyone? Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com> Hi everybody, Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD. So I am wondering: 1) Is anybody using Bio.CDD? 2) Is anybody willing to update Bio.CDD to handle current HTML? 3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago. --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 19 13:38:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 14:38:29 +0100 Subject: [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From mjldehoon at yahoo.com Thu Jun 19 13:58:25 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:58:25 -0700 (PDT) Subject: [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> Message-ID: <352888.20937.qm@web62409.mail.re1.yahoo.com> > I wonder if the NCBI make any of this available as XML via Entrez? I > had a quick look and couldn't find anything. Actually I already asked this question to NCBI. Their answer was that a subset of the information shown on the web page is available as XML via Entrez's ESummary and EFetch (and thus available from Biopython). The full CDD records are stored as one large file, which is obtainable from NCBI's ftp site, but currently it is not possible to get individual CDD records except in HTML form through the NCBI website. --Michiel. Peter wrote: > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=29475 I wonder if the NCBI make any of this available as XML via Entrez? I had a quick look and couldn't find anything. Peter From biopython at maubp.freeserve.co.uk Thu Jun 19 21:08:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 22:08:13 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? Message-ID: <320fb6e00806191408t45a45da8hda0c2fc8a39aae57@mail.gmail.com> Hi Michiel, I've just tried the unit tests on a clean checkout on Linux, and there is a problem with test_Entrez.py (shown below). I'm pretty sure it was working for me on Mac OS X this afternoon, so this may be platform specific. I haven't using Biopython on Windows recently so I don't know if that is working or not. If you can't reproduce this, let me know and I do some investigation here. The good news is all the other tests seem fine on Linux (bar the GFF, dnal and the population genetics tests for which I don't have the external dependencies installed). Peter This is the output I get on python 2.4.3, using 64bit Ubuntu Dapper Drake (a little old now). maubp at shuttle2:~/repository/biopython/Tests$ python test_Entrez.py Test parsing database list returned by EInfo ... ok Test parsing database info returned by EInfo ... ok Test parsing XML returned by ESearch from the Journals database ... ok Test parsing XML returned by ESearch when no items were found ... ok Test parsing XML returned by ESearch from the Nucleotide database ... ok Test parsing XML returned by ESearch from PubMed Central ... ok Test parsing XML returned by ESearch from the Protein database ... ok Test parsing XML returned by ESearch from PubMed (first test) ... ok Test parsing XML returned by ESearch from PubMed (second test) ... ok Test parsing XML returned by ESearch from PubMed (third test) ... ok Test parsing XML returned by EPost ... ok Test parsing XML returned by EPost with an invalid id (overflow tag) ... ok Test parsing XML returned by EPost with incorrect arguments ... ERROR Test parsing XML returned by ESummary from the Journals database ... ok Test parsing XML returned by ESummary from the Nucleotide database ... ok Test parsing XML returned by ESummary from the Protein database ... ok Test parsing XML returned by ESummary from PubMed ... ok Test parsing XML returned by ESummary from the Structure database ... ok Test parsing XML returned by ESummary from the Taxonomy database ... ok Test parsing XML returned by ESummary from the UniSTS database ... ok Test parsing XML returned by ESummary with incorrect arguments ... ERROR Test parsing cancerchromosomes links returned by ELink ... ok Test parsing medline indexed articles returned by ELink ... ok Test parsing Nucleotide to Protein links returned by ELink ... ok Test parsing pubmed links returned by ELink (first test) ... ok Test parsing pubmed links returned by ELink (second test) ... ok Test parsing pubmed link returned by ELink (third test) ... ok Test parsing pubmed links returned by ELink (fourth test) ... ok Test parsing pubmed links returned by ELink (fifth test) ... ok Test parsing pubmed links returned by ELink (sixth test) ... ok Test parsing XML returned by EFetch, Journals database ... ok Test parsing XML returned by EFetch, Nucleotide database (first test) ... ok Test parsing XML returned by EFetch, Protein database ... ok Test parsing XML returned by EFetch, OMIM database ... ok Test parsing XML returned by EFetch, PubMed database (first test) ... ok Test parsing XML returned by EFetch, PubMed database (second test) ... ok Test parsing XML returned by EFetch, Taxonomy database ... ok Test parsing XML output returned by EGQuery (first test) ... ok Test parsing XML output returned by EGQuery (second test) ... ok Test parsing XML output returned by ESpell ... ok ====================================================================== ERROR: Test parsing XML returned by EPost with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 560, in t_wrong assert exception.message=="Wrong DB name" AttributeError: RuntimeError instance has no attribute 'message' ====================================================================== ERROR: Test parsing XML returned by ESummary with incorrect arguments ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez.py", line 943, in t_wrong assert exception.message=="Neither query_key nor id specified" AttributeError: RuntimeError instance has no attribute 'message' ---------------------------------------------------------------------- Ran 40 tests in 0.471s FAILED (errors=2) From biopython at maubp.freeserve.co.uk Fri Jun 20 09:31:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 10:31:21 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <320fb6e00806191408t45a45da8hda0c2fc8a39aae57@mail.gmail.com> References: <320fb6e00806191408t45a45da8hda0c2fc8a39aae57@mail.gmail.com> Message-ID: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> > Hi Michiel, > > I've just tried the unit tests on a clean checkout on Linux, and there > is a problem with test_Entrez.py (shown below). I'm pretty sure it > was working for me on Mac OS X this afternoon, so this may be platform > specific. I haven't using Biopython on Windows recently so I don't > know if that is working or not. I've just checked, and on a clean CVS checkout under Mac OS 10.5 Leopard with python 2.5.2, test_Entrez.py passes. A clean check out last night on 64bit Ubuntu Dapper Drake with python 2.4.3 failed. So whatever is going wrong is probably OS specific or perhaps python version specific. Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 20 10:07:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 06:07:59 -0400 Subject: [Biopython-dev] [Bug 2524] New: Handle missing libraries like TextTools in run_tests.py Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2524 Summary: Handle missing libraries like TextTools in run_tests.py Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Once upon a time, we treated any ImportError from a unit test as a reason to skip the test gracefully, as these are *usually* from missing external dependencies. This could hide real errors if we had (re)moved a Biopython module. We now use the Bio.MissingExternalDependencyError exception, and the unit tests themselve will raise this for missing command line tools or certain optional libraries like MySQLdb. However, the Bio.MissingExternalDependencyError exception does not get raised when the following commonly used external dependencies are missing: import TextTools import Numeric import reportlab It is now possible to install Biopython without TextTools and reportlab (and Numeric?), and make use of a lot of its functionality - but the unit tests give nasty error messages. I propose we either: (a) Add a special case to run_tests.py to catch specific ImportError cases and skip the test with a suitable message (patch to follow). Specifically TextTools, reportlab and Numeric - but potentially other third party libraries like MySQLdb could be handled too. This keeps the individual unit tests simple. or: (b) Modify all the tests using these semi-optional libraries to catch the ImportError and raise MissingExternalDependencyError instead. As the tests themselves generally don't directly import the external library this is perhaps messy. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 20 10:09:37 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 06:09:37 -0400 Subject: [Biopython-dev] [Bug 2524] Handle missing libraries like TextTools in run_tests.py In-Reply-To: Message-ID: <200806201009.m5KA9b98019988@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2524 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-20 06:09 EST ------- Created an attachment (id=948) --> (http://bugzilla.open-bio.org/attachment.cgi?id=948&action=view) Patch to Tests/run_tests.py Adds a hard coded list of known import errors to be treated as missing external dependencies (i.e. skip the test). This is implemented as a dict allowing a URL to be given. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 20 10:16:49 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 06:16:49 -0400 Subject: [Biopython-dev] [Bug 2525] New: The unit tests GUI run_tests.py does not track skipped tests Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2525 Summary: The unit tests GUI run_tests.py does not track skipped tests Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Running run_tests.py without the --no-gui command line option counts any skipped tests as passed (green). Furthermore, the skipped message is just printed to the command line (if run from a terminal). Ideally the test framework would report these skipped tests in the GUI, perhaps even with a clickable entry (like the failures) to show the message. [On a personal note, I never use the run_tests.py GUI, and would rather it was not the default. If no one likes it, we could just remove the GUI] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 20 12:17:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 08:17:15 -0400 Subject: [Biopython-dev] [Bug 2525] The unit tests GUI run_tests.py does not track skipped tests In-Reply-To: Message-ID: <200806201217.m5KCHFoF025054@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2525 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2008-06-20 08:17 EST ------- > [On a personal note, I never use the run_tests.py GUI, and would rather it was > not the default. If no one likes it, we could just remove the GUI] > Personally, I don't see the advantage of the GUI, and I can live without it. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jun 20 12:14:30 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 20 Jun 2008 05:14:30 -0700 (PDT) Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> Message-ID: <795994.35527.qm@web62408.mail.re1.yahoo.com> Hi Peter, Thanks for letting me know. It turned out that there were two problems with older Python versions (2.3 and 2.4). One issue was not in Bio.Entrez but in the test script itself, using a feature that is only available in Python 2.5. This is now fixed in CVS. The second issue is with Python 2.3: It does not copy data files to the build directory. Then, when you run "python run_tests.py test_Entrez.py" you will get many error messages about missing DTD files. If you run "python test_entrez.py" instead, the tests are done from the installed Biopython instead of the one in the build directory, and then no errors occur. I guess the only way to solve this is to modify run_tests.py to skip test_Entrez if Python is version 2.3. Unless somebody else has a better suggestion, I will do that. --Michiel. Peter wrote: > Hi Michiel, > > I've just tried the unit tests on a clean checkout on Linux, and there > is a problem with test_Entrez.py (shown below). I'm pretty sure it > was working for me on Mac OS X this afternoon, so this may be platform > specific. I haven't using Biopython on Windows recently so I don't > know if that is working or not. I've just checked, and on a clean CVS checkout under Mac OS 10.5 Leopard with python 2.5.2, test_Entrez.py passes. A clean check out last night on 64bit Ubuntu Dapper Drake with python 2.4.3 failed. So whatever is going wrong is probably OS specific or perhaps python version specific. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Fri Jun 20 12:43:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 13:43:55 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <795994.35527.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> <795994.35527.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00806200543u62d385fcka3aa9026986549ba@mail.gmail.com> On Fri, Jun 20, 2008 at 1:14 PM, Michiel de Hoon wrote: > Hi Peter, > > Thanks for letting me know. > > It turned out that there were two problems with older Python versions (2.3 and 2.4). > One issue was not in Bio.Entrez but in the test script itself, using a > feature that is only available in Python 2.5. This is now fixed in CVS. Good work. > The second issue is with Python 2.3: It does not copy data files to the > build directory. Then, when you run "python run_tests.py test_Entrez.py" > you will get many error messages about missing DTD files. If you run > "python test_entrez.py" instead, the tests are done from the installed > Biopython instead of the one in the build directory, and then no errors occur. I had suspected there was something like this happening on my Windows machine (which is on python 2.3) but at the time you were still busy updating the code so I didn't worry about it. This issue with non-python files in the build directory reminds me of something Tiago found with his Population Genetics work. I'd have to go over the old emails to double check. > I guess the only way to solve this is to modify run_tests.py to skip > test_Entrez if Python is version 2.3. Unless somebody else has a better > suggestion, I will do that. We could modify setup.py under python 2.3 to make sure these files are copied. Is this related to the (reverted) package_data change you tried recently? Peter From biopython at maubp.freeserve.co.uk Fri Jun 20 13:23:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 14:23:21 +0100 Subject: [Biopython-dev] test_Entrez.py fails on Linux? In-Reply-To: <320fb6e00806200543u62d385fcka3aa9026986549ba@mail.gmail.com> References: <320fb6e00806200231y716c5a1ds2495f16a56a15f88@mail.gmail.com> <795994.35527.qm@web62408.mail.re1.yahoo.com> <320fb6e00806200543u62d385fcka3aa9026986549ba@mail.gmail.com> Message-ID: <320fb6e00806200623n2148b735t1071aa40b0f24a7c@mail.gmail.com> >> The second issue is with Python 2.3: It does not copy data files to the >> build directory. Then, when you run "python run_tests.py test_Entrez.py" >> you will get many error messages about missing DTD files. If you run >> "python test_entrez.py" instead, the tests are done from the installed >> Biopython instead of the one in the build directory, and then no errors occur. > > ... > > This issue with non-python files in the build directory reminds me of > something Tiago found with his Population Genetics work. I'd have to > go over the old emails to double check. I was thinking of bug 2375, where Tiago had to add a work arround for data files not present in the build directory. http://bugzilla.open-bio.org/show_bug.cgi?id=2375 Peter From biopython at maubp.freeserve.co.uk Fri Jun 20 14:42:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Jun 2008 15:42:57 +0100 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Message-ID: <320fb6e00806200742w7e9e57dbt8d0d3362573cf9a@mail.gmail.com> On Wed, Jun 18, 2008 at 4:16 PM, Jared Flatow wrote: > However, py3k and 2.6 will make available the functionality described in PEP > 3101: > > http://www.python.org/dev/peps/pep-3101/ > > I think it would be best to define some semantics that are compatible with > this PEP. That is interesting - the PEP has been accepted, but I guess we should wait and see exactly what python 2.6 and 3.0 end up using before trying to integrate this into the SeqRecord. > In short, I think creating methods to return formatted versions of objects > (SeqRecords) is a good idea, but most especially if it is done in a way > consistent with the language's vision. That does sound wise - but I'm a little hazy on how exactly PEP-3101 will work in practice for generic complex objects. Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 20 15:01:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Jun 2008 11:01:17 -0400 Subject: [Biopython-dev] [Bug 2526] New: SeqFeature's .id property is not preserved in BioSQL Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2526 Summary: SeqFeature's .id property is not preserved in BioSQL Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk As per the title, a SeqFeature's .id property is not preserved after a save/retreive in BioSQL. I found this while working on Bug 2235, where my modified "swiss" parser creates SeqRecord objects with SeqFeature object which may have their .id set. Note that in GenBank and EMBL, the SeqFeature objects do not have their id property set, and so are not affected. I need to review the BioSQL schema to see if there is a suitable field that Biopython is ignoring, and if there is, use it. If not, we can probably use a tagged qualifier - ideally with the same name as the other Bio* projects. See also test_BioSQL_SeqIO.py revision 1.17 which includes a word arround to avoid this limitation. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Fri Jun 20 16:16:10 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Fri, 20 Jun 2008 11:16:10 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <320fb6e00806200742w7e9e57dbt8d0d3362573cf9a@mail.gmail.com> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> <320fb6e00806200742w7e9e57dbt8d0d3362573cf9a@mail.gmail.com> Message-ID: <0FB6DD30-426C-43F3-BEBE-1728FA1E9D79@northwestern.edu> On Jun 20, 2008, at 9:42 AM, Peter wrote: > On Wed, Jun 18, 2008 at 4:16 PM, Jared Flatow > wrote: >> However, py3k and 2.6 will make available the functionality >> described in PEP >> 3101: >> >> http://www.python.org/dev/peps/pep-3101/ >> >> I think it would be best to define some semantics that are >> compatible with >> this PEP. > > That is interesting - the PEP has been accepted, but I guess we should > wait and see exactly what python 2.6 and 3.0 end up using before > trying to integrate this into the SeqRecord. I agree, there's a couple of things that may still change, but the betas for 2.6 and 3.0 are out and that PEP has been around a while so I would say it's pretty much stable. At least as far as how the general mechanism will work, I don't believe that is likely to change. >> In short, I think creating methods to return formatted versions of >> objects >> (SeqRecords) is a good idea, but most especially if it is done in a >> way >> consistent with the language's vision. > > That does sound wise - but I'm a little hazy on how exactly PEP-3101 > will work in practice for generic complex objects. Yes I had to read it a few times through to understand how exactly it will work, here is what I know: All objects now get the __format__ method which has a signature like this: def __format__(self, format_spec): # return a formatted string The format_spec (format specifier) can be defined by the object, so essentially it's totally customizable (if you want to do really crazy things there is a Formatter that can be messed with, but we should and can avoid this). This object method works like other customizable python methods, and there's a corresponding builtin, so calling format(obj, "the format specifier") will simply call obj.__format__(self, "the format specifier"). Thus we can define the format_spec for a SeqRecord to differentiate between FASTA and whatever other formats we want to define. The string class is also getting a .format method which just calls the .__format__ method in an OO way instead of using the builtin. We can do the same thing, and it seems like most use cases will be to call seq_rec.format('fasta'). All this works for all python versions, except you typically can't call it using format(seq_rec, 'fasta') except in 2.6 or 3.0. Besides the builtin format, we gain the ability to embed the format within other strings. So, using the implementation you provided earlier which just returns the underlying Seq as a string if no format is specified, we might define the __format__ method like this: def __format__(self, format_spec=None): if format_spec: from StringIO import StringIO from Bio import SeqIO handle = StringIO() SeqIO.write([self], handle, format) handle.seek(0) return handle.read() return str(self) def __str__(self): return str(self.seq) Now that means I can also embed this in formatted strings, like so: "this is my sequence: {0}".format(seq_rec) Or: "this is my sequence in fasta format: {0:fasta}".format(seq_rec) All in all, its pretty much what you'd expect (and the same as what you had before). There's only a few small benefits we get for doing it this way (right now), but I don't think we can go wrong using the __format__ method like it was meant to be used, and who knows what future use cases this may simplify. jared From bugzilla-daemon at portal.open-bio.org Sat Jun 21 04:19:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 21 Jun 2008 00:19:59 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200806210419.m5L4JxfJ001994@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #22 from mdehoon at ims.u-tokyo.ac.jp 2008-06-21 00:19 EST ------- (In reply to comment #15) > The solution in Bio/PopGen/SimCoal/__init__.py to find builtin_tpl_dir is not > so beautiful, but on the other hand I don't see a better way to do it. I ran into the same problem with Bio/Entrez, which needs a bunch of DTD files in Bio/Entrez/DTDs/. The attached patch to setup.py modifies the build command such that the data files are copied to the build directory when running "python setup.py build". This solves the problem with Bio.Entrez, and should also solve the problem with Bio/PopGen/SimCoal without using the workaround in Bio/PopGen/SimCoal/__init__.py. Can you guys try this patch on the platforms and python versions you have access to? Just to make sure I didn't miss anything before committing to CVS. Recently there have been quite a lot of updates to CVS, so you may need to start from a fresh CVS checkout. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 21 04:21:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 21 Jun 2008 00:21:13 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200806210421.m5L4LDPg002064@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 ------- Comment #23 from mdehoon at ims.u-tokyo.ac.jp 2008-06-21 00:21 EST ------- Created an attachment (id=950) --> (http://bugzilla.open-bio.org/attachment.cgi?id=950&action=view) Patch to setup.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 21 05:11:18 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 20 Jun 2008 22:11:18 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP Message-ID: <251322.99482.qm@web62401.mail.re1.yahoo.com> Bio.SCOP is one of the modules affected by Bug 2454 (http://bugzilla.open-bio.org/show_bug.cgi?id=2454), which is basically about how Biopython uses file handles. Bio.SCOP contains parsers for several file formats used by SCOP. I am using Bio.SCOP.Hie as an example here, but the same applies to the other parsers. The Bio.SCOP parsers define a Parser and a Iterator class (similar to other older Biopython parsers). Typical usage is as follows: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> parser = Hie.Parser() >>> records = Hier.Iterator(handle, parser) >>> for record in records: ... # record is an instance of Bio.SCOP.Hie.Record Now, in the SCOP file format, each record is on one line in the data file. So we don't need the Iterator: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> parser = Hie.Parser() >>> for line in handle: ... record = parser.parse(line) ... # record is an instance of Bio.SCOP.Hie.Record This solves Bug #2454 (which occurs in the Iterator class), and is more general than the Iterator class (e.g., now we can parse a list of lines). To take this one step further, the Parser class is not really needed either. Although Parser is a class, we are not using the functionality of a class (no inheritance, and the object self is never used). In essence, the parse() function inside the Parser class may as well live outside of it. There are several ways to simplify this module; each of them essentially amount to moving the parse() function: 1) Move the parse() function to the Record class initializer: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> for line in handle: ... record = Hie.Record(line) ... # record is an instance of Bio.SCOP.Hie.Record 2) Move the parse() function outside of the Parser class, and rename it read() for consistency with other Biopython parsers: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> while True: ... record = Hie.read(handle) ... if not record: break ... # record is an instance of Bio.SCOP.Hie.Record 3) Move the parse() function outside of the Parser class, and use it as a generator function: >>> from Bio.SCOP import Hie >>> handle = open("mydatafile.txt") >>> records = Hie.parse(handle) >>> for record in records: ... # record is an instance of Bio.SCOP.Hie.Record Comments, suggestions, preferences? --Michiel. From bugzilla-daemon at portal.open-bio.org Sat Jun 21 11:31:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 21 Jun 2008 07:31:14 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806211131.m5LBVEWb019981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #17 from mdehoon at ims.u-tokyo.ac.jp 2008-06-21 07:31 EST ------- I added a DeprecationWarning to Bio.Rebase. Next on the to-do list is Bio.SCOP. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 21 11:36:43 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 21 Jun 2008 04:36:43 -0700 (PDT) Subject: [Biopython-dev] [BioPython] Bio.CDD, anyone? In-Reply-To: <485A70B0.1010202@gmail.com> Message-ID: <195444.96577.qm@web62403.mail.re1.yahoo.com> As far as I can tell, the test files were created by saving the HTML source code from the CDD web site to a file. As the CDD web site has changed its HTML is the meantime, we cannot reproduce the HTML files used by the Bio.CDD tests. Unless somebody objects in the next couple of days, I'll add a DeprecationWarning to Bio.CDD. --Michiel. Bruce Southey wrote: Hi, Do you know how the test files were created? If there is not an easy answer then it makes the decision easier. Anyhow, I vote to remove this module as, in addition to the things previously mentioned, it would far better to support interproscan (http://www.ebi.ac.uk/Tools/InterProScan/ ) than just a single tool. Bruce _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From bugzilla-daemon at portal.open-bio.org Sun Jun 22 04:51:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 22 Jun 2008 00:51:58 -0400 Subject: [Biopython-dev] [Bug 2527] New: Bug in NCBIXML.py in _end_BlastOutput_version() Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2527 Summary: Bug in NCBIXML.py in _end_BlastOutput_version() Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cdputnam at ucsd.edu biopython version is from Fedora distribution: python-biopython-1.45-1.fc7 For a recently run NCBIWWW Blast (following the tutorial at http://biopython.org/DIST/docs/tutorial/Tutorial.html), I ran into a problem in parsing by _end_BlastOutput_version with the version information: BLASTP 2.2.18+ Traceback (most recent call last): File "blast2.py", line 7, in for blast_record in blast_records: File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 577, in parse expat_parser.Parse(text, False) File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 98, in endElement eval("self.%s()" % method) File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Blast/NCBIXML.py", line 216, in _end_BlastOutput_version self._header.date = self._value.split()[2][1:-1] IndexError: list index out of range I've worked around this bug for now by commenting out the offending line and setting the date to an empty string: def _end_BlastOutput_version(self): """version number of the BLAST engine (e.g., 2.1.2) Save this to put on each blast record object """ self._header.version = self._value.split()[1] # self._header.date = self._value.split()[2][1:-1] self._header.date = '' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 22 04:52:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 22 Jun 2008 00:52:45 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806220452.m5M4qjiE029058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 cdputnam at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |cdputnam at ucsd.edu -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 22 05:52:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 22 Jun 2008 01:52:05 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806220552.m5M5q5rQ031580@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2008-06-22 01:52 EST ------- I believe that this is already fixed in CVS. Could you try the latest version of Bio/Blast/NCBIXML.py, available at http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/?cvsroot=biopython and let us know if it fixes the bug? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 10:54:22 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 06:54:22 -0400 Subject: [Biopython-dev] [Bug 2528] New: NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2528 Summary: NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz I have already mentioned this on the email list few weeks ago ... NCBI Blast 2.2.18 (but was a case of also previous version as far as I remember) does not flush output buffers when run from under mod_python-3.3.11/apache-2.2.8. I tried to flush the buffers or disable buffering but it does not help. In the end, a working solution is to move the using subprocess module introduced in python 2.4 and which deprecates os.system, os.exec, os.popen* and other functions. The following patch works for me, so the user receives back into his/her web browser the blast stdout. Somehow, one has to copy the data into another variable and close the file descriptors used by blastall binary. Unfortunately, still a stale process can be seen in "ps -ef" output: apache 5382 5323 47 12:31 ? 00:00:04 [blastall] But as I have said, at least the data is not buffered anymore. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 10:55:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 06:55:26 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806231055.m5NAtQCC030683@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #1 from mmokrejs at ribosome.natur.cuni.cz 2008-06-23 06:55 EST ------- Created an attachment (id=951) --> (http://bugzilla.open-bio.org/attachment.cgi?id=951&action=view) NCBIStandalone.py.patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 10:56:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 06:56:00 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806231056.m5NAu0or030728@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #5 from mmokrejs at ribosome.natur.cuni.cz 2008-06-23 06:56 EST ------- (In reply to comment #4) Yes, the "filter" argument is not clear, please improve the docs in the sources and on the web. At the best I would in addition propose renaming the argument. Regarding the patch in comment #3, I think it should be more strict and blast* functions should only accept explicitly listed arguments in the function definition, so no kwargs, etc. But it is a good startup. In general, I would propose to provide a general wrapper function to be placed in front of _ALL_ popen3() calls. And, conjuction, replace the popen3 calls with subprocess.Popen. See Bug #2528 on the NCBIStandalone.blastall() where is a working example of this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 15:01:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 11:01:17 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806231501.m5NF1Hth014356@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #18 from mdehoon at ims.u-tokyo.ac.jp 2008-06-23 11:01 EST ------- See the discussion on the mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003819.html for some ideas for Bio.SCOP. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 23 15:16:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Jun 2008 11:16:29 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806231516.m5NFGTgD015331@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 cdputnam at ucsd.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from cdputnam at ucsd.edu 2008-06-23 11:16 EST ------- The latest NCBIXML.py does fix the problem with Blast version parsing. Just so you know, I had to comment out two lines in _end_Hsp_bit_score, similar to the version of the file I already had. I'm guessing this is a version mismatch with some other file that I didn't update (I only replaced NCBIXML.py). The error was: AttributeError: Description instance has no attribute 'bits' And the commented version of the function is: def _end_Hsp_bit_score(self): """bit score of HSP """ self._hsp.bits = float(self._value) #if self._descr.bits == None: # self._descr.bits = float(self._value) Thanks for your help. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 09:38:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 05:38:54 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806240938.m5O9csKZ032756@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 05:38 EST ------- With this patch we have to wait for the sub-process to finish before we can read its output. This is a potential drawback as it delays the parsing. Currently we should be able to can parse this iteratively as the queries are processed. Also, you are loading the entire output into memory (as a list of strings, which you then turn into a StringIO handle). This is potentially a very bad idea, as in extreme cases Blast XML files can be GB in size. I'm not keen on your solution, but I don't know what to suggest for your original problem, running Blast under mod_python-3.3.11/apache-2.2.8. Two minor points: Do you think we can do anything better on Python 2.3? Did you intend something similar for blastpgp and rpsblast. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 24 09:46:19 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Jun 2008 10:46:19 +0100 Subject: [Biopython-dev] Bio.SCOP In-Reply-To: <251322.99482.qm@web62401.mail.re1.yahoo.com> References: <251322.99482.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00806240246u8afdb6fp51cd31000ebe3d9@mail.gmail.com> On Sat, Jun 21, 2008 at 6:11 AM, Michiel de Hoon wrote: > Bio.SCOP contains parsers for several file > formats used by SCOP. I am using Bio.SCOP.Hie > as an example here, but the same applies to > the other parsers. > > The Bio.SCOP parsers define a Parser and a Iterator > class (similar to other older Biopython parsers). I would deprecate the Parser and Iterator objects, and introduce a parse(handle) function to iterate over a file (following our recent convention) and a perhaps a read() function too (taking a handle or a single line?), Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 24 10:17:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 06:17:41 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806241017.m5OAHfdK002192@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #3 from mmokrejs at ribosome.natur.cuni.cz 2008-06-24 06:17 EST ------- Hi Peter, well I am not much happy with this either, and I do understand your points. I will try to come up with another solution. Would be best to disable buffering in popen3() but I failed to get it working. Will give it some more thought next week. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 10:35:50 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 06:35:50 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806241035.m5OAZo3p003784@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 06:35 EST ------- Regarding comment 2, I think you need to update Bio/Blast/Record.py as well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 10:36:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 06:36:18 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806241036.m5OAaIIt003857@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-06-24 06:36 EST ------- Is there an easy way to replicate this issue? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 11:30:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 07:30:45 -0400 Subject: [Biopython-dev] [Bug 2527] Bug in NCBIXML.py in _end_BlastOutput_version() In-Reply-To: Message-ID: <200806241130.m5OBUjYU007159@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2527 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 07:30 EST ------- P.S. This is a duplicate of Bug 2499 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 24 13:05:46 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Jun 2008 09:05:46 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806241305.m5OD5jZa012413@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-24 09:05 EST ------- Checking in Tests/test_NCBIStandalone.py new revision: 1.14 Checking in Bio/Blast/NCBIStandalone.py new revision: 1.73 I've checked in my suggested patch, and tried to improve the filter documentation by including the phrase "low complexity". It might be worth passing this suggestion on to the NCBI as their own command line tools just use the term filter. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Wed Jun 25 14:04:09 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 25 Jun 2008 07:04:09 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP.FileIndex Message-ID: <141582.2274.qm@web62413.mail.re1.yahoo.com> Hi everybody, When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed if file reading is done via a buffer (which is often the case in Python). Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex? If not, I think we should deprecate it instead of trying to fix it. --Michiel. From bugzilla-daemon at portal.open-bio.org Wed Jun 25 15:55:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Jun 2008 11:55:58 -0400 Subject: [Biopython-dev] [Bug 2529] New: NCBI BLAST XML parser does not support the online blast version 2.2.18+ Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2529 Summary: NCBI BLAST XML parser does not support the online blast version 2.2.18+ Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P1 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: lordnapi at gmail.com QAContact: lordnapi at gmail.com Hello, I have performed a blast search of PDB database. I am having a problem while parsing the blast result on both Windows and Linux machines. The following four lines of code provides me the same error. Thanks. Ahmet >>> from Bio.Blast import NCBIWWW >>> from Bio.Blast import NCBIXML >>> results_handle = NCBIWWW.qblast( 'blastp', 'pdb', 'ASFPVEILPFLYLGCAKDSTNLDVLEEFGIKYILNVTPNLPNLFENAGEFKYKQIPISDHWSQNLSQ') >>> blast_record = NCBIXML.parse( results_handle ).next() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 25 16:09:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Jun 2008 12:09:24 -0400 Subject: [Biopython-dev] [Bug 2528] NCBIStandalone.blastall(): Replace os.popen3 with subprocess.Popen In-Reply-To: Message-ID: <200806251609.m5PG9OWX002384@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2528 ------- Comment #5 from mmokrejs at ribosome.natur.cuni.cz 2008-06-25 12:09 EST ------- (In reply to comment #4) > Is there an easy way to replicate this issue? > I believe run under mod_python a blast search and try to display it on the web the results, that's all I actually do. On the server the blastall processes did not flush it's cache, so if you would connect to the running process by strace utility you would see it has done write() of some line being not yet the last one of the output. The process hangs like this for ages, until you do "kill -HUP $pid", then it it flushes the write buffer and exits successfully. Happens with blast 2.2.18 at least. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 25 16:24:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Jun 2008 12:24:45 -0400 Subject: [Biopython-dev] [Bug 2529] NCBI BLAST XML parser does not support the online blast version 2.2.18+ In-Reply-To: Message-ID: <200806251624.m5PGOjgf003205@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2529 lordnapi at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #1 from lordnapi at gmail.com 2008-06-25 12:24 EST ------- The problem was caused by not having data in BLASTP 2.2.18+ in the XML files. I fixed the problem for myself by changing _end_BlastOutput_version function in the Blast/NCBIXML.py file to the following (starts at line 208). I still don't know if having date is important elsewhere. def _end_BlastOutput_version(self): """version number of the BLAST engine (e.g., 2.1.2) Save this to put on each blast record object """ self._valuesplit = self._value.split() self._header.version = self._valuesplit[1] if len(self._valuesplit) > 2 : self._header.date = self._value.split()[2][1:-1] else: self._header.date = '' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Thu Jun 26 00:01:07 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 25 Jun 2008 17:01:07 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython Message-ID: <254082.68438.qm@web62401.mail.re1.yahoo.com> Dear all, Recently NCBI blocked access for a Biopython user who? was making 50,000 requests to NCBI at a rate of 18 requests per second during peak hours. This user was using the search_for function in Bio.GenBank, which internally uses Bio.EUtils. Apparently, Bio.EUtils does not follow the 3 seconds sleep rule betwen requests. NCBI also asked us to send requests for the Entrez E-Utilities to the EUtils web address, and not to the regular NCBI web address. I don't know if Bio.EUtils does that. Bio.Entrez does use the 3 seconds sleep rule, and the eight E-Utilities functions all make use of the EUtils web address, though it is possible to pass a different web address as one of the arguments. The "query" function, which is not part of the E-Utilities, does use the standard NCBI web address. To avoid such problems in the future, I'd like to propose the following: 1) Deprecate Bio.EUtils. Its functionality is covered by Bio.Entrez, which (from release 1.46) will have a parser. Bio.EUtils is currently used by the following modules: Bio/config/DBRegistry.py Bio/dbdefs/fasta.py Bio/dbdefs/genbank.py Bio/dbdefs/medline.py Bio/GenBank/__init__.py We were already planning to remove Bio.config and Bio.dbdefs, so we'd only have to modify Bio.GenBank. 2) Remove the 'query' function from Bio.Entrez. Anyway accessing NCBI's web site from Python to get HTML back doesn't make a lot of sense. 3) Remove the argument for a user-specified web address to make sure that always the E-Utilities address is used. --Michiel. From dalke at dalkescientific.com Thu Jun 26 01:52:07 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 03:52:07 +0200 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <254082.68438.qm@web62401.mail.re1.yahoo.com> References: <254082.68438.qm@web62401.mail.re1.yahoo.com> Message-ID: <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> On Jun 26, 2008, at 2:01 AM, Michiel de Hoon wrote: > Bio.Entrez does use the 3 seconds sleep rule, and the eight E- > Utilities functions all make use of the EUtils web address, though > it is possible to pass a different web address as one of the > arguments. The "query" function, which is not part of the E- > Utilities, does use the standard NCBI web address. What is the proper EUtils web address? Entrez/__init__.py uses cgi='http://www.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' while the documentation at http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html claims "Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov", which I think should be "http://eutils.ncbi.nlm.nih.gov/ entrez/eutils/epost.fcgi" > To avoid such problems in the future, I'd like to propose the > following: > 1) Deprecate Bio.EUtils. Its functionality is covered by > Bio.Entrez, which (from release 1.46) will have a parser. I looked over Bio.Entrez and it handles only a subset of what Bio.EUtils does. For example, it doesn't have any support to help track WebEnv as it changes over each request, nor support for alternate format types. I would deprecate Bio.EUtils for another reason - there's no maintainer. > 2) Remove the 'query' function from Bio.Entrez. Anyway accessing > NCBI's web site from Python to get HTML back doesn't make a lot of > sense. Okay, now I'm quite confused. This is functionality that Bio.EUtils supports. >>> from Bio.EUtils import HistoryClient >>> client = HistoryClient.HistoryClient() >>> result = client.search("Michiel de Hoon[AU]") >>> print result.efetch("text", "docsum").read() 1: de Hoon M, Hayashizaki Y. Deep cap analysis gene expression (CAGE): genome-wide identification of promoters, quantification of their expression, and network inference. Biotechniques. 2008 Apr;44(5):627-8, 630, 632. Review. PMID: 18474037 [PubMed - indexed for MEDLINE] 2: Sierro N, Makita Y, de Hoon M, Nakai K. DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 2008 Jan;36(Database issue):D93-6. Epub 2007 Oct 25. PMID: 17962296 [PubMed - indexed for MEDLINE] 3: Makita Y, de Hoon MJ, Danchin A. Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes. BMC Bioinformatics. 2007 Feb 8;8:47. PMID: 17286872 [PubMed - indexed for MEDLINE] 4: de Hoon MJ, Makita Y, Nakai K, Miyano S. Prediction of transcriptional terminators in Bacillus subtilis and related species. PLoS Comput Biol. 2005 Aug;1(3):e25. Epub 2005 Aug 12. PMID: 16110342 [PubMed - indexed for MEDLINE] 5: de Hoon MJ, Imoto S, Kobayashi K, Ogasawara N, Miyano S. Inferring gene regulatory networks from time-ordered gene expression data of Bacillus subtilis using differential equations. Pac Symp Biocomput. 2003;:17-28. PMID: 12603014 [PubMed - indexed for MEDLINE] (The default returns this in XML format.) >>> print result.efetch().read(500) 18474037 2008 05 13 2008 06 3) Remove the argument for a user-specified web address to make > sure that always the E-Utilities address is used. Yes. Andrew dalke at dalkescientific.com From bugzilla-daemon at portal.open-bio.org Thu Jun 26 09:20:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 05:20:55 -0400 Subject: [Biopython-dev] [Bug 2529] NCBI BLAST XML parser does not support the online blast version 2.2.18+ In-Reply-To: Message-ID: <200806260920.m5Q9Ktlt019555@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2529 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 05:20 EST ------- This is a duplicate of Bug 2499, reopening in order to mark this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 26 09:21:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 05:21:38 -0400 Subject: [Biopython-dev] [Bug 2529] NCBI BLAST XML parser does not support the online blast version 2.2.18+ In-Reply-To: Message-ID: <200806260921.m5Q9Lcp6019606@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2529 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |DUPLICATE ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 05:21 EST ------- The fix for the 2.2.18+ XML output is already in CVS, see Bug 2499 *** This bug has been marked as a duplicate of bug 2499 *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 26 09:21:40 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 05:21:40 -0400 Subject: [Biopython-dev] [Bug 2499] Bio.Blast.NCBIXML cannot handle XML without date in BlastOutput_version In-Reply-To: Message-ID: <200806260921.m5Q9Lebn019619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2499 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lordnapi at gmail.com ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 05:21 EST ------- *** Bug 2529 has been marked as a duplicate of this bug. *** -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 26 10:25:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 11:25:38 +0100 Subject: [Biopython-dev] [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> Message-ID: <320fb6e00806260325m3b92ff8n143141c73a1a60dd@mail.gmail.com> Andrew wrote: > > I thought I put a rate limiter into the code, but looking at it now I see I > didn't. The documentation clearly states that users must follow NCBI's > recommendations, but who actually reads documentation? > >>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov >>> , not the standard NCBI Web address. > > That change was announced on May 21, 2003, and most likely no one on the > Biopython dev group tracks the EUtils mailing list. It was also after I > wrote the code, but to be fair I was subscribed to the utilities list at the > time and should have caught the change. > > I think the correct fix is to this code in ThinClient.py: > > def __init__(self, > opener = None, > tool = TOOL, > email = EMAIL, > baseurl = "http://www.ncbi.nlm.nih.gov/entrez/eutils/"): > > Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/". I > have not tested this. I've tested that fix, and it seems to be OK with test_EUtils.py and test_SeqIO_online.py which calls Bio.EUTils via Bio.GenBank, checked in as Bio/EUtils/ThinClient.py revision 1.6 I'll have a look at your other specific suggestions too. Thanks for taking the time to go over this Andrew. Peter From p.j.a.cock at googlemail.com Thu Jun 26 10:47:05 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 Jun 2008 11:47:05 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> References: <254082.68438.qm@web62401.mail.re1.yahoo.com> <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> Message-ID: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> On Thu, Jun 26, 2008 at 2:52 AM, Andrew Dalke wrote: > On Jun 26, 2008, at 2:01 AM, Michiel de Hoon wrote: >> >> Bio.Entrez does use the 3 seconds sleep rule, and the eight E-Utilities >> functions all make use of the EUtils web address, though it is possible to >> pass a different web address as one of the arguments. The "query" function, >> which is not part of the E-Utilities, does use the standard NCBI web >> address. > > What is the proper EUtils web address? > > Entrez/__init__.py uses > cgi='http://www.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' > while the documentation at > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html > claims "Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov", > which I think should be > "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi" Yes, for ePost that is correct: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html [On a related note, following Andrew's suggestion, I have updated CVS to use the new base URL in Bio/EUtils/ThinClient.py] >> To avoid such problems in the future, I'd like to propose the following: >> 1) Deprecate Bio.EUtils. Its functionality is covered by Bio.Entrez, which >> (from release 1.46) will have a parser. > > I looked over Bio.Entrez and it handles only a subset of what Bio.EUtils > does. For example, it doesn't have any support to help track WebEnv as it > changes over each request, nor support for alternate format types. No, Bio.Entrez does not support the WebEnv / history interface. It can request data in different format types though, although it will only parse the XML output. > I would deprecate Bio.EUtils for another reason - there's no maintainer. This is a strong reason - although we are still using Bio.EUtils in Bio.GenBank (and probably in other places too). >> 2) Remove the 'query' function from Bio.Entrez. Anyway accessing NCBI's >> web site from Python to get HTML back doesn't make a lot of sense. > > Okay, now I'm quite confused. This is functionality that Bio.EUtils > supports. I think Michiel meant getting a handle containing raw HTML isn't very sensible, and this is what the Bio.Entrez.query() function does. If it can only return HTML, then I agree, its not very useful and could be removed. >> 3) Remove the argument for a user-specified web address to make sure that >> always the E-Utilities address is used. > > Yes. > Unlike BLAST where you may have a local webserver, is there any reason for to use a URL other than the NCBI's one? Peter From dalke at dalkescientific.com Thu Jun 26 11:03:19 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 13:03:19 +0200 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> References: <254082.68438.qm@web62401.mail.re1.yahoo.com> <635E5251-830F-409C-A2D4-10EA59FA5037@dalkescientific.com> <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> Message-ID: <52BDC1F6-52F8-4A42-B738-DFBB119F9C27@dalkescientific.com> On Jun 26, 2008, at 12:47 PM, Peter Cock wrote: > I think Michiel meant getting a handle containing raw HTML isn't very > sensible, and this is what the Bio.Entrez.query() function does. I meant to point out that supporting the search interface, with machine parseable, is functionality in Bio.EUtils that isn't in Bio.Entrez. > Unlike BLAST where you may have a local webserver, is there any reason > for to use a URL other than the NCBI's one? I can't think of any. (I can make up one - setting up a local mock server for tests. But that's not seriously going to happen.) Andrew dalke at dalkescientific.com From biopython at maubp.freeserve.co.uk Thu Jun 26 11:40:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 12:40:54 +0100 Subject: [Biopython-dev] [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <5CD393BF-D4FB-4700-B7CC-2417C9845010@dalkescientific.com> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com> <5CD393BF-D4FB-4700-B7CC-2417C9845010@dalkescientific.com> Message-ID: <320fb6e00806260440n4a933b60of5a7c8eee4e15a89@mail.gmail.com> On Thu, Jun 26, 2008 at 12:26 PM, Andrew Dalke wrote: > On Jun 26, 2008, at 1:21 PM, Peter wrote: >> >> Looking over the code, should this wait also be done for the >> ThinClient's epost() method as well? > > Where? It gets the URL from an instance variable, which is set in the > constructor. The ThinClient class is defined In Bio/EUtils/ThinClient.py, and I have added a 3 second wait to its _get() method. I think we should also add the three second wait to the epost() method. Both methods will construct their URL using self.baseurl, so they are both going to hit the same server. Note that for the implementation, I would probably define a new _wait() method to check the time since the last call, and call this _wait() method from both _get() and epost(). >> This complexity is also daunting for anyone else considering taking >> over the Bio.EUtils code base. > > My incomplete rewrite uses elementtree which does reduce some of the > complexity. But the NCBI interface is a mess. I can see why Michiel has kept things simple in Bio.Entrez - this should cater to most user's needs. Peter From mjldehoon at yahoo.com Thu Jun 26 11:45:45 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 04:45:45 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> Message-ID: <402220.93857.qm@web62411.mail.re1.yahoo.com> > > I would deprecate Bio.EUtils for another reason - there's no maintainer. This is what I meant. I am sure that we can fix Bio.EUtils for now, but I don't see how we can maintain it in the future. That is why originally we decided to focus on Bio.WWW.NCBI (renamed to Bio.Entrez) instead. > - although we are still using Bio.EUtils in Bio.GenBank > (and probably in other places too). As far as I can tell, Bio.GenBank is currently the only module in which Bio.EUtils is used, not counting modules that themselves have been deprecated. It shouldn't be too complicated to modify Bio.GenBank to use Bio.Entrez instead. >>> 2) Remove the 'query' function from Bio.Entrez. >>> Anyway accessing NCBI's web site from Python >>> to get HTML back doesn't make a lot of sense. > >> Okay, now I'm quite confused. This is functionality >> that Bio.EUtils supports. > > I think Michiel meant getting a handle containing > raw HTML isn't very sensible, and this is what the > Bio.Entrez.query() function does. If it can only > return HTML, then I agree, its not very useful and > could be removed. That is indeed what I meant. (It is still possible to get raw HTML by using the other EUtilities, for example efetch, but from a scripting language efetch is more likely to be used to get XML or some plain-text output). --Michiel From mjldehoon at yahoo.com Thu Jun 26 12:50:10 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 05:50:10 -0700 (PDT) Subject: [Biopython-dev] New release Message-ID: <390323.35893.qm@web62411.mail.re1.yahoo.com> Hi everybody, I think we should make a new Biopython release within the next couple of weeks to solve the issues with NCBI and to get the fixed Blast parser out (for output from Blast 2.2.18). There are a few outstanding issues that hopefully can be fixed before the next release: 1) NCBI access from Bio.GenBank 2) Bug #2454 (Iterators can't use file-like objects), which affects a number of parsers in Biopython 3) Martel-based parsers. >From a technical viewpoint, none of these are very complicated. 2) is almost finished. With respect to 3), a small number of parsers in Biopython are based on Martel (none of the major ones as far as I can tell). For some of these parsers, it is not quite clear if they are still useful. For the remaining ones, it would be nice if they could be rewritten without using Martel -- that would let us get rid of the dependency on mxTextTools. Any other urgent issues that need to be resolved before a release? --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 26 12:53:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 13:53:09 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <402220.93857.qm@web62411.mail.re1.yahoo.com> References: <320fb6e00806260347i7655ba6eg490f5003a273a37d@mail.gmail.com> <402220.93857.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> > As far as I can tell, Bio.GenBank is currently the only module in which > Bio.EUtils is used, not counting modules that themselves have been > deprecated. It shouldn't be too complicated to modify Bio.GenBank to use > Bio.Entrez instead. Looking back at CVS, it used to use Bio.WWW.NCBI once upon a time (which is now Bio.Entrez), and had explicit rate limiting. Then four years ago Brad moved the Bio.GenBank.download_many() and search_for() functions over to using Bio.EUtils (CVS revision 1.51 of Bio/GenBank/__init__.py). Brad also appears to have changed the functionality of Bio.GenBank.download_many() from a call back mechanism to returning a handle. We could still return a handle, but it would require fetching all the records (perhaps in batches), and concatenating them. I think it would make more sense to deprecate the Bio.GenBank.download_many() function, and direct people to Bio.Entrez.efetch() instead. The Bio.GenBank.search_for() still seems somewhat useful, but without a default limit on the number of returned IDs, this could easily be abused. Again, we could deprecate this and direct people to Bio.Entrez.esearch() instead. Peter From mjldehoon at yahoo.com Thu Jun 26 13:41:24 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 06:41:24 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> Message-ID: <8498.83228.qm@web62412.mail.re1.yahoo.com> > The Bio.GenBank.search_for() still seems somewhat > useful, but without a default limit on the number > of returned IDs, this could easily be abused. > Again, we could deprecate this and direct people > to Bio.Entrez.esearch() instead. As always, I am in favor of deprecating functions whose purpose is dubious. F # Using Bio.GenBank >>> from Bio import GenBank >>> gi_list = GenBank.search_for("Opuntia AND rpl16") >>> gi_list ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] # Same thing, using Bio.Entrez >>> from Bio import Entrez >>> handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16") >>> record = Entrez.read(handle) >>> record["IdList"] ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [Biopython-dev] NCBI Abuse activity with Biopython To: mjldehoon at yahoo.com Cc: "Biopython Developers Mailing List" Date: Thursday, June 26, 2008, 8:53 AM > As far as I can tell, Bio.GenBank is currently the only module in which > Bio.EUtils is used, not counting modules that themselves have been > deprecated. It shouldn't be too complicated to modify Bio.GenBank to use > Bio.Entrez instead. Looking back at CVS, it used to use Bio.WWW.NCBI once upon a time (which is now Bio.Entrez), and had explicit rate limiting. Then four years ago Brad moved the Bio.GenBank.download_many() and search_for() functions over to using Bio.EUtils (CVS revision 1.51 of Bio/GenBank/__init__.py). Brad also appears to have changed the functionality of Bio.GenBank.download_many() from a call back mechanism to returning a handle. We could still return a handle, but it would require fetching all the records (perhaps in batches), and concatenating them. I think it would make more sense to deprecate the Bio.GenBank.download_many() function, and direct people to Bio.Entrez.efetch() instead. The Bio.GenBank.search_for() still seems somewhat useful, but without a default limit on the number of returned IDs, this could easily be abused. Again, we could deprecate this and direct people to Bio.Entrez.esearch() instead. Peter From mjldehoon at yahoo.com Thu Jun 26 13:51:55 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 06:51:55 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> Message-ID: <597121.15112.qm@web62401.mail.re1.yahoo.com> [Sorry, hit the send button too soon] > The Bio.GenBank.search_for() still seems somewhat > useful, but without a default limit on the number > of returned IDs, this could easily be abused. > Again, we could deprecate this and direct people > to Bio.Entrez.esearch() instead. As always, I am in favor of deprecating functions whose purpose is dubious. As an example, this is a Genbank search done via Bio.GenBank and via Bio.Entrez: # Using Bio.GenBank >>> from Bio import GenBank >>> gi_list = GenBank.search_for("Opuntia AND rpl16") >>> gi_list ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] # Same thing, using Bio.Entrez >>> from Bio import Entrez >>> handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16") >>> record = Entrez.read(handle) >>> record["IdList"] ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284'] I believe that GenBank.search_for automatically takes care of the retmax parameter (the maximum number of ids to return), but I agree that this can be abused easily. > Brad also appears to have changed the functionality of > Bio.GenBank.download_many() from a call back mechanism > to returning a handle. We could still return a handle, but it would > require fetching all the records (perhaps in batches), and > concatenating them. I think it would make more sense to deprecate > the Bio.GenBank.download_many() function, and direct people to > Bio.Entrez.efetch() instead. Agree. Btw, NCBIDictionary definitely needs to go. >From the documentation, continuing the example above: >>> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank") >>> gb_record = ncbi_dict[gi_list[0]] Hence, we're running efetch once for each key separately; this is exactly what NCBI advised against. --Michiel. From mjldehoon at yahoo.com Thu Jun 26 14:01:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 07:01:31 -0700 (PDT) Subject: [Biopython-dev] Bio.ECell, anybody? Message-ID: <712489.88060.qm@web62410.mail.re1.yahoo.com> This is one of the Martel-based parser whose relevance in 2008 is unclear to me. >From the docstring: Ecell converts the ECell input from spreadsheet format to an intermediate format, described in http://www.e-cell.org/manual/chapter2E.html#3.2.? It provides an alternative to the perl script supplied with the Ecell2 distribution at http://bioinformatics.org/project/?group_id=49. Currently, ECell is at version 3.1.106 (and uses Python as the scripting interface! Yay!). The link to the chapter in the ECell manual is dead. Is anybody using the Bio.ECell module? --Michiel From biopython at maubp.freeserve.co.uk Thu Jun 26 14:43:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 15:43:10 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <597121.15112.qm@web62401.mail.re1.yahoo.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> OK then - I will deprecate the Bio.GenBank.search_for() and Bio.GenBank,download_many() functions, suggesting Bio.Entrez instead. I will also update the tutorial on this. On Thu, Jun 26, 2008 at 2:51 PM, Michiel de Hoon wrote: > Btw, NCBIDictionary definitely needs to go. > From the documentation, continuing the example above: >>>> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank") >>>> gb_record = ncbi_dict[gi_list[0]] > Hence, we're running efetch once for each key separately; this is exactly what NCBI advised against. If the user wants to run a Entrez search and then fetch some/all of the results, then yes, the NCBI would not want us to do a multiple separate efetch calls by idenifier. Could you prepare an example using Bio.Entrez with the "history" (WebEnv argument)? However, if the user has provided the list of GI numbers (e.g. from a file), there is no existing NCBI search data to refer to, and I don't see any other option. So there is a use-case for the Bio.GenBank.NCBIDictionary class. Peter From mjldehoon at yahoo.com Thu Jun 26 14:49:49 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 07:49:49 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> Message-ID: <525848.21341.qm@web62410.mail.re1.yahoo.com> --- On Thu, 6/26/08, Peter wrote: However, if the user has provided the list of GI numbers (e.g. from a file), there is no existing NCBI search data to refer to, and I don't see any other option. So there is a use-case for the Bio.GenBank.NCBIDictionary class. In that case, the following can be used: >>> from Bio import Entrez >>> idlist = ['123','456','453',.....] # a list of GI numbers >>> ids = ",".join(idlist) >>> handle = Entrez.efetch(db='nucleotide', id=ids, retmode='xml') >>> records = Entrez.read(handle) # records is now a list of records corresponding to '123', '456', '453',... --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 26 16:05:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 17:05:36 +0100 Subject: [Biopython-dev] [BioPython] Fwd: NCBI Abuse Activity with BioPython In-Reply-To: <79693088-0D38-459E-ADEC-FF2757E41912@dalkescientific.com> References: <55F39FBF-3CF0-4192-AFEB-100853FEE8A1@sonsorol.org> <320fb6e00806260421g48e5807ei92297b372c330e5b@mail.gmail.com> <5CD393BF-D4FB-4700-B7CC-2417C9845010@dalkescientific.com> <320fb6e00806260440n4a933b60of5a7c8eee4e15a89@mail.gmail.com> <79693088-0D38-459E-ADEC-FF2757E41912@dalkescientific.com> Message-ID: <320fb6e00806260905i599a53f3v367045d3ee07ffbf@mail.gmail.com> On Thu, Jun 26, 2008 at 12:48 PM, Andrew Dalke wrote: >> I think we should >> also add the three second wait to the epost() method. > > I see it now. Yes, that needs it as well. Good - I've updated that in CVS, Bio/EUtils/ThinClient.py revision 1.8 >> I can see why Michiel has kept things simple in Bio.Entrez - this >> should cater to most user's needs. > > Sad, but true. EUtils (the server and the client) offer a lot more than > what most users need. > Agreed. Thanks again Andrew for your advice on where Bio.EUtils needed updating - it certainly meant this got dealt with more quickly. Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 17:04:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 18:04:26 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> Message-ID: <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> Michiel, I started working on a patch to mark Bio.GenBank.search_for() etc as deprecated, but on reflection I don't really like the longer code needed with Bio.Entrez - for example this one liner: from Bio import GenBank gi_list = GenBank.search_for("Opuntia AND rpl16") becomes: from Bio import Entrez handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16") gi_list = Entrez.read(handle)["IdList"] One idea that might be worth discussing is having variations of the Entrez.e* functions which will parse the XML and return the results. i.e. something like this: def esearch2(...) : """Calls ESearch and parses the returned XML.""" return read(esearch(..., retmode="XML")) Then we can write, from Bio import Entrez gi_list = Entrez.esearch2(db='nucleotide', term="Opuntia AND rpl16")["IdList"] (An alternative naming convention like a "p" might be nicer) My initial plan was to get the search results back as plain text (retmode='uilist'), thus avoiding parsing the XML. However, after reading the Entrez documentation, and some experimentation to confirm this, I was surprised to find the ESearch will only return XML. The NCBI appear to suggest that if you want your search results in another format use the WebEnv session history, and then ask EFetch to reformat it (!). This does work, but means making two internet calls: from Bio import Entrez handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16", usehistory="y") session = Entrez.read(handle)['WebEnv'] gi_list = Entrez.efetch(db='nucleotide', WebEnv=session, query_key=1, rettype='uilist').read().split('\n') As an aside, do we really have to include the database in the efetch call above? Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 20:32:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 21:32:07 +0100 Subject: [Biopython-dev] New release In-Reply-To: <390323.35893.qm@web62411.mail.re1.yahoo.com> References: <390323.35893.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> On Thu, Jun 26, 2008 at 1:50 PM, Michiel de Hoon wrote: > Hi everybody, > > I think we should make a new Biopython release within the next couple of weeks > to solve the issues with NCBI and to get the fixed Blast parser out (for output > from Blast 2.2.18). There are a few outstanding issues that hopefully can be > fixed before the next release: > 1) NCBI access from Bio.GenBank > 2) Bug #2454 (Iterators can't use file-like objects), which affects a number of parsers in Biopython > 3) Martel-based parsers. Given the updates to Bio.EUtils to enforce the 3 second rule, the urgent part of issue (1) is now resolved, and any futher refinements needn't hold up the release. >From a technical viewpoint, none of these are very complicated. 2) is almost finished. While there are still outstanding parsers affected by issue (2) (Bug 2454), I don't think this need hold up the release. > With respect to 3), a small number of parsers in Biopython are based on > Martel (none of the major ones as far as I can tell). For some of these > parsers, it is not quite clear if they are still useful. For the remaining ones, > it would be nice if they could be rewritten without using Martel -- that would > let us get rid of the dependency on mxTextTools. Again, while removing the dependency on mxTextTools is a worthwhile aim, I don't think this should hold up the release. > Any other urgent issues that need to be resolved before a release? There is an AlignInfo alphabet issue I'm currently working on, and expect to have fixed tomorrow. Peter From dalke at dalkescientific.com Thu Jun 26 21:40:51 2008 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 26 Jun 2008 23:40:51 +0200 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> Message-ID: <5DF39193-B52A-4EB9-84D3-C9626984DEA8@dalkescientific.com> On Jun 26, 2008, at 7:04 PM, Peter wrote: > I started working on a patch to mark Bio.GenBank.search_for() etc as > deprecated, but on reflection I don't really like the longer code > needed with Bio.Entrez > One idea that might be worth discussing is having variations of the > Entrez.e* functions which will parse the XML and return the results. > i.e. something like this: > > def esearch2(...) : > """Calls ESearch and parses the returned XML.""" > return read(esearch(..., retmode="XML")) What about calling it "search"? That is, the one that does everything the default way as most people expect is the one which doesn't need the prefix? > My initial plan was to get the search results back as plain text > (retmode='uilist'), thus avoiding parsing the XML. However, after > reading the Entrez documentation, and some experimentation to confirm > this, I was surprised to find the ESearch will only return XML. The > NCBI appear to suggest that if you want your search results in another > format use the WebEnv session history, and then ask EFetch to reformat > it (!). This does work, but means making two internet calls: That's my memory of it too. > As an aside, do we really have to include the database in the > efetch call above? Yes. Or you did 5 years ago. Andrew dalke at dalkescientific.com From biopython at maubp.freeserve.co.uk Thu Jun 26 21:53:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 22:53:40 +0100 Subject: [Biopython-dev] New release In-Reply-To: <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> References: <390323.35893.qm@web62411.mail.re1.yahoo.com> <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> Message-ID: <320fb6e00806261453l649f4ce3i83a6ed38fec54965@mail.gmail.com> >> Any other urgent issues that need to be resolved before a release? > > There is an AlignInfo alphabet issue I'm currently working on, and > expect to have fixed tomorrow. Fixed, I think. Alphabets can be annoying, especially gapped alphabets! Peter From biopython at maubp.freeserve.co.uk Thu Jun 26 22:05:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Jun 2008 23:05:45 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <5DF39193-B52A-4EB9-84D3-C9626984DEA8@dalkescientific.com> References: <320fb6e00806260553i4a7c5b2cxe5ae5aa0c80e53d1@mail.gmail.com> <597121.15112.qm@web62401.mail.re1.yahoo.com> <320fb6e00806260743u3385955dt2be06d7f8122d8e5@mail.gmail.com> <320fb6e00806261004r227c3340wf390779f1cc4616b@mail.gmail.com> <5DF39193-B52A-4EB9-84D3-C9626984DEA8@dalkescientific.com> Message-ID: <320fb6e00806261505w6e51d168i78987ac109a6f015@mail.gmail.com> On Thu, Jun 26, 2008 at 10:40 PM, Andrew Dalke wrote: > On Jun 26, 2008, at 7:04 PM, Peter wrote: >> >> I started working on a patch to mark Bio.GenBank.search_for() etc as >> deprecated, but on reflection I don't really like the longer code >> needed with Bio.Entrez > >> One idea that might be worth discussing is having variations of the >> Entrez.e* functions which will parse the XML and return the results. >> i.e. something like this: >> >> def esearch2(...) : >> """Calls ESearch and parses the returned XML.""" >> return read(esearch(..., retmode="XML")) > > What about calling it "search"? That is, the one that does everything the > default way as most people expect is the one which doesn't need the prefix? I like that idea for the naming :) What do you think Michiel, as this is your module? Peter From mjldehoon at yahoo.com Thu Jun 26 23:16:23 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 16:16:23 -0700 (PDT) Subject: [Biopython-dev] New release In-Reply-To: <320fb6e00806261332q408cc02boa7ee4c3342b53e4b@mail.gmail.com> Message-ID: <501202.26872.qm@web62413.mail.re1.yahoo.com> OK, then let's make a new release as soon as possible, and perhaps another one soon after that. Tentative date is this Sunday, around noon GMT. All biopython unit tests pass (at least, on my machine), so it should be straightforward to build a release. --Michiel. --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [Biopython-dev] New release To: mjldehoon at yahoo.com Cc: biopython-dev at biopython.org Date: Thursday, June 26, 2008, 4:32 PM On Thu, Jun 26, 2008 at 1:50 PM, Michiel de Hoon wrote: > Hi everybody, > > I think we should make a new Biopython release within the next couple of weeks > to solve the issues with NCBI and to get the fixed Blast parser out (for output > from Blast 2.2.18). There are a few outstanding issues that hopefully can be > fixed before the next release: > 1) NCBI access from Bio.GenBank > 2) Bug #2454 (Iterators can't use file-like objects), which affects a number of parsers in Biopython > 3) Martel-based parsers. Given the updates to Bio.EUtils to enforce the 3 second rule, the urgent part of issue (1) is now resolved, and any futher refinements needn't hold up the release. >From a technical viewpoint, none of these are very complicated. 2) is almost finished. While there are still outstanding parsers affected by issue (2) (Bug 2454), I don't think this need hold up the release. > With respect to 3), a small number of parsers in Biopython are based on > Martel (none of the major ones as far as I can tell). For some of these > parsers, it is not quite clear if they are still useful. For the remaining ones, > it would be nice if they could be rewritten without using Martel -- that would > let us get rid of the dependency on mxTextTools. Again, while removing the dependency on mxTextTools is a worthwhile aim, I don't think this should hold up the release. > Any other urgent issues that need to be resolved before a release? There is an AlignInfo alphabet issue I'm currently working on, and expect to have fixed tomorrow. Peter From mjldehoon at yahoo.com Thu Jun 26 23:20:49 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 26 Jun 2008 16:20:49 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806261505w6e51d168i78987ac109a6f015@mail.gmail.com> Message-ID: <900951.88468.qm@web62414.mail.re1.yahoo.com> There are some other possibilities, for example to use the retout parameter. This parameter lets you choose between XML, HTML, plain text, ... format for the results. We could make the rule that without an explicit value for this parameter, the Bio.Entrez.e* functions return the parsed results. If we're not sure what to do, I suggest we keep the search_for function in Bio.GenBank for the upcoming release, and take this issue up later. --Michiel. --- On Thu, 6/26/08, Peter wrote: From: Peter Subject: Re: [Biopython-dev] NCBI Abuse activity with Biopython To: "Biopython Developers Mailing List" Cc: "Andrew Dalke" Date: Thursday, June 26, 2008, 6:05 PM On Thu, Jun 26, 2008 at 10:40 PM, Andrew Dalke wrote: > On Jun 26, 2008, at 7:04 PM, Peter wrote: >> >> I started working on a patch to mark Bio.GenBank.search_for() etc as >> deprecated, but on reflection I don't really like the longer code >> needed with Bio.Entrez > >> One idea that might be worth discussing is having variations of the >> Entrez.e* functions which will parse the XML and return the results. >> i.e. something like this: >> >> def esearch2(...) : >> """Calls ESearch and parses the returned XML.""" >> return read(esearch(..., retmode="XML")) > > What about calling it "search"? That is, the one that does everything the > default way as most people expect is the one which doesn't need the prefix? I like that idea for the naming :) What do you think Michiel, as this is your module? Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Thu Jun 26 23:45:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 00:45:50 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <900951.88468.qm@web62414.mail.re1.yahoo.com> References: <320fb6e00806261505w6e51d168i78987ac109a6f015@mail.gmail.com> <900951.88468.qm@web62414.mail.re1.yahoo.com> Message-ID: <320fb6e00806261645y1819cddx620d430f34d7e725@mail.gmail.com> On Fri, Jun 27, 2008 at 12:20 AM, Michiel de Hoon wrote: > There are some other possibilities, for example to use the retout parameter. > This parameter lets you choose between XML, HTML, plain text, ... format for > the results. I'm not sure if its rettype, retmode or retout - but something like that. > We could make the rule that without an explicit value for this > parameter, the Bio.Entrez.e* functions return the parsed results. You suggestion to automatically do the parsing when XML format is requested would prevent the user from parsing the XML themselves (e.g. using SAX or DOM). It would also spoil my plan to include some of the Entrez sequence XML formats in Bio.SeqIO as this would need Bio.efetch(...) to return a handle with XML in it. > If we're not sure what to do, I suggest we keep the search_for function in > Bio.GenBank for the upcoming release, and take this issue up later. That would be expedient. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 26 23:47:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 19:47:14 -0400 Subject: [Biopython-dev] [Bug 2090] Blast.NCBIStandalone BlastParser fails with blastall 2.2.14 In-Reply-To: Message-ID: <200806262347.m5QNlESr031036@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2090 ------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-26 19:47 EST ------- Created an attachment (id=952) --> (http://bugzilla.open-bio.org/attachment.cgi?id=952&action=view) Patch to Bio/Blast/NCBIStandalone.py This is a very rough attempt at fixing multiquery BLAST output from recent versions of NCBI BLAST. It seems to work for the file I tested, but breaks the final part of the unit test due to the alignments shown as "Flat Query-Anchored with(out) Identities", described here: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/multi_formats.html See also unit test files bt005 and bt045 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 27 00:37:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 26 Jun 2008 20:37:14 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200806270037.m5R0bEkY000324@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 ------- Comment #24 from mdehoon at ims.u-tokyo.ac.jp 2008-06-26 20:37 EST ------- I committed my patch to setup.py, as it seems to work fine with Python 2.3, 2.4, and 2.5 on all platforms. Leaving this bug open, since we still need to remove the workaround in Bio/PopGen/SimCoal/__init__.py. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Jun 27 14:12:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 15:12:45 +0100 Subject: [Biopython-dev] Bio.AlignIO and Bio.Entrez documentation Message-ID: <320fb6e00806270712w134e1c5cm903b811c55fc60e1@mail.gmail.com> Hi all, I've realised that there is quite a lot of new content in the Tutorial since the last release. In addition to my new chapter on Bio.AlignIO, Michiel and I have both spent a good chunk of time on the Bio.Entrez chapter of the tutorial. Michiel wrote the bulk of this chapter and has updated it to cover the new XML parser. I've just been adding information based on the NCBI guidelines (for example encouraging people to include their email address in the Entrez calls), and I've just added another section with an example using the history/webenv for a combined esearch and efetch. If anyone could spare some time to proof read the tutorial, concentrating on either or both of these new chapters (and trying the examples) it would be appreciated. Those of you with CVS access can of course check in any little fixes - but if you spot anything significant its probably worth discussing first. Ideally we can fix any little typos before Michiel releases Biopython 1.46 (tentatively this Sunday, around noon GMT). Peter P.S. If you'd like to help out and can't read or run LaTeX, let me know by email and I'll send you the latest edition of the tutorial as a PDF or HTML file. From biopython at maubp.freeserve.co.uk Fri Jun 27 15:42:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 16:42:16 +0100 Subject: [Biopython-dev] Removing obsolete bits of the Tutorial Message-ID: <320fb6e00806270842r6231adfdo8edff7a07a329cdf@mail.gmail.com> I'm still in documentation mode, and I've just removed bits of documentation of a few deprecated or obsolete bits of code. I've just got the the "BioRegistry ? automatically ?nding sequence sources" section of the tutorial/cookbook, and this either needs major updating or removing. First of all since Biopython 1.44, the line "from Bio import db" had to be "from Bio.config.DBRegistry import db". And secondly, given this is all based on Martel parsers, the list of supported formats is now a lot thinner. Would anyone object to me removing this section of the tutorial/cookbook? We might be able to deprecate it too, but I'm not sure what side effects that might have so its a bit risky this close to a planned release. Then there is the section on "Parser Design" which focuses on the scanner/consumer model and lists lots of the events these parsers (used to) generate. I don't think any of this is useful, and suspect that a lot of it is out of date. Again, should we just remove this section? Peter From mjldehoon at yahoo.com Fri Jun 27 15:54:13 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 08:54:13 -0700 (PDT) Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <320fb6e00806261645y1819cddx620d430f34d7e725@mail.gmail.com> Message-ID: <224711.6366.qm@web62411.mail.re1.yahoo.com> > > We could make the rule that without an explicit value for this > > parameter, the Bio.Entrez.e* functions return the parsed results. > You suggestion to automatically do the parsing when XML format is > requested would prevent the user from parsing the XML themselves (e.g. > using SAX or DOM).Actually I was suggesting to do the parsing only if no format is requested, and to return a handle to XML if XML format is requested. But from the current examples in the Bio.Entrez chapter in the tutorial, it appears that typically users will have to write some glue code anyway to make optimally use of Bio.Entrez for their purposes. In that case, I suppose that whether or not we return a handle or an object from the Bio.Entrez.e* functions makes little difference. --Michiel. From biopython at maubp.freeserve.co.uk Fri Jun 27 16:06:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 17:06:58 +0100 Subject: [Biopython-dev] NCBI Abuse activity with Biopython In-Reply-To: <224711.6366.qm@web62411.mail.re1.yahoo.com> References: <320fb6e00806261645y1819cddx620d430f34d7e725@mail.gmail.com> <224711.6366.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00806270906p3d0d3a1dyf78b64bc2f0afa13@mail.gmail.com> On Fri, Jun 27, 2008 at 4:54 PM, Michiel de Hoon wrote: >> Your suggestion to automatically do the parsing when XML format is >> requested would prevent the user from parsing the XML themselves (e.g. >> using SAX or DOM). > > Actually I was suggesting to do the parsing only if no format is > requested, and to return a handle to XML if XML format is requested. Oh I see. But determining the format is a complex combination of the retmode and rettype parameters... quite confusing it its own right! Especially as the are multiple different XML file formats for the same result set. http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchlit_help.html http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchlit_help.html > But from the current examples in the Bio.Entrez chapter in the tutorial, it appears > that typically users will have to write some glue code anyway to make optimally > use of Bio.Entrez for their purposes. In that case, I suppose that whether or not > we return a handle or an object from the Bio.Entrez.e* functions makes little difference. Fair point. Certainly the "esearch and efetch" example is relatively complicated, and having a combined "esearch then parse" function wouldn't make much difference. Let's leave this suggestion for the time being (having versions of the Bio.Entrez functions which include the call to Bio.Entrez.read() to parse the XML). Peter From mjldehoon at yahoo.com Fri Jun 27 16:01:54 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 09:01:54 -0700 (PDT) Subject: [Biopython-dev] Removing obsolete bits of the Tutorial In-Reply-To: <320fb6e00806270842r6231adfdo8edff7a07a329cdf@mail.gmail.com> Message-ID: <215121.11545.qm@web62405.mail.re1.yahoo.com> > I've just got the the "BioRegistry ? automatically ?nding sequence > sources" section of the tutorial/cookbook, and this either needs major > updating or removing > ... > Would anyone object to me removing this section of the > tutorial/cookbook? I think it's better to remove it. Then there is the section on "Parser Design" which focuses on the scanner/consumer model and lists lots of the events these parsers (used to) generate. I don't think any of this is useful, and suspect that a lot of it is out of date. Again, should we just remove this section? That too. Otherwise, we may inadvertently be causing new Biopython developers to write their parsers using this out of date parser design, which as far as I know is not being used in the major Biopython modules. --Michiel From mjldehoon at yahoo.com Fri Jun 27 16:40:13 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 09:40:13 -0700 (PDT) Subject: [Biopython-dev] Modules to be removed from Biopython Message-ID: <492634.64872.qm@web62414.mail.re1.yahoo.com> Hi everybody, In recent releases, we have been using the rule of thumb to remove all modules from a new Biopython release that were deprecated two releases ago. For the upcoming release, this means that we will remove the modules that were deprecated in Biopython 1.44. In that release, quite a lot of modules were deprecated; these modules will not appear in Biopython 1.46. Some of the modules to be removed are relatively simple cases, which I think can be removed without causing any real pain to anybody: Bio.crc (moved to Bio.SeqUtils.CheckSum) Bio.Fasta.index_file Bio.Fasta.Dictionary Bio.GenBank.index_file Bio.GenBank.Dictionary Bio.Geo.Iterator (replaced by Bio.Geo.parse) Bio.KEGG.Compound.Iterator (replaced by Bio.KEGG.Compound.parse) Bio.KEGG.Enzyme.Iterator (replaced by Bio.KEGG.Enzyme.parse) Bio.KEGG.Map.Iterator (replaced by Bio.KEGG.Enzyme.parse) Bio.lcc (moved to Bio.SeqUtils.lcc) Bio.MarkupEditor Bio.Medline.NLMMedlineXML Bio.Medline.nlmmedline_001211_format Bio.Medline.nlmmedline_010319_format Bio.Medline.nlmmedline_011101_format Bio.Medline.nlmmedline_031101_format Bio.MultiProc Bio.SeqIO.FASTA.py Bio.SeqIO.generic.py But, there is also a set of interconnected modules where it's not 100% clear if they can be removed without causing some surprises: Bio.builders Bio.config Bio.dbdefs Bio.formatdefs Bio.dbdefs Bio.expressions Bio.FormatIO Bio.Std Bio.StdHandler It is probably OK to remove these, since these were deprecated we did not get a barrage of complaints from our users. Personally, I think it is important to keep the code base clean, so I am in favor of removing these (and see if anybody complains; in that case, we can always put these modules back in and make a new release). But I can live with keeping these modules for another release round. If anybody thinks that that would be better, please let us know. --Michiel From biopython at maubp.freeserve.co.uk Fri Jun 27 16:50:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 17:50:17 +0100 Subject: [Biopython-dev] Modules to be removed from Biopython In-Reply-To: <492634.64872.qm@web62414.mail.re1.yahoo.com> References: <492634.64872.qm@web62414.mail.re1.yahoo.com> Message-ID: <320fb6e00806270950k479eda23ia96d3c2d36557510@mail.gmail.com> On Fri, Jun 27, 2008 at 5:40 PM, Michiel de Hoon wrote: > Hi everybody, > > In recent releases, we have been using the rule of thumb to remove all > modules from a new Biopython release that were deprecated two releases ago. I was wondering if there was a stated policy on this. > For the upcoming release, this means that we will remove the modules > that were deprecated in Biopython 1.44. In that release, quite a lot of > modules were deprecated; these modules will not appear in Biopython 1.46. > > Some of the modules to be removed are relatively simple cases, which I > think can be removed without causing any real pain to anybody: > > Bio.crc (moved to Bio.SeqUtils.CheckSum) > Bio.Fasta.index_file > Bio.Fasta.Dictionary > Bio.GenBank.index_file > Bio.GenBank.Dictionary > Bio.Geo.Iterator (replaced by Bio.Geo.parse) > Bio.KEGG.Compound.Iterator (replaced by Bio.KEGG.Compound.parse) > Bio.KEGG.Enzyme.Iterator (replaced by Bio.KEGG.Enzyme.parse) > Bio.KEGG.Map.Iterator (replaced by Bio.KEGG.Enzyme.parse) > Bio.lcc (moved to Bio.SeqUtils.lcc) > Bio.MarkupEditor > Bio.Medline.NLMMedlineXML > Bio.Medline.nlmmedline_001211_format > Bio.Medline.nlmmedline_010319_format > Bio.Medline.nlmmedline_011101_format > Bio.Medline.nlmmedline_031101_format > Bio.MultiProc > Bio.SeqIO.FASTA.py > Bio.SeqIO.generic.py Those all look fine to remove. I agree here. > But, there is also a set of interconnected modules where it's not 100% > clear if they can be removed without causing some surprises: > Bio.builders > Bio.config > Bio.dbdefs > Bio.formatdefs > Bio.dbdefs > Bio.expressions > Bio.FormatIO > Bio.Std > Bio.StdHandler > It is probably OK to remove these, since these were deprecated we did > not get a barrage of complaints from our users. Personally, I think it is > important to keep the code base clean, so I am in favor of removing > these (and see if anybody complains; in that case, we can always put > these modules back in and make a new release). But I can live with > keeping these modules for another release round. If anybody thinks > that that would be better, please let us know. Given some of these are very interconnected, I would be inclined to leave them in for one more release. However I'm content to see them go. If no one else has any qualms, then please carry on. Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 16:54:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 17:54:16 +0100 Subject: [Biopython-dev] Removing obsolete bits of the Tutorial In-Reply-To: <215121.11545.qm@web62405.mail.re1.yahoo.com> References: <320fb6e00806270842r6231adfdo8edff7a07a329cdf@mail.gmail.com> <215121.11545.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00806270954r4ee7b16fw3210cd77f1708a3@mail.gmail.com> On Fri, Jun 27, 2008 at 5:01 PM, Michiel de Hoon wrote: > >> I've just got the the "BioRegistry ? automatically ?nding sequence >> sources" section of the tutorial/cookbook, and this either needs major >> updating or removing >> ... >> Would anyone object to me removing this section of the >> tutorial/cookbook? > > I think it's better to remove it. Gone. >> Then there is the section on "Parser Design" which focuses on the >> scanner/consumer model and lists lots of the events these parsers >> (used to) generate. I don't think any of this is useful, and suspect >> that a lot of it is out of date. Again, should we just remove this >> section? > > That too. Otherwise, we may inadvertently be causing new > Biopython developers to write their parsers using this out of > date parser design, which as far as I know is not being used > in the major Biopython modules. It's not entirely out of date - don't SAX based XML parsers do something similar? And quite a few major modules still follow this scheme (e.g. Bio.GenBank and Bio.SwissProt). Anyway, I have removed most of this section leaving only a short overview. Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 17:49:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 18:49:53 +0100 Subject: [Biopython-dev] Recent Bio.Nexus updates Message-ID: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> Hi Frank, I see you've got your CVS access working again - good :) I wanted to ask you about two of your recent changes to Bio/Nexus/Nexus.py First of all, you've added a new method export_phylip(), which seems to be a simple function to record the Nexus object's alignment as a PHYLIP format alignment. One point of concern is code duplication (Bio.AlignIO can write PHYLIP files). Also, you don't seem to be following the "spec" strictly, as the taxon names are not cropped to ten characters, nor are any "illegal" characters dealt with. More generally, I wonder if this method is really needed - perhaps instead a general method to return a Bio.Align.Generic.Alignment object would be preferable. This could then be used in conjunction with any of the alignment formats supported in Bio.AlignIO. Secondly, you seem to have reverted the alphabet change to Bio/Nexus/Nexus.py made in revision 1.12 to fix Bug 2380. Was this deliberate or just accidental? http://bugzilla.open-bio.org/show_bug.cgi?id=2380 Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 21:58:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 22:58:04 +0100 Subject: [Biopython-dev] [BioPython] Entrez In-Reply-To: <1214569152.6026.9.camel@ubuntu> References: <1214494546.6215.3.camel@ubuntu> <320fb6e00806260857i619d4947l130791ab8276f992@mail.gmail.com> <1214562160.6026.2.camel@ubuntu> <320fb6e00806270416x76d8b388mdd79577927001f32@mail.gmail.com> <1214569152.6026.9.camel@ubuntu> Message-ID: <320fb6e00806271458t4e043c39sb664c4346c8a6949@mail.gmail.com> Just forwarding this to the mailing list - Binbin's problem is resolved (although I don't know what was wrong originally). A happy ending :) Peter ---------- Forwarded message ---------- From: binbin Date: Fri, Jun 27, 2008 at 1:19 PM Subject: Re: [BioPython] Entrez To: Peter i re-install the biopyton1.45 and now i can import Entrez! thanks very much! ? 2008-06-27?? 13:16 +0200?Peter??? > On Fri, Jun 27, 2008 at 11:22 AM, binbin wrote: > > thank you for answering, i am a beginner of biopython,in the "Biopython > > Tutorial and Cookbook": > > 2.5 Connecting with biological databases: > > this is found > > "from Bio import Entrez" > > > > i tried this but it did work for me, that is why i asked. > > That should have worked if your installation of Biopython 1.45 was successful. > > We may be able to work out what is wrong. What operating system are > you using, which version of python, and how did you install Biopython? > > Regards, > > Peter From biopython at maubp.freeserve.co.uk Fri Jun 27 22:06:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Jun 2008 23:06:14 +0100 Subject: [Biopython-dev] [BioPython] Bio.SCOP.FileIndex In-Reply-To: <141582.2274.qm@web62413.mail.re1.yahoo.com> References: <141582.2274.qm@web62413.mail.re1.yahoo.com> Message-ID: <320fb6e00806271506i1af1db34n1aec65605fd6f83c@mail.gmail.com> On Wed, Jun 25, 2008 at 3:04 PM, Michiel de Hoon wrote: > Hi everybody, > > When I was modifying Bio.SCOP, I noticed that Bio.SCOP.FileIndex is flawed > if file reading is done via a buffer (which is often the case in Python). Are you talking about Bio/SCOP/FileIndex.py? The whole design seems to be geared to indexing the position of record in a file - down to the fact that it takes as filename rather than a handle. Why does it need "fixing"? > Before we try to fix this, is anybody actually using Bio.SCOP.FileIndex? > If not, I think we should deprecate it instead of trying to fix it. We've deprecated similar functionality in Bio.GenBank, although if I recall correctly that was because it was using Martel and broke with mxTextTools 3.0, and therefore fixing it was non-trivial. If Bio.SCOP.FileIndex is broken, then deprecation seems sensible. Peter From mjldehoon at yahoo.com Sat Jun 28 02:21:53 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 19:21:53 -0700 (PDT) Subject: [Biopython-dev] [BioPython] Bio.SCOP.FileIndex In-Reply-To: <320fb6e00806271506i1af1db34n1aec65605fd6f83c@mail.gmail.com> Message-ID: <216781.61321.qm@web62403.mail.re1.yahoo.com> --- On Fri, 6/27/08, Peter wrote: Are you talking about Bio/SCOP/FileIndex.py? The whole design seems to begeared to indexing the position of record in a file - down to the fact that it takes as filename rather than a handle. Why does it need "fixing"? FileIndex pulls out records from the iterator one by one, and then calls .tell() on the file handle to find the starting position of each record. The problem is that (due to buffered reading from the file handle) .tell() does not correspond to the record starting positions. Taking the essential pieces of FileIndex: >>> input = open("mydatafile.txt") >>> while True: ...???? next_line = input.next() ...???? print input.tell() ... 8192 8192 8192 8192 8192 ... 8192 8192 18432 18432 18432 ... It works because in the iterators that are actually used in Bio.SCOP call readline() internally, which reads exactly one line so that .tell() returns the expected answer. But, calling readline() in the iterator is a limitation (e.g., you cannot run it on a list of lines). Another option is to let FileIndex itself call readline(): class FileIndex(dict): ??? def __init__(self, filename, record_gen, key_gen) ??????? ... ??????? f = open(filename) ??????? while True: ??????????? line = f.readline() ??????????? self[key] = f.tell() # store location ... ??? def __getitem__(self, key): ??????? location = dict.__getitem__[key] ??????? f.seek(location) ??????? line = f.readline() ??????? return record_gen(line) This works, but it means changing how users call FileIndex. Which is also OK, but before modifying FileIndex it would be good to know if anybody is actually using this functionality. --Michiel. From mjldehoon at yahoo.com Sat Jun 28 02:28:48 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 27 Jun 2008 19:28:48 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank.NCBIDictionary, Bio.PubMed.Dictionary Message-ID: <982950.87150.qm@web62409.mail.re1.yahoo.com> Does anybody have any further objections to deprecating Bio.GenBank.NCBIDictionary and Bio.PubMed.Dictionary? These two classes download records from NCBI one by one, which is exactly what NCBI advised against. --Michiel From bugzilla-daemon at portal.open-bio.org Sat Jun 28 20:09:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 16:09:44 -0400 Subject: [Biopython-dev] [Bug 2530] New: Bio.Seq.translate() treats invalid codons as stops Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2530 Summary: Bio.Seq.translate() treats invalid codons as stops Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk The following results are with CVS. Biopython 1.45 may be different, I have recently tweaked the translate function for some less dramatic issues. I would like Bio.Seq.translate() to raise exceptions on untranslatable codons, rather than inserting a stop character. e.g. for "N at N" or "TA-". Currently: >>> from Bio.Seq import translate >>> translate("TAA") '*' >>> translate("TAG") '*' >>> translate("TAA") '*' >>> translate("TAC") 'Y' >>> translate("TAN") ... Bio.Data.CodonTable.TranslationError: 'TAN' >>> translate("NNN") ... Bio.Data.CodonTable.TranslationError: 'TAN' >>> translate("AAA") 'K' >>> translate("ANA") 'X' >>> translate("AXA") 'X' That is all fine. However, >>> translate("A at A") '*' >>> translate("A-A") '*' These should also raise a TranslationError. Suggested non-trivial patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 28 20:19:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 16:19:09 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806282019.m5SKJ9l2011097@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 16:19 EST ------- Created an attachment (id=953) --> (http://bugzilla.open-bio.org/attachment.cgi?id=953&action=view) Patch to Bio/Seq.py Bio/Data/CodonTable.py and the test_seq.py unit test The basic idea of this patch is to include the stop codons in the CodonTable's forward table dictionary. Currently, when doing the translation a stop codon is inserted when the key is undefined (but this also happens for invalid codons). Instead, by including the stop codons in the forward table, we can do a single mapping. Any KeyError becomes a translation error. However, this is a fiarly significant change to the existing CodonTable objects. The are a strange odd bunch of objects - with the ambiguous codon tables being very odd. I have replaced all of these with a single codon table which includes all the DNA and RNA codons, including the ambiguous ones. All the existing variants of DNA/RNA/Generic and (un)ambiguous CodonTables are more replaced with the single object. We still have one per NCBI codon table. I think that the CodonTable could be made simpler still, but I wanted to at least try and remain API backwards compatible (bar the dictionary change). Then, I tweaked the Bio.Seq translate method to take advantage of this. NOTE - We don't have a unit test for Bio.Data.CodonTable or Bio.Translate, so it would be wise to write one BEFORE commiting this patch. If there are any other bits of code using Bio.Data.CodonTable they could also be affected. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Jun 28 20:32:09 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 28 Jun 2008 21:32:09 +0100 Subject: [Biopython-dev] Failing unit tests under Windows Message-ID: <320fb6e00806281332v44ba6139xd2531c57f53f92e@mail.gmail.com> I run python 2.3.5 on Windows, and compile from source with MSCV 6.0 (which is a different setup to the one Michiel uses for the builds). I just thought I should document the unit test oddities I see on this machine: test_ProtPram - fails with a single floating point difference, 0.562 versus 0.563. test_Wise - doesn't fail gracefully due to a problem detecting dnal http://bugzilla.open-bio.org/show_bug.cgi?id=2469 test_psw - fails due to a "doctest of" versus "Doctest: " string difference. This may be due to the different version of python? We can probably fix this in run_tests.py test_KDTree - fails with ImportError: No module named _CKDTree I do select yes when asked if I want to build Bio.KDTree - does this work for anyone under Windows? Peter From bugzilla-daemon at portal.open-bio.org Sat Jun 28 20:39:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 16:39:45 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806282039.m5SKdjUA011740@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 16:39 EST ------- Actually there is a unit test, test_translate.py - maybe the lower case T confused me? The bad news is this unit test fails with my patch, due to the Bio.Translate module using an incredibly strict check on the alphabet. I'll try and come up with a less invasive change to Bio.Data.CodonTable which makes Bio.Translate happy again - but probably not tonight. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 01:57:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 21:57:54 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806290157.m5T1vshF022329@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #953 is|0 |1 obsolete| | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 21:57 EST ------- (From update of attachment 953) There is an underlying issue in Bio.Data.CodonTable, which is at least commented: # These two are WRONG! I need to get the # list of ambiguous codons which code for # the stop codons XXX For example, R = A or G, so UAR = UAA or UAG / TAR = TAA or TAG = stop codons. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 02:37:01 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 28 Jun 2008 22:37:01 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806290237.m5T2b1Wu023585@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-28 22:37 EST ------- Created an attachment (id=954) --> (http://bugzilla.open-bio.org/attachment.cgi?id=954&action=view) Rough patch to Bio/Data/CodonTable.py This includes some self testing, but needs further validation before being trusted. For example, is it enough to compare just pairs of unambiguous start/stop codons when generating the set of possible ambiguous start/stop codons? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sun Jun 29 06:22:43 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 28 Jun 2008 23:22:43 -0700 (PDT) Subject: [Biopython-dev] [BioPython] Bio.SCOP.FileIndex In-Reply-To: <216781.61321.qm@web62403.mail.re1.yahoo.com> Message-ID: <584421.23968.qm@web62410.mail.re1.yahoo.com> It turned out that Bio.SCOP.FileIndex was used as a base class in Bio.SCOP.Cla and Bio.SCOP.Raf. Without using Bio.SCOP.FileIndex as a base class, the derived classes in Bio.SCOP.Cla and Bio.SCOP.Raf were easy to fix. So I deprecated Bio.SCOP.FileIndex, while keeping Bio.SCOP's functionality intact by fixing the derived classes. --Michiel From bugzilla-daemon at portal.open-bio.org Sun Jun 29 06:24:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 02:24:42 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806290624.m5T6Og3F029458@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #19 from mdehoon at ims.u-tokyo.ac.jp 2008-06-29 02:24 EST ------- Bio.SCOP is fixed now (added a parse() function as a replacement for the Iterator class, which is now deprecated). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 10:09:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 06:09:25 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806291009.m5TA9PfZ021963@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #7 from mmokrejs at ribosome.natur.cuni.cz 2008-06-29 06:09 EST ------- Quoting from http://www.python.org/dev/peps/pep-0324/ - No implicit call of /bin/sh. This means that there is no need for escaping dangerous shell meta characters. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 10:55:04 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 06:55:04 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806291055.m5TAt4qX023404@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-29 06:55 EST ------- Hmm. Another reason to move to Python 2.4+, see also Bug 2480. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sun Jun 29 11:15:00 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 29 Jun 2008 04:15:00 -0700 (PDT) Subject: [Biopython-dev] CVS freeze for release 1.46 Message-ID: <799546.26730.qm@web62413.mail.re1.yahoo.com> Hi everybody, I will start to creating the new release from now. Please don't make any commits to CVS until the new release is out. Thanks! --Michiel. From bugzilla-daemon at portal.open-bio.org Sun Jun 29 14:35:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 29 Jun 2008 10:35:11 -0400 Subject: [Biopython-dev] [Bug 2530] Bio.Seq.translate() treats invalid codons as stops In-Reply-To: Message-ID: <200806291435.m5TEZBAh032091@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2530 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #954 is|0 |1 obsolete| | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-29 10:35 EST ------- Created an attachment (id=955) --> (http://bugzilla.open-bio.org/attachment.cgi?id=955&action=view) Patches Bio/Data/CodonTable.py for ambiguous start/stop codons This implements the stub function list_ambiguous_codons, and adds a lot of in-situ asserts which could later be moved to a unit test. e.g. ['TAG', 'TAA'] -> ['TAG', 'TAA', 'TAR'] ['UAG', 'UGA'] -> ['UAG', 'UGA', 'URA'] Note that ['TAG', 'TGA'] -> ['TAG', 'TGA'], this does not add 'TRR' is this could be a stop codon or a coding amino acid. Thus only two more codons are added in the following example: e.g. ['TGA', 'TAA', 'TAG'] -> ['TGA', 'TAA', 'TAG', 'TRA', 'TAR'] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sun Jun 29 14:43:25 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 29 Jun 2008 07:43:25 -0700 (PDT) Subject: [Biopython-dev] New release 1.46 Message-ID: <899008.26338.qm@web62403.mail.re1.yahoo.com> Hi everybody, Release 1.46 is essentially done. Feel free to start committing to CVS again. Currently I am not able to update Biopython's wiki pages. This looks like an problem with the wiki, since I am getting a blank screen without any error message. So I cannot update the website and send out the announcement yet. --Michiel From biopython at maubp.freeserve.co.uk Sun Jun 29 15:09:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Jun 2008 16:09:47 +0100 Subject: [Biopython-dev] New release 1.46 In-Reply-To: <899008.26338.qm@web62403.mail.re1.yahoo.com> References: <899008.26338.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00806290809r6ad238d3r3a16dfa145bc0186@mail.gmail.com> On Sun, Jun 29, 2008 at 3:43 PM, Michiel de Hoon wrote: > Hi everybody, > > Release 1.46 is essentially done. Feel free to start committing to CVS again. Well done - I hope you didn't give up your whole weekend for this. > Currently I am not able to update Biopython's wiki pages. This looks like an problem > with the wiki, since I am getting a blank screen without any error message. So I > cannot update the website and send out the announcement yet. I've been in touch with the OBF about this before. You'll notice all the other project pages are down too (check www.biosql.org and www.bioperl.org for example). I'm told they have something in place to automatically reboot the server, so it should fix itself within an hour or so, but it looks like they haven't resolved the underlying problem. I guess this means the new release files themselves are still waiting on your local machine(s)? That's a shame. Peter From mjldehoon at yahoo.com Sun Jun 29 15:07:36 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 29 Jun 2008 08:07:36 -0700 (PDT) Subject: [Biopython-dev] Removing obsolete bits of the Tutorial In-Reply-To: <320fb6e00806270954r4ee7b16fw3210cd77f1708a3@mail.gmail.com> Message-ID: <176230.99034.qm@web62415.mail.re1.yahoo.com> >> Then there is the section on "Parser Design" which focuses on the >> scanner/consumer model and lists lots of the events these parsers >> (used to) generate. I don't think any of this is useful, and suspect >> that a lot of it is out of date. Again, should we just remove this >> section? > > That too. Otherwise, we may inadvertently be causing new > Biopython developers to write their parsers using this out of > date parser design, which as far as I know is not being used > in the major Biopython modules. It's not entirely out of date - don't SAX based XML parsers do something similar? Yes, but there's a difference: In an XML file, we need to find out where the XML tags are to be able to parse the file. These tags can appear anywhere in the file. In flat-file text formats, typically different information is stored in different lines. So finding out where one piece of information ends and another one starts becomes trivial. We just need to pull out the lines one by one, and check whether they are a new piece of information or a continuation of the current piece of information. Especially for simple formats (e.g. Fasta), using a scanner / consumer model can be unnecessarily complex. But also for more complicated formats, parsing line by line can be entirely straightforward. For example, have a look at Bio/SwissProt/KeyWList.py, which currently contains a line-by-line parser and a scanner/consumer parser (which is deprecated). The former takes 26 lines, the latter more than a 100. --Michiel. From biopython at maubp.freeserve.co.uk Sun Jun 29 15:28:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 29 Jun 2008 16:28:04 +0100 Subject: [Biopython-dev] Modules to be removed from Biopython In-Reply-To: <320fb6e00806270950k479eda23ia96d3c2d36557510@mail.gmail.com> References: <492634.64872.qm@web62414.mail.re1.yahoo.com> <320fb6e00806270950k479eda23ia96d3c2d36557510@mail.gmail.com> Message-ID: <320fb6e00806290828u7133ee40x8feba14b19c13be8@mail.gmail.com> > On Fri, Jun 27, 2008 at 5:40 PM, Michiel de Hoon wrote: >> For the upcoming release, this means that we will remove the modules >> that were deprecated in Biopython 1.44. In that release, quite a lot of >> modules were deprecated; these modules will not appear in Biopython 1.46. >> >> Some of the modules to be removed are relatively simple cases, which I >> think can be removed without causing any real pain to anybody: >> >> ... I see you removed most of the easy ones before making Biopython 1.46. Just to let you all know that I've just removed these three: >> Bio.SeqIO.FASTA.py >> Bio.SeqIO.generic.py >> Bio.FormatIO Peter From fkauff at biologie.uni-kl.de Mon Jun 30 08:34:30 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 30 Jun 2008 10:34:30 +0200 Subject: [Biopython-dev] Recent Bio.Nexus updates In-Reply-To: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> References: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> Message-ID: <48689A96.4010805@biologie.uni-kl.de> Hi Peter and Michiel, Peter wrote: > Hi Frank, > > I see you've got your CVS access working again - good :) > > I wanted to ask you about two of your recent changes to Bio/Nexus/Nexus.py > > First of all, you've added a new method export_phylip(), which seems > to be a simple function to record the Nexus object's alignment as a > PHYLIP format alignment. One point of concern is code duplication > (Bio.AlignIO can write PHYLIP files). Also, you don't seem to be > following the "spec" strictly, as the taxon names are not cropped to > ten characters, nor are any "illegal" characters dealt with. True - I ignored this delibaretely. I think except for old PHYLIP itself, all software I know handles longer taxon names by default. The format I used here is sometimes refered to as "relaxed phylip" but as it has become the standard for what people call phylip formt, so I just kept it this way. > More > generally, I wonder if this method is really needed - perhaps instead > a general method to return a Bio.Align.Generic.Alignment object would > be preferable. This could then be used in conjunction with any of the > alignment formats supported in Bio.AlignIO. > That is a possibility. I would then vouch for adding support for "relaxed phylip" to AlignIO.PhylipIO (which I could easily do with a little mofification of Nexus.export_phylip() myself) > Secondly, you seem to have reverted the alphabet change to > Bio/Nexus/Nexus.py made in revision 1.12 to fix Bug 2380. Was this > deliberate or just accidental? > http://bugzilla.open-bio.org/show_bug.cgi?id=2380 > > Sorry for that. I missed that bug. Thaks for re-fixing it. Frank > Thanks, > > Peter > > -- J-Prof. Dr. Frank Kauff Molecular Phylogenetics FB Biologie, 13/276 TU Kaiserslautern Postfach 3049 67653 Kaiserslautern Tel. +49 (0)631 205-2562 Fax. +49 (0)631 205-2998 email: fkauff at biologie.uni-kl.de skype: frank.kauff From biopython at maubp.freeserve.co.uk Mon Jun 30 09:12:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 10:12:17 +0100 Subject: [Biopython-dev] Recent Bio.Nexus updates In-Reply-To: <48689A96.4010805@biologie.uni-kl.de> References: <320fb6e00806271049vdfb15co30a05c0a93963aba@mail.gmail.com> <48689A96.4010805@biologie.uni-kl.de> Message-ID: <320fb6e00806300212m6b129a17he9dfd7c8af7cbc03@mail.gmail.com> >> First of all, you've added a new method export_phylip(), which seems >> to be a simple function to record the Nexus object's alignment as a >> PHYLIP format alignment. One point of concern is code duplication >> (Bio.AlignIO can write PHYLIP files). Also, you don't seem to be >> following the "spec" strictly, as the taxon names are not cropped to >> ten characters, nor are any "illegal" characters dealt with. > > True - I ignored this delibaretely. I think except for old PHYLIP itself, > all software I know handles longer taxon names by default. The format I used > here is sometimes refered to as "relaxed phylip" but as it has become the > standard for what people call phylip formt, so I just kept it this way. Sadly "relaxed phylip" is an even less well defined format! >> More >> generally, I wonder if this method is really needed - perhaps instead >> a general method to return a Bio.Align.Generic.Alignment object would >> be preferable. This could then be used in conjunction with any of the >> alignment formats supported in Bio.AlignIO. > > That is a possibility. I would then vouch for adding support for "relaxed > phylip" to AlignIO.PhylipIO (which I could easily do with a little > mofification of Nexus.export_phylip() myself) Would you expect spaces to be allowed in the names for "relaxed phylip" files? Writing the files is easy - checking that other tools can understand them is more hassle. And the flip side of this is reading assorted versions of "relaxed phylip" is also tricky. If you have a collection of various "valid" files (ideally output from or accepted by mainstream tools) we could use that to put together a test suite which would define the de-facto standard. But without that, I wouldn't be so confident about adding this to Biopython. >> Secondly, you seem to have reverted the alphabet change to >> Bio/Nexus/Nexus.py made in revision 1.12 to fix Bug 2380. Was this >> deliberate or just accidental? >> http://bugzilla.open-bio.org/show_bug.cgi?id=2380 > > Sorry for that. I missed that bug. Thaks for re-fixing it. There may be a more elegant way of fixing this. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 30 10:21:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 06:21:26 -0400 Subject: [Biopython-dev] [Bug 2509] Deprecating the .data property of the Seq and MutableSeq objects In-Reply-To: Message-ID: <200806301021.m5UALQVF020449@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2509 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 06:21 EST ------- See also Bug 2351, Make Seq more like a string, even subclass string? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 13:35:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 09:35:59 -0400 Subject: [Biopython-dev] [Bug 2531] New: Nexus and fasta parsers have a problem with identical taxa names Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2531 Summary: Nexus and fasta parsers have a problem with identical taxa names Product: Biopython Version: 1.44 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P4 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: abetanco at staffmail.ed.ac.uk When identical taxa names are used to identify different sequences, the nexus and fasta parser will output both taxa names, but output the same sequence for each of them. If it's not possible to store both sequences, maybe it would be better if only one of the sequences were written out, so at least it's obvious there's a problem? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 13:48:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 09:48:24 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301348.m5UDmO70030666@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 09:48 EST ------- Which Nexus and Fasta parsers? There is more than one way to load these file formats in Biopython - could you show us some sample code please? You can attach a pair of example input files if it helps. Thanks. Peter. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 14:21:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 10:21:41 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301421.m5UELfPj000799@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 10:21 EST ------- Can I repeat my request that you upload an example file (by creating an attachment to this bug) of a FASTA and NEXUS file that doesn't work for you. Here is a small Nexus file I just created by hand, with repeated taxon CYS1_DICDI (with almost the same sequence), and then below some example code using Bio.Nexus to parse it. ================================== #NEXUS [TITLE: NoName] begin data; dimensions ntax=4 nchar=50; format interleave datatype=protein gap=- symbols="FSTNKEYVQMCLAWPHDRIG"; matrix CYS1_DICDI -----MKVIL LFVLAVFTVF VSS------- --------RG IPPEEQ---- ALEU_HORVU MAHARVLLLA LAVLATAAVA VASSSSFADS NPIRPVTDRA ASTLESAVLG CATH_HUMAN ------MWAT LPLLCAGAWL LGV------- -PVCGAAELS VNSLEK---- CYS1_DICDI -----MKVIL LFVLAVFTVF VSS------- --------RG IPPEEQ---X ; end; ================================== Then in python, >>> filename = ... >>> handle = open(filename) >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(handle) >>> print n.matrix.keys() ['CATH_HUMAN', 'CYS1_DICDI', 'CYS1_DICDI.copy', 'ALEU_HORVU'] >>> n.matrix['CYS1_DICDI'] Seq('-----MKVILLFVLAVFTVFVSS---------------RGIPPEEQ----', IUPACProtein()) >>> n.matrix['CYS1_DICDI.copy'] Seq('-----MKVILLFVLAVFTVFVSS---------------RGIPPEEQ---X', IUPACProtein()) Note that Bio.Nexus has automatically renamed the duplicate entry 'CYS1_DICDI.copy' and that their different sequences have been loaded correctly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 14:36:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 10:36:06 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301436.m5UEa6WK001525@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #3 from abetanco at staffmail.ed.ac.uk 2008-06-30 10:36 EST ------- Created an attachment (id=956) --> (http://bugzilla.open-bio.org/attachment.cgi?id=956&action=view) nexus file Sorry for the overly complicated nexus file, but I can't seem to reproduce the bug with a simple example. In this case, HI99.Line5 is entered twice, and differs just at three sites (249, 417, and 452). The result I get at those three sites is the first sequence duplicated twice. 249 417 452 nexus file HI99.Line5 T T A HI99.Line5 C C G fasta output HI99.Line5 T T A HI99.Line5 T T A To do the conversion, I used this, which I think is just copied off the Biopython documentation site: #! /usr/bin/python if __name__ == '__main__' : from Bio import SeqIO import sys input_handle = open(sys.argv[1], "rU") output_handle = open(sys.argv[1].+"fas", "w") sequences = SeqIO.parse(input_handle, "nexus") SeqIO.write(sequences, output_handle, "fasta") output_handle.close() input_handle.close() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 14:52:08 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 10:52:08 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301452.m5UEq8DN002181@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 10:52 EST ------- Thanks for the example file - I can now reproduce a problem, which is progress. There is a rather cryptic error message from Bio.SeqIO, due to the fact that when Bio.Nexus parses the file it doesn't create a matrix. You can see this by using Bio.Nexus directly: >>> filename = ... >>> handle = open(filename) >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(handle) >>> n.matrix.keys() Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'keys' >>> n.matrix is None True This explains why trying to use Bio.SeqIO gives the following exception: TypeError: argument of type 'NoneType' is not iterable So, from my point of view this is good news (joke) as its not really a problem in Bio.SeqIO - although I will fix Bio.SeqIO so it fails gracefully. This seems to be a problem in Bio.Nexus, so its a job for Frank... I've got a couple more questions for you: (1) Where did this file come from? I'm not an expert on the details of the Nexus file format, but I am wondering which program wrote this file, as perhaps it is invalid in some way? (2) Could we add it to Biopython as an example for our unit tests? It might be a bit big as it is, but we could cut it down a little by hand first. P.S. I have retitled the bug from "Nexus and fasta parsers have a problem with identical taxa names" to "Bio.Nexus has a problem with identical taxa names". You don't seem to be parsing in any FASTA files, just trying to write one. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Mon Jun 30 14:55:16 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 30 Jun 2008 07:55:16 -0700 (PDT) Subject: [Biopython-dev] New release Message-ID: <97693.82874.qm@web62401.mail.re1.yahoo.com> Sorry, but I still can't edit the Biopython wiki pages, so I can't make the new release available. Can other people edit these pages? --Michiel. From biopython at maubp.freeserve.co.uk Mon Jun 30 14:56:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 30 Jun 2008 15:56:39 +0100 Subject: [Biopython-dev] Bug 2531 - Bio.Nexus problem with file with repeated id Message-ID: <320fb6e00806300756l7e9f6fe6sc68cf1884cb2994@mail.gmail.com> Hi Frank, Would you be able to take a look at this new report, bug 2531: http://bugzilla.open-bio.org/show_bug.cgi?id=2531 The reporter Andrea Betancourt says she is using Biopython 1.44, while I am on CVS (which should be equivalent to Biopython 1.46 for Bio.Nexus). Her reported symptoms and what I see are different... but she has provided a test file to work from. Thanks, Peter From p.j.a.cock at googlemail.com Mon Jun 30 15:00:22 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Jun 2008 16:00:22 +0100 Subject: [Biopython-dev] New release In-Reply-To: <97693.82874.qm@web62401.mail.re1.yahoo.com> References: <97693.82874.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00806300800rd74082eqabbd1a2bef66da76@mail.gmail.com> On Mon, Jun 30, 2008 at 3:55 PM, Michiel de Hoon wrote: > Sorry, but I still can't edit the Biopython wiki pages, so I can't make the new > release available. Can other people edit these pages? No - as soon as I saw the wiki came back to life last night I tried, and have tried again today. I can make changes, view the preview and differences, but I just get a blank page when I click submit. I sent off an email to OBF to alert them in case you hadn't. I see the Biopython 1.46 files themselves are now online at http://biopython.org/DIST/ so at least some of the web-server is running properly :) We could just do the announcement by email and the news page, and fix the wiki later. But it does risk causing a little confusion in the short term. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 30 15:36:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 11:36:17 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301536.m5UFaHlo004669@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #5 from abetanco at staffmail.ed.ac.uk 2008-06-30 11:36 EST ------- The file was written by a Windows program called DNAsp (http://www.ub.es/dnasp/), which is widely used by population geneticists, which is not to say that it didn't write an invalid file. But it looked OK to me, other than the too short taxa names. (Those too short names were inherited from another program). I don't mind you using for the test unit, but it would be nice if it were cut down or something, as it is both unwieldy and unpublished data. A. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 15:38:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 11:38:00 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301538.m5UFc0S4004813@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 fkauff at biologie.uni-kl.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #6 from fkauff at biologie.uni-kl.de 2008-06-30 11:38 EST ------- Handling a handle works like a charm for me with the attachment provided: >>> handle=open('eg.nex') >>> n=Nexus.Nexus(handle) >>> n.matrix.keys() ['HI99.Line5.copy', 'am', 'HI99.Line1.copy', 'ezo', 'HI99.Line0.copy', 'DI05.Line5.copy', 'DI05.Line0.copy', 'DI05.Line8.copy1', 'DI05.Line1.copy1', 'HI99.Line3.copy', 'HI99.Line1.copy1', 'DI05.Line1.copy', 'DI05.Line9.copy', 'DI05.Line8.copy', 'HI99.Line4.copy', 'vir', 'DI05.Line8', 'DI05.Line9', 'HI99.Line2.copy', 'DI05.Line2', 'DI05.Line3', 'DI05.Line0', 'DI05.Line1', 'DI05.Line6', 'DI05.Line7', 'DI05.Line4', 'DI05.Line5', 'HI99.Line1', 'HI99.Line0', 'HI99.Line3', 'HI99.Line2', 'HI99.Line5', 'HI99.Line4'] However, Nexus.py needs unique taxon names. Non-unique taxon names won't make much sense in a nexus file imho. If Nexus.py encounters non-unique names, they are unified by adding a suffix (.copy, .copy1, ...) to it. Could this cause problems to SeqIO.NexusIO? Frank -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 16:12:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 12:12:29 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301612.m5UGCTnZ006531@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 12:12 EST ------- It looks like I didn't have the latest version of Bio.Nexus on this machine which may have added to the confusion. I've just updated to CVS (i.e. almost exactly Biopython 1.46). My issue with the matrix being None has gone away. Opps. >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(open('eg.nex')) >>> n.matrix.keys() ['HI99.Line5.copy', 'am', 'HI99.Line1.copy', 'ezo', 'HI99.Line0.copy', 'DI05.Line5.copy', 'DI05.Line0.copy', 'DI05.Line8.copy1', 'DI05.Line1.copy1', 'HI99.Line3.copy', 'HI99.Line1.copy1', 'DI05.Line1.copy', 'DI05.Line9.copy', 'DI05.Line8.copy', 'HI99.Line4.copy', 'vir', 'DI05.Line8', 'DI05.Line9', 'HI99.Line2.copy', 'DI05.Line2', 'DI05.Line3', 'DI05.Line0', 'DI05.Line1', 'DI05.Line6', 'DI05.Line7', 'DI05.Line4', 'DI05.Line5', 'HI99.Line1', 'HI99.Line0', 'HI99.Line3', 'HI99.Line2', 'HI99.Line5', 'HI99.Line4'] >>> assert [id for id in n.matrix] == n.matrix.keys() >>> n.matrix['HI99.Line5'] Seq('ATCGATAGCATTGCGG-GGACGACGATGGACATTTGGAAAACGAATATGAAAAT...GAG', IUPACAmbiguousDNA()) >>> n.matrix['HI99.Line5'][249-1] 'T' >>> n.matrix['HI99.Line5'][417-1] 'T' >>> n.matrix['HI99.Line5'][452-1] 'A' >>> n.matrix['HI99.Line5.copy'] Seq('ATCGATAGCATTGCGGCGGACGACGATGGACATTTGGAAAACGAATATGAAAAT...GAG', IUPACAmbiguousDNA()) >>> n.matrix['HI99.Line5.copy'][249-1] 'C' >>> n.matrix['HI99.Line5.copy'][417-1] 'C' >>> n.matrix['HI99.Line5.copy'][452-1] 'G' So far this looks good. However: >>> n.original_taxon_order ['vir', 'am', 'ezo', 'DI05.Line5', 'DI05.Line1', 'DI05.Line9', 'DI05.Line2', 'DI05.Line3', 'HI99.Line2', 'HI99.Line1', 'HI99.Line5', 'DI05.Line4', 'DI05.Line1', 'DI05.Line7', 'HI99.Line3', 'DI05.Line6', 'DI05.Line8', 'HI99.Line4', 'DI05.Line1', 'HI99.Line1', 'DI05.Line8', 'DI05.Line5', 'HI99.Line2', 'HI99.Line0', 'HI99.Line0', 'HI99.Line5', 'DI05.Line9', 'HI99.Line3', 'DI05.Line0', 'DI05.Line0', 'HI99.Line4', 'HI99.Line1', 'DI05.Line8'] In the Bio.SeqIO code that calls Bio.Nexus, I hadn't realized that Bio.Nexus kept the un-edited taxon names around. It is this list of the non-unique original identifiers that Bio.SeqIO was using, which explains why you end up with two copies of HI99.Line5. Sorry Frank - I was pointing fingers when it was my own bug after all! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 16:20:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 12:20:20 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301620.m5UGKK7M007026@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 12:20 EST ------- Frank, Looking back, the reason I was using the original_taxon_order list was I wanted to get the sequences in their original order. I see now that I can't use the elements in this list as keys to the matrix because the matrix keys are the modified taxon names. Is there any way to get the modified taxon names in the original order? Other than looping over original_taxon_order and repeating your naming algorithm? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 17:07:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 13:07:05 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301707.m5UH75I7009356@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 13:07 EST ------- Created an attachment (id=957) --> (http://bugzilla.open-bio.org/attachment.cgi?id=957&action=view) Sample input file Simple example file without a TAXA block Second example file to follow -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 17:22:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 13:22:23 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301722.m5UHMNo4010009@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 13:22 EST ------- Created an attachment (id=958) --> (http://bugzilla.open-bio.org/attachment.cgi?id=958&action=view) Second example file Using the first file where there is no TAXA block: >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus(open('dup_names.nex')) >>> print n.matrix.keys() ['CATH_HUMAN', 'CYS1_DICDI', 'CYS1_DICDI.copy', 'ALEU_HORVU'] >>> print n.original_taxon_order ['CYS1_DICDI', 'ALEU_HORVU', 'CATH_HUMAN', 'CYS1_DICDI.copy'] Then with a TAXA block, >>> n2 = Nexus.Nexus(open('dup_names2.nex')) >>> print n2.matrix.keys() ['CATH_HUMAN', 'CYS1_DICDI', 'CYS1_DICDI.copy', 'ALEU_HORVU'] >>> print n2.original_taxon_order ['CYS1_DICDI', 'ALEU_HORVU', 'CATH_HUMAN', 'CYS1_DICDI'] Notice the different behaviour of the original_taxon_order list. In the first case it gets the modified names, in the second case it doesn't. Is this deliberate Frank? On the other hand, maybe Nexus files without a TAXA block are rare in real life? Are they? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Mon Jun 30 17:10:15 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 30 Jun 2008 19:10:15 +0200 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: <200806301612.m5UGCTnZ006531@portal.open-bio.org> References: <200806301612.m5UGCTnZ006531@portal.open-bio.org> Message-ID: <48691377.803@biologie.uni-kl.de> bugzilla-daemon at portal.open-bio.org wrote: > > > In the Bio.SeqIO code that calls Bio.Nexus, I hadn't realized that Bio.Nexus > kept the un-edited taxon names around. It is this list of the non-unique > original identifiers that Bio.SeqIO was using, which explains why you end up > with two copies of HI99.Line5. > > Sorry Frank - I was pointing fingers when it was my own bug after all! > > > Looking back, the reason I was using the original_taxon_order list was I wanted > to get the sequences in their original order. I see now that I can't use the > elements in this list as keys to the matrix because the matrix keys are the > modified taxon names. > > Is there any way to get the modified taxon names in the original order? Other > than looping over original_taxon_order and repeating your naming algorithm? > Actually -this *IS* a bug. All fingers were pointing correctly... Original_taxon labels was just kept just for compatibility, and is the same as taxlabels. Taxlabels is supposed to have the unique identifiers - it just doesn't work correctly with non-unique ids in interleaved data sets. Fix following soon Frank From bugzilla-daemon at portal.open-bio.org Mon Jun 30 17:28:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 13:28:25 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301728.m5UHSPVk010377@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 13:28 EST ------- Created an attachment (id=959) --> (http://bugzilla.open-bio.org/attachment.cgi?id=959&action=view) Tentative patch to Bio/SeqIO/NexusIO.py This seems to cope with Andrea's real input file and my two hand written ones. It works by taking the original_taxon_order lists, and applying the disambiguation algorithm if needed. Not very elegant! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 19:29:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 15:29:32 -0400 Subject: [Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names In-Reply-To: Message-ID: <200806301929.m5UJTWYQ015982@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2531 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 15:29 EST ------- Created an attachment (id=960) --> (http://bugzilla.open-bio.org/attachment.cgi?id=960&action=view) Suggested patch to Bio/Nexus/Nexus.py This modifies Bio.Nexus to ensure that the original_taxon_order uses the original (duplicated) names, resolving the discrepancy I reported in comment 10. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 21:18:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 17:18:48 -0400 Subject: [Biopython-dev] [Bug 2520] Reading ACE assembly contig files in Bio.SeqIO In-Reply-To: Message-ID: <200806302118.m5ULImoB021255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2520 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 17:18 EST ------- Checked into CVS. We'll need to revisit this once we have a good way of dealing with per-letter-annotation which would be suitable for the quality scores. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 22:50:01 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 18:50:01 -0400 Subject: [Biopython-dev] [Bug 2532] New: Using IUPAC alphabets in mixed case Seq objects Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2532 Summary: Using IUPAC alphabets in mixed case Seq objects Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Bio.Alphabets.IUPAC defines a number of alphabets with defined lists of valid letters which are in upper case ONLY. Bio.Nexus and Bio.Sequencing.Phd create Seq objects which use these alphabets even with mixed case sequences. This contradicts how I think the alphabet's .letters property is intended to be used (although currently this is not enforced by the Seq object). I suggest either: (a) Bio.Nexus etc switch to using generic DNA/RNA alphabets for any Seq objects including lower case letters (or more simply, all Seq objects). (b) We add lower case and mixed case variants of the alphabet objects, and use the mixed case IUPAC alphabets in Bio.Nexus etc for the Seq objects. There is also the option of (c) Extend the existing upper case only IUPAC alphabets to include lower case too, but I fear this could have unexpected side effects (e.g. where people looping over the expected set of letters). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 30 22:51:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Jun 2008 18:51:17 -0400 Subject: [Biopython-dev] [Bug 2532] Using IUPAC alphabets in mixed case Seq objects In-Reply-To: Message-ID: <200806302251.m5UMpHBf024519@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2532 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-30 18:51 EST ------- Created an attachment (id=961) --> (http://bugzilla.open-bio.org/attachment.cgi?id=961&action=view) Patch to Bio.Sequencing.Phd This takes the simple route of using a generic DNA alphabet. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.