From bugzilla-daemon at portal.open-bio.org Mon Jun 2 04:19:50 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 04:19:50 -0400 Subject: [Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: Message-ID: <200806020819.m528JoXn006809@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2502 ------- Comment #19 from ibdeno at gmail.com 2008-06-02 04:19 EST ------- Thank you, Peter. In principle, I don't use that information. I will try then with the XML parser. Cheers, Miguel (In reply to comment #18) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 04:49:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 04:49:55 -0400 Subject: [Biopython-dev] [Bug 2502] PSIBlastParser fails with blastpgp 2.2.18 though works with blastpgp 2.2.15 In-Reply-To: Message-ID: <200806020849.m528ntdY008609@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2502 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #20 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 04:49 EST ------- Marking this bug as fixed. The original report was about parsing the plain text output which is fixed - see comment 12, and Bio/Blast/NCBIStandalone.py CVS revision 1.72. I have not added the 2.2.18 plain text file as a unit test since its over 750kb. For the XML output from 2.2.18, as far as I can tell we are not ignoring any important information from PSI-BLAST, as it is simply not included. If the NCBI updates the XML output from blastpgp then we should revisit the XML parsing. Thank you Miguel for your report and assistance. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 06:37:51 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 06:37:51 -0400 Subject: [Biopython-dev] [Bug 2503] An error when parsing NCBIWWW Blast output In-Reply-To: Message-ID: <200806021037.m52Abpj9019177@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2503 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 06:37 EST ------- Dear Prashanth, Unless you can provide some more information, I'm going to have to close Bug 2503, as you haven't given us enough to go on. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 08:57:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 08:57:20 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806021257.m52CvKt4026676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 08:57 EST ------- I've added simple __str__ and __repr__ methods to the alignment class in Bio/Align/Generic.py CVS revision 1.8, which give output like this: str(a): DNAAlphabet() alignment with 3 rows and 14 columns ACGATCAGCTAGCT Alpha CCGATCAGCTAGCT Beta ACGATGAGCTAGCT Gamma repr(a): <__main__.Alignment instance (3 records of length 14, DNAAlphabet()) at 9e96c2c> The string output gets truncated to show a maximum of 20 rows and 50 columns, which allowing for typical identifiers will still display nicely on a default terminal. I now intend to update the tutorial, as being able to print an alignment should make it much easier to explain and get to grips with. Note that there is still some interesting code in both attachment 732 (the __getitem__ method) and in attachment 770 (e.g. subclassing list and adding __len__, __add__, __radd__ etc). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 09:26:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 09:26:28 -0400 Subject: [Biopython-dev] [Bug 2507] New: Adding __getitem__ to SeqRecord for element access and slicing Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2507 Summary: Adding __getitem__ to SeqRecord for element access and slicing Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 1944 nThis: With a Seq object, you can access individual letters and create sub-sequences using slicing. You can even use a stride to reverse the sequence, or select every third letter. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> print my_seq GATCGATGGGCCTATATAGGATCGAAAATCGC >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) >>> my_seq[5:10] Seq('ATGGG', IUPACUnambiguousDNA()) >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) >>> my_seq[5] 'A' Currently, these operations cannot be done with a SeqRecord object. This enhancement bug is to allow element access and splicing (perhaps even with a stride) on SeqRecord objects, where the annotations are taken into consideration, and preserved as far as reasonably possible. Looking at the different SeqRecord properties, this is what I think should happen for creating a sub-sequence: .id, .name, .description (three strings) - preserve? Blindly preserving these may not always be meaningful. For example, if the description was "Complete plasmid" then it doesn't really apply to a sub-sequence. Perhaps we should preserve only the id and name, and set the description to "sub-sequence"? .annotations (dictionary) - either preserve or lose? Some annotation entries will still be valid for a sub-sequence (e.g. "source" or references). Others will not (e.g. anything describing its coordinates within a larger parent sequence). There is no reliable way to decide on a case by case basis. .dbxrefs (list of strings) - preserve? Any database cross-references would arguably still apply to a sub-sequence or even a reversed sequence. .features (list of SeqFeatures) - select only those features still in the new sub-sequence, and adjust their locations for the new coordinates. Supporting strides other than +1 would be complicated! For simplicity, I would say any feature only partially within the sub-sequence should be discarded. In summary, one clearly defined set of actions on creating a sub-sequence could be to preserve all the annotation data except the SeqFeatures which would be handled sensibly. [If we later support "per-letter-annotation" in either a Seq or SeqRecord subclass, then this too should be spliced] Adding a __getitem__ method to the SeqRecord as outlined above should be compatible with the suggestion that the SeqRecord subclasses the Seq object (see bug 2351). A related point, when accessing single letters, e.g. record[0], should a single letter string be returned (which lacks any annotation) as currently happens with the Seq object? P.S. I'm marking this new enhancement bug as blocking bug 1944. Once SeqRecord objects support splicing, this would make annotation preserving slicing of alignment objects much more straightforward. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 09:26:33 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 09:26:33 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806021326.m52DQXk2029561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2507 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 10:00:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 10:00:15 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806021400.m52E0FJK032027@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-02 10:00 EST ------- Simple implementation with ignores the features (non-trivial) to be added to the SeqRecord class in Bio/SeqRecord.py def __getitem__(self, index) : if isinstance(index, int) : #TODO - Should single letters be returned as just #strings? This prevents the inclusion of any annotation. #Revisit this once the Seq object is a subclass of string. return self.seq[index] elif isinstance(index, slice) : answer = self.__class__(self.seq[index], id=self.id, name=self.name, description=self.description) #COPY the annotation dict and dbxefs list: answer.annotations = dict(self.annotations.iteritems()) answer.dbxrefs = self.dbxrefs[:] #TODO - select relevant features, and add them with #adjusted coordinates. Take special care with a stride! return answer raise ValueError, "Invalid index" -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 2 10:12:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 2 Jun 2008 10:12:29 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806021412.m52ECT86000330@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #2 from jblanca at btc.upv.es 2008-06-02 10:12 EST ------- Does this means that SeqRecord would deprecate the .seq attribute? If the .seq attribute is not removed slicing could be used in it like: my_seq[1:100] and my_seq.seq[1:100]. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jun 2 10:14:40 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Jun 2008 15:14:40 +0100 Subject: [Biopython-dev] sequence class proposal In-Reply-To: <1211779470.483a498e18e3e@webmail.upv.es> References: <320fb6e00805251437n34362f0bm2a323cd1194afaa@mail.gmail.com> <1211779470.483a498e18e3e@webmail.upv.es> Message-ID: <320fb6e00806020714s2c789f61ke676a448e2ec871a@mail.gmail.com> In reply to Jose, I (Peter) wrote: >> One of your points seemed to be that the SeqRecord couldn't have a >> __getitem__ and methods like reverse, complement, etc. I don't see >> why it couldn't have these. Perhaps rather than introducing a whole >> new class, enhancing the SeqRecord would be a better avenue. I've filed Bug 2507 to try and show what I had in mind for the __getitem__ method. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 Adding further methods for (reverse) complement etc could be done in much the same way. Returning to extending Biopython to support per-letter-annotation, I can see two options: Right now, the SeqRecord object HAS a Seq object. If we create a new RichSeq which subclasses the Seq object to provide per-letter-annotation, then you could use a SeqRecord where the .seq property is in fact a RichSeq object. The SeqRecord class doesn't need to have any changes made for this to work (assuming the RichSeq provides the same API as the Seq object). If we make the SeqRecord a subclass of the Seq object, then I would suggest either RichSeq subclassing SeqRecord subclassing Seq, or perhaps SeqRecord subclassing RichSeq subclassing Seq. It depends on if you think the id/name/description/dbxrefs/etc properties would be useful in common use cases of the RichSeq object. Its not going to be possible for all three classes to have the same __init__ parameters without breaking existing scripts (and only supporting the lowest common denominator). Peter From jblanca at btc.upv.es Mon Jun 2 15:11:19 2008 From: jblanca at btc.upv.es (Blanca Postigo Jose Miguel) Date: Mon, 2 Jun 2008 21:11:19 +0200 Subject: [Biopython-dev] Fwd: Re: sequence class proposal Message-ID: <1212433879.484445d7a6117@webmail.upv.es> ----- Mensaje reenviado de Blanca Postigo Jose Miguel ----- Fecha: Mon, 2 Jun 2008 21:08:59 +0200 De: Blanca Postigo Jose Miguel Responder-A: Blanca Postigo Jose Miguel Asunto: Re: [Biopython-dev] sequence class proposal Para: Peter Mensaje citado por Peter : > In reply to Jose, I (Peter) wrote: > >> One of your points seemed to be that the SeqRecord couldn't have a > >> __getitem__ and methods like reverse, complement, etc. I don't see > >> why it couldn't have these. Perhaps rather than introducing a whole > >> new class, enhancing the SeqRecord would be a better avenue. > > I've filed Bug 2507 to try and show what I had in mind for the > __getitem__ method. > http://bugzilla.open-bio.org/show_bug.cgi?id=2507 I think that would be great. I've just added to the bug a question about the .seq property of SeqRecord. > Adding further methods for (reverse) complement etc could be done in > much the same way. > > Returning to extending Biopython to support per-letter-annotation, I > can see two options: > > Right now, the SeqRecord object HAS a Seq object. If we create a new > RichSeq which subclasses the Seq object to provide > per-letter-annotation, then you could use a SeqRecord where the .seq > property is in fact a RichSeq object. The SeqRecord class doesn't > need to have any changes made for this to work (assuming the RichSeq > provides the same API as the Seq object). Here I had a slighty different idea, but maybe yours is better. Basically my RichSeq proposal is just a RichSeq with slicing and without the seq property. The problem with the approach that you describe is that the RichSeq should have the per-letter-annotation, so SeqRecord would have a general annotation and RichSeq (in the .seq) would have other features. I would find that confusing. > > If we make the SeqRecord a subclass of the Seq object, then I would > suggest either RichSeq subclassing SeqRecord subclassing Seq, or > perhaps SeqRecord subclassing RichSeq subclassing Seq. It depends on > if you think the id/name/description/dbxrefs/etc properties would be > useful in common use cases of the RichSeq object. If SeqRecord is a subclass of Seq RichSeq is not necessary anymore. That's what I was proposing. The problem is that the current users of SeqRecord would had a hard time with the new behaviour, because in that case supporting the seq property would be hard. To avoid that breakage I was proposing to create RichSeq. RichSeq would be just the SeqRecord that you propose but would allow the users to migrate to RichSeq without forcing them to change to a new SeqRecord behaviour. > > Its not going to be possible for all three classes to have the same > __init__ parameters without breaking existing scripts (and only > supporting the lowest common denominator). That's another reason to rename your new proposed SeqRecord to RichSeq. > > Peter > Jose Blanca -- ----- Fin del mensaje reenviado ----- -- From biopython at maubp.freeserve.co.uk Mon Jun 2 15:51:30 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Jun 2008 20:51:30 +0100 Subject: [Biopython-dev] Fwd: Re: sequence class proposal In-Reply-To: <1212433879.484445d7a6117@webmail.upv.es> References: <1212433879.484445d7a6117@webmail.upv.es> Message-ID: <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> Jose wrote: > > I've filed Bug 2507 to try and show what I had in mind for the > > __getitem__ method. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2507 > > I think that would be great. Good :) Does anyone else want to comment? > I've just added to the bug a question about the .seq property of SeqRecord. http://bugzilla.open-bio.org/show_bug.cgi?id=2507#c2 reads: > Does this means that SeqRecord would deprecate the .seq attribute? > If the .seq attribute is not removed slicing could be used in it like: > my_seq[1:100] and my_seq.seq[1:100]. I was not intending to deprecate the SeqRecord's .seq property at this time (I think that should happen in preparation for if/when the SeqRecord becomes a subclass of the Seq object). With my idea described on bug 2507, given a SeqRecord object my_seq_record: my_seq_record[1:100] -> another SeqRecord (with annotation) my_seq_record.seq[1:100] -> just a Seq object (no annotation) my_seq_record.seq.tostring()[1:100] -> just a string (no annotation or alphabet) str(my_seq_record.seq)[1:100] -> just a string (no annotation or alphabet) These trivial examples would all "contain" the same sequence string. This enhancement could be done right now, and shouldn't impeed any future per-letter-annotation enhancements. Perhaps per-letter-annotation enhancements could be added to the SeqRecord class directly... I need to fully digest the discussion on the BioSQL list, see: http://lists.open-bio.org/pipermail/biosql-l/2008-May/thread.html Peter From mjldehoon at yahoo.com Mon Jun 2 20:19:59 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 2 Jun 2008 17:19:59 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez & Bio.EUtil In-Reply-To: <320fb6e00805300717v60f0b153i88b5e9a8aee1744c@mail.gmail.com> Message-ID: <624249.42121.qm@web62408.mail.re1.yahoo.com> OK I'll double-check. I may not have noticed some missing DTDs if they were downloaded automatically from the internet. I think Biopython should ship the most common DTDs. At least the ones needed for test_Entrez, which probably covers most of the use cases of Bio.Entrez. --Michiel. Peter wrote: On 24 May 2008, Michiel de Hoon wrote: > Dear all, > > I have essentially completed the parser in Bio.Entrez. The internals of the new design look more complicated to start with, but I can see how much more general it is than the older versions :) Should it work starting from an empty DTDs folder - or will we ship Biopython with most of the current files? I've had trouble with Biopython trying to fetch missing DTD files from the internet. I think the problem is the NCBI using relative URLs. The following quick hack seems to help in Parser.py but only in some cases (because as listed below, the NCBI have two different base paths): 279,280c279,288 < warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % filename) < handle = urllib.urlopen(systemId) --- > warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % path) > if "/" in systemId : > #Assume this is a full path, e.g. > #http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedline_080101.dtd > handle = urllib.urlopen(systemId) > else : > #Its a relative path, and I'm not sure how to best get the base path: > handle = urllib.urlopen("http://www.ncbi.nlm.nih.gov/entrez/query/DTD/"+systemId) (Also note there seem to be some tab/space isssues in this file). >From http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ I've downloaded the following files using wget: egquery.dtd eSearch_020511.dtd nlmcommon_080101.dtd pubmed_080101.dtd eInfo_020511.dtd eSpell.dtd nlmmedline_080101.dtd taxon.dtd eLink_020511.dtd eSummary_041029.dtd nlmmedlinecitation_080101.dtd uilist.dtd ePost_020511.dtd nlmsharedcatcit_080101.dtd Additionally http://www.ncbi.nlm.nih.gov/dtd/ provided some further XML files needed for the test_Entrez.py unit test: NCBI_GBSeq.dtd NCBI_GBSeq.mod.dtd NCBI_Entity.mod.dtd NCBI_Mim.dtd NCBI_Mim.mod.dtd With all the above files, then the unit test file test_Entrez.py doesn't give any missing DTD warnings - but still has a couple of failures. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 3 00:39:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 00:39:24 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806030439.m534dOYI021682@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-06-03 00:39 EST ------- I agree that type checking is a problem. I am not sure if a specialized function in Bio.File is a good idea. The question is not if "this object is a file-like object", but "does this object have the attributes/methods needed". So I would prefer to add checks only for the required attributes/methods in each of the iterators. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Jun 3 00:33:27 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 2 Jun 2008 21:33:27 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez & Bio.EUtil In-Reply-To: <624249.42121.qm@web62408.mail.re1.yahoo.com> Message-ID: <112249.61498.qm@web62410.mail.re1.yahoo.com> I checked but I did not see any missing DTDs. Most of the DTDs in the list you sent are in Biopython's CVS under Bio/Entrez/DTDs, and are included correctly if I do a fresh checkout from CVS. Maybe could you try with a fresh checkout? --Michiel. Michiel de Hoon wrote: OK I'll double-check. I may not have noticed some missing DTDs if they were downloaded automatically from the internet. I think Biopython should ship the most common DTDs. At least the ones needed for test_Entrez, which probably covers most of the use cases of Bio.Entrez. --Michiel. Peter wrote: On 24 May 2008, Michiel de Hoon wrote: > Dear all, > > I have essentially completed the parser in Bio.Entrez. The internals of the new design look more complicated to start with, but I can see how much more general it is than the older versions :) Should it work starting from an empty DTDs folder - or will we ship Biopython with most of the current files? I've had trouble with Biopython trying to fetch missing DTD files from the internet. I think the problem is the NCBI using relative URLs. The following quick hack seems to help in Parser.py but only in some cases (because as listed below, the NCBI have two different base paths): 279,280c279,288 < warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % filename) < handle = urllib.urlopen(systemId) --- > warnings.warn("DTD file %s not found in Biopython installation; trying to retrieve it from NCBI" % path) > if "/" in systemId : > #Assume this is a full path, e.g. > #http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedline_080101.dtd > handle = urllib.urlopen(systemId) > else : > #Its a relative path, and I'm not sure how to best get the base path: > handle = urllib.urlopen("http://www.ncbi.nlm.nih.gov/entrez/query/DTD/"+systemId) (Also note there seem to be some tab/space isssues in this file). >From http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ I've downloaded the following files using wget: egquery.dtd eSearch_020511.dtd nlmcommon_080101.dtd pubmed_080101.dtd eInfo_020511.dtd eSpell.dtd nlmmedline_080101.dtd taxon.dtd eLink_020511.dtd eSummary_041029.dtd nlmmedlinecitation_080101.dtd uilist.dtd ePost_020511.dtd nlmsharedcatcit_080101.dtd Additionally http://www.ncbi.nlm.nih.gov/dtd/ provided some further XML files needed for the test_Entrez.py unit test: NCBI_GBSeq.dtd NCBI_GBSeq.mod.dtd NCBI_Entity.mod.dtd NCBI_Mim.dtd NCBI_Mim.mod.dtd With all the above files, then the unit test file test_Entrez.py doesn't give any missing DTD warnings - but still has a couple of failures. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 3 05:16:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 05:16:48 -0400 Subject: [Biopython-dev] [Bug 2446] Comments in CT tags cause Bio.Sequencing.Ace.ACEParser to fail. In-Reply-To: Message-ID: <200806030916.m539GmwZ001955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2446 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-03 05:16 EST ------- As pointed out on the mailing list, the test cases attached to this bug have disappeared (some expiry issue?). In the mean time, we could probably just edit the sole existing test case in Tests/Ace/contig1.ace to add a comment to an existing CT tag. Looking at this file, for example edit: CT{ Contig1 repeat phrap 52 53 555456:555432 This is the forst line of comment for c1 and this the second for c1 } to become: CT{ Contig1 repeat phrap 52 53 555456:555432 COMMENT{ This is the first line of comment for c1 and this the second for c1} } In the short term, we could either ignore the COMMENT tags within a CT tag, or just treat them as plain next. Supporting the nested structure within the current would require changes to the current Record structure. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 3 07:46:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 07:46:58 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806031146.m53BkwAB009224@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #5 from cracka80 at gmail.com 2008-06-03 07:46 EST ------- (In reply to comment #4) > I agree that type checking is a problem. > I am not sure if a specialized function in Bio.File is a good idea. The > question is not if "this object is a file-like object", but "does this object > have the attributes/methods needed". So I would prefer to add checks only for > the required attributes/methods in each of the iterators. > The function I have written does exactly this - it checks for the necessary attributes and methods for a given object. The iterators would then only need to call ``File.is_filelike()`` on each object passed into them, rather than a type checking procedure. This is in accordance with the design pattern "Program to an 'interface', not an 'implementation'." (Gang of Four). Would you like me to provide a diff against the current revision of Biopython, with suggested changes? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 3 11:07:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Jun 2008 11:07:35 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806031507.m53F7Zm7019694@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-06-03 11:07 EST ------- Two things: 1) Some of the code that does type checking for file-like-ness seems to be quite old and possibly outdated (e.g. Gobase.Iterator). We should take this opportunity to go through these modules and check if they are still useful. 2) Many of these modules (especially the ones that use an "Iterator" class) would be written differently in modern Python (in particular by making use of a generator function instead of an Iterator class). So I'd like to suggest the following: -) For the modules whose usability is dubious in 2008, let's check on the mailing list if anybody is still using them. If not, we can simply deprecate them. -) For the modules that are still useful, use try/except clauses to check for the necessary attributes. The current function checks for 'read', 'readline', 'readlines', and '__iter__', whereas the parser probably only needs one of them. -) If possible, I'd prefer to convert to modern Python as much as possible (though formally that is not within the scope of this bug report). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 4 15:50:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Jun 2008 15:50:14 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806041950.m54JoEPj029720@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #3 from jblanca at btc.upv.es 2008-06-04 15:50 EST ------- Created an attachment (id=927) --> (http://bugzilla.open-bio.org/attachment.cgi?id=927&action=view) RichSeq proposal I have coded a sequence class that fullfils the requirements that I would like to see. It's very similar to SeqRecord, but it is not compatible with it. It has no seq property, although that can be solved. The problem with SeqRecord is that it is not possible to create a class with an __init__ compatible with Seq and SeqRecord at the same time. This proposed class is just a draft, it needs more work but I would like to receive comments about it. It inherits from MutableSeq so it should be named MutableRichSeq, but it seems that I'm too lazy to such a long name, I promise to change the name in a later version and to create a RichSeq with Seq as parent. Besides RichSeq there is in the attachment two other classes, RichFeature and BioRange, but I would comment on that in another post. I think that it is quite important to convert Seq and MutableSeq to newclasses, what do you think about that? With the new classes we can use properties. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 4 16:19:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Jun 2008 16:19:41 -0400 Subject: [Biopython-dev] [Bug 2508] New: NCBIStandalone.blastall: provide support for '-F F' and make it safe Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2508 Summary: NCBIStandalone.blastall: provide support for '-F F' and make it safe Product: Biopython Version: 1.44 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz The local NCBI blast by default masks low-complexity region by SEG algorithm. I do not see a variable to affect this in NCBIStandalone.blastall(). Luckily, NCBIStandalone.blastall() is an unsafe function and does not check whether I pass multiple arguments in a value expected to be a string or number. Thus, I can do: _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix='IDENTITY -F 0') but imagine I would have done: _blast_out, _error_info = NCBIStandalone.blastall('/usr/bin/blastall', 'blastn', blast_db, _blast_file, matrix='IDENTITY -F 0; rm -rf /etc/passwd') The function should be protected against such attacks like if it would have been directly exposed to web users as a CGI script. I propose similar defensive strategy for all functions calling os.system(), os.exec(), os.popen*(), etc. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 04:52:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 04:52:47 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806050852.m558qlPF031059@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 04:52 EST ------- I replied to comment 2 on the mailing list. I had intended this particular bugzilla entry (bug 2507) to be very narrow in scope - purely a small backwards compatible change to the current SeqRecord Some of the questions in comment 3 might have fit better on Bug 2351 although its getting rather long. Rather than taking this issue further off topic, I'll reply on the mailing list again. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jun 5 05:17:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Jun 2008 10:17:00 +0100 Subject: [Biopython-dev] Fwd: Re: sequence class proposal In-Reply-To: <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> References: <1212433879.484445d7a6117@webmail.upv.es> <320fb6e00806021251q6cc1a7e8p36125c1326ab7a14@mail.gmail.com> Message-ID: <320fb6e00806050217y1c437b01qa7fd21d75a609e8c@mail.gmail.com> This is in reply to Jose's comment 3 on bug 2507, which was quite broad. http://bugzilla.open-bio.org/show_bug.cgi?id=2507#c3 > I have coded a sequence class that fullfils the requirements that I > would like to see. It's very similar to SeqRecord, but it is not compatible > with it. It has no seq property, although that can be solved. The problem > with SeqRecord is that it is not possible to create a class with an __init__ > compatible with Seq and SeqRecord at the same time. Even if one day the SeqRecord is a subclass of the Seq object, there is no requirement that it have the same __init__ arguments. In fact, have to be different because for a SeqRecord you should also supply an identifier (and potentially a name, description and other annotation). > This proposed class is just a draft, it needs more work but I would like to > receive comments about it. It inherits from MutableSeq so it should be > named MutableRichSeq, but it seems that I'm too lazy to such a long name, > I promise to change the name in a later version and to create a RichSeq > with Seq as parent. I agree with you here that when getting a single letter (amino acid or nucleotide) from a sequence with per-letter-annotation, e.g. my_sequence[5], it would be very nice to have the per-letter-annotation like the quality included. This does mean the object returned can't just be a single one character string. However, because the current Seq and MutableSeq classes return a simple string, unless we return a subclass of a string, this risks breaking other peoples code. So, I would conclude that Seq needs to subclass a string BEFORE we start including support for per-letter-annotation. Ideally we would have alphabet aware versions of all the string functions before we made this change (see Bug 2351). > Besides RichSeq there is in the attachment two other classes, RichFeature > and BioRange, but I would comment on that in another post. Your BioRange and BioFeature classes seem somewhat similar to the current SeqFeature class with its locations (and sub features). > I think that it is quite important to convert Seq and MutableSeq to newclasses, > what do you think about that? With the new classes we can use properties. I have been thinking about deprecating the Seq.data property (and also the MutableSeq). The data string (or array) should really be a private implementation detail, perhaps Seq._data following the underscore for private convention. We can then add property methods to make the Seq.data available (perhaps with a deprecation warning). Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 5 05:36:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 05:36:18 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806050936.m559aINS001028@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 05:36 EST ------- Created an attachment (id=928) --> (http://bugzilla.open-bio.org/attachment.cgi?id=928&action=view) Patch to Bio/SeqRecord.py adding __getitem__ and __len__ and __iter__ Patch based on my comment 1, with addition of __len__ allowing len(my_record) rather than len(my_record.seq) and an explicit __iter__ method (although this is not required, it lets us give a doc string). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 06:18:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:18:11 -0400 Subject: [Biopython-dev] [Bug 2509] New: Deprecating the .data property of the Seq and MutableSeq objects Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2509 Summary: Deprecating the .data property of the Seq and MutableSeq objects Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 2351 nThis: In anticipation that the Seq and MutableSeq objects will eventually subclass the python string, their data property is not needed and confusing. The following patch will replace it with a new-class style property methods and a docstring declaring it to be deprecated. In the case of the Seq object, the sequence should be read only but the user can currently modify the data property in place. In the case of the MutableSeq, the fact that it is internally an array of characters should be a private implementation detail. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 06:18:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:18:14 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200806051018.m55AIE7S003198@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2509 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 06:47:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 06:47:43 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806051047.m55AlhBe004755@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 06:47 EST ------- Note that adding __len__ has a knock on effect when dealing with SeqRecord objects with a zero length sequence - they now evaluate to False rather than True. This was an issue for some of the unit tests where "if record" was used rather than the more explicit "if record is not None". This change could therefore have unexpected side effects in existing scripts, however adding __len__ is required if we intend to make the SeqRecord act more like the Seq object. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 07:03:27 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 07:03:27 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806051103.m55B3RUU005472@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 07:03 EST ------- You seem to have identified two issues. Adding support for -F should be fairly easy. For the security issue, the caller should be validating their input. Also if running from a web-server, the permissions should also be restricted - failing to do this is asking for trouble. However, defence in layers would be good. Would you suggest a simple check for the ";" character? What about escaped semi-colons? Also this a platform dependant issue. The ";" character is Unix only. At the Windows command line you have to use an &&. Do you have a patch in mind? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 08:56:21 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 08:56:21 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806051256.m55CuLfC010670@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #2 from mmokrejs at ribosome.natur.cuni.cz 2008-06-05 08:56 EST ------- For the latter issue, I would go and use some python library to escape shell metacharacters. cgi.escape() doesn't do what I would like to. Or cgi.wrap()? Google search returned some hints: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498202 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66012 http://e-articles.info/e/a/title/Command-Injection/ https://bugs.gentoo.org/show_bug.cgi?id=187971#c5 https://bugs.gentoo.org/show_bug.cgi?id=187971#c23 http://mail.python.org/pipermail/python-3000/2007-May/007192.html http://www.owasp.org/index.php/Interpreter_Injection http://www.velocityreviews.com/forums/t352309-sql-escaping-module.html One could learn or even use escaping functions from e.g. MySQLdb.escape() of MySQLdb.connection.escape_string() but I don't think it is a complete solution. I will try to think of it more later. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 09:25:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 09:25:43 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200806051325.m55DPhrQ012033@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 09:25 EST ------- I've commited this patch to CVS as part of BioSQL/BioSeq.py revision 1.24 If you could update you installation of Biopython to CVS and test this please Eric, then I think we can mark this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 09:29:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 09:29:25 -0400 Subject: [Biopython-dev] [Bug 2509] Deprecating the .data property of the Seq and MutableSeq objects In-Reply-To: Message-ID: <200806051329.m55DTP30012244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2509 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-05 09:29 EST ------- Created an attachment (id=929) --> (http://bugzilla.open-bio.org/attachment.cgi?id=929&action=view) Patch to Bio/Seq.py This turns out to be quite a big change, and while the unit tests still pass more extensive testing would be a good idea. Alternatively, we could just leave expose .data as a read only property, and switch to ._data (or a string subclass) instead. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 5 13:55:02 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Jun 2008 13:55:02 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806051755.m55Ht2TS024644@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #7 from cracka80 at gmail.com 2008-06-05 13:55 EST ------- I understand your approach that these functions should be converted to modern Python, but it must also be remembered that Biopython as a whole is Python 2.3-compatible, so care must be taken not to modernise code too much. I can't remember when iterators were phased in, but it should be possible, I think it was around 2.2 anyway. (In reply to comment #6) > Two things: > 1) Some of the code that does type checking for file-like-ness seems to be > quite old and possibly outdated (e.g. Gobase.Iterator). We should take this > opportunity to go through these modules and check if they are still useful. > 2) Many of these modules (especially the ones that use an "Iterator" class) > would be written differently in modern Python (in particular by making use of a > generator function instead of an Iterator class). > > So I'd like to suggest the following: > -) For the modules whose usability is dubious in 2008, let's check on the > mailing list if anybody is still using them. If not, we can simply deprecate > them. > -) For the modules that are still useful, use try/except clauses to check for > the necessary attributes. The current function checks for 'read', 'readline', > 'readlines', and '__iter__', whereas the parser probably only needs one of > them. > -) If possible, I'd prefer to convert to modern Python as much as possible > (though formally that is not within the scope of this bug report). > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 7 04:26:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 7 Jun 2008 04:26:54 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806070826.m578Qsj4019312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-06-07 04:26 EST ------- (In reply to comment #7) > I understand your approach that these functions should be converted to modern > Python, but it must also be remembered that Biopython as a whole is Python > 2.3-compatible, so care must be taken not to modernise code too much. I can't > remember when iterators were phased in, but it should be possible, I think it > was around 2.2 anyway. > Bio.Blast.NCBIXML already uses generator functions to return iterators, so I think we are fine as far as compatibility with Python 2.3 and later is concerned. I'll ask on the mailing list if Bio.Gobase has any users, to get started. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Sat Jun 7 04:35:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 7 Jun 2008 01:35:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Gobase, anybody? Message-ID: <844450.31822.qm@web62415.mail.re1.yahoo.com> Hi everbody, As part of bug report 2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454, I started looking at the Bio.Gobase module. This module provides access to the gobase database: http://megasun.bch.umontreal.ca/gobase/ This module is about seven years old and (AFAICT) is not actively maintained. We don't have documentation for this module, but the unit tests suggests that it parses HTML files from gobase. I am not sure exactly where the HTML files came from, but I doubt that after seven years this still works. So I was wondering: Does anybody use Bio.Gobase? If not, I suggest we deprecate it for the next release, and remove it in some future release. If there are users, we need to make some (small) changes to this module (that is what the original bug report was about). --Michiel. From bugzilla-daemon at portal.open-bio.org Mon Jun 9 08:45:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 9 Jun 2008 08:45:24 -0400 Subject: [Biopython-dev] [Bug 2511] New: setup.py problem with del sys.modules["Martel"] Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2511 Summary: setup.py problem with del sys.modules["Martel"] Product: Biopython Version: Not Applicable Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I'm currently trying to install Biopython from source (CVS) on a clean Mac OS X machine, without reportlab, Numeric or mxTextTools. I've run into a small issue with "python setup.py build" related to the testing for an existing Martel distribution (since Martel has been distributed separately from Biopython before) due to the lack of mxTextTools. Traceback (most recent call last): File "setup.py", line 508, in 'Bio.PopGen': ['SimCoal/data/*.par'], File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/core.py", line 151, in setup dist.run_commands() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 974, in run_commands self.run_command(cmd) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/command/build.py", line 112, in run self.run_command(cmd_name) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/cmd.py", line 333, in run_command self.distribution.run_command(command) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command cmd_obj.run() File "setup.py", line 157, in run if not is_Martel_installed(): File "setup.py", line 292, in is_Martel_installed del sys.modules["Martel"] # Delete the old version of Martel. The function is_Martel_installed() starts by trying to load the bundled Martel, by calling can_import("Martel"). This is failing with an ImportError from mxTextTools - and hence the Martel version of the bundled copy cannot be determined. The next line of is_Martel_installed() causes the problem: del sys.modules["Martel"] I think this only makes sense if the module could be imported, patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 9 08:46:51 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 9 Jun 2008 08:46:51 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806091246.m59Ckpts011798@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-09 08:46 EST ------- Created an attachment (id=930) --> (http://bugzilla.open-bio.org/attachment.cgi?id=930&action=view) Patch to setup.py How does this look Michiel? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 10 07:37:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Jun 2008 12:37:42 +0100 Subject: [Biopython-dev] Giving the SeqRecord a length? Evaluating it as a boolean Message-ID: <320fb6e00806100437n21e53369p36c85a810007ca19@mail.gmail.com> Something we've discussed before is making the SeqRecord more like a Seq object, perhaps even subclassing it. I've got a patch on Bug 2507 to make some small steps in this direction - accessing elements of the sequence by indexing the SeqRecord, i.e. letter = my_seq_record[5], or iterating over the letters in a SeqRecord's sequence. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 In addition, I would like to give the SeqRecord a length, allowing len(my_seq_record) rather than len(my_seq_record.seq). However, this has a side effect on the evaluation of a SeqRecord as a boolean. Before, any sequence was True, but if we add the __len__ method then any SeqRecord with a zero length sequence will evaluate as False. This is a real issue, for example you can have GenBank files without a sequence (see our unit test cases). One example where this is important is if you are using an iterator via the .next() method and had been checking for a returned None by using "if record:" (something some of the older unit tests were doing) you would have to start using "if record is not None:" instead. If the old behaviour is desirable (evaluating a SeqRecord as a boolean is alway True), we could implement a __nonzero__ method to preserve it, see: http://docs.python.org/ref/customization.html What do people think? Would adding a __len__ method to the SeqRecord cause trouble? Peter From mjldehoon at yahoo.com Tue Jun 10 19:17:56 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 10 Jun 2008 16:17:56 -0700 (PDT) Subject: [Biopython-dev] Giving the SeqRecord a length? Evaluating it as a boolean In-Reply-To: <320fb6e00806100437n21e53369p36c85a810007ca19@mail.gmail.com> Message-ID: <797428.30617.qm@web62402.mail.re1.yahoo.com> +1 for adding a __len__ method, with a __nonzero__ method to let all SeqRecord objects evaluate as true. --Michiel. Peter wrote: Something we've discussed before is making the SeqRecord more like a Seq object, perhaps even subclassing it. I've got a patch on Bug 2507 to make some small steps in this direction - accessing elements of the sequence by indexing the SeqRecord, i.e. letter = my_seq_record[5], or iterating over the letters in a SeqRecord's sequence. http://bugzilla.open-bio.org/show_bug.cgi?id=2507 In addition, I would like to give the SeqRecord a length, allowing len(my_seq_record) rather than len(my_seq_record.seq). However, this has a side effect on the evaluation of a SeqRecord as a boolean. Before, any sequence was True, but if we add the __len__ method then any SeqRecord with a zero length sequence will evaluate as False. This is a real issue, for example you can have GenBank files without a sequence (see our unit test cases). One example where this is important is if you are using an iterator via the .next() method and had been checking for a returned None by using "if record:" (something some of the older unit tests were doing) you would have to start using "if record is not None:" instead. If the old behaviour is desirable (evaluating a SeqRecord as a boolean is alway True), we could implement a __nonzero__ method to preserve it, see: http://docs.python.org/ref/customization.html What do people think? Would adding a __len__ method to the SeqRecord cause trouble? Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 10 19:30:20 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 10 Jun 2008 19:30:20 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806102330.m5ANUKfo019481@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-06-10 19:30 EST ------- (In reply to comment #1) > Created an attachment (id=930) --> (http://bugzilla.open-bio.org/attachment.cgi?id=930&action=view) [details] > Patch to setup.py > > How does this look Michiel? > That looks find to me, though eventually I would prefer to get rid of the dependence on Martel/mxTextTools altogether. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 10 19:42:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 10 Jun 2008 19:42:52 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806102342.m5ANgqct019925@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-10 19:42 EST ------- In reply to comment 2, would it make sense for the unit test framework to treat the mxTextTools (or reportlab, or Numeric) import errors as a missing external dependency? In the unit tests we used to "ignore" any tests which failed with an ImportError, but have now switched to our own MissingExternalDependencyError exception. We want to distinguish ImportErrors which are external to Biopython (and therefore can be considered as missing dependencies) from those internal to Biopython (perhaps due to refactoring or removal of code - a real unit test failure). One way to do this would be in the bits of Biopython that try to import mxTextTools (or any other module) to raise MissingExternalDependencyError (or something that is a subclass of both MissingExternalDependencyError and the built in ImportError). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 02:54:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 02:54:32 -0400 Subject: [Biopython-dev] [Bug 2516] New: Make it clear what is numeric and what is numpy Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2516 Summary: Make it clear what is numeric and what is numpy Product: Biopython Version: 1.45 Platform: PC URL: http://www.biopython.org/DIST/docs/install/Installation. html OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Documentation AssignedTo: biopython-dev at biopython.org ReportedBy: mmokrejs at ribosome.natur.cuni.cz Hi, although both packages are from the same source site, numpy is the newer implementation whereas numeric is the old, deprecated implementation, right? Why do you say in the installation docs the following? "The Numerical Python distribution (also known an Numeric or Numpy) is a fast implementation of arrays and associated array functionality. This is important for a number of Biopython modules that deal with number processing. The main web site for Numeric is: http://sourceforge.net/projects/numpy and downloads are available from:..." I think it is fooling. BTW, is numpy-1.1.0 supported? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 04:47:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 04:47:32 -0400 Subject: [Biopython-dev] [Bug 2511] setup.py problem with del sys.modules["Martel"] In-Reply-To: Message-ID: <200806110847.m5B8lWxd010254@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2511 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 04:47 EST ------- Patch checked into CVS as Biopython/setup.py revision 1.133, marking this bug as fixed. The issue I raised in comment 3 is still outstanding (external ImportErrors and the unit tests). We may want to file a separate bug, or discuss this on the dev mailing list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 04:53:30 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 04:53:30 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806110853.m5B8rU2t010552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 04:53 EST ------- That text is rather out of date - if you are familiar with the history of Numeric, numarray and numpy you'll know that the old module used with "import Numeric" was called Numerical Python or NumPy for short. This shorthand was used in lots of documentation (not just in Biopython). I think the choice to call the third generation of the array packages numpy has caused a lot of confusion. See http://numpy.scipy.org/#older_array We had updated the Biopython website and other bits of documentation, but had missed this one. Thank you for point this out. P.S. Supporting numpy instead of Numeric is Biopython Bug 2251. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 05:04:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 05:04:47 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806110904.m5B94li8011303@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 05:04 EST ------- I raised the issue of evaluating a SeqRecord as a boolean with a proposal that would could add __len__ but also add __nonzero__ to ensure that any SeqRecord evaluates as True (even if the sequence is of length zero): http://lists.open-bio.org/pipermail/biopython-dev/2008-June/003756.html Michiel was in favour of this: > +1 for adding a __len__ method, with a __nonzero__ method to let all SeqRecord > objects evaluate as true. The patch isn't ready yet because in addition it doesn't get deal with the SeqFeature objects. I think the SeqFeature class needs a _shift(offset) method to return a copy of itself with its location (and the locations of any sub-features) adjusted. I'm still not sure about handling strides, and I am tempted to rule that if a stride other than one is used then the features of the SeqRecord are lost. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 09:57:56 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 09:57:56 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806111357.m5BDvu1I024400@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #928 is|0 |1 obsolete| | ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-11 09:57 EST ------- Created an attachment (id=937) --> (http://bugzilla.open-bio.org/attachment.cgi?id=937&action=view) Patch to Bio/SeqRecord.py and Bio/SeqFeature.py This modifies the SeqRecord to give it __getitem__ (supporting sliced annotations including features), __len__ (to return the length of the sequence). __nonzero__ (to ensure any SeqRecord evaluates as True regardless of the length of its sequence) and __iter__ (to explicitly support iteration over the sequence with a docstring). As part of this, assorted objects in SeqFeature.py get a private _shift() method taking an integer offset to return a self copy with an adjusted location. Note that slices with a stride (other than one) will result in the features being lost. Handling (positive) strides would require complicated consideration about if an exact location is still present, and if not replacing it with either a fuzzy position or a range. Negative strides are worse! The current set of unit tests seem fine, but addition checks would need to be added to validate this new behaviour. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 11 11:26:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Jun 2008 11:26:59 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806111526.m5BFQxMw029057@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2008-06-11 11:26 EST ------- I "fixed" SwissProt.SProt.Iterator by deprecating it. Instead of SwissProt.SProt.Iterator, we recommend using Bio.SwissProt.parse and Bio.SeqIO.parse. Next on the to-do list is SwissProt.KeyWList.extract_keywords. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 12 10:23:16 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 12 Jun 2008 10:23:16 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806121423.m5CENG95026678@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2008-06-12 10:23 EST ------- SwissProt.KeyWList.extract_keywords could only parse very old SwissProt files. I deprecated it and wrote a new function "parse" that parses current SwissProt files. This function does not do the file-like check. Prosite.Iterator and Prosite.Prodoc.Iterator are next. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Thu Jun 12 10:33:56 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Thu, 12 Jun 2008 16:33:56 +0200 Subject: [Biopython-dev] CVS access and developers web site In-Reply-To: <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> References: <483E7578.50402@biologie.uni-kl.de> <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> Message-ID: <485133D4.2060405@biologie.uni-kl.de> Peter Cock wrote: > Hi Frank, > > I would try emailing support at helpdesk.open-bio.org using the email > address associated with your CVS username. If you've changed email > address, and you run into problems, I expect Michiel or I could vouch > for you. > Is somebody monitoring that email address? I got an automated response about two weeks ago, and then nothing happened. > For the website, the wiki usernames are entirely separate and you > should be able to create a new account if you don't have one already. > If you want to update the tutorial new HTML and PDF files are loaded > with each release from the version in CVS. > Thanks Peter, got access to the wiki and updated personal data. Frank > Peter > > On Thu, May 29, 2008 at 10:20 AM, Frank Kauff wrote: > >> Hi folks, >> >> although I've been quiet for a while, I'm still doing some changes to the >> Nexus parser of biopython from time to time.... I totally lost my passwords >> to access the repository. Could someone please send me a new password to get >> write access to cvs? And I would also like to change the information on the >> biopython developers web site, as they are somewhat outdated. >> And is this the right place to ask for such things? >> >> Thanks! >> >> Frank >> > > From bugzilla-daemon at portal.open-bio.org Thu Jun 12 11:42:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 12 Jun 2008 11:42:58 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806121542.m5CFgw9t029594@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #11 from cracka80 at gmail.com 2008-06-12 11:42 EST ------- Maybe it's a good idea for any parsers/iterators to just use the iterator-like ability of file handles? Writers would have to function slightly differently, but since file objects, StringIOs and any other file-like objects must provide an __iter__ method, it's probably a good idea to take that into consideration when developing a common interface. In addition, writers could output iterators or generators, so that they can be chained together to operate on files. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 12:24:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 12:24:29 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806131624.m5DGOTKw025954@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #12 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 12:24 EST ------- (In reply to comment #11) > Maybe it's a good idea for any parsers/iterators to just use the iterator-like > ability of file handles? In principle, yes. In practice, it's not so easy because many parsers in Biopython follow the framework in Bio.ParserSupport. These parsers are not really written to deal with lines pulled one-by-one from a file handle. To reconcile these two, I pull out data line-by-line from the file handle, store it in a string, and then call the parser to parse it. This is not ideal, and it may be a good idea for Biopython at some point to change its parser strategy. > Writers would have to function slightly differently, > but since file objects, StringIOs and any other file-like objects must provide > an __iter__ method, it's probably a good idea to take that into consideration > when developing a common interface. In addition, writers could output > iterators or generators, so that they can be chained together to operate > on files. > Writers should also be able to just print the record to the screeen. I don't see how that is easily achievable with generators. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 12:27:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 12:27:47 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806131627.m5DGRlTE026072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 12:27 EST ------- Medline.Iterator, Prosite.Iterator, and Prosite.Prodoc.Iterator are now fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 22:29:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 22:29:13 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806140229.m5E2TDdD014417@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #14 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 22:29 EST ------- I deprecated Bio.Gobase, since no users came forward on the mailing list. Bio.Rebase is also problematic. It parses HTML from the Rebase database, but it was written in 2000 and cannot parse current HTML from Rebase (which looks completely different from the HTML used in 2000). I'll ask on the mailing list if anybody is willing to update Bio.Rebase. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jun 13 22:34:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:34:05 -0700 (PDT) Subject: [Biopython-dev] Bio.Rebase Message-ID: <237761.5963.qm@web62409.mail.re1.yahoo.com> Hi everybody, As part of bug #2454 on Bugzilla, I am looking at the Bio.Rebase module. This module parses files (in HTML format) from the Rebase database: http://rebase.neb.com/rebase/rebase.html Unfortunately, since this module was written (in 2000) the HTML format used by the Rebase database has changed completely. This module is therefore not able to parse current Rebase HTML files. Is anybody willing to update Bio.Rebase (either by updating the HTML parser, or preferably by writing a parser for plain-text output from Bio.Rebase)? If not, I think this module should be deprecated. --Michiel. From bugzilla-daemon at portal.open-bio.org Fri Jun 13 22:50:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 13 Jun 2008 22:50:42 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806140250.m5E2ogvf014920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-06-13 22:50 EST ------- According to the Numerical Python website, the NumPy documentation will become freely available on September 1, 2008. That would be a good time to start thinking seriously about converting from the "old" Numerical Python to the "new" NumPy 1.1. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Fri Jun 13 22:46:37 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 13 Jun 2008 19:46:37 -0700 (PDT) Subject: [Biopython-dev] Bio.SCOP maintainer? Message-ID: <523172.98428.qm@web62402.mail.re1.yahoo.com> Still looking at Bug 2454 (http://bugzilla.open-bio.org/show_bug.cgi?id=2454). To fix this bug, I'd like to make some changes to Bio.SCOP. Is anybody currently maintaining Bio.SCOP? The changes I'd like to make are small, but it would be better to discuss with the Bio.SCOP maintainer (if there is one) so I won't get in their way. --Michiel. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 05:52:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 14 Jun 2008 05:52:09 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200806140952.m5E9q9X9032018@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-06-14 05:52 EST ------- We now have parsers for XML returned by Entrez, provided the corresponding DTDs are available. Bio/Entrez/DTDs contains most (all?) DTDs currently used by Entrez. If later some DTDs appear to be missing, we can simply add them to Bio/Entrez/DTDs. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jun 14 06:29:12 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 14 Jun 2008 06:29:12 -0400 Subject: [Biopython-dev] [Bug 2516] Make it clear what is numeric and what is numpy In-Reply-To: Message-ID: <200806141029.m5EATC64001227@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2516 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-06-14 06:29 EST ------- Updated the installation instructions (in CVS, at least). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Sat Jun 14 18:51:26 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 14 Jun 2008 23:51:26 +0100 Subject: [Biopython-dev] CVS access and developers web site In-Reply-To: <485133D4.2060405@biologie.uni-kl.de> References: <483E7578.50402@biologie.uni-kl.de> <320fb6e00805291446x1cebf67bpe3e0818af5b9a7c5@mail.gmail.com> <485133D4.2060405@biologie.uni-kl.de> Message-ID: <320fb6e00806141551t56422a98v752e34bbbb38d0aa@mail.gmail.com> >> Hi Frank, >> >> I would try emailing support at helpdesk.open-bio.org using the email >> address associated with your CVS username. If you've changed email >> address, and you run into problems, I expect Michiel or I could vouch >> for you. >> > > Is somebody monitoring that email address? I got an automated response about > two weeks ago, and then nothing happened. > Maybe someone is on holiday - or they are caught up with BOSC 2008 work? I can suggest a few specific people at OBF to try and contact directly if you are still stuck. In the short term, if there are any urgent fixes you think need to be checked in, stick them on Bugzilla and I'm sure one of us will be able to commit them on your behalf. Peter From bugzilla-daemon at portal.open-bio.org Sun Jun 15 03:03:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 15 Jun 2008 03:03:18 -0400 Subject: [Biopython-dev] [Bug 2468] Tutorial needs a fix: Bio.WWW.NCBI In-Reply-To: Message-ID: <200806150703.m5F73IF2007099@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2468 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-06-15 03:03 EST ------- I created a subsection Examples to the tutorial chapter on Bio.Entrez, and added the example from section 2.5 and Martin's taxonomy example to it. With the Bio.Entrez currently in CVS, finding the lineage works as follows: >>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae") >>> record = Entrez.read(handle) >>> record["IdList"] ['158330'] >>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode='xml') >>> records = Entrez.read(handle) >>> records[0]['Lineage'] 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 16 15:23:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Jun 2008 15:23:43 -0400 Subject: [Biopython-dev] [Bug 2507] Adding __getitem__ to SeqRecord for element access and slicing In-Reply-To: Message-ID: <200806161923.m5GJNhZw012022@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2507 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #937 is|0 |1 obsolete| | ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-16 15:23 EST ------- Created an attachment (id=942) --> (http://bugzilla.open-bio.org/attachment.cgi?id=942&action=view) Patch to Bio/SeqRecord.py and Bio/SeqFeature.py I've checked in the SeqRecord __len__ and __nonzero__ methods with CVS Bio/SeqRecord.py revision 1.17 The earlier __getitem__ and __iter__ patch has been updated accordingly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 16 16:08:00 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Jun 2008 16:08:00 -0400 Subject: [Biopython-dev] [Bug 1944] Align.Generic adding iterator and more In-Reply-To: Message-ID: <200806162008.m5GK80bv014002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1944 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-16 16:07 EST ------- Created an attachment (id=943) --> (http://bugzilla.open-bio.org/attachment.cgi?id=943&action=view) Minimal __getitem__ method for generic alignment This patch just adds a __getitem__ to the alignment which ONLY accepts a single integer index and returns the corresponding SeqRecord object. I propose to add this NOW, as I think even just this is a worthwhile improvement. This is a natural expectation given the current __iter__ behaviour and the model of the alignment as a list of SeqRecord objects. Its also part of the more rich behaviour discussed above, which we can add more easily if/when the SeqRecord gets a __getitem__ method (bug 2507). Comments on this particular patch? Should we add __len__ at the same time giving the number of rows in the alignments? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Tue Jun 17 03:35:38 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 17 Jun 2008 09:35:38 +0200 Subject: [Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO In-Reply-To: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> References: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> Message-ID: <200806170935.38904.jblanca@btc.upv.es> Hi: My main use of the Alignment class is to parse Ace files. I've been thinking about that problem recently. My proposal to modify SeqRecord was due to this problem. I think that the best solution would be to treat the Alignment as a sequence. The consensus would be the actual sequences and the aligned read would be features with per-base-annotations. I've implemented such a class and it works fine for me. In fact the Alignment class is just a wrapper around a standard SeqRecord (I name it RichSeq in my implementation). To do that you just need a SeqRecord with a __getitem__ method. You have already proposing that so that's not a problem. Padding with spaces is not an option when you're dealing with genomic wide alignments, that's one of the problems of the actual Alignment class. If you want I can send my implementation to the list, although it could take a while because I've got my home computer dead. Best regards, Jose Blanca On Monday 16 June 2008 16:01:31 Peter wrote: > I've recently had to deal with some contig files in the Ace format > (output by CAP3, but many assembly files will produce this output). > > We have a module for parsing Ace files in Biopython, > Bio.Sequencing.Ace but I was wondering about integrating this into the > Bio.SeqIO or Bio.AlignIO framework. > http://www.biopython.org/wiki/SeqIO > http://www.biopython.org/wiki/AlignIO > > I'd like to hear from anyone currently using Ace files, on how they > tend to treat the data - and if they think a SeqRecord or Alignment > based representation would be useful. > > Each contig in an Ace file could be treated as a SeqRecord using the > consensus sequence. The identifiers of each sub-sequence used to > build the consensus could be stored as database cross-references, or > perhaps we could store these as SeqFeatures describing which part of > the consensus they support. This would then fit into Bio.SeqIO quite > well. > > Alternatively, each contig could be treated as an alignment (with a > consensus) and integrated into Bio.AlignIO. One drawback for this is > doing this with the current generic alignment class would require > padding the start and/or end of each sequence with gaps in order to > make every sequence the same length. However, if we did this (or > created a more specialised alignment class), the Ace file format would > then fit into Bio.AlignIO too. > > So, Ace users - would either (or both) of the above approaches make > sense for how you use the Ace contig files? > > Thanks > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Jun 17 04:46:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 09:46:22 +0100 Subject: [Biopython-dev] [BioPython] Ace contig files in Bio.SeqIO or Bio.AlignIO In-Reply-To: <200806170935.38904.jblanca@btc.upv.es> References: <320fb6e00806160701l428584c0i30acac57338b9357@mail.gmail.com> <200806170935.38904.jblanca@btc.upv.es> Message-ID: <320fb6e00806170146j6f1843e6hed4166ad62c84423@mail.gmail.com> On Tue, Jun 17, 2008 at 8:35 AM, Jose Blanca wrote: > Hi: > My main use of the Alignment class is to parse Ace files. I've been thinking > about that problem recently. My proposal to modify SeqRecord was due to this > problem. I think that the best solution would be to treat the Alignment as a > sequence. The consensus would be the actual sequences and the aligned read > would be features with per-base-annotations. So integrating the "ace" format into Bio.SeqIO representing the consensus sequence of each contig as a SeqRecord would be useful. Initially I would try and represent the aligned reads as SeqFeature objects (much like when reading a genome from a GenBank file you get CDS features with their amino acid translation). Note that for memory reasons, I would be inclined to scan over the Ace file in one pass (using the existing Iterator in the Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank points out in the code comments, this means we can't easily include the WA, CT, RT and WR tags found in the Ace file footer. Do you use this information Jose? > I've implemented such a class > and it works fine for me. In fact the Alignment class is just a wrapper > around a standard SeqRecord (I name it RichSeq in my implementation). > To do that you just need a SeqRecord with a __getitem__ method. You have > already proposing that so that's not a problem. Your enthusiasm Jose is one of the things motivating me to try and do more with the Seq and SeqRecord. Without a third party to offer feedback, making big changes is risky. > Padding with spaces is not an option when you're dealing with genomic wide > alignments, that's one of the problems of the actual Alignment class. It might make sense to talk about a "Contig Alignment" object/class, compared to the existing "multiple sequence alignment" object/class where all the sequences are the same length. Ideally these should provide as similar an API as possible - even if the internals are different. One idea is a sub-class of the current alignment class which stores an offset (>=0) for each supporting read, used when accessing columns. Maybe we should check out BioPerl etc for inspiration? > If you want I can send my implementation to the list, although it could take a > while because I've got my home computer dead. Good luck with the broken computer - I hope you have an easier time fixing it / rebuilding it than I did last time this hapended to me. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jun 17 05:16:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 10:16:29 +0100 Subject: [Biopython-dev] Iterating over Ace contig files Message-ID: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> Hello Frank, I wanted to get your opinion on iterating over the Ace file contig by contig, and what is lost in the WA, CT, RT and WR tags at the end of the file by doing this. As large sequencing runs become more common, iterating over the file in a single pass WITHOUT keeping everything in memory does seem to be desirable. Similar past discussions: http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html Would you object to me rewording your module's header-comment not to say that the Ace Iterator is NOT deprecated, but rather that it has certain drawbacks. [The context for this is my recent thread on the Biopython dev mailing list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO and/or Bio.AlignIO - I've included a little context below.] Thanks, Peter -- Peter wrote: >> So integrating the "ace" format into Bio.SeqIO representing the >> consensus sequence of each contig as a SeqRecord would be useful. >> Initially I would try and represent the aligned reads as SeqFeature >> objects (much like when reading a genome from a GenBank file you get >> CDS features with their amino acid translation). >> >> Note that for memory reasons, I would be inclined to scan over the Ace >> file in one pass (using the existing Iterator in the >> Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank >> points out in the code comments, this means we can't easily include >> the WA, CT, RT and WR tags found in the Ace file footer. Do you use >> this information Jose? Jose replied, > I haven't used the iterator because of the deprecation warning of the code. I > tried with about 40000 alignments and it worked in a computer with 8 GB of ram. > I there are more sequences, and there will be with the 454 sequencer, we will > have trouble reading all at once. I vote for the iterator approach. I have not > used the information of this tag, but I don't know also what they mean. I've > been looking for documentation about this format, but I've found none, do you > have any good ace documentation? From bugzilla-daemon at portal.open-bio.org Tue Jun 17 07:23:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 07:23:59 -0400 Subject: [Biopython-dev] [Bug 2520] New: Reading ACE assembly contig files in Bio.SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2520 Summary: Reading ACE assembly contig files in Bio.SeqIO Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk As I suggested on the mailing list, we could use Bio.Sequencing.Ace to parse ACE assembly files, and then turn each contig into a SeqRecord using the consensus sequence. I will attach a basic implementation which only uses the consensus sequence and its name. For now this ignores all the meta data and in particular the read information. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 17 07:29:15 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 07:29:15 -0400 Subject: [Biopython-dev] [Bug 2520] Reading ACE assembly contig files in Bio.SeqIO In-Reply-To: Message-ID: <200806171129.m5HBTFVG026790@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2520 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-17 07:29 EST ------- Created an attachment (id=944) --> (http://bugzilla.open-bio.org/attachment.cgi?id=944&action=view) New file Bio/SeqIO/AceIO.py This new file would be added to Bio.SeqIO in the usual way (updating Bio/SeqIO/__init__.py to import this module and map the format "ace" to the new iterator). Handling different gap characters in Bio.SeqIO (and translating them when reading and writing files) has not been formalised. Where possible, converting them into dashes on loading seems to be a sensisble route to take. Therefore I deliberately map any "*" gap characters in the consensus sequence into "-" characters, which are used by default in the alphabet class and are far more commonly used. The "*" character is typically associated with a stop codon in protein sequences, which is another reason to avoid using it here. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Tue Jun 17 09:06:34 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 17 Jun 2008 15:06:34 +0200 Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> References: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> Message-ID: <4857B6DA.9040309@biologie.uni-kl.de> Hi Peter, makes totally sense to me. Feel free to do the changes as you see it fit Frank Peter wrote: > Hello Frank, > > I wanted to get your opinion on iterating over the Ace file contig by > contig, and what is lost in the WA, CT, RT and WR tags at the end of > the file by doing this. As large sequencing runs become more common, > iterating over the file in a single pass WITHOUT keeping everything in > memory does seem to be desirable. > > Similar past discussions: > http://portal.open-bio.org/pipermail/biopython/2004-February/001825.html > http://portal.open-bio.org/pipermail/biopython/2005-May/002661.html > > Would you object to me rewording your module's header-comment not to > say that the Ace Iterator is NOT deprecated, but rather that it has > certain drawbacks. > > [The context for this is my recent thread on the Biopython dev mailing > list about integrating your Bio.Sequencing.Ace parser into Bio.SeqIO > and/or Bio.AlignIO - I've included a little context below.] > > Thanks, > > Peter > > -- > > Peter wrote: > >>> So integrating the "ace" format into Bio.SeqIO representing the >>> consensus sequence of each contig as a SeqRecord would be useful. >>> Initially I would try and represent the aligned reads as SeqFeature >>> objects (much like when reading a genome from a GenBank file you get >>> CDS features with their amino acid translation). >>> >>> Note that for memory reasons, I would be inclined to scan over the Ace >>> file in one pass (using the existing Iterator in the >>> Bio.Sequencing.Ace parser) returning SeqRecords as we go. As Frank >>> points out in the code comments, this means we can't easily include >>> the WA, CT, RT and WR tags found in the Ace file footer. Do you use >>> this information Jose? >>> > > Jose replied, > >> I haven't used the iterator because of the deprecation warning of the code. I >> tried with about 40000 alignments and it worked in a computer with 8 GB of ram. >> I there are more sequences, and there will be with the 454 sequencer, we will >> have trouble reading all at once. I vote for the iterator approach. I have not >> used the information of this tag, but I don't know also what they mean. I've >> been looking for documentation about this format, but I've found none, do you >> have any good ace documentation? >> > > From biopython at maubp.freeserve.co.uk Tue Jun 17 09:53:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Jun 2008 14:53:23 +0100 Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <4857B6DA.9040309@biologie.uni-kl.de> References: <320fb6e00806170216k12ecd88fof60758db1ccec3cf@mail.gmail.com> <4857B6DA.9040309@biologie.uni-kl.de> Message-ID: <320fb6e00806170653g482b104fl739107fcada06dc8@mail.gmail.com> On Tue, Jun 17, 2008 at 2:06 PM, Frank Kauff wrote: > Hi Peter, > > makes totally sense to me. Feel free to do the changes as you see it fit > > Frank Thanks Frank. I've checked in some comment changes to both Ace.py and Phd.py, aimed at both improving the documentation and trying and make epydoc happier for the automatic API documentation: http://biopython.org/DIST/docs/api/ Peter P.S. I also added an __iter__ method to the Ace Iterator (Phd already had one). From mjldehoon at yahoo.com Tue Jun 17 10:08:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 17 Jun 2008 07:08:31 -0700 (PDT) Subject: [Biopython-dev] Iterating over Ace contig files In-Reply-To: <320fb6e00806170653g482b104fl739107fcada06dc8@mail.gmail.com> Message-ID: <399611.60966.qm@web62415.mail.re1.yahoo.com> Note that bug #2454 also pertains to the Ace and Phd parsers. If you are modifying the Ace and Phd parsers, can you fix this bug at the same time? http://bugzilla.open-bio.org/show_bug.cgi?id=2454 --Michiel. Peter wrote: On Tue, Jun 17, 2008 at 2:06 PM, Frank Kauff wrote: > Hi Peter, > > makes totally sense to me. Feel free to do the changes as you see it fit > > Frank Thanks Frank. I've checked in some comment changes to both Ace.py and Phd.py, aimed at both improving the documentation and trying and make epydoc happier for the automatic API documentation: http://biopython.org/DIST/docs/api/ Peter P.S. I also added an __iter__ method to the Ace Iterator (Phd already had one). _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Jun 17 10:43:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Jun 2008 10:43:42 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806171443.m5HEhgua005645@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-17 10:43 EST ------- I've removed the strict file-like test in: Bio/Sequencing/Ace.py revision: 1.12 Bio/Sequencing/Phd.py revision: 1.6 In these cases, the handle is immediately turned into an UndoHandle which will be able to check for a sufficiently file like object. Hopefully that's what you meant Michiel - we could go further and introduce a parse() function and deprecate the Iterator objects in these modules. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 06:34:43 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 06:34:43 -0400 Subject: [Biopython-dev] [Bug 2503] An error when parsing NCBIWWW Blast output In-Reply-To: Message-ID: <200806181034.m5IAYhS1026214@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2503 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 06:34 EST ------- I'm closing this bug as "INVALID" due to a lack of information. If you are still having trouble Prashantha, and can give us some more information, please re-open this bug. Thank you. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 07:34:26 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 07:34:26 -0400 Subject: [Biopython-dev] [Bug 2497] Unit tests do not cover Bio.Blast.NCBIWWW.qblast() In-Reply-To: Message-ID: <200806181134.m5IBYQjC032061@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2497 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 07:34 EST ------- I checked in a slightly revised version of this as test_NCBI_qblast.py - marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 18 08:01:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 08:01:11 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806181201.m5IC1BxA001255@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-18 08:01 EST ------- Created an attachment (id=946) --> (http://bugzilla.open-bio.org/attachment.cgi?id=946&action=view) Patch to Bio/Blast/NCBIStandalone.py and Tests/test_NCBIStandalone.py Suggested patch for the command injection risk. Can anyone think of a legitimate reason for a ; or & character in the parameters of a BLAST command line? This patch is very simple and will reject any keyword parameter containing the ; or && characters. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jun 18 10:00:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 15:00:56 +0100 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> Message-ID: <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> This is returning to a thread last year, about getting a SeqRecord into a string in a particular file format (e.g. fasta). Jared Flatow had suggest adding a method to the SeqRecord itself. Jared wrote: > > ... To always have to write to a file feels strange, but I see > > that it would be messy to go OO since there are so many formats. > > However, giving preference to fasta over other formats by making it > > innate doesn't seem like such a terrible idea. I do have mixed > > feelings about 'bloating' the code which is why I asked, and you have > > convinced me that this is not quite appropriate given existing > > convention. However the idea would be to put the to_fasta or > > to_format method inside the SeqRecord, then to call it from the IO > > when needed to actually write to a file, but call it directly when > > all that is wanted is a string... > > Its debatable isn't it? I suspect that for most users, when they want a > record in a particular file format its for writing to a file. However, > adding a to_format() method to a SeqRecord some sense (suitable for > sequential file formats only). This would take a format name and return > a string, by calling Bio.SeqIO with a StringIO object internally. > > Peter Jared - On reflection, do you think adding a method like this to the SeqRecord (or even just for the FASTA format) would be useful? I recently found myself wanting to use this sort of functionality, and remembered this old thread. This time I was wondering about using the method name tostring (matching the name of a Seq object method). In order to mimic the Seq object's method, the format would be optional and when omitted would give the sequence as a string. Otherwise one of the lower case strings used in Bio.SeqIO should be supplied. There is a sample implementation at the end of this email. ? On Wed, Oct 17, 2007 Michiel De Hoon wrote: > How about the following: > > SeqIO.write(sequences, handle, format) returns the properly formatted string > if handle==None. I can see the above is simpler than having to supply a StringIO handle, but it doesn't make the functionality available directly from the SeqRecord object. It also complicates the API of the SeqIO module with a special case. Peter -- ###################################### For the SeqRecord class, in Bio/SeqRecord.py ###################################### def tostring(self, format=None) : """Returns the record as a string in the specified file format. If the file format is omitted (default), the sequence itself is returned as a string. Otherwise the format should be a lower case string supported by Bio.SeqIO, which is used to turn the SeqRecord into a string.""" if format : from StringIO import StringIO from Bio import SeqIO handle = StringIO() SeqIO.write([self], handle, format) handle.seek(0) return handle.read() else : #Return the sequence as a string return self.seq.tostring() ############################################ From jflatow at northwestern.edu Wed Jun 18 11:25:18 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Wed, 18 Jun 2008 10:25:18 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Message-ID: <55567F98-C5F5-4A2F-8542-502F17F485E9@northwestern.edu> Quick correction: On Jun 18, 2008, at 10:16 AM, Jared Flatow wrote: > Hi Peter, > > On Jun 18, 2008, at 9:00 AM, Peter wrote: > >> Jared - On reflection, do you think adding a method like this to the >> SeqRecord (or even just for the FASTA format) would be useful? > > Yes I still think so. In fact, for sequences, I would say that I > pretty much never deal with a format ever than FASTA, so even making > the __str__ method of SeqRecord return the FASTA format as well > seems reasonable, though perhaps my use cases are different than > others. > > However, py3k and 2.6 will make available the functionality > described in PEP 3101: > > http://www.python.org/dev/peps/pep-3101/ > > I think it would be best to define some semantics that are > compatible with this PEP. This would basically mean using the > __format__ method (which could be the same as the tostring method > you have defined below). To achieve backward compatibility and/or a > more OO interface, tostring could just be an alias for __format__. > Thus, instead of calling format(seq_rec, 'fasta') one could call > seq_rec.tostring('fasta') and these would be equivalent. The PEP > also states that format(seq_rec) should be the same as str(seq_rec). On second thought it seems like a .format method (similar to the one the string class is acquiring) should be used as an alias to __format__ (somehow I think tostring should always be the same as __str__) > In short, I think creating methods to return formatted versions of > objects (SeqRecords) is a good idea, but most especially if it is > done in a way consistent with the language's vision. > > Best, > jared From bugzilla-daemon at portal.open-bio.org Wed Jun 18 11:36:48 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Jun 2008 11:36:48 -0400 Subject: [Biopython-dev] [Bug 2454] Iterators can't use file-like objects In-Reply-To: Message-ID: <200806181536.m5IFamvB015695@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2454 ------- Comment #16 from mdehoon at ims.u-tokyo.ac.jp 2008-06-18 11:36 EST ------- (In reply to comment #15) > I've removed the strict file-like test in: > > Bio/Sequencing/Ace.py revision: 1.12 > Bio/Sequencing/Phd.py revision: 1.6 > > In these cases, the handle is immediately turned into an UndoHandle which will > be able to check for a sufficiently file like object. > > Hopefully that's what you meant Michiel Actually, I think we should avoid using an UndoHandle altogether, now that Python has generator functions. > - we could go further and introduce a > parse() function and deprecate the Iterator objects in these modules. > That would make things a lot easier. An Iterator class was useful in older versions of Python, but generator functions provide a cleaner alternative. In Ace.py, we'd need three functions: 1) read(handle), which returns one record (Contig) read from the handle, and None otherwise; 2) parse(handle), a generator function returning an iterator over the records; 3) a local function _process_line(line, record) These functions then look like this: def read(handle): record = None for line in handle: if line[:2]=='CO': break else: return None record = Contig() for line in handle: if line[:2]=='CO': return record else: _process_line(line, record) def parse(handle): record = None for line in handle: if line[:2]=='CO': if record: yield record record = Contig() _process_line(line, record) if record: return record The actual work is done in _process_line. So we don't need to store the read lines explicitly; this is now taken care of by the generator function. Hence, we don't need to convert the handle to an UndoHandle. In addition, handle can now also be a list of lines instead of a file handle. In this respect, I think Zachary was right in comment #11: > Maybe it's a good idea for any parsers/iterators to just > use the iterator-like ability of file handles? In other words, as long as we can pull lines from the handle, we can parse it. In Phd.py, it's even simpler. Here, we only need the read() and parse() function: def read(handle): for line in handle: if line.startswith("BEGIN_SEQUENCE"): record = Record() elif line.startswith("END_SEQUENCE"): return record else: # do the actual processing of the other lines here def parse(handle): while True: record = read(handle) if not record: return yield record Again, we can process each line just as they come along. No UndoHandle, no Parser, no Consumer, no Scanner needed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jflatow at northwestern.edu Wed Jun 18 11:16:59 2008 From: jflatow at northwestern.edu (Jared Flatow) Date: Wed, 18 Jun 2008 10:16:59 -0500 Subject: [Biopython-dev] SeqRecord to file format as string In-Reply-To: <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> References: <0616CDF3-C4CB-4954-916C-A307A9CB9DD0@northwestern.edu> <47147341.4020708@maubp.freeserve.co.uk> <7981A30E-BA08-4748-8FA3-4D7B82AF0F59@northwestern.edu> <4714EB8E.3000700@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B63C@mail2.exch.c2b2.columbia.edu> <320fb6e00806180700k327e6913m7ba9c4bdc3421f67@mail.gmail.com> Message-ID: <4D53AB82-F673-4F4F-BCEC-BA06088E8721@northwestern.edu> Hi Peter, On Jun 18, 2008, at 9:00 AM, Peter wrote: > Jared - On reflection, do you think adding a method like this to the > SeqRecord (or even just for the FASTA format) would be useful? Yes I still think so. In fact, for sequences, I would say that I pretty much never deal with a format ever than FASTA, so even making the __str__ method of SeqRecord return the FASTA format as well seems reasonable, though perhaps my use cases are different than others. However, py3k and 2.6 will make available the functionality described in PEP 3101: http://www.python.org/dev/peps/pep-3101/ I think it would be best to define some semantics that are compatible with this PEP. This would basically mean using the __format__ method (which could be the same as the tostring method you have defined below). To achieve backward compatibility and/or a more OO interface, tostring could just be an alias for __format__. Thus, instead of calling format(seq_rec, 'fasta') one could call seq_rec.tostring('fasta') and these would be equivalent. The PEP also states that format(seq_rec) should be the same as str(seq_rec). In short, I think creating methods to return formatted versions of objects (SeqRecords) is a good idea, but most especially if it is done in a way consistent with the language's vision. Best, jared From yair.benita at gmail.com Wed Jun 18 13:26:02 2008 From: yair.benita at gmail.com (Yair Benita) Date: Wed, 18 Jun 2008 13:26:02 -0400 Subject: [Biopython-dev] BioPax parser Message-ID: Hi Guys, Does anyone have a biopax parser written in python? Thanks, Yair From biopython at maubp.freeserve.co.uk Wed Jun 18 13:42:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Jun 2008 18:42:13 +0100 Subject: [Biopython-dev] BioPax parser In-Reply-To: References: Message-ID: <320fb6e00806181042y169f580epbd8c876eb3cb57fa@mail.gmail.com> On Wed, Jun 18, 2008 at 6:26 PM, Yair Benita wrote: > Hi Guys, > Does anyone have a biopax parser written in python? > Thanks, > Yair I don't know of any (but I haven't searched). From a quick look on www.biopax.org they use XML, so you should be able to parse it in python fairly easily - but I guess some sort of object orientated representation of the data would be very nice to have. Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 19 06:08:55 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:08:55 -0400 Subject: [Biopython-dev] [Bug 2508] NCBIStandalone.blastall: provide support for '-F F' and make it safe In-Reply-To: Message-ID: <200806191008.m5JA8t0v016495@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2508 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:08 EST ------- On the issue of the low-complexity filter, that is actually already supported in NCBIStandalone.blastall(), NCBIStandalone.blastpgp() and NCBIStandalone.rpsblast() using the optional argument 'filter'. This is described in the doc string too, although it doesn't use the phrase "low complexity" which might be clearer. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 19 06:20:03 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:20:03 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200806191020.m5JAK3OZ017201@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:20 EST ------- I'm marking this as fixed now, but if anyone does find an issue with it please re-open the bug. Thanks for your work on this Eric. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 19 06:41:22 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 19 Jun 2008 06:41:22 -0400 Subject: [Biopython-dev] [Bug 2408] GenBank records do not contain U's In-Reply-To: Message-ID: <200806191041.m5JAfMNK018058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2408 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-06-19 06:41 EST ------- Given there were no other opinions voiced on how to handle this, I went ahead and fixed this in Bio/GenBank/__init__.py CVS revision 1.83 For records from RNA, if the sequence contains T but not U, we will use a DNA alphabet in the Seq object. Thanks for raising this Marcin. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Thu Jun 19 09:04:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 19 Jun 2008 06:04:31 -0700 (PDT) Subject: [Biopython-dev] Bio.CDD, anyone? Message-ID: <14893.84074.qm@web62409.mail.re1.yahoo.com> Hi everybody, Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) records. The parser parses HTML pages from CDD's web site. Since the parser was written about six years ago, the CDD web site has changed considerably. Bio.CDD therefore cannot parse current HTML pages from CDD. So I am wondering: 1) Is anybody using Bio.CDD? 2) Is anybody willing to update Bio.CDD to handle current HTML? 3) If not, can we deprecate it? There is not much purpose of having a parser for HTML pages from years ago. --Michiel. From biopython at maubp.freeserve.co.uk Thu Jun 19 09:38:29 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Jun 2008 14:38:29 +0100 Subject: [Biopython-dev] Bio.CDD, anyone? In-Reply-To: <14893.84074.qm@web62409.mail.re1.yahoo.com> References: <14893.84074.qm@web62409.mail.re1.yahoo.com> Message-ID: <320fb6e00806190638y2e3729e1ga66561de0c962700@mail.gmail.com> > Bio.CDD is a module with a parser for CDD (NCBI's Conserved Domain Database) > records. The parser parses HTML pages from CDD's web site. Since the parser > was written about six years ago, the CDD web site has changed considerably. > Bio.CDD therefore cannot parse current HTML pages from CDD. A couple of years ago, I wanted to get the CDD domain name and description and ended up writing my own very simple and crude parser to extract just this information. Doing a proper job would mean extracting lots and lots of fields, e.g. http://www.nc