From tiagoantao at gmail.com Tue Apr 1 08:23:57 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 1 Apr 2008 13:23:57 +0100 Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <508125.38490.qm@web62401.mail.re1.yahoo.com> References: <6d941f120803311208k6b6c9d1ah58c7808e0fbd0e2c@mail.gmail.com> <508125.38490.qm@web62401.mail.re1.yahoo.com> Message-ID: <6d941f120804010523k1edefb9aq8e57ad3137f66c59@mail.gmail.com> On Tue, Apr 1, 2008 at 1:13 PM, Michiel de Hoon wrote: > What is the advantage of branching? > AFAIK, the code in Bio.PopGen does not affect the rest of Biopython anyway. > All things equal, I'd prefer not to branch to keep things simple for users, > not to mention myself. The idea was to make things easier to you and new developers, actually. Mess on the branches and a clean trunk to be easy for new developers and easy releases. Merging would probably be the responsability of whoever is developing the code. So you would have a clean trunk and new people would also see something clean. I think the problem really stems from where biopython is going: if it is mainly maintenance mode then branching makes no sense (as new code is essencially refactoring and bug patching). If there are new features and modules poping in (which might bring initial chaos) then branching would be a good place to clean out the chaos before hitting the main trunk. Most of the code that I am adding now is actually quite pacific (although being the most important), but I was trying to avoid having the main trunk with code under heavy development. Tiago From mjldehoon at yahoo.com Tue Apr 1 08:13:22 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 05:13:22 -0700 (PDT) Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <6d941f120803311208k6b6c9d1ah58c7808e0fbd0e2c@mail.gmail.com> Message-ID: <508125.38490.qm@web62401.mail.re1.yahoo.com> What is the advantage of branching? AFAIK, the code in Bio.PopGen does not affect the rest of Biopython anyway. All things equal, I'd prefer not to branch to keep things simple for users, not to mention myself. --Michiel. Tiago Ant?o wrote: On Mon, Mar 31, 2008 at 8:04 PM, Peter wrote: > There is a lot to be said for having a single stable trunk - it > certainly makes things simpler for any new developers to get to grips > with things. It is one of those issues where there is no clear answer. Maybe a case by case analysis? I think having 5 gazillion branches would not be a good idea ever, but in the Biopython case many modules are somewhat self contained, making merging an easier exercise. Tiago _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev --------------------------------- Special deal for Yahoo! users & friends - No Cost. Get a month of Blockbuster Total Access now From mjldehoon at yahoo.com Tue Apr 1 08:27:55 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 05:27:55 -0700 (PDT) Subject: [Biopython-dev] Genbank dbSNP support In-Reply-To: <6d941f120803311513k43139fbi97683597c15f03a2@mail.gmail.com> Message-ID: <160489.24269.qm@web62402.mail.re1.yahoo.com> > Any plans for dbSNP support? > http://www.ncbi.nlm.nih.gov/SNP/index.html No existing plans, as far as I know. > I think I would volunteer to implement this. Great! > I think I would volunteer to implement this. A simple solution would > be to add both databases and return types. Michiel (I suppose this is > code that you are actively maintaining, or it is Peter?), can I send > you a diff? Opening a bug report on Bugzilla and adding your diff there is better. It's likely to get lost (i.e., forgotten) if it's in an email. Also, please have a look at Bio.Entrez (the module formerly known as Bio.WWW.NCBI). It has code for all of NCBI's EUtils, including efetch, except for parsers at this point. This is currently under development. Bio.Entrez is in release 1.45, but there are already some additions in CVS. --Michiel. --------------------------------- No Cost - Get a month of Blockbuster Total Access now. Sweet deal for Yahoo! users and friends. From mjldehoon at yahoo.com Tue Apr 1 08:52:17 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 05:52:17 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <264855a00803301751h270ee34dg86325eb1af298369@mail.gmail.com> Message-ID: <685989.32313.qm@web62405.mail.re1.yahoo.com> I have added a read() function to Bio.Entrez in CVS. Following Peter's suggestion, I put a dictionary (_NameToModule) inside the Bio.Entrez.DataHandler class, which can be used to override the default parser with a user-defined parser. I am not sure though why a user-defined parser needs to go through Bio.Entrez.read(). Wouldn't it be easier to do something like >>> from Bio import Entrez >>> handle = Entrez.efetch(something) >>> record = run_my_parser(handle) Currently, I have added only one parser (for EInfo). To try it, use >>> from Bio import Entrez >>> handle = Entrez.einfo() >>> record = Entrez.read(handle) >>> print record ['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists'] # To get information about the snp database >>> handle = Entrez.einfo(db="snp") >>> record = Entrez.read(handle) >>> print record["Count"] 44992036 >>> print record["LastUpdate"] 2007/11/29 18:22 --Michiel. Sean Davis wrote:This makes sense. However, it seems that there needs to be a way to "register" a parser with read() so that users can extend their local installation with a specialized parser. In other words, it seems that a way to dynamically register a parser with read() would be helpful. Or am I missing something? Sean --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From mjldehoon at yahoo.com Tue Apr 1 09:04:36 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 06:04:36 -0700 (PDT) Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <6d941f120804010523k1edefb9aq8e57ad3137f66c59@mail.gmail.com> Message-ID: <526878.58305.qm@web62413.mail.re1.yahoo.com> Tiago Ant?o wrote:Most of the code that I am adding now is actually quite pacific (although being the most important), but I was trying to avoid having the main trunk with code under heavy development. While I appreciate your consideration, personally I don't mind if some module of the main trunk is under heavy development, as long as it doesn't break other modules. Go ahead, knock yourself out :-). --Michiel. --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From biopython at maubp.freeserve.co.uk Tue Apr 1 09:49:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Apr 2008 14:49:14 +0100 Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <685989.32313.qm@web62405.mail.re1.yahoo.com> References: <264855a00803301751h270ee34dg86325eb1af298369@mail.gmail.com> <685989.32313.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00804010649x4175396fr23d1f1993d1da11f@mail.gmail.com> Michiel wrote: > I have added a read() function to Bio.Entrez in CVS. > Following Peter's suggestion, I put a dictionary (_NameToModule) inside the > Bio.Entrez.DataHandler class, which can be used to override the default parser > with a user-defined parser. Do you only intend to support Entrez XML files with this read() function, or potentially other formats too? Even for the assorted XML formats, I'm not yet clear on how you imaging this being extended. Have you had a chance to look at Eric's Entrez Taxonomy XML parser? It would need some re-factoring to fit in (see attachments on Bug 2475). http://bugzilla.open-bio.org/show_bug.cgi?id=2475 > I am not sure though why a user-defined parser needs to go through > Bio.Entrez.read(). Wouldn't it be easier to do something like > >>> from Bio import Entrez > >>> handle = Entrez.efetch(something) > >>> record = run_my_parser(handle) Sure - you could pass the handle to any parser of your choice, e.g. Bio.SeqIO.read() or Bio.SeqIO.parse() if you used Bio.Entrez.efetch to get a GenBank or Fasta file. Peter From tiagoantao at gmail.com Tue Apr 1 10:22:43 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 1 Apr 2008 15:22:43 +0100 Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <526878.58305.qm@web62413.mail.re1.yahoo.com> References: <6d941f120804010523k1edefb9aq8e57ad3137f66c59@mail.gmail.com> <526878.58305.qm@web62413.mail.re1.yahoo.com> Message-ID: <6d941f120804010722va1ded18q1e5e3c69ebc6c7c8@mail.gmail.com> On Tue, Apr 1, 2008 at 2:04 PM, Michiel de Hoon wrote: > Tiago Ant?o wrote: > Most of the code that I am adding now is actually quite pacific > (although being the most important), but I was trying to avoid having > the main trunk with code under heavy development. > While I appreciate your consideration, personally I don't mind if some > module of the main trunk is under heavy development, as long as it doesn't > break other modules. Go ahead, knock yourself out :-). I will do that when SVN is online ;) From mjldehoon at yahoo.com Tue Apr 1 10:23:29 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 07:23:29 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <320fb6e00804010649x4175396fr23d1f1993d1da11f@mail.gmail.com> Message-ID: <246475.95559.qm@web62405.mail.re1.yahoo.com> > Do you only intend to support Entrez XML files with this read() > function, or potentially other formats too? As all of Entrez's EUtils can return XML output (with many of them returning XML only), I was thinking of parsing XML files only. EUtils output in one of the sequence formats ought to be parsed by Bio.SeqIO. I am not sure if there are any other major file formats that we should handle. We can think about that later if and when the need arises. > Even for the assorted XML formats, I'm not yet clear on how you > imaging this being extended. This I am not clear on either; I just added this in response to Sean's request so we have some concrete code to look at. Sean, could you give an example of how you would extend (this or a different) parser? > Have you had a chance to look at Eric's Entrez Taxonomy XML > parser? It would need some re-factoring to fit in (see attachments > on Bug 2475). > http://bugzilla.open-bio.org/show_bug.cgi?id=2475 Eric uses a DOM parser, while I am using a SAX parser. DOM parsers have the advantage that they allow modification of the XML tree, whereas SAX just goes through the XML in one pass. SAX is preferable for large files, since DOM keeps the full XML file in memory, but maybe it is not so relevant for NCBI's EUtils. Anyway, if the end result is a Python object representing the XML, it doesn't matter much whether we go through DOM or SAX. Eric, do you have a strong preference for DOM? Once we have the basic framework for the Bio.Entrez parser settled, we can merge it with Eric's code. --Michiel --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 04:58:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 04:58:05 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804020858.m328w5HX024442@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-04-02 04:58 EST ------- Eric, I was looking at the part of your code that parses the XML. I see you use a DOM parser instead of a SAX parser. For Bio.Entrez in general, I have a slight preference for a SAX parser, since it does not require to have the full XML in memory. Do you have a strong preference for a DOM parser? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 05:31:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 05:31:44 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804020931.m329Vi9p026094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #9 from ericgibert at yahoo.fr 2008-04-02 05:31 EST ------- Well, I always use DOM as I find it easier whereas SAX seems cumbersome. I can try to convert if you want... Another question: do you want me to add an extra function in the class to update the tables taxon/taxon_name or you prefer it to be done in the loader.py? Let me know and I'll provide that missing code. I am already thinking about it with the default to be taxon_id == ncbi_taxon_id, and if not then taxon_id will be autogenerated. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 05:45:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 05:45:14 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804020945.m329jE0u026758@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-02 05:45 EST ------- For the organisation of the code, what I had in mind was a general purpose XML parser in Bio.Entrez.Taxonomy (with nothing to do with BioSQL), which would be called from an updated BioSQL.Loader to parse a handle to the XML data fetched using Bio.Entrez.efetch(). When adding a new SeqRecord to the BioSQL datanase, we would start with its NCBI taxon ID, and assuming its not already in the database, go online to find the parent taxon ID, and repeat until we match the ID of an existing taxon record in the database (or get to the root node). And then add all the new taxon records to the database. [I hope this is roughly the process you had in mind Eric] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 06:56:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 06:56:24 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804021056.m32AuOl9029219@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #11 from ericgibert at yahoo.fr 2008-04-02 06:56 EST ------- Ok to have the code in Loader.py. When a SeqRecord is to be added, we create a Taxonomy instance. I will add a function to return a copy of the _NCBI_lineage list. Then from "top" to "bottom", check if the taxon exists, if not, add it, until the species itself (there ensure that the parent_taxon_id is well populated). By default, we assume that taxon_id == NCBI_taxon_id. If this is not the case, do I raise an error or "fall to plan B" and let the database to auto assign the taxon_id? On missing point: the left and right value. Do you know what to do? I have run the Perl script on a test database and plan to look into the created records to clarify it... but you can save me the effort if you already know their logic. PS: because my original script was only updating the partial records created by the previous algorithm of Loader, I need to rewrite it. Maybe 2 man.day. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 07:52:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 07:52:18 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804021152.m32BqIcD031558@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #12 from mdehoon at ims.u-tokyo.ac.jp 2008-04-02 07:52 EST ------- Peter wrote: > For the organisation of the code, what I had in mind was a general purpose XML > parser in Bio.Entrez.Taxonomy (with nothing to do with BioSQL), which would be > called from an updated BioSQL.Loader to parse a handle to the XML data fetched > using Bio.Entrez.efetch(). That is what I have in mind also. Eric wrote: > Well, I always use DOM as I find it easier whereas SAX seems cumbersome. > I can try to convert if you want... I can understand that for your own work, you prefer to use DOM to just pick up the tags you are interested in. For a Biopython, though, we should have a more general solution that is useful to other users and in other situations also. Which is why I was thinking of a parser in Bio.Entrez that parses all of the XML returned from the Taxonomy database. If you're interested in writing a full parser for Taxonomy XML, maybe the parser for EInfo that is currently in CVS may be useful as an example. For ESearch, I already wrote a SAX parser; I just need to upload it to CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Apr 2 08:14:20 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 2 Apr 2008 13:14:20 +0100 Subject: [Biopython-dev] Bio.Entrez XML parsing Message-ID: Hey all, I don't know if I'm gonna say something really stupid, because really, I haven't read much of the discussion. From what I read you're discussing if we can use a parser for the XML given by most entrez methods. If you want my advice, I'd use libxml2. I'm currently working with efetch/esearch to get PMIDs and Abstracts and I use XML as my return mode and libxml2 as my parser. It is quite simple to use and quite faster when compared to, for example, minidom. If you need some example code, or some hand in this, I'll be more than glad to lend pieces of my code :) Best regards! From biopython at maubp.freeserve.co.uk Wed Apr 2 09:07:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Apr 2008 14:07:59 +0100 Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: References: Message-ID: <320fb6e00804020607s53d171d6s503756ac65de875a@mail.gmail.com> On Wed, Apr 2, 2008 at 1:14 PM, Jo?o Rodrigues wrote: > Hey all, I don't know if I'm gonna say something really stupid, because > really, I haven't read much of the discussion. From what I read you're > discussing if we can use a parser for the XML given by most entrez methods. > If you want my advice, I'd use libxml2. I'm currently working with > efetch/esearch to get PMIDs and Abstracts and I use XML as my return mode > and libxml2 as my parser. It is quite simple to use and quite faster when > compared to, for example, minidom. If you need some example code, or some > hand in this, I'll be more than glad to lend pieces of my code :) > > Best regards! Do you know how libxml2 compares to the python SAX XML parser for speed? One big downside is libxml2 would be yet another external dependency for Biopython. I would be much happier if we could stick to the built in python libraries. Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 2 09:41:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 09:41:35 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804021341.m32DfZYO006566@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-02 09:41 EST ------- In reply to comment 11, > Ok to have the code in Loader.py. > When a SeqRecord is to be added, we create a Taxonomy instance. > I will add a function to return a copy of the _NCBI_lineage list. > Then from "top" to "bottom", check if the taxon exists, if not, > add it, until the species itself (there ensure that the > parent_taxon_id is well populated). Something like that sounds fine. But I think we should settle the Bio.Entrez.Taxonomy code first. > By default, we assume that taxon_id == NCBI_taxon_id. Why do you say that? I don't think we should make this assumption. See also BioSQL project Bug 2470 > If this is not the case, do I raise an error or > "fall to plan B" and let the database to auto assign > the taxon_id? I am inclined to let the database assign the taxon_id, unless after discussion on the BioSQL mailing list it is agreed that "attempting" to use the NCBI taxon id as the taxon_id is encouraged. > On missing point: the left and right value. Do you know > what to do? I have run the Perl script on a test database > and plan to look into the created records to clarify it... > but you can save me the effort if you already know their logic. Sorry, I haven't yet gone through this enough to be confident in the correct usage (and Brad's comments in the relevant old bit of Loader.py wasn't very helpful). It might be worth discussing this on the BioSQL mailing list. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Wed Apr 2 12:57:20 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 2 Apr 2008 17:57:20 +0100 Subject: [Biopython-dev] Statistics code Message-ID: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> Hi, Is it OK to add a dependency on SciPy? Would only influence Bio.PopGen, of course (part of it, actually). Test code would be done not to fail the test suite if the library is missing. As I see it, it is a bit like adding a dependency on an external program (which already happens). If the dependency is not there then that part is non-functional, but all the rest is OK -- http://www.tiago.org From biopython at maubp.freeserve.co.uk Wed Apr 2 13:20:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Apr 2008 18:20:54 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> References: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> Message-ID: <320fb6e00804021020n2a910a80x79ee6cc3829d5e51@mail.gmail.com> On Wed, Apr 2, 2008 at 5:57 PM, Tiago Ant?o wrote: > Hi, > > Is it OK to add a dependency on SciPy? Would only influence > Bio.PopGen, of course (part of it, actually). Its not out of the question, but what exactly do you need from SciPy? If its a very simple thing we might be better off just duplicating it, e.g. in the Bio.Statistics module. Peter From tiagoantao at gmail.com Wed Apr 2 13:35:49 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 2 Apr 2008 18:35:49 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <320fb6e00804021020n2a910a80x79ee6cc3829d5e51@mail.gmail.com> References: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> <320fb6e00804021020n2a910a80x79ee6cc3829d5e51@mail.gmail.com> Message-ID: <6d941f120804021035w2eb864bdo27fbe47b4be6c923@mail.gmail.com> On Wed, Apr 2, 2008 at 6:20 PM, Peter wrote: > Its not out of the question, but what exactly do you need from SciPy? > If its a very simple thing we might be better off just duplicating it, > e.g. in the Bio.Statistics module. I know that this will sound ridiculous, but, in the long run I will need almost everything. Statistics was invented because and for population genetics. http://en.wikipedia.org/wiki/Ronald_Fisher It is at the core of population genetics (actually, without it, Bio.PopGen is really nothing relevant). Bio.Statistics has not much content now... But this requirement could be completely isolated (as is the requirement for external programs). Not having a statistical library would only mean not using Bio.PopGen.Stats . The impact to other modules would be null. I would imagine this problem (requiring external libraries) to appear from time to time as new functionality is included. It will only not appear if Biopython stops in time. SciPy is a stable project, not an obscure library. Tiago From mjldehoon at yahoo.com Wed Apr 2 20:13:15 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 2 Apr 2008 17:13:15 -0700 (PDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120804021035w2eb864bdo27fbe47b4be6c923@mail.gmail.com> Message-ID: <806128.68558.qm@web62410.mail.re1.yahoo.com> > > Its not out of the question, but what exactly do you need from SciPy? > > I know that this will sound ridiculous, but, in the long run I will > need almost everything. I don't think we should include a dependency now because we may need it in the long run. > Bio.Statistics has not much content now... I agree, and probably it is not a good idea to have too much statistics code in Biopython. Such code would fit in better in a numerical or statistics library. > SciPy is a stable project, not an obscure library. While this is true, in my experience SciPy is also difficult to install. It may mean fewer people using your code because they don't want to go through the hassle of installing SciPy. Particularly users coming from a biology rather than a computer science background. Previously we also discussed switching from the old Numerical Python to the new NumPy. I've heard rumors that the NumPy documentation will be declared open at the SciPy conference this year. Not having this documentation was my biggest argument against NumPy. In my understanding, NumPy has more functionality than Numeric. Maybe it has better statistics support also? --Michiel --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From tiagoantao at gmail.com Wed Apr 2 20:55:20 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 3 Apr 2008 01:55:20 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <806128.68558.qm@web62410.mail.re1.yahoo.com> References: <6d941f120804021035w2eb864bdo27fbe47b4be6c923@mail.gmail.com> <806128.68558.qm@web62410.mail.re1.yahoo.com> Message-ID: <6d941f120804021755x52be0194lb3135a0153813405@mail.gmail.com> On Thu, Apr 3, 2008 at 1:13 AM, Michiel de Hoon wrote: > > > Its not out of the question, but what exactly do you need from SciPy? > > > > I know that this will sound ridiculous, but, in the long run I will > > need almost everything. > > I don't think we should include a dependency now because we may need it in > the long run. I already need it now, but just for a very small thing: The chi-square test. It is quite easy to reimplement. If it ends up by being just chisquare (which I doubt, but I might be able to externalize to the user the conventional stats part), then I think the best thing would be just to reimplement and not to force the dependency. But I think that I will need to use more stats stuff as I implement functionality. The point is, IF I need to use it extensively, can I go ahead? If I end up with just the need couple of functions I would not mind implementing it myself (but more than that is too much work, and as you say I don't think it makes sense for biopython to be also biostats). > > > SciPy is a stable project, not an obscure library. > While this is true, in my experience SciPy is also difficult to install. It > may mean fewer people using your code because they don't want to go through > the hassle of installing SciPy. Particularly users coming from a biology > rather than a computer science background. And poorly documented also, in my view. But population genetics is actually 90% statistics. One doesn't do population genetics without statistics. So, if one does pop gen then some kind of statistical processing will have to exist somewhere. If SciPy is difficult to install on Windows/Mac then there is a adoption problem as you point out (I am on Linux/Ubuntu, in this setup is trivial to install), but I don't see a way around statisics for anyone that wants to do population genetics (again statistics where invented for population genetics, it is really core for us). Of course, better solutions than SciPy might exist... > Previously we also discussed switching from the old Numerical Python to the > new NumPy. I've heard rumors that the NumPy documentation will be declared > open at the SciPy conference this year. Not having this documentation was my > biggest argument against NumPy. In my understanding, NumPy has more > functionality than Numeric. Maybe it has better statistics support also? It says on http://www.scipy.org/Documentation : "fee based until SciPy 2008" I think that NumPy has only basic stuff (standard deviation, mean). I might be wrong, but my research points to that. To sum it up: 1. It is still not clear to me that I will need a stats library, most probably yes. 2. I won't mind reimplementing some stats stuff in biopython as long as it is little work in order avoid a dependency. I can try in as much as possible to avoid a dependency. 3. The dependency (in case it appears) would be of zero impact outside of Bio.PopGen.Stats (maybe just setup.py to optionally allow using scipy) 4. I need to know "the rules of the game" before I write more code (in order to know what I can or cannot use, in case I need to use). Tiago PS - In the spirit of cascade software development I could do a a priori study of the requirement, but I really don't believe the conclusion would be reliable. From mjldehoon at yahoo.com Thu Apr 3 05:49:45 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 3 Apr 2008 02:49:45 -0700 (PDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120804021755x52be0194lb3135a0153813405@mail.gmail.com> Message-ID: <434068.89146.qm@web62413.mail.re1.yahoo.com> > I already need it now, but just for a very small thing: The chi-square > test. It is quite easy to reimplement. If it ends up by being just > chisquare (which I doubt, but I might be able to externalize to the > user the conventional stats part), then I think the best thing would > be just to reimplement and not to force the dependency. But I think > that I will need to use more stats stuff as I implement functionality. One solution is just to copy and paste whatever statistics code you need from S SciPy. > I think that NumPy has only basic stuff (standard deviation, mean). I > might be wrong, but my research points to that. The ideal solution would be to move the statistics stuff from SciPy to NumPy, or to expand the statistics stuff currently in NumPy. Since SciPy and NumPy come from the same group of developers, they may not mind too much. Having a statistics library in NumPy would be a big encouragement to move from Numeric to NumPy. > 3. The dependency (in case it appears) would be of zero impact outside > of Bio.PopGen.Stats (maybe just setup.py to optionally allow using > scipy) In practice, when I make the Biopython releases it's special situations like these that cause trouble. For example, if I don't install SciPy on Windows, I can't test Bio.PopGen.Stats there, and errors will go unnoticed. This has happened in the previous Biopython releases. > 4. I need to know "the rules of the game" before I write more code (in > order to know what I can or cannot use, in case I need to use). I would strongly encourage not to add any new dependencies to Biopython. We have too many already; I was actually hoping that the number of dependencies could be reduced. --Michiel. --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From sbassi at gmail.com Thu Apr 3 08:59:05 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu, 3 Apr 2008 09:59:05 -0300 Subject: [Biopython-dev] Code of BLAST XML to HTML Message-ID: Here is the code to convert a BLAST XML file (with one or multiple entries) into one or multiple HTML file(s). It is working with my inputs files (from BLAST 2.2.17 and 18). See and download from: http://www.pastecode.com.ar/f1abb1fbb I don't know whether this should be included into Biopython or not, I am sending to the list since somebody may find it useful anyway. Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From bugzilla-daemon at portal.open-bio.org Thu Apr 3 19:25:40 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Apr 2008 19:25:40 -0400 Subject: [Biopython-dev] [Bug 2480] New: Local BLAST fails: Spaces in Windows file-path values Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2480 Summary: Local BLAST fails: Spaces in Windows file-path values Product: Biopython Version: 1.45 Platform: PC OS/Version: Windows XP Status: NEW Severity: blocker Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: drpatnaik at yahoo.com I am a new user trying Python on a Windows XP SP2 machine on which I do not have admin rights. Consequently, Python itself as well as all of the files/executables I work with have file-paths that contain spaces (e.g., python.exe is at C:\Documents and settings\username...). When I try to perform a local BLAST using code mentioned in one of Bio-Python tuorials, the BLAST fails. I use the following code to capture the error: my_blast_db =r"C:/Documents and Settings/patnaik/My Documents/blast/bin/mine" my_blast_file =r"C:/Documents and Settings/patnaik/My Documents/blast/bin/hairpin" my_blast_exe =r'C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe' from Bio.Blast import NCBIStandalone result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file) error_results = error_handle.read() save_file = open(r"C:/Documents and Settings/patnaik/My Documents/blast/bin/my_blast_error", "w") save_file.write(error_results) save_file.close() The error reported is: 'C:\Documents' is not recognized as an internal or external command, operable program or batch file. There thus seems to be some issue because of the spaces in the file-paths. Can this be resolved by appropriately replacing 'os.popen3' with 'subprocess.call' in Bio/Blast/NCBIStandaolne.py? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 3 20:00:56 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Apr 2008 20:00:56 -0400 Subject: [Biopython-dev] [Bug 2481] New: bitscore not parsed. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2481 Summary: bitscore not parsed. Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sbassi at gmail.com >>> from Bio.Blast import NCBIXML >>> fr=NCBIXML.parse(open('/media/disk/GENES/INTA9/BLAST/seqspMOS.xml')).next() >>> fr.descriptions[0].title u'gnl|BL_ORD_ID|0 pMOS (vector para mt INTA)' >>> fr.descriptions[0].bits Traceback (most recent call last): File "", line 1, in AttributeError: Description instance has no attribute 'bits' To fix this, two files must be modified: NCBIXML.py and Record.py In NCBIXML.py, 2 changes: Line 94 : "method = self._secure_name('_end_' + name.replace("-","_"))" #name in the xml file is "bit-score" and the function should be named like this but can only be named "_end_Hsp_bit_score" hence change from - to _ resolve the issue and should not disturb the rest, this method could/should be also applied line 63 for StartElement Lines 409-410 uncommented In Record.py: Add "self.bits = None" # in line 68 This bug was reported to me by Yoan Jacquemin when testing my code to convert BLAST XML output to HTML. After applying this modifications, it works: >>> from Bio.Blast import NCBIXML >>> f_in='/mnt/hda2/bio/3vsT.xml' >>> fr=NCBIXML.parse(open(f_in)).next() >>> fr.descriptions[0].bits 32.210500000000003 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 3 20:27:16 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Apr 2008 20:27:16 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804040027.m340RGp1003920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2008-04-03 20:27 EST ------- Could you paste here the exact error shown by Python? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 02:15:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 02:15:54 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804040615.m346FsWY020451@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #2 from drpatnaik at yahoo.com 2008-04-04 02:15 EST ------- (In reply to comment #1) I do not see any error/warning notice in the command console. The reason that I suspect something is wrong is that the Blast results output file I get using the code below doesn't have any content (empty file): [snip -- code same as that in previous post] output_results = result_handle.read() save_file = open(r"C:/Documents and Settings/patnaik/My Documents/blast/bin/my_blast_output", "w") save_file.write(output_results) save_file.close() To capture any error, I used the code I mention in my first post. And that is where I find "'C:\Documents' is not recognized as an internal or external command, operable program or batch file." -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 04:37:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 04:37:45 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804040837.m348bjXJ027534@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-04-04 04:37 EST ------- Please try my_blast_exe = r'"C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe"' (note the extra " in the command). If it works, an easy solution would be to add the " to Bio/Blast/NCBIStandalone.py before calling os.popen3. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 04:46:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 04:46:13 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804040846.m348kDlB027900@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #14 from ericgibert at yahoo.fr 2008-04-04 04:46 EST ------- Created an attachment (id=892) --> (http://bugzilla.open-bio.org/attachment.cgi?id=892&action=view) Recoding of the Taxonomy parser using SAX All attributes found in the XML document are now parsed and are stored as properties. Please look at the header's explanation or the tests at the end of the code for examples. Please let me know if this is ok. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 09:14:57 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 09:14:57 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804041314.m34DEv6D009448@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #1 from sbassi at gmail.com 2008-04-04 09:14 EST ------- Created an attachment (id=893) --> (http://bugzilla.open-bio.org/attachment.cgi?id=893&action=view) Corrected NCBIXML (for bit parsing) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 09:17:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 09:17:47 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804041317.m34DHlEx009715@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #2 from sbassi at gmail.com 2008-04-04 09:17 EST ------- Created an attachment (id=894) --> (http://bugzilla.open-bio.org/attachment.cgi?id=894&action=view) Record modified for "bit" parsing (bit from XML blast) Both files must be applied together. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 16:01:01 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 16:01:01 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804042001.m34K11LP030328@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #4 from drpatnaik at yahoo.com 2008-04-04 16:01 EST ------- (In reply to comment #3) Python throws this error when I do so: File "C:\Documents and Settings\patnaik\My Documents\Python252\lib\site-packages\Bio\Blast\NCBIStandalone.py", line 1650, in blastall raise ValueError, "blastall does not exist at %s" % blastcmd ValueError: blastall does not exist at "C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe" I get a similar error using: my_blast_exe = r'"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' blastall.exe _is_ at C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 23:10:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 23:10:58 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804050310.m353Awmd016642@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-04-04 23:10 EST ------- Isn't the bit score in the hsps? Using xbt001.xml from the Biopython test suite, I find >>> fr.alignments[0].hsps[0].score 469.0 >>> fr.alignments[0].hsps[0].bits 185.267 Otherwise, which line in xbt001.xml is not being parsed? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 23:14:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 23:14:28 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804050314.m353ESag016750@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|blocker |normal ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2008-04-04 23:14 EST ------- In the line shown in the Python error message, it is trying os.path.exists(your_path) with your_path = "C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe" and doesn't find your_path if it includes the extra ". Can you play a bit with >>> import os >>> os.path.exists(your_path) >>> os.system(your_path) to see which variation (if any) works for both? I mean with ", without ", maybe trying '? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 00:01:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 00:01:23 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804050401.m3541NH5018822@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-04-05 00:01 EST ------- After some googling, it looks like your original suggestion to use subprocess.call is probably the best solution. However, it requires Python >= 2.4, whereas currently we require Python >= 2.3. Does anybody have an objection against requiring Python >= 2.4? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 01:24:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 01:24:35 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804050524.m355OZVU022917@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #4 from sbassi at gmail.com 2008-04-05 01:24 EST ------- (In reply to comment #3) > Isn't the bit score in the hsps? Yes, it is. But this is not the only place it should be, keep on reading. > Otherwise, which line in xbt001.xml is not being parsed? This is not the problem (we are not missing a line from being parsed from the xml). The problem is that "bit" is not in "description" as it should be. Why it should be in description? Take this example: >>> fr.alignments[0].hsps[0].expect 8.32193 But you also have "expect" in "descriptions": >>> fr.descriptions[0].e 8.32193 Another example: >>> fr.alignments[0].hsps[0].score 16.0 and >>> fr.descriptions[0].score 16.0 "descriptions" corresponds to the description table in BLAST HTML documents, all values from table should be there. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 06:03:30 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 06:03:30 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804051003.m35A3Uwu002667@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2008-04-05 06:03 EST ------- OK, I see. Three more comments though: 1) In NCBIXML.py: method = self._secure_name('_end_' + name.replace("-","_")) I don't see why name.replace("-","_") is needed. Isn't that the purpose of self._secure_name in the first place? 2) In Record.py: Please add "bits" with a description to the docstring (lines 58-65). 3) About your change: > Lines 409-410 uncommented I wonder why these lines were commented out in the first place. It was done in revision 1.8 of NCBIXML.py, but I didn't see any explanation as to why those lines were commented out. Peter, do you know? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 08:33:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 08:33:09 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804051233.m35CX95L008244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #15 from ericgibert at yahoo.fr 2008-04-05 08:33 EST ------- Created an attachment (id=895) --> (http://bugzilla.open-bio.org/attachment.cgi?id=895&action=view) Parser for Taxonomic Data from NCBI for Bio.Entrez All right, following Michiel email, I wrote this third version, based on the existing parser in Bio.Entrez CVS. Integration with Bio.Entrez.__init__.py is straight forward: class DataHandler(ContentHandler): from Bio.Entrez import EInfo, ESearch, ESummary, EPost, ETaxonomy # eric gibert _NameToModule = {"eInfoResult": EInfo, "eSearchResult": ESearch, "eSummaryResult": ESummary, "ePostResult": EPost, "TaxaSet": ETaxonomy, # eric gibert } That's it. A Unit test script will be like: import Bio.Entrez def print_record(taxrec): print taxrec["Rank"], taxrec["ScientificName"], "has the TaxId ", taxrec["TaxId"], "and its parent is", taxrec["ParentTaxId"] print taxrec["OtherNames"] print taxrec["Division"], "with Genetic Code:", taxrec["GeneticCode"], "and Mitochondrial Genetic Code:", taxrec["MitoGeneticCode"] print taxrec["Lineage"] print taxrec["LineageEx"] print "Record Created on %s, updated on %s and published on %s." %(taxrec["CreateDate"],taxrec["UpdateDate"],taxrec["PubDate"]) # simple test: get the dog... handle = Bio.Entrez.efetch(db = "taxonomy", id = 9615, retmode = "XML") taxonomic_record = Bio.Entrez.read(handle) print_record(taxonomic_record) # get multiple answers search_handle = Bio.Entrez.esearch(db = "taxonomy", term = "orthetrum c*", retmode = "XML") IdList = Bio.Entrez.read(search_handle)["IdList"] for id in IdList: handle = Bio.Entrez.efetch(db = "taxonomy", id = id, retmode = "XML") orthetrum = Bio.Entrez.read(handle) print_record(orthetrum) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 16:44:34 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 16:44:34 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804052044.m35KiYZm030885@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-05 16:44 EST ------- I use Python 2.3 on Windows because I have the MSVC 6.0 compiler all setup for building python extensions (later versions of Python switched to a later MS compiler). I would object to dropping support for Python 2.3 over what seems to be a minor issue. As I do have Biopython and local blast up and running on this machine, I will be able to try and investigate this issue. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 16:46:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 16:46:42 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804052046.m35KkgPq030982@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-05 16:46 EST ------- P.S. There were similar issues in the clustalw wrapper, where I added win32 only code to add quotes to the command line when there were spaces in the file name. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 6 07:02:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 6 Apr 2008 07:02:38 -0400 Subject: [Biopython-dev] [Bug 2468] Tutorial needs a fix: Bio.WWW.NCBI In-Reply-To: Message-ID: <200804061102.m36B2cYw027716@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2468 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-06 07:02 EST ------- Once Eric and Michiel have settled on a Taxonomy XML parser for Bio.Entrez (see Bug 2475), then this section of the tutorial could be updated to use this and the new XML search results parser Michiel has already checked into CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 6 21:54:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 6 Apr 2008 21:54:13 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804070154.m371sDPG004564@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #9 from drpatnaik at yahoo.com 2008-04-06 21:54 EST ------- (In reply to comment #5) Using code: import os your_path =r'"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' os.path.exists(your_path) os.system(your_path) Console output indicates [v] works (outputs: "blastall 2.2.18 arguments ..." ) and [x] doesn't work (outputs some variation of "... is not recognized as ...") [v] r'"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' [x] r"'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe'" [x] r"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe" [x] r'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe' [v] '"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' [x] "'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe'" [x] "C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe" [x] 'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe' [v]'"C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe"' [x] "'C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe'" [x] "C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe" [x] 'C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe' However, when I run the code (original post) with any of the [v] working values, I get the "blastall does not exist" error. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 04:42:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 04:42:17 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804070842.m378gHae024590@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #891 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 04:42:37 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 04:42:37 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804070842.m378gb0U024630@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #892 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 04:45:53 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 04:45:53 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804070845.m378jrSf024885@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #16 from ericgibert at yahoo.fr 2008-04-07 04:45 EST ------- Created an attachment (id=896) --> (http://bugzilla.open-bio.org/attachment.cgi?id=896&action=view) Parser for many Taxons returned from NCBI Taxonomy db This version allows the parsing of 1 to many taxons. The 'record' is a list of dictionaries. Application: # get multiple answers print "Multiple records search:" search_handle = Bio.Entrez.esearch(db = "taxonomy", term = "orthetrum c*", retmode = "XML") IdList = Bio.Entrez.read(search_handle)["IdList"] handle = Bio.Entrez.efetch(db = "taxonomy", id = IdList, retmode = "XML") orthetrum_list = Bio.Entrez.read(handle) print len(orthetrum_list), "Orthetrum match your search:" for orthetrum in orthetrum_list: print orthetrum["Rank"], orthetrum["ScientificName"], "has the TaxId", orthetrum["TaxId"], "and its parent is", orthetrum["ParentTaxId"] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 06:44:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 06:44:09 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804071044.m37Ai9M5031608@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #17 from mdehoon at ims.u-tokyo.ac.jp 2008-04-07 06:44 EST ------- Patch (2008-04-07 04:45 EST) looks fine to me. I have uploaded it to CVS. I renamed it Taxon.py though for consistency with the existing parsers (name is the same as the corresponding DTD file). I have also modified Bio/Entrez/__init__.py accordingly. The patch (2008-04-05 08:33 EST) is obsolete, right? Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 08:59:31 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 08:59:31 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804071259.m37CxVUk005741@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-07 08:59 EST ------- Michiel, It seems that Bio.Entrez.efetch can return XML files containing one record or many records, e.g. taxon_id_list = ['488050', '447868', '333459', '126256'] taxon_handle = Bio.Entrez.efetch(db="taxonomy", id=taxon_id_list, retmode="XML") #This handle contains four Taxon entries taxon_handle = Bio.Entrez.efetch(db="taxonomy", id='488050', retmode="XML") #This handle contains one Taxon entry Bio.Entrez.read(taxon_handle) will return a list of dictionaries (one for each taxon ID supplied). We've established a convention of sorts about "read()" versus "parse()", the first returns a single record and the second a record iterator. If a taxon single entry (currently held as a dictionary) is regarded as a record, then should Bio.Entrez.read() be called Bio.Entrez.parse() instead? I am also wondering if we should create simple record classes for the different XML data types (instead of using dictionaries). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 09:46:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 09:46:52 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804071346.m37DkqVP009002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-07 09:46 EST ------- I've checked this in to CVS (taking note of Michiel's comments), and confirmed the NCBI XML unit test passes. Sebastian - could you submit your suggested changes as patches next time please? It would have made life a little easier, trying to work out what exactly you wanted to change (which in the end, was fairly small). > 1) In NCBIXML.py: > method = self._secure_name('_end_' + name.replace("-","_")) > I don't see why name.replace("-","_") is needed. Isn't that the purpose of > self._secure_name in the first place? I agree. > 2) In Record.py: > Please add "bits" with a description to the docstring (lines 58-65). I've done this in CVS. > 3) About your change: > > Lines 409-410 uncommented > I wonder why these lines were commented out in the first > place. It was done in revision 1.8 of NCBIXML.py, but I > didn't see any explanation as to why those lines were > commented out. Peter, do you know? I made that old check-in, but it was some time ago and I don't recall the details. The self._descr.bits variable was never setup, causing an exception, and I guess at the time uncommenting this bit seemed like a good solution. With the Record.py fixed, the parser lines can now be uncommented. Perhaps the original code used to work on early NCBI XML files? Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Mon Apr 7 11:26:28 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Apr 2008 16:26:28 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <434068.89146.qm@web62413.mail.re1.yahoo.com> References: <6d941f120804021755x52be0194lb3135a0153813405@mail.gmail.com> <434068.89146.qm@web62413.mail.re1.yahoo.com> Message-ID: <320fb6e00804070826v7d4ff977uf4e7cdff68e9fc33@mail.gmail.com> On Thu, Apr 3, 2008 at 10:49 AM, Michiel de Hoon wrote: > > But I think that I will need to use more stats stuff as I implement functionality. > > One solution is just to copy and paste whatever statistics code you need from S > SciPy. That does seem to be an option based on their licence and Biopython's. > > I think that NumPy has only basic stuff (standard deviation, mean). I > > might be wrong, but my research points to that. According to http://www.scipy.org/Numpy_Functions_by_Category they have array statistics: average(), mean(), bincount(), histogram(), corrcoef(), cov(), max(), min(), ptp(), median(), std(), var() plus a selection of random number and distribution functions. > The ideal solution would be to move the statistics stuff from SciPy to NumPy, > or to expand the statistics stuff currently in NumPy. Since SciPy and NumPy > come from the same group of developers, they may not mind too much. Is that something you want to raise with them, Michiel? > Having a statistics library in NumPy would be a big encouragement to move from > Numeric to NumPy. Speaking of which, is that still stuck on the 64bit issue? Bug 2251 - NumPy support for BioPython http://bugzilla.open-bio.org/show_bug.cgi?id=2251 Peter From bugzilla-daemon at portal.open-bio.org Mon Apr 7 21:41:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 21:41:24 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804080141.m381fOL6013094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #895 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 21:52:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 21:52:58 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804080152.m381qwwN014055@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #19 from ericgibert at yahoo.fr 2008-04-07 21:52 EST ------- I marked the attachment #895 (2008-04-05 08:33 EST) as obsolete. Waiting for Michiel's reply to Peter's reply for updating the current code. Or maybe it is only __init__.py which needs modification (as I did not see "parse()" defined yet). Eric -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 10:17:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 10:17:54 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804081417.m38EHs0q007147@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #20 from mdehoon at ims.u-tokyo.ac.jp 2008-04-08 10:17 EST ------- > Bio.Entrez.read(taxon_handle) will return a list of dictionaries (one for each > taxon ID supplied). We've established a convention of sorts about "read()" > versus "parse()", the first returns a single record and the second a record > iterator. > If a taxon single entry (currently held as a dictionary) is regarded as a > record, then should Bio.Entrez.read() be called Bio.Entrez.parse() instead? I thought about that also, but I think that having Bio.Entrez.read() only is better. The reason is that some XML files returned by NCBI can be regarded as a list of records (possibly a list of only one record), but others can never be regarded as a list of records. That means we could have a Bio.Entrez.parse() in addition to Bio.Entrez.read(), but not instead of Bio.Entrez.read(). Now, in practical situations that could get ugly, not to say counterintuitive. For example, take Bio.Entrez.einfo. Without an argument, Bio.Entrez.einfo() returns a list of NCBI databases. Bio.Entrez.einfo(db="pubmed") then returns a dictionary with information about the pubmed database. (This double usage is not my choice; this is how NCBI has it set up). If we apply the parse/read rule strictly, we'd get the following: >>> from Bio import Entrez >>> handle = Entrez.einfo() >>> records = Entrez.parse(handle) >>> for record in records: ... print record pubmed protein nucleotide nuccore .... taxonomy toolkit unigene unists >>> To me, this seems to be a bit too much, since this is actually just a list. Now if we want information about pubmed, we'd use >>> handle = Entrez.einfo(db="pubmed") >>> record = Entrez.read(handle) # Now we have to use read() instead of parse() And here is the really tricky part: Is the following possible? >>> handle = Entrez.einfo(db=["pubmed","taxonomy"]) For example, Entrez.efetch allows a list of Ids; a user may guess that Entrez.einfo can handle a list of dbs. If it can, should he then call parse() instead of read() (in the example above, with db="pubmed")? Unlike for example Bio.Blast.NCBIXML, where we always get a list of records, for Bio.Entrez some XML files are more like a single record, whereas others are more like a list of records, and it may not be obvious to the user which is which. If you make a mistake, you have to repeat your query to NCBI, because the handle is already partially read. If we define the read/parse rule as "read returns an object, parse returns an iterator", then the existing Bio.Entrez.read() is still fine. > I am also wondering if we should create simple record classes for > the different XML data types (instead of using dictionaries). This can be useful if the record is an empty object deriving from a dict. It allows us to add a docstring to each record, while still preserving the functionality of each record as a dictionary. I don't see a good usage of additional functionality right now. Essentially, the XML file represents a dictionary (or a list of dictionaries); the Python object we returns should correspond to this. One alternative is to have a record class with fields corresponding to the keys in the dictionary. So >>> record.abc >>> record.ddd >>> record.klmnop instead of >>> record["abc"] >>> record["ddd"] >>> record["klmnop"] But I like the second form better, because it allows us to call keys() on the record and get the names of all fields. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 8 10:43:29 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Apr 2008 07:43:29 -0700 (PDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <320fb6e00804070826v7d4ff977uf4e7cdff68e9fc33@mail.gmail.com> Message-ID: <474123.57528.qm@web62410.mail.re1.yahoo.com> Peter Cock wrote:> Having a statistics library in NumPy would be a big encouragement to move from > Numeric to NumPy. Speaking of which, is that still stuck on the 64bit issue? Bug 2251 - NumPy support for BioPython http://bugzilla.open-bio.org/show_bug.cgi?id=2251 I was not driving the issue of NumPy support because in its current state, NumPy seems to have as many advantages as disadvantages compared to Numeric. In addition the NumPy documentation is not free while the Numeric documentation is, so currently in my opinion the balance is in favor of Numeric. The situation changes when the NumPy documentation becomes freely available this summer (at the SciPy conference). Then the scale might tip in favor of NumPy, so we should revisit the issue then. --Michiel. --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 11:04:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 11:04:54 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804081504.m38F4sph009833@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-08 11:04 EST ------- Regarding comment 20, you're right to say the one record/many records issue is cloudy. Lets stick with using Bio.Entrez.read() then. Regarding returning objects or dictionaries, my feeling was that attributes and doc strings could be used to help explain how to interpret the results. However, if as you say the python dictionary is a natural representation of the XML data, then that should suffice - provided the NCBI have been clear with their field naming conventions. If we are all happy with this, then we should update the Bio.Entrez chapter of the tutorial. I would remove some of the longer cut-n-paste sections of XML output, as it doesn't look very good in the PDF output. Getting back to the NCBI taxon issue in BioSQL, did Hilmar's reply on the BioSQL mailing list clarify the use of the left/right fields? http://lists.open-bio.org/pipermail/biosql-l/2008-April/001233.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 19:17:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:17:59 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082317.m38NHxZf002619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #22 from ericgibert at yahoo.fr 2008-04-08 19:17 EST ------- Yes, he replied with the following link: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html This page provides explanation and algorithm. The point is that this calculation has to be done on the whole table, not on addition of a new taxon. Thus this will bvery penalizing if eqch time we add a taxon, we force the recalculation. Better let the batch doing so and default the values to NULL (or -1 if not NULL, I did not check). What do you think? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 19:19:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:19:44 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082319.m38NJim6002688@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ericgibert at yahoo.fr ------- Comment #23 from ericgibert at yahoo.fr 2008-04-08 19:19 EST ------- Regarding note #20, I think that always returning a list is better. We need anyway to test if the record was found or not thus we might as well do: if len(return_val) == 0: -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 19:25:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:25:11 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082325.m38NPB4d002893@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #24 from ericgibert at yahoo.fr 2008-04-08 19:25 EST ------- if len(return_val) == 0: print "No record found" elif len(return_val) == 1: print "ok, proceed with", return_val[0] else: print "Ambiguity: please look at the different matches" for tax in return_val: ..... whatever print/select you need -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 19:31:10 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:31:10 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082331.m38NVAMe003042@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-08 19:31 EST ------- Hi Eric, On the BioSQL taxon issue, recalculating the whole taxon table left/right values each time we add a new entry doesn't seem very sensible. Could you try my patch (attachment 883 on this bug) which only records a single entry for the new NCBI taxon ID (with null left/right values)? I should have split the Bio.Entrez issue into a separate bug a while ago - but yes, as things stand it is up to the user to check if they get one or more records, depending on what they asked for. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 9 09:34:08 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Apr 2008 09:34:08 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804091334.m39DY84A014676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #26 from mdehoon at ims.u-tokyo.ac.jp 2008-04-09 09:34 EST ------- > If we are all happy with this, then we should update the Bio.Entrez chapter of > the tutorial. OK I can do that. > I would remove some of the longer cut-n-paste sections of XML > output, as it doesn't look very good in the PDF output. Agreed. The raw XML output is now in the tutorial only because it was the best we could do at the time, as we didn't have any parsers in Bio.Entrez. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 10 08:07:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Apr 2008 08:07:32 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804101207.m3AC7WL8017496@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #27 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-10 08:07 EST ------- Regarding inserting the lineage into the taxon/taxon_tables, the tree structure is stored in two ways. Firstly, using the taxon.parent field, and secondly using the left/right fields. Over on the BioSQL mailing list we've established that updating the left/right values by recalulating them takes about 10 minutes - doing this from Biopython when adding a new sequence does not seem ideal. We could add missing taxonomy nodes to the tables (based on the Bio.Entrez data), and record the tree structure using the taxon.parent field, but leave the left/right values as NULL. This should be enough for Biopython to recover the full linege when retrieving a sequence - we need to check BioSQL.BioSeq._retrieve_taxon() is happy. If the user wants the left/right values, they would have to (re)run the BioSQL load_ncbi_taxonomy.pl script (which is slow). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 12 15:25:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 12 Apr 2008 15:25:42 -0400 Subject: [Biopython-dev] [Bug 2488] New: Adding XML parsers to Bio.Entrez Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2488 Summary: Adding XML parsers to Bio.Entrez Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk This is a placeholder bug for adding more XML parsers to Bio.Entrez to cope with all the NCBI formats. See also Bug 2475 which had a Taxonomy parser, now checked in. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 12 15:38:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 12 Apr 2008 15:38:38 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804121938.m3CJcckt000647@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-12 15:38 EST ------- Created an attachment (id=904) --> (http://bugzilla.open-bio.org/attachment.cgi?id=904&action=view) Bio/Entrez/PubmedArticle.py This is a possible Bio/Entrez/PubmedArticle.py which implements an XML parser for the PubMed database. When constructing a dictionary to hold each publication, I am deliberately flattening and simplifying the very deeply nested structure the NCBI uses. In general, do we want to provide a faithful conversion of the full XML DOM structure into python objects, or just a simplificaton? If the user cares about the exact XML structure, or particular elements, they are probably better off writing their own parsers using DOM or SAX as they see fit. Still needs more testing, perhaps storing the dates as date objects and not as dictionaries. Also I am ignoring the "history" elements. It may be worthwhile returning a Reference object (see the GenBank parser) for these entries... Just thinking out loud about the Bio.Entrez parsers in general: Why don't the Bio/Entrez/XXX.py implement subclasses of Bio.Entrez.DataHandler, rather than just the two methods startElement() and endElement() -- I'm trying to understand why you did it this way round Michiel. Finally, in Bio/Entrez/__init__.py why is the _NameToModule dict defined within the DataHandler class? This seems to prevent it from being edited -- desirable if the user wanted to add or change the parsers called by Bio.Entrez.read() in their script. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 12 20:55:03 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 12 Apr 2008 20:55:03 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804130055.m3D0t3Fi014906@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-04-12 20:55 EST ------- > Why don't the Bio/Entrez/XXX.py implement subclasses of Bio.Entrez.DataHandler, > rather than just the two methods startElement() and endElement() -- I'm trying > to understand why you did it this way round Michiel. We don't know what kind of XML we're handling until after we start reading it, since this is information contained in the XML. So we need to create the handler before knowing which handler will be needed. > Finally, in Bio/Entrez/__init__.py why is the _NameToModule dict defined > within the DataHandler class? This seems to prevent it from being edited > -- desirable if the user wanted to add or change the parsers called by > Bio.Entrez.read() in their script. We can get to _nameToModule. Try it: >>> from Bio import Entrez >>> Entrez.DataHandler._NameToModule Being able to override the parsers in Bio.Entrez is something Sean requested. I am not sure if he still wants it, or if it's really useful. A user can also modify the record after parsing it with the standard parser in Bio.Entrez, which gives the same end result as modifying the parser. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 08:46:39 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 13 Apr 2008 08:46:39 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804131246.m3DCkdiU029292@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-04-13 08:46 EST ------- Just one comment on your patch: It would be a good idea to include the exact name of the DTD in the comments. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 09:46:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 13 Apr 2008 09:46:06 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804131346.m3DDk66B032208@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-04-13 09:46 EST ------- Uploaded a parser for SerialSet to CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 10:32:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 13 Apr 2008 10:32:28 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804131432.m3DEWSln001836@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-13 10:32 EST ------- >> Why don't the Bio/Entrez/XXX.py implement subclasses >> of Bio.Entrez.DataHandler, rather than just the two >> methods startElement() and endElement() -- I'm trying >> to understand why you did it this way round Michiel. > > We don't know what kind of XML we're handling until > after we start reading it, since this is information > contained in the XML. So we need to create the > handler before knowing which handler will be needed. OK - I see what you mean now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From samnemo at gmail.com Mon Apr 14 18:06:46 2008 From: samnemo at gmail.com (sam n) Date: Mon, 14 Apr 2008 18:06:46 -0400 Subject: [Biopython-dev] kdtree update Message-ID: I made some changes to the C++ side of Biopython's KDTree that allow one to perform a nearest-neighbor search without specifying a range up-front. I found this saved a considerable amount of CPU time for the problem I was using it for. It might be useful to other people so I can send the update, which is based on http://www.google.com/codesearch?hl=en&q=+kdtree+show:b099E8j0eYY:M9X8aTw_p7E:Tn8Xj-OBPYY&sa=N&cd=4&ct=rc&cs_p=ftp://ftp.diku.dk/diku/users/martinz/tabu.tar.gz&cs_f=kdtree.c#first Where do I send it to? This mailing list? Thanks Sam From biopython-dev at maubp.freeserve.co.uk Mon Apr 14 18:37:59 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Apr 2008 23:37:59 +0100 Subject: [Biopython-dev] kdtree update In-Reply-To: References: Message-ID: <320fb6e00804141537i73a3ccadmbbe4e71769bd5469@mail.gmail.com> On Mon, Apr 14, 2008 at 11:06 PM, sam n wrote: > I made some changes to the C++ side of Biopython's KDTree that allow one to > perform a nearest-neighbor search without specifying a range up-front. I found > this saved a considerable amount of CPU time for the problem I was using it for. > It might be useful to other people so I can send the update, which is based on > http://www.google.com/codesearch?hl=en&q=+kdtree+show:b099E8j0eYY:M9X8aTw_p7E:Tn8Xj-OBPYY&sa=N&cd=4&ct=rc&cs_p=ftp://ftp.diku.dk/diku/users/martinz/tabu.tar.gz&cs_f=kdtree.c#first > > Where do I send it to? This mailing list? Hi Sam, Could you file an "enhancement" bug on Bugzilla, and then attach a patch? http://bugzilla.open-bio.org/ It would also help to have a small example (in python) of how you would use this. Thanks Peter From thamelry at binf.ku.dk Tue Apr 15 02:08:03 2008 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Tue, 15 Apr 2008 08:08:03 +0200 Subject: [Biopython-dev] kdtree update In-Reply-To: References: Message-ID: <2d7c25310804142308l477ba091tcab648f753fb1ba0@mail.gmail.com> On Tue, Apr 15, 2008 at 12:06 AM, sam n wrote: > I made some changes to the C++ side of Biopython's KDTree that allow one > to > perform a nearest-neighbor search without specifying a range up-front. What exactly do you mean by this? Could you post an example, also of the speed up? Cheers, -Thomas From bugzilla-daemon at portal.open-bio.org Tue Apr 15 10:10:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:10:06 -0400 Subject: [Biopython-dev] [Bug 2489] New: KDTree NN search without specifying radius Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2489 Summary: KDTree NN search without specifying radius Product: Biopython Version: 1.45 Platform: PC OS/Version: Windows Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: samnemo at gmail.com CC: thamelry at binf.ku.dk All the current searches in the KDTree require specifying a radius. If you don't know what the radius is, you don't know how far to search without taking a typical estimate of the data set. I just added a function to find the nearest neighbor to a coordinate without specifying this radius up front. I made the changes on the C++ side of Biopython's KDTree. It might be useful to other people so I will post the update, which is based on http://www.google.com/codesearch?hl=en&q=+kdtree+show:b099E8j0eYY:M9X8aTw_p7E:Tn8Xj-OBPYY&sa=N&cd=4&ct=rc&cs_p=ftp://ftp.diku.dk/diku/users/martinz/tabu.tar.gz&cs_f=kdtree.c#first However, I am not currently proficient in the Python C API, so someone else may be able to write the interface in 3 minutes... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 10:19:04 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:19:04 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151419.m3FEJ4Ib010877@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #1 from samnemo at gmail.com 2008-04-15 10:19 EST ------- Created an attachment (id=908) --> (http://bugzilla.open-bio.org/attachment.cgi?id=908&action=view) updated CPP file added public function void KDTree::search_nn(float* coord,bool allowzero) and private function void KDTree::_search_r(Node* node,float* coord,bool allowzero) other changes include leaving coordinates as squared distances when storing them. the user is then responsible for calling sqrt if desired. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 10:25:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:25:47 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151425.m3FEPl7J011220@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #2 from samnemo at gmail.com 2008-04-15 10:25 EST ------- Created an attachment (id=909) --> (http://bugzilla.open-bio.org/attachment.cgi?id=909&action=view) updated .H file added public function void KDTree::search_nn(float* coord,bool allowzero) bool allowzero specifies whether to allow zero distance between nearest neighbor searching for in tree (should be false when searching for nearest neighbor of a coordinate known to be in the tree) and private function : void KDTree::_search_r(Node* node,float* coord,bool allowzero) performs recursive search for nearest neighbor of coord starting from node also note new member variable : float _min_radius_sq; used to keep track of min distance found for a single nearest neighbor search other changes include: leaving coordinates as squared distances when storing them. the user is then responsible for calling sqrt if desired. replaced some of the static declaration of variable sized arrays as vectors since wouldn't compile on msvc++, but easy to change that back... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 10:26:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:26:24 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151426.m3FEQO1d011261@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 samnemo at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |samnemo at gmail.com -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 10:45:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:45:06 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151445.m3FEj6bc011983@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-15 10:45 EST ------- The python API would need to be updated to match. Do you know how to use SWIG? You mention changes to leaving coordinates as squared distances - this may be more efficient but would probably break any existing code using this. As Thomas Hamelryck said on the mailing list, an example to show why this new code is useful would be very helpful (and to demonstrate the claimed time speed up). Link to email thread for anyone not subscribed: http://lists.open-bio.org/pipermail/biopython-dev/2008-April/003601.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 11:03:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 11:03:52 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151503.m3FF3q08013312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #4 from samnemo at gmail.com 2008-04-15 11:03 EST ------- I never used SWIG before, but could learn how to use it...I don't have a lot of time at the moment...so I'll have to come back to this...might be sooner (or later) than I think... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 16 07:02:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 16 Apr 2008 07:02:41 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804161102.m3GB2fDi013466@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-04-16 07:02 EST ------- Added a parser for the OMIM database, and an initial unit test. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 17 08:41:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 17 Apr 2008 08:41:06 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804171241.m3HCf6OO031008@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-17 08:41 EST ------- Michiel - I see you've been doing more work on these parsers (and unit tests). I'm quite happy for you to take on PubmedArticle.py and implement it as you see fit (based on my suggested code or otherwise). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 17 22:35:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 17 Apr 2008 22:35:29 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804180235.m3I2ZT1A004790@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #10 from drpatnaik at yahoo.com 2008-04-17 22:35 EST ------- Could this be a blast.exe issue? Using the Windows command console, if I change the working directory (cd) to the one having 'blast.exe', both the following work: [1] blastall.exe -p blastn -d "C:\Documents and Settings\patnaik\My Documents\blast\bin\mine" -i "C:\Documents and Settings\patnaik\My Documents\blast\bin\hairpin" -m 7 [2] "C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe" -p blastn -d "C:\Documents and Settings\patnaik\My Documents\blast\bin\mine" -i "C:\Documents and Settings\patnaik\My Documents\blast\bin\hairpin" -m 7 But neither ([1] for obvious reason), nor 'bin/blast.exe -p ...', etc., work if I move out of the directory that has 'blast.exe'. The console displays: [NULL_Caption] WARNING: Unable to open Documents.nin [NULL_Caption] WARNING: Unable to open and.nin [NULL_Caption] WARNING: Unable to open My.nin ... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 17 22:42:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 17 Apr 2008 22:42:35 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804180242.m3I2gZ1q005051@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #11 from drpatnaik at yahoo.com 2008-04-17 22:42 EST ------- (In reply to comment #10) No, not a blast.exe issue as, e.g., this works (note the escaped quotes): "bin\blastall.exe" -p blastn -d "\"C:\Documents and Settings\patnaik\My Documents\blast\bin\mine\"" -i "C:\Documents and Settings\patnaik\My Documents\blast\bin\hairpin" -m 7 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Fri Apr 18 14:45:36 2008 From: chris.lasher at gmail.com (Chris Lasher) Date: Fri, 18 Apr 2008 14:45:36 -0400 Subject: [Biopython-dev] [O|B|F Helpdesk #332] Transitioning Biopython to SVN In-Reply-To: <-2724361901926009927@unknownmsgid> References: <-2724361901926009927@unknownmsgid> Message-ID: <128a885f0804181145p7440f6bfp94c12c514519ed1@mail.gmail.com> Hi Mauricio, Right now the transition is idling. George Hartzell set up a prototype repository that was read-only and I had a successful checkout. I don't think any Biopython devs noticed any anomalies, so I guess that's a thumbs-up for the transition by default. There are a number of tickets that seem to be getting worked on at the moment but I think we can freeze the CVS repository soon and get SVN going. Also, what's the resolution for providing public read-only access to the repositories? I think that was the only unresolved matter that concerned the Biopython users. Thanks, Chris On Fri, Apr 18, 2008 at 2:34 PM, Mauricio Herrera Cuadra via RT wrote: > What is the status of this so far? > From bugzilla-daemon at portal.open-bio.org Tue Apr 22 08:27:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Apr 2008 08:27:47 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804221227.m3MCRl95007664@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 22 09:10:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Apr 2008 09:10:58 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804221310.m3MDAwOo010833@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #896 is|0 |1 obsolete| | ------- Comment #28 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-22 09:10 EST ------- (From update of attachment 896) Marking this patch as obsolete since we've got something based on Eric's work for Bio.Entrez checked into CVS now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From MatatTHC at gmx.de Tue Apr 22 12:49:02 2008 From: MatatTHC at gmx.de (Matthias Bernt) Date: Tue, 22 Apr 2008 18:49:02 +0200 Subject: [Biopython-dev] derive from Seq Message-ID: <20080422164902.232560@gmx.net> Hi, Despite of other messages on this list someone needs circular sequences .. me :). I thought that the best way to get a circular behaviour is to make a derived class in order to keep all the nice features of Seq. So I've started to write a derived class which overwrites some of the methods from the Seq - especially __getitem__ - and I run quite fast into problems. E.g. the complement method returns a Seq object. The desired behaviour would be to return an instance of my derived class .. This could be done easily with self.__class__(s, self.alphabet). Unfortunately the __init__ method of my derived class has a third parameter (with default value) which sets the sequence to circular / linear. This could be done with copy or clone methods. So you see that there are problems when deriving from the Seq class. What is the best (or a good) strategy for deriving classes from Seq? Thanks -- Matthias -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger From n.j.loman at bham.ac.uk Tue Apr 22 12:50:36 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Tue, 22 Apr 2008 17:50:36 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader Message-ID: <480E175C.9060301@bham.ac.uk> Dear biopython-developers, As importing data into PostgreSQL is much faster when using the batch "COPY" method I decided I would hack BioSQL.Loader to produce COPY statements for the bulk of the data in a typical GenBank file. As index updating/foreign key checking is also slow, I split the BioSQL schema. I put table definitions in one file and then indexes/foreign key constraints in a separate one. I import the schema file, then apply indexes/FK only after the data is loaded. Caveat, pbviously this can't be done on a "live" database and it relies on only a single import process being run at any one time. A modified 'Loader' uses a new class called 'FakeTable'. FakeTable acts as a very, very basic data store attempting to simulate the behavior of Postgres. FakeTable.dump() outputs COPY statements to stdout instead of SQL commands. I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a GenBank file 42MB large (microbial32.genomic.gbff from RefSeq). load_seqdatabase.pl - not directly comparable as needs foreign keys/rules to run correctly, but conservatively >20 minutes BioSQL.Loader/psyco - 4 minutes, 54 seconds BatchLoader/psyco - 1 minute, 38 seconds +Import the output - 8 seconds Postgres 8.3.1, Gentoo/Linux, 8GB RAM. As the number of sequence files increases, there should be even greater gains, as the interactive version will take longer to execute each query. This is not production-quality code but might act as a starting poing for hacking about with. I would be grateful for any comments. If the team felt this would be a useful inclusion into BioPython I am happy to work it up a bit more. A MySQL compatible version would not be very hard, for example. I reckon this could be faster, for example the sequence parsing could be threaded on a multi-core machines. Code is here: http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/ I'd be grateful for any feedback on how this might be improved, and how we can make it even faster! Many thanks Nick. From biopython-dev at maubp.freeserve.co.uk Tue Apr 22 13:27:38 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Apr 2008 18:27:38 +0100 Subject: [Biopython-dev] derive from Seq In-Reply-To: <20080422164902.232560@gmx.net> References: <20080422164902.232560@gmx.net> Message-ID: <320fb6e00804221027x68e866eeo9f62a9e83250355c@mail.gmail.com> Hi Matthias, > ... I've started to write a derived class which overwrites some of the > methods from the Seq - especially __getitem__ - and I run quite fast > into problems. E.g. the complement method returns a Seq object. > > The desired behaviour would be to return an instance of my derived class > .. This could be done easily with self.__class__(s, self.alphabet). That has been raised before (although I don't think anyone filed a bug) and as I recall, changing this lead to some unexpected side effects (failures in the test suite). If you fancy trying this change, and digging into what (if anything) breaks as a result, that would be very helpful. > Unfortunately the __init__ method of my derived class has a third > parameter (with default value) which sets the sequence to circular / linear. > This could be done with copy or clone methods. > > So you see that there are problems when deriving from the Seq class. What > is the best (or a good) strategy for deriving classes from Seq? Maybe for now your best bet is to subclass, and then write your own (reverse)complement method which calls the base-class to do the work, and then transforms the resulting Seq object into your own CircularSeq object. Unfortunately, you would probably have to do something similar for other problematic methods (until the base class is fixed). Out of interest, how do you interpret integers in your CircularSeq's __getitem__ method? Python's existing negative index behaviour seems to be ideal, for example -1 already returns the last letter. I'd guess you make values longer than the sequence length simply wrap. Is that the only change or do you alter the splice behaviour too (this is where it gets tricky). Peter From biopython-dev at maubp.freeserve.co.uk Tue Apr 22 13:38:24 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Apr 2008 18:38:24 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <480E175C.9060301@bham.ac.uk> References: <480E175C.9060301@bham.ac.uk> Message-ID: <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> On Tue, Apr 22, 2008 at 5:50 PM, Nick Loman wrote: > Dear biopython-developers, > > As importing data into PostgreSQL is much faster when using the batch > "COPY" method I decided I would hack BioSQL.Loader to produce COPY > statements for the bulk of the data in a typical GenBank file. Can I ask what version of Biopython you're using? And given you've got it running on PostgreSQL, is there anything you think should be added to the wiki documentation?: http://biopython.org/wiki/BioSQL > As index updating/foreign key checking is also slow, I split the BioSQL > schema. I put table definitions in one file and then indexes/foreign key > constraints in a separate one. While this is fine for your own use - you'd had have to take this up on the BioSQL mailing list if you wanted it to become a standard (i.e. its not just up to us at Biopython). It might be worth moving some of this discussion there anyway. > I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a > GenBank file 42MB large (microbial32.genomic.gbff from RefSeq). > > load_seqdatabase.pl - not directly comparable as needs foreign > keys/rules to run correctly, but > conservatively >20 minutes > > BioSQL.Loader/psyco - 4 minutes, 54 seconds > > BatchLoader/psyco - 1 minute, 38 seconds > +Import the output - 8 seconds > > Postgres 8.3.1, Gentoo/Linux, 8GB RAM. Did you run the numbers for a plain Biopython BioSQL.Loader import (without psyco)? If you do go back and run some more tests, could you also try just parsing the GenBank file without actually doing anything with the data (to see what the overhead is on your machine). > I reckon this could be faster, for example the sequence parsing could be > threaded on a multi-core machines. You should in principle be able to run multiple imports even without making any code changes to Biopython, although I suspect there is some scope for clashes (e.g. two threads both adding new entries to the taxonomy tables). > Code is here: > http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/ > > I'd be grateful for any feedback on how this might be improved, and how we > can make it even faster! That seems to be password protected at the moment. Peter From n.j.loman at bham.ac.uk Wed Apr 23 04:09:54 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 23 Apr 2008 09:09:54 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> Message-ID: <480EEED2.5080602@bham.ac.uk> Hi Peter >> As importing data into PostgreSQL is much faster when using the batch >> "COPY" method I decided I would hack BioSQL.Loader to produce COPY >> statements for the bulk of the data in a typical GenBank file. > > Can I ask what version of Biopython you're using? 1.45. > is there anything you think should be > added to the wiki documentation?: > http://biopython.org/wiki/BioSQL I've added a few lines on Postgres. >> As index updating/foreign key checking is also slow, I split the BioSQL >> schema. I put table definitions in one file and then indexes/foreign key >> constraints in a separate one. > > While this is fine for your own use - you'd had have to take this up > on the BioSQL mailing list if you wanted it to become a standard (i.e. > its not just up to us at Biopython). It might be worth moving some of > this discussion there anyway. Yep, appreciate that! The problem is that you wouldn't want to have non-indexed tables ever if you were updating with the traditional 'interactive' scripts, as they will begin to slow to a crawl as more data is imported. So this approach is only really good for this kind of batch-import model. However I guess it is still reasonably friendly to ask people to import 2 scripts in a row. >> load_seqdatabase.pl - not directly comparable as needs foreign >> keys/rules to run correctly, but >> conservatively >20 minutes >> +Import the output - 8 seconds > > Did you run the numbers for a plain Biopython BioSQL.Loader import > (without psyco)? If you do go back and run some more tests, could you > also try just parsing the GenBank file without actually doing anything > with the data (to see what the overhead is on your machine). Yep, sure. GenBank parsing without psyco - 2 minutes, 15 seconds GenBank parsing with psyco - 1 minute, 20 seconds >> BioSQL.Loader/psyco - 4 minutes, 54 seconds BioSQL.Loader without psyco - 6 minutes, 10 seconds >> BatchLoader/psyco - 1 minute, 38 seconds BatchLoader without psyco - 2 minutes, 42 seconds >> I reckon this could be faster, for example the sequence parsing could be >> threaded on a multi-core machines. > > You should in principle be able to run multiple imports even without > making any code changes to Biopython, although I suspect there is some > scope for clashes (e.g. two threads both adding new entries to the > taxonomy tables). Yep, with the interactive version I reckon this would work without many problems (most taxa should be pulled out of NCBI anyway), but with my flat-file version this wouldn't work unless specifically designed for. I could parallelise the GB parsing stage though as that is the current bottleneck for my app. >> Code is here: >> http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/ >> >> I'd be grateful for any feedback on how this might be improved, and how we >> can make it even faster! > > That seems to be password protected at the moment. My bad, it's open now. Regards, Nick. From biopython at maubp.freeserve.co.uk Wed Apr 23 04:56:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 09:56:33 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <480EEED2.5080602@bham.ac.uk> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> Message-ID: <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> > GenBank parsing with psyco - 1 minute, 20 seconds > GenBank parsing without psyco - 2 minutes, 15 seconds > > BioSQL.Loader/psyco - 4 minutes, 54 seconds > BioSQL.Loader without psyco - 6 minutes, 10 seconds > > BatchLoader/psyco - 1 minute, 38 seconds > BatchLoader without psyco - 2 minutes, 42 seconds That's impressive - you seem to have got the database side of things down to about 30 seconds; a fraction of the time to parse the GenBank file! Although, as you pointed out, there are a lot of provisos here. There are still some slow bits in the current GenBank parser which would be an obvious next target for you in your quest for speed. I did a little investigation a while ago, and concluded the parsing of the feature locations was the biggest bottleneck. However, this is a rather complicated lump of code, so its not such an easy task. I tried out a "hack" which special-cased the most common feature location types, with a fall back on the original parser, which gave much better performance. I didn't check this in as it made some already complex code WAY more complicated! > > > I reckon this could be faster, for example the sequence parsing could > > > be threaded on a multi-core machines. Did you mean simply one GenBank file per core, or something more complicated where parsing a single file is done using multiple cores? Peter From n.j.loman at bham.ac.uk Wed Apr 23 08:08:57 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 23 Apr 2008 13:08:57 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> Message-ID: <480F26D9.8000405@bham.ac.uk> Peter wrote: > That's impressive - you seem to have got the database side of things > down to about 30 seconds; a fraction of the time to parse the GenBank > file! Although, as you pointed out, there are a lot of provisos here. Yep. Would it be helpful to do anything further with this code, i.e. put it into CVS and document on the Wiki, perhaps when its been a bit more tested? > There are still some slow bits in the current GenBank parser which > would be an obvious next target for you in your quest for speed. I > did a little investigation a while ago, and concluded the parsing of > the feature locations was the biggest bottleneck. However, this is a > rather complicated lump of code, so its not such an easy task. I > tried out a "hack" which special-cased the most common feature > location types, with a fall back on the original parser, which gave > much better performance. I didn't check this in as it made some > already complex code WAY more complicated! Aha, sounds good. I haven't profiled the Biopython code but I will check this. I'm dealing with bacterial sequences in the main which have mainly simple location identifiers, so there could well be some mileage here. >>>> I reckon this could be faster, for example the sequence parsing could >>>> be threaded on a multi-core machines. > > Did you mean simply one GenBank file per core, or something more > complicated where parsing a single file is done using multiple cores? I mean process one GenBank file per core. Locally that would mean on a 4-core machine you could have 3 parser threads working concurrently, each passing the generated Seq object to the Loader when read. Cheers Nick. From biopython at maubp.freeserve.co.uk Wed Apr 23 08:48:25 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 13:48:25 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <480F26D9.8000405@bham.ac.uk> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> <480F26D9.8000405@bham.ac.uk> Message-ID: <320fb6e00804230548w71e44289q40387a131476ec4b@mail.gmail.com> > > That's impressive - you seem to have got the database side of things > > down to about 30 seconds; a fraction of the time to parse the GenBank > > file! Although, as you pointed out, there are a lot of provisos here. > > Yep. > > Would it be helpful to do anything further with this code, i.e. put it into > CVS and document on the Wiki, perhaps when its been a bit more tested? I'm not ready to put this into the main Biopython CVS. But by all means, add a new page to the wiki to describe your approach. Hopefully there are a few others who might be interested, and we'll see. > > There are still some slow bits in the current GenBank parser which > > would be an obvious next target for you in your quest for speed. I > > did a little investigation a while ago, and concluded the parsing of > > the feature locations was the biggest bottleneck. However, this is a > > rather complicated lump of code, so its not such an easy task. I > > tried out a "hack" which special-cased the most common feature > > location types, with a fall back on the original parser, which gave > > much better performance. I didn't check this in as it made some > > already complex code WAY more complicated! > > Aha, sounds good. I haven't profiled the Biopython code but I will check > this. I'm dealing with bacterial sequences in the main which have mainly > simple location identifiers, so there could well be some mileage here. Yes, I had been experimenting with bacterial sequences too. Beware that the location string in general can be extremely complex (and even reference other files by their identifier). A complete backwards compatible re-write of the location parsing (into sub-features) looked like a big job. That said, if you do run some profiling, you may spot some other "low hanging fruit" which would be easier to tackle. I haven't done any optimisation work since my original re-write of the GenBank parser back in August 2006 when I replaced the older slower Martel parser which didn't scale well with large input files. > I mean process one GenBank file per core. > > Locally that would mean on a 4-core machine you could have 3 parser threads > working concurrently, each passing the generated Seq object to the Loader > when read. I see - that means there is only one thread/job writing to the database, which keeps that side of things thread-safe. To be honest, unless you are trying to import several hundred bacterial genomes into BioSQL, I don't think this level of complexity is a worth while pay off. Right now, I would target the GenBank parsing itself (which would be useful outside the task of loading sequences into BioSQL). Something else you may want to consider is timing the BioPerl scripts for importing a GenBank file into BioSQL. There will probably be some minor differences in their interpretation of the data and exactly they store it, but it would be a useful base mark. Peter From n.j.loman at bham.ac.uk Wed Apr 23 11:46:35 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 23 Apr 2008 16:46:35 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <320fb6e00804230548w71e44289q40387a131476ec4b@mail.gmail.com> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> <480F26D9.8000405@bham.ac.uk> <320fb6e00804230548w71e44289q40387a131476ec4b@mail.gmail.com> Message-ID: <480F59DB.70604@bham.ac.uk> Peter wrote: > I'm not ready to put this into the main Biopython CVS. But by all > means, add a new page to the wiki to describe your approach. > Hopefully there are a few others who might be interested, and we'll > see. Okay! >> I mean process one GenBank file per core. >> >> Locally that would mean on a 4-core machine you could have 3 parser threads >> working concurrently, each passing the generated Seq object to the Loader >> when read. > > I see - that means there is only one thread/job writing to the > database, which keeps that side of things thread-safe. To be honest, > unless you are trying to import several hundred bacterial genomes into > BioSQL, I don't think this level of complexity is a worth while pay > off. Right now, I would target the GenBank parsing itself (which > would be useful outside the task of loading sequences into BioSQL). I agree, I will take a look at GenBank parsing next, and then concurrency after that. The reason I'm doing this is that I need to import all 1686 complete/incomplete bacterial genomes in RefSeq - and plenty more besides! > Something else you may want to consider is timing the BioPerl scripts > for importing a GenBank file into BioSQL. There will probably be some > minor differences in their interpretation of the data and exactly they > store it, but it would be a useful base mark. I did this, it was incredibly slow, at least 5x slower. We've been using Bioperl for some time. I realised I needed a faster script so I investigated the same approach with BioPerl but I thought I'd be able to hack the Biopython stuff a bit faster as the BioSQL stuff seems a bit less complex. Plus Python is easy to read ;) Cheers Nick. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 10:52:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:52:28 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241452.m3OEqS2w005327@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #29 from ericgibert at yahoo.fr 2008-04-24 10:52 EST ------- Created an attachment (id=914) --> (http://bugzilla.open-bio.org/attachment.cgi?id=914&action=view) Usage of Bio/Entrez/Taxon.py parse to load a SeqRecord's taxonomy Modification essentially in the function DatabaseLoader._get_taxon_id(self, record). Note the new optinal parameter in DatabaseLoader.__init__(self, adaptor, dbid, fetch_NCBI_taxonomy=False) Attention: this parameter must be collected from BioSeqDatabase.load(self, record_iterator, fetch_NCBI_taxonomy=False) in BioSQL/BioSeqDatabase.py: See next attachment for this modification. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 10:54:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:54:38 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241454.m3OEscMM005448@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #30 from ericgibert at yahoo.fr 2008-04-24 10:54 EST ------- Created an attachment (id=915) --> (http://bugzilla.open-bio.org/attachment.cgi?id=915&action=view) Addition of 'fetch_NCBI_taxonomy' as noptional parameter to DatabaseLoader.load() Extra parameter for BioSeqDatabase.load() and pass it to DatabaseLoader (where it will be a property) db_loader = Loader.DatabaseLoader(self.adaptor, self.dbid, fetch_NCBI_taxonomy) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 10:55:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:55:18 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241455.m3OEtImJ005511@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #915|Addition of |Addition of description|'fetch_NCBI_taxonomy' as |'fetch_NCBI_taxonomy' as |noptional parameter to |optional parameter to |DatabaseLoader.load() |DatabaseLoader.load() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 10:55:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:55:44 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241455.m3OEtidF005561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #914|Usage of Bio/Entrez/Taxon.py|Usage of Bio/Entrez/Taxon.py description|parse to load a SeqRecord's |parser to load a SeqRecord's |taxonomy |taxonomy -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 25 19:18:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Apr 2008 19:18:11 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804252318.m3PNIBg9008981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 drpatnaik at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |blocker -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 25 23:47:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Apr 2008 23:47:28 -0400 Subject: [Biopython-dev] [Bug 2494] New: _retrieve_taxon in BioSQL.py needs urgent optimization Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2494 Summary: _retrieve_taxon in BioSQL.py needs urgent optimization Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: ericgibert at yahoo.fr I ran the Perl script to get the BioSQL tables 'taxon' and 'taxon_name' updated. Taxon contains 419036 rows and taxon_name contains 584058 rows. To retrieve the taxonomy of a DBSeqRecord, the function DBSeq._retrieve_taxon() uses a SQL based on the nested sets defined by left and right values. This approach is extremely time consuming once the tables grow large. When the issue is a bottom-up search, in this case "all taxon parent of this species", it is better to use the links child/parent based on parent_taxon_id field. Please refer to next post with attached script demonstrating my point. Eric -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 25 23:54:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Apr 2008 23:54:23 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200804260354.m3Q3sNhM019804@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #1 from ericgibert at yahoo.fr 2008-04-25 23:54 EST ------- Created an attachment (id=916) --> (http://bugzilla.open-bio.org/attachment.cgi?id=916&action=view) script timing the current SQL and proposed bottom-up 'loop' implementation results obtained on my dual core PC (Fedora 7 64 bit): (resulting lists are truncated to help the reading) go for it: getTaxonSQLsimplex took 370.300 ms [1L, 2759L, 6072L, ...... 229390L, 229391L] getTaxonSQL took 6846.810 ms ['Eukaryota', 'Metazoa', '...... 'Nannophya', 'Nannophya pygmaea'] getTaxonSQLall took 6772.037 ms ['root', 'cellular organisms', '... 'Odonata', 'Anisoptera/Anisozygoptera group''Nannophya', 'Nannophya pygmaea'] getTaxonLoop took 14.559 ms ['cellular organisms',... 'Nannophya', 'Nannophya pygmaea'] Conclusion: many runs have shown that the Loop function is always under 15ms while the current SQL will be more than 6500ms. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 26 06:29:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 26 Apr 2008 06:29:25 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200804261029.m3QATPLW003286@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-26 06:29 EST ------- There is a small risk that your numbers will be missleading when applied to other databases (i.e. mysql versus postgres). Other than that, using just the parent id is probably a much better idea (especially given some of the proposals on Bug 2475 about not writing the left/right values). Have you ever used the "diff" command-line tool to produce a patch file? e.g. diff old_version.py new_version.py > patch.txt Or, if you are working on a CVS checkout, modify the local file and then: cvs diff changed_file.py > patch.txt Also read up on the "patch" command for applying the patch to update an unchanged file. If you are on a Unix style platform, these are usually installed already. For Windows I use cygwin's diff command but there are probably other options. There are several advantages, including the fact that patches are smaller. For initial code review, they also highlight the area changed. Another big advantage is if CVS has been updated in the meantime, a patch can often still be applied automatically. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 28 08:42:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Apr 2008 08:42:05 -0400 Subject: [Biopython-dev] [Bug 2495] New: parse element symbols for ATOM/HETATM records (Bio.PDB.PDBParser) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2495 Summary: parse element symbols for ATOM/HETATM records (Bio.PDB.PDBParser) Product: Biopython Version: 1.45 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: macrozhu+biopy at gmail.com Hi, the current Bio.PDB.PDBParser does not parse column 77-78 from ATOM records in PDB files, where element symbols are (usually) stored for ATOM. We suggest BioPython to parse this information in the next version. The reasons are given as follows: 1. The current remediated PDB format requires these symbols to be always present ( http://www.wwpdb.org/documentation/format3.1-20080211.pdf ), though in old PDB files (v2.3), these symbols are sometimes missing. 2. In some cases it is not straightforward, if not impossible, to recognize hydrogen atoms by their identifiers in the remediated PDB files. e.g. in 1AWW, ATOM 378 HD11 LEU A 25 46.755 -3.858 0.453 1.00 0.00 H ATOM 379 HD12 LEU A 25 47.178 -2.160 0.234 1.00 0.00 H ATOM 380 HD13 LEU A 25 47.054 -3.226 -1.165 1.00 0.00 H ATOM 381 HD21 LEU A 25 49.453 -1.483 0.307 1.00 0.00 H ATOM 382 HD22 LEU A 25 50.714 -2.537 -0.327 1.00 0.00 H ATOM 383 HD23 LEU A 25 49.413 -1.984 -1.381 1.00 0.00 H In this PDB entry, chemical symbols (H) are not right justified in column 13-14 for hydrogen identifiers like for other elements. A bit extra work is required to figure it out. What's more, sometimes it's even impossible to distinguish hydrogen from mercury without columns 77-78. From the PDB entry format description version 2.1: "Hydrogen naming sometimes conflicts with IUPAC conventions. For example, a hydrogen named HG11 in columns 13 - 16 is differentiated from a mercury atom by the element symbol in columns 77 - 78. Columns 13 - 16 present a unique name for each atom." Therefore we strongly suggest PDBParser to cover column 77-78 for ATOM/HETATM records. We have looked at relevant code and it seems three files (Atom.py, PDBParser.py, StructureBuilder.py) needed to be revised marginally for integrating this update: 1). in Atom.py CVS Revision 1.18 line 17: add one parameter "element" to the function Atom::__init__(...) def __init__(self, name, coord, bfactor, occupancy, altloc, fullname, serial_number, element): line 61: add line self.element = element add a set method: def set_element(self, element): self.element = element add a public method: def get_element(self): return self.element 2). in PDBParser.py CVS Revision 1.20 line 161: add one line to parse element symbol in function PDBParser::_parse_coordinates(self, coords_trailer) element=line[76:78].strip() line 182: add one more parameter to init_atom(): structure_builder.init_atom(name, coord, bfactor, occupancy, altloc, fullname, serial_number, element) 3). in StructureBuilder.py CVS Revision 1.16 line 158: add one parameter "element" to the function StructureBuilder::init_atom(self, name, coord, b_factor, occupancy, altloc, fullname, serial_number=None, element='') line 190: add "element" to the initialization of Atom instance. atom=self.atom=myAtom(name, coord, b_factor, occupancy, altloc, fullname, serial_number, element) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 1 00:57:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 31 Mar 2008 20:57:35 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804010057.m310vZqG029753@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #890 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Tue Apr 1 12:23:57 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 1 Apr 2008 13:23:57 +0100 Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <508125.38490.qm@web62401.mail.re1.yahoo.com> References: <6d941f120803311208k6b6c9d1ah58c7808e0fbd0e2c@mail.gmail.com> <508125.38490.qm@web62401.mail.re1.yahoo.com> Message-ID: <6d941f120804010523k1edefb9aq8e57ad3137f66c59@mail.gmail.com> On Tue, Apr 1, 2008 at 1:13 PM, Michiel de Hoon wrote: > What is the advantage of branching? > AFAIK, the code in Bio.PopGen does not affect the rest of Biopython anyway. > All things equal, I'd prefer not to branch to keep things simple for users, > not to mention myself. The idea was to make things easier to you and new developers, actually. Mess on the branches and a clean trunk to be easy for new developers and easy releases. Merging would probably be the responsability of whoever is developing the code. So you would have a clean trunk and new people would also see something clean. I think the problem really stems from where biopython is going: if it is mainly maintenance mode then branching makes no sense (as new code is essencially refactoring and bug patching). If there are new features and modules poping in (which might bring initial chaos) then branching would be a good place to clean out the chaos before hitting the main trunk. Most of the code that I am adding now is actually quite pacific (although being the most important), but I was trying to avoid having the main trunk with code under heavy development. Tiago From mjldehoon at yahoo.com Tue Apr 1 12:13:22 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 05:13:22 -0700 (PDT) Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <6d941f120803311208k6b6c9d1ah58c7808e0fbd0e2c@mail.gmail.com> Message-ID: <508125.38490.qm@web62401.mail.re1.yahoo.com> What is the advantage of branching? AFAIK, the code in Bio.PopGen does not affect the rest of Biopython anyway. All things equal, I'd prefer not to branch to keep things simple for users, not to mention myself. --Michiel. Tiago Ant?o wrote: On Mon, Mar 31, 2008 at 8:04 PM, Peter wrote: > There is a lot to be said for having a single stable trunk - it > certainly makes things simpler for any new developers to get to grips > with things. It is one of those issues where there is no clear answer. Maybe a case by case analysis? I think having 5 gazillion branches would not be a good idea ever, but in the Biopython case many modules are somewhat self contained, making merging an easier exercise. Tiago _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev --------------------------------- Special deal for Yahoo! users & friends - No Cost. Get a month of Blockbuster Total Access now From mjldehoon at yahoo.com Tue Apr 1 12:27:55 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 05:27:55 -0700 (PDT) Subject: [Biopython-dev] Genbank dbSNP support In-Reply-To: <6d941f120803311513k43139fbi97683597c15f03a2@mail.gmail.com> Message-ID: <160489.24269.qm@web62402.mail.re1.yahoo.com> > Any plans for dbSNP support? > http://www.ncbi.nlm.nih.gov/SNP/index.html No existing plans, as far as I know. > I think I would volunteer to implement this. Great! > I think I would volunteer to implement this. A simple solution would > be to add both databases and return types. Michiel (I suppose this is > code that you are actively maintaining, or it is Peter?), can I send > you a diff? Opening a bug report on Bugzilla and adding your diff there is better. It's likely to get lost (i.e., forgotten) if it's in an email. Also, please have a look at Bio.Entrez (the module formerly known as Bio.WWW.NCBI). It has code for all of NCBI's EUtils, including efetch, except for parsers at this point. This is currently under development. Bio.Entrez is in release 1.45, but there are already some additions in CVS. --Michiel. --------------------------------- No Cost - Get a month of Blockbuster Total Access now. Sweet deal for Yahoo! users and friends. From mjldehoon at yahoo.com Tue Apr 1 12:52:17 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 05:52:17 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <264855a00803301751h270ee34dg86325eb1af298369@mail.gmail.com> Message-ID: <685989.32313.qm@web62405.mail.re1.yahoo.com> I have added a read() function to Bio.Entrez in CVS. Following Peter's suggestion, I put a dictionary (_NameToModule) inside the Bio.Entrez.DataHandler class, which can be used to override the default parser with a user-defined parser. I am not sure though why a user-defined parser needs to go through Bio.Entrez.read(). Wouldn't it be easier to do something like >>> from Bio import Entrez >>> handle = Entrez.efetch(something) >>> record = run_my_parser(handle) Currently, I have added only one parser (for EInfo). To try it, use >>> from Bio import Entrez >>> handle = Entrez.einfo() >>> record = Entrez.read(handle) >>> print record ['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists'] # To get information about the snp database >>> handle = Entrez.einfo(db="snp") >>> record = Entrez.read(handle) >>> print record["Count"] 44992036 >>> print record["LastUpdate"] 2007/11/29 18:22 --Michiel. Sean Davis wrote:This makes sense. However, it seems that there needs to be a way to "register" a parser with read() so that users can extend their local installation with a specialized parser. In other words, it seems that a way to dynamically register a parser with read() would be helpful. Or am I missing something? Sean --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From mjldehoon at yahoo.com Tue Apr 1 13:04:36 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 06:04:36 -0700 (PDT) Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <6d941f120804010523k1edefb9aq8e57ad3137f66c59@mail.gmail.com> Message-ID: <526878.58305.qm@web62413.mail.re1.yahoo.com> Tiago Ant?o wrote:Most of the code that I am adding now is actually quite pacific (although being the most important), but I was trying to avoid having the main trunk with code under heavy development. While I appreciate your consideration, personally I don't mind if some module of the main trunk is under heavy development, as long as it doesn't break other modules. Go ahead, knock yourself out :-). --Michiel. --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From biopython at maubp.freeserve.co.uk Tue Apr 1 13:49:14 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Apr 2008 14:49:14 +0100 Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <685989.32313.qm@web62405.mail.re1.yahoo.com> References: <264855a00803301751h270ee34dg86325eb1af298369@mail.gmail.com> <685989.32313.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00804010649x4175396fr23d1f1993d1da11f@mail.gmail.com> Michiel wrote: > I have added a read() function to Bio.Entrez in CVS. > Following Peter's suggestion, I put a dictionary (_NameToModule) inside the > Bio.Entrez.DataHandler class, which can be used to override the default parser > with a user-defined parser. Do you only intend to support Entrez XML files with this read() function, or potentially other formats too? Even for the assorted XML formats, I'm not yet clear on how you imaging this being extended. Have you had a chance to look at Eric's Entrez Taxonomy XML parser? It would need some re-factoring to fit in (see attachments on Bug 2475). http://bugzilla.open-bio.org/show_bug.cgi?id=2475 > I am not sure though why a user-defined parser needs to go through > Bio.Entrez.read(). Wouldn't it be easier to do something like > >>> from Bio import Entrez > >>> handle = Entrez.efetch(something) > >>> record = run_my_parser(handle) Sure - you could pass the handle to any parser of your choice, e.g. Bio.SeqIO.read() or Bio.SeqIO.parse() if you used Bio.Entrez.efetch to get a GenBank or Fasta file. Peter From tiagoantao at gmail.com Tue Apr 1 14:22:43 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 1 Apr 2008 15:22:43 +0100 Subject: [Biopython-dev] Bio.PopGen and CVS/SVN In-Reply-To: <526878.58305.qm@web62413.mail.re1.yahoo.com> References: <6d941f120804010523k1edefb9aq8e57ad3137f66c59@mail.gmail.com> <526878.58305.qm@web62413.mail.re1.yahoo.com> Message-ID: <6d941f120804010722va1ded18q1e5e3c69ebc6c7c8@mail.gmail.com> On Tue, Apr 1, 2008 at 2:04 PM, Michiel de Hoon wrote: > Tiago Ant?o wrote: > Most of the code that I am adding now is actually quite pacific > (although being the most important), but I was trying to avoid having > the main trunk with code under heavy development. > While I appreciate your consideration, personally I don't mind if some > module of the main trunk is under heavy development, as long as it doesn't > break other modules. Go ahead, knock yourself out :-). I will do that when SVN is online ;) From mjldehoon at yahoo.com Tue Apr 1 14:23:29 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 1 Apr 2008 07:23:29 -0700 (PDT) Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <320fb6e00804010649x4175396fr23d1f1993d1da11f@mail.gmail.com> Message-ID: <246475.95559.qm@web62405.mail.re1.yahoo.com> > Do you only intend to support Entrez XML files with this read() > function, or potentially other formats too? As all of Entrez's EUtils can return XML output (with many of them returning XML only), I was thinking of parsing XML files only. EUtils output in one of the sequence formats ought to be parsed by Bio.SeqIO. I am not sure if there are any other major file formats that we should handle. We can think about that later if and when the need arises. > Even for the assorted XML formats, I'm not yet clear on how you > imaging this being extended. This I am not clear on either; I just added this in response to Sean's request so we have some concrete code to look at. Sean, could you give an example of how you would extend (this or a different) parser? > Have you had a chance to look at Eric's Entrez Taxonomy XML > parser? It would need some re-factoring to fit in (see attachments > on Bug 2475). > http://bugzilla.open-bio.org/show_bug.cgi?id=2475 Eric uses a DOM parser, while I am using a SAX parser. DOM parsers have the advantage that they allow modification of the XML tree, whereas SAX just goes through the XML in one pass. SAX is preferable for large files, since DOM keeps the full XML file in memory, but maybe it is not so relevant for NCBI's EUtils. Anyway, if the end result is a Python object representing the XML, it doesn't matter much whether we go through DOM or SAX. Eric, do you have a strong preference for DOM? Once we have the basic framework for the Bio.Entrez parser settled, we can merge it with Eric's code. --Michiel --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 08:58:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 04:58:05 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804020858.m328w5HX024442@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #8 from mdehoon at ims.u-tokyo.ac.jp 2008-04-02 04:58 EST ------- Eric, I was looking at the part of your code that parses the XML. I see you use a DOM parser instead of a SAX parser. For Bio.Entrez in general, I have a slight preference for a SAX parser, since it does not require to have the full XML in memory. Do you have a strong preference for a DOM parser? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 09:31:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 05:31:44 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804020931.m329Vi9p026094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #9 from ericgibert at yahoo.fr 2008-04-02 05:31 EST ------- Well, I always use DOM as I find it easier whereas SAX seems cumbersome. I can try to convert if you want... Another question: do you want me to add an extra function in the class to update the tables taxon/taxon_name or you prefer it to be done in the loader.py? Let me know and I'll provide that missing code. I am already thinking about it with the default to be taxon_id == ncbi_taxon_id, and if not then taxon_id will be autogenerated. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 09:45:14 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 05:45:14 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804020945.m329jE0u026758@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-02 05:45 EST ------- For the organisation of the code, what I had in mind was a general purpose XML parser in Bio.Entrez.Taxonomy (with nothing to do with BioSQL), which would be called from an updated BioSQL.Loader to parse a handle to the XML data fetched using Bio.Entrez.efetch(). When adding a new SeqRecord to the BioSQL datanase, we would start with its NCBI taxon ID, and assuming its not already in the database, go online to find the parent taxon ID, and repeat until we match the ID of an existing taxon record in the database (or get to the root node). And then add all the new taxon records to the database. [I hope this is roughly the process you had in mind Eric] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 10:56:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 06:56:24 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804021056.m32AuOl9029219@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #11 from ericgibert at yahoo.fr 2008-04-02 06:56 EST ------- Ok to have the code in Loader.py. When a SeqRecord is to be added, we create a Taxonomy instance. I will add a function to return a copy of the _NCBI_lineage list. Then from "top" to "bottom", check if the taxon exists, if not, add it, until the species itself (there ensure that the parent_taxon_id is well populated). By default, we assume that taxon_id == NCBI_taxon_id. If this is not the case, do I raise an error or "fall to plan B" and let the database to auto assign the taxon_id? On missing point: the left and right value. Do you know what to do? I have run the Perl script on a test database and plan to look into the created records to clarify it... but you can save me the effort if you already know their logic. PS: because my original script was only updating the partial records created by the previous algorithm of Loader, I need to rewrite it. Maybe 2 man.day. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 2 11:52:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 07:52:18 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804021152.m32BqIcD031558@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #12 from mdehoon at ims.u-tokyo.ac.jp 2008-04-02 07:52 EST ------- Peter wrote: > For the organisation of the code, what I had in mind was a general purpose XML > parser in Bio.Entrez.Taxonomy (with nothing to do with BioSQL), which would be > called from an updated BioSQL.Loader to parse a handle to the XML data fetched > using Bio.Entrez.efetch(). That is what I have in mind also. Eric wrote: > Well, I always use DOM as I find it easier whereas SAX seems cumbersome. > I can try to convert if you want... I can understand that for your own work, you prefer to use DOM to just pick up the tags you are interested in. For a Biopython, though, we should have a more general solution that is useful to other users and in other situations also. Which is why I was thinking of a parser in Bio.Entrez that parses all of the XML returned from the Taxonomy database. If you're interested in writing a full parser for Taxonomy XML, maybe the parser for EInfo that is currently in CVS may be useful as an example. For ESearch, I already wrote a SAX parser; I just need to upload it to CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Apr 2 12:14:20 2008 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 2 Apr 2008 13:14:20 +0100 Subject: [Biopython-dev] Bio.Entrez XML parsing Message-ID: Hey all, I don't know if I'm gonna say something really stupid, because really, I haven't read much of the discussion. From what I read you're discussing if we can use a parser for the XML given by most entrez methods. If you want my advice, I'd use libxml2. I'm currently working with efetch/esearch to get PMIDs and Abstracts and I use XML as my return mode and libxml2 as my parser. It is quite simple to use and quite faster when compared to, for example, minidom. If you need some example code, or some hand in this, I'll be more than glad to lend pieces of my code :) Best regards! From biopython at maubp.freeserve.co.uk Wed Apr 2 13:07:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Apr 2008 14:07:59 +0100 Subject: [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: References: Message-ID: <320fb6e00804020607s53d171d6s503756ac65de875a@mail.gmail.com> On Wed, Apr 2, 2008 at 1:14 PM, Jo?o Rodrigues wrote: > Hey all, I don't know if I'm gonna say something really stupid, because > really, I haven't read much of the discussion. From what I read you're > discussing if we can use a parser for the XML given by most entrez methods. > If you want my advice, I'd use libxml2. I'm currently working with > efetch/esearch to get PMIDs and Abstracts and I use XML as my return mode > and libxml2 as my parser. It is quite simple to use and quite faster when > compared to, for example, minidom. If you need some example code, or some > hand in this, I'll be more than glad to lend pieces of my code :) > > Best regards! Do you know how libxml2 compares to the python SAX XML parser for speed? One big downside is libxml2 would be yet another external dependency for Biopython. I would be much happier if we could stick to the built in python libraries. Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 2 13:41:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 2 Apr 2008 09:41:35 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804021341.m32DfZYO006566@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-02 09:41 EST ------- In reply to comment 11, > Ok to have the code in Loader.py. > When a SeqRecord is to be added, we create a Taxonomy instance. > I will add a function to return a copy of the _NCBI_lineage list. > Then from "top" to "bottom", check if the taxon exists, if not, > add it, until the species itself (there ensure that the > parent_taxon_id is well populated). Something like that sounds fine. But I think we should settle the Bio.Entrez.Taxonomy code first. > By default, we assume that taxon_id == NCBI_taxon_id. Why do you say that? I don't think we should make this assumption. See also BioSQL project Bug 2470 > If this is not the case, do I raise an error or > "fall to plan B" and let the database to auto assign > the taxon_id? I am inclined to let the database assign the taxon_id, unless after discussion on the BioSQL mailing list it is agreed that "attempting" to use the NCBI taxon id as the taxon_id is encouraged. > On missing point: the left and right value. Do you know > what to do? I have run the Perl script on a test database > and plan to look into the created records to clarify it... > but you can save me the effort if you already know their logic. Sorry, I haven't yet gone through this enough to be confident in the correct usage (and Brad's comments in the relevant old bit of Loader.py wasn't very helpful). It might be worth discussing this on the BioSQL mailing list. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Wed Apr 2 16:57:20 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 2 Apr 2008 17:57:20 +0100 Subject: [Biopython-dev] Statistics code Message-ID: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> Hi, Is it OK to add a dependency on SciPy? Would only influence Bio.PopGen, of course (part of it, actually). Test code would be done not to fail the test suite if the library is missing. As I see it, it is a bit like adding a dependency on an external program (which already happens). If the dependency is not there then that part is non-functional, but all the rest is OK -- http://www.tiago.org From biopython at maubp.freeserve.co.uk Wed Apr 2 17:20:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 2 Apr 2008 18:20:54 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> References: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> Message-ID: <320fb6e00804021020n2a910a80x79ee6cc3829d5e51@mail.gmail.com> On Wed, Apr 2, 2008 at 5:57 PM, Tiago Ant?o wrote: > Hi, > > Is it OK to add a dependency on SciPy? Would only influence > Bio.PopGen, of course (part of it, actually). Its not out of the question, but what exactly do you need from SciPy? If its a very simple thing we might be better off just duplicating it, e.g. in the Bio.Statistics module. Peter From tiagoantao at gmail.com Wed Apr 2 17:35:49 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 2 Apr 2008 18:35:49 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <320fb6e00804021020n2a910a80x79ee6cc3829d5e51@mail.gmail.com> References: <6d941f120804020957o400f8160k6bb1ca236b75bc2f@mail.gmail.com> <320fb6e00804021020n2a910a80x79ee6cc3829d5e51@mail.gmail.com> Message-ID: <6d941f120804021035w2eb864bdo27fbe47b4be6c923@mail.gmail.com> On Wed, Apr 2, 2008 at 6:20 PM, Peter wrote: > Its not out of the question, but what exactly do you need from SciPy? > If its a very simple thing we might be better off just duplicating it, > e.g. in the Bio.Statistics module. I know that this will sound ridiculous, but, in the long run I will need almost everything. Statistics was invented because and for population genetics. http://en.wikipedia.org/wiki/Ronald_Fisher It is at the core of population genetics (actually, without it, Bio.PopGen is really nothing relevant). Bio.Statistics has not much content now... But this requirement could be completely isolated (as is the requirement for external programs). Not having a statistical library would only mean not using Bio.PopGen.Stats . The impact to other modules would be null. I would imagine this problem (requiring external libraries) to appear from time to time as new functionality is included. It will only not appear if Biopython stops in time. SciPy is a stable project, not an obscure library. Tiago From mjldehoon at yahoo.com Thu Apr 3 00:13:15 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 2 Apr 2008 17:13:15 -0700 (PDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120804021035w2eb864bdo27fbe47b4be6c923@mail.gmail.com> Message-ID: <806128.68558.qm@web62410.mail.re1.yahoo.com> > > Its not out of the question, but what exactly do you need from SciPy? > > I know that this will sound ridiculous, but, in the long run I will > need almost everything. I don't think we should include a dependency now because we may need it in the long run. > Bio.Statistics has not much content now... I agree, and probably it is not a good idea to have too much statistics code in Biopython. Such code would fit in better in a numerical or statistics library. > SciPy is a stable project, not an obscure library. While this is true, in my experience SciPy is also difficult to install. It may mean fewer people using your code because they don't want to go through the hassle of installing SciPy. Particularly users coming from a biology rather than a computer science background. Previously we also discussed switching from the old Numerical Python to the new NumPy. I've heard rumors that the NumPy documentation will be declared open at the SciPy conference this year. Not having this documentation was my biggest argument against NumPy. In my understanding, NumPy has more functionality than Numeric. Maybe it has better statistics support also? --Michiel --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From tiagoantao at gmail.com Thu Apr 3 00:55:20 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 3 Apr 2008 01:55:20 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <806128.68558.qm@web62410.mail.re1.yahoo.com> References: <6d941f120804021035w2eb864bdo27fbe47b4be6c923@mail.gmail.com> <806128.68558.qm@web62410.mail.re1.yahoo.com> Message-ID: <6d941f120804021755x52be0194lb3135a0153813405@mail.gmail.com> On Thu, Apr 3, 2008 at 1:13 AM, Michiel de Hoon wrote: > > > Its not out of the question, but what exactly do you need from SciPy? > > > > I know that this will sound ridiculous, but, in the long run I will > > need almost everything. > > I don't think we should include a dependency now because we may need it in > the long run. I already need it now, but just for a very small thing: The chi-square test. It is quite easy to reimplement. If it ends up by being just chisquare (which I doubt, but I might be able to externalize to the user the conventional stats part), then I think the best thing would be just to reimplement and not to force the dependency. But I think that I will need to use more stats stuff as I implement functionality. The point is, IF I need to use it extensively, can I go ahead? If I end up with just the need couple of functions I would not mind implementing it myself (but more than that is too much work, and as you say I don't think it makes sense for biopython to be also biostats). > > > SciPy is a stable project, not an obscure library. > While this is true, in my experience SciPy is also difficult to install. It > may mean fewer people using your code because they don't want to go through > the hassle of installing SciPy. Particularly users coming from a biology > rather than a computer science background. And poorly documented also, in my view. But population genetics is actually 90% statistics. One doesn't do population genetics without statistics. So, if one does pop gen then some kind of statistical processing will have to exist somewhere. If SciPy is difficult to install on Windows/Mac then there is a adoption problem as you point out (I am on Linux/Ubuntu, in this setup is trivial to install), but I don't see a way around statisics for anyone that wants to do population genetics (again statistics where invented for population genetics, it is really core for us). Of course, better solutions than SciPy might exist... > Previously we also discussed switching from the old Numerical Python to the > new NumPy. I've heard rumors that the NumPy documentation will be declared > open at the SciPy conference this year. Not having this documentation was my > biggest argument against NumPy. In my understanding, NumPy has more > functionality than Numeric. Maybe it has better statistics support also? It says on http://www.scipy.org/Documentation : "fee based until SciPy 2008" I think that NumPy has only basic stuff (standard deviation, mean). I might be wrong, but my research points to that. To sum it up: 1. It is still not clear to me that I will need a stats library, most probably yes. 2. I won't mind reimplementing some stats stuff in biopython as long as it is little work in order avoid a dependency. I can try in as much as possible to avoid a dependency. 3. The dependency (in case it appears) would be of zero impact outside of Bio.PopGen.Stats (maybe just setup.py to optionally allow using scipy) 4. I need to know "the rules of the game" before I write more code (in order to know what I can or cannot use, in case I need to use). Tiago PS - In the spirit of cascade software development I could do a a priori study of the requirement, but I really don't believe the conclusion would be reliable. From mjldehoon at yahoo.com Thu Apr 3 09:49:45 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 3 Apr 2008 02:49:45 -0700 (PDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <6d941f120804021755x52be0194lb3135a0153813405@mail.gmail.com> Message-ID: <434068.89146.qm@web62413.mail.re1.yahoo.com> > I already need it now, but just for a very small thing: The chi-square > test. It is quite easy to reimplement. If it ends up by being just > chisquare (which I doubt, but I might be able to externalize to the > user the conventional stats part), then I think the best thing would > be just to reimplement and not to force the dependency. But I think > that I will need to use more stats stuff as I implement functionality. One solution is just to copy and paste whatever statistics code you need from S SciPy. > I think that NumPy has only basic stuff (standard deviation, mean). I > might be wrong, but my research points to that. The ideal solution would be to move the statistics stuff from SciPy to NumPy, or to expand the statistics stuff currently in NumPy. Since SciPy and NumPy come from the same group of developers, they may not mind too much. Having a statistics library in NumPy would be a big encouragement to move from Numeric to NumPy. > 3. The dependency (in case it appears) would be of zero impact outside > of Bio.PopGen.Stats (maybe just setup.py to optionally allow using > scipy) In practice, when I make the Biopython releases it's special situations like these that cause trouble. For example, if I don't install SciPy on Windows, I can't test Bio.PopGen.Stats there, and errors will go unnoticed. This has happened in the previous Biopython releases. > 4. I need to know "the rules of the game" before I write more code (in > order to know what I can or cannot use, in case I need to use). I would strongly encourage not to add any new dependencies to Biopython. We have too many already; I was actually hoping that the number of dependencies could be reduced. --Michiel. --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From sbassi at gmail.com Thu Apr 3 12:59:05 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu, 3 Apr 2008 09:59:05 -0300 Subject: [Biopython-dev] Code of BLAST XML to HTML Message-ID: Here is the code to convert a BLAST XML file (with one or multiple entries) into one or multiple HTML file(s). It is working with my inputs files (from BLAST 2.2.17 and 18). See and download from: http://www.pastecode.com.ar/f1abb1fbb I don't know whether this should be included into Biopython or not, I am sending to the list since somebody may find it useful anyway. Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From bugzilla-daemon at portal.open-bio.org Thu Apr 3 23:25:40 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Apr 2008 19:25:40 -0400 Subject: [Biopython-dev] [Bug 2480] New: Local BLAST fails: Spaces in Windows file-path values Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2480 Summary: Local BLAST fails: Spaces in Windows file-path values Product: Biopython Version: 1.45 Platform: PC OS/Version: Windows XP Status: NEW Severity: blocker Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: drpatnaik at yahoo.com I am a new user trying Python on a Windows XP SP2 machine on which I do not have admin rights. Consequently, Python itself as well as all of the files/executables I work with have file-paths that contain spaces (e.g., python.exe is at C:\Documents and settings\username...). When I try to perform a local BLAST using code mentioned in one of Bio-Python tuorials, the BLAST fails. I use the following code to capture the error: my_blast_db =r"C:/Documents and Settings/patnaik/My Documents/blast/bin/mine" my_blast_file =r"C:/Documents and Settings/patnaik/My Documents/blast/bin/hairpin" my_blast_exe =r'C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe' from Bio.Blast import NCBIStandalone result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, "blastn", my_blast_db, my_blast_file) error_results = error_handle.read() save_file = open(r"C:/Documents and Settings/patnaik/My Documents/blast/bin/my_blast_error", "w") save_file.write(error_results) save_file.close() The error reported is: 'C:\Documents' is not recognized as an internal or external command, operable program or batch file. There thus seems to be some issue because of the spaces in the file-paths. Can this be resolved by appropriately replacing 'os.popen3' with 'subprocess.call' in Bio/Blast/NCBIStandaolne.py? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 00:00:56 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Apr 2008 20:00:56 -0400 Subject: [Biopython-dev] [Bug 2481] New: bitscore not parsed. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2481 Summary: bitscore not parsed. Product: Biopython Version: 1.45 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sbassi at gmail.com >>> from Bio.Blast import NCBIXML >>> fr=NCBIXML.parse(open('/media/disk/GENES/INTA9/BLAST/seqspMOS.xml')).next() >>> fr.descriptions[0].title u'gnl|BL_ORD_ID|0 pMOS (vector para mt INTA)' >>> fr.descriptions[0].bits Traceback (most recent call last): File "", line 1, in AttributeError: Description instance has no attribute 'bits' To fix this, two files must be modified: NCBIXML.py and Record.py In NCBIXML.py, 2 changes: Line 94 : "method = self._secure_name('_end_' + name.replace("-","_"))" #name in the xml file is "bit-score" and the function should be named like this but can only be named "_end_Hsp_bit_score" hence change from - to _ resolve the issue and should not disturb the rest, this method could/should be also applied line 63 for StartElement Lines 409-410 uncommented In Record.py: Add "self.bits = None" # in line 68 This bug was reported to me by Yoan Jacquemin when testing my code to convert BLAST XML output to HTML. After applying this modifications, it works: >>> from Bio.Blast import NCBIXML >>> f_in='/mnt/hda2/bio/3vsT.xml' >>> fr=NCBIXML.parse(open(f_in)).next() >>> fr.descriptions[0].bits 32.210500000000003 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 00:27:16 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 3 Apr 2008 20:27:16 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804040027.m340RGp1003920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp 2008-04-03 20:27 EST ------- Could you paste here the exact error shown by Python? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 06:15:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 02:15:54 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804040615.m346FsWY020451@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #2 from drpatnaik at yahoo.com 2008-04-04 02:15 EST ------- (In reply to comment #1) I do not see any error/warning notice in the command console. The reason that I suspect something is wrong is that the Blast results output file I get using the code below doesn't have any content (empty file): [snip -- code same as that in previous post] output_results = result_handle.read() save_file = open(r"C:/Documents and Settings/patnaik/My Documents/blast/bin/my_blast_output", "w") save_file.write(output_results) save_file.close() To capture any error, I used the code I mention in my first post. And that is where I find "'C:\Documents' is not recognized as an internal or external command, operable program or batch file." -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 08:37:45 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 04:37:45 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804040837.m348bjXJ027534@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-04-04 04:37 EST ------- Please try my_blast_exe = r'"C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe"' (note the extra " in the command). If it works, an easy solution would be to add the " to Bio/Blast/NCBIStandalone.py before calling os.popen3. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 08:46:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 04:46:13 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804040846.m348kDlB027900@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #14 from ericgibert at yahoo.fr 2008-04-04 04:46 EST ------- Created an attachment (id=892) --> (http://bugzilla.open-bio.org/attachment.cgi?id=892&action=view) Recoding of the Taxonomy parser using SAX All attributes found in the XML document are now parsed and are stored as properties. Please look at the header's explanation or the tests at the end of the code for examples. Please let me know if this is ok. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 13:14:57 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 09:14:57 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804041314.m34DEv6D009448@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #1 from sbassi at gmail.com 2008-04-04 09:14 EST ------- Created an attachment (id=893) --> (http://bugzilla.open-bio.org/attachment.cgi?id=893&action=view) Corrected NCBIXML (for bit parsing) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 13:17:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 09:17:47 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804041317.m34DHlEx009715@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #2 from sbassi at gmail.com 2008-04-04 09:17 EST ------- Created an attachment (id=894) --> (http://bugzilla.open-bio.org/attachment.cgi?id=894&action=view) Record modified for "bit" parsing (bit from XML blast) Both files must be applied together. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 4 20:01:01 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 16:01:01 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804042001.m34K11LP030328@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #4 from drpatnaik at yahoo.com 2008-04-04 16:01 EST ------- (In reply to comment #3) Python throws this error when I do so: File "C:\Documents and Settings\patnaik\My Documents\Python252\lib\site-packages\Bio\Blast\NCBIStandalone.py", line 1650, in blastall raise ValueError, "blastall does not exist at %s" % blastcmd ValueError: blastall does not exist at "C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe" I get a similar error using: my_blast_exe = r'"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' blastall.exe _is_ at C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 03:10:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 23:10:58 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804050310.m353Awmd016642@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-04-04 23:10 EST ------- Isn't the bit score in the hsps? Using xbt001.xml from the Biopython test suite, I find >>> fr.alignments[0].hsps[0].score 469.0 >>> fr.alignments[0].hsps[0].bits 185.267 Otherwise, which line in xbt001.xml is not being parsed? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 03:14:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 4 Apr 2008 23:14:28 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804050314.m353ESag016750@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|blocker |normal ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2008-04-04 23:14 EST ------- In the line shown in the Python error message, it is trying os.path.exists(your_path) with your_path = "C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe" and doesn't find your_path if it includes the extra ". Can you play a bit with >>> import os >>> os.path.exists(your_path) >>> os.system(your_path) to see which variation (if any) works for both? I mean with ", without ", maybe trying '? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 04:01:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 00:01:23 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804050401.m3541NH5018822@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-04-05 00:01 EST ------- After some googling, it looks like your original suggestion to use subprocess.call is probably the best solution. However, it requires Python >= 2.4, whereas currently we require Python >= 2.3. Does anybody have an objection against requiring Python >= 2.4? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 05:24:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 01:24:35 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804050524.m355OZVU022917@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 ------- Comment #4 from sbassi at gmail.com 2008-04-05 01:24 EST ------- (In reply to comment #3) > Isn't the bit score in the hsps? Yes, it is. But this is not the only place it should be, keep on reading. > Otherwise, which line in xbt001.xml is not being parsed? This is not the problem (we are not missing a line from being parsed from the xml). The problem is that "bit" is not in "description" as it should be. Why it should be in description? Take this example: >>> fr.alignments[0].hsps[0].expect 8.32193 But you also have "expect" in "descriptions": >>> fr.descriptions[0].e 8.32193 Another example: >>> fr.alignments[0].hsps[0].score 16.0 and >>> fr.descriptions[0].score 16.0 "descriptions" corresponds to the description table in BLAST HTML documents, all values from table should be there. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 10:03:30 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 06:03:30 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804051003.m35A3Uwu002667@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 mdehoon at ims.u-tokyo.ac.jp changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2008-04-05 06:03 EST ------- OK, I see. Three more comments though: 1) In NCBIXML.py: method = self._secure_name('_end_' + name.replace("-","_")) I don't see why name.replace("-","_") is needed. Isn't that the purpose of self._secure_name in the first place? 2) In Record.py: Please add "bits" with a description to the docstring (lines 58-65). 3) About your change: > Lines 409-410 uncommented I wonder why these lines were commented out in the first place. It was done in revision 1.8 of NCBIXML.py, but I didn't see any explanation as to why those lines were commented out. Peter, do you know? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 12:33:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 08:33:09 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804051233.m35CX95L008244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #15 from ericgibert at yahoo.fr 2008-04-05 08:33 EST ------- Created an attachment (id=895) --> (http://bugzilla.open-bio.org/attachment.cgi?id=895&action=view) Parser for Taxonomic Data from NCBI for Bio.Entrez All right, following Michiel email, I wrote this third version, based on the existing parser in Bio.Entrez CVS. Integration with Bio.Entrez.__init__.py is straight forward: class DataHandler(ContentHandler): from Bio.Entrez import EInfo, ESearch, ESummary, EPost, ETaxonomy # eric gibert _NameToModule = {"eInfoResult": EInfo, "eSearchResult": ESearch, "eSummaryResult": ESummary, "ePostResult": EPost, "TaxaSet": ETaxonomy, # eric gibert } That's it. A Unit test script will be like: import Bio.Entrez def print_record(taxrec): print taxrec["Rank"], taxrec["ScientificName"], "has the TaxId ", taxrec["TaxId"], "and its parent is", taxrec["ParentTaxId"] print taxrec["OtherNames"] print taxrec["Division"], "with Genetic Code:", taxrec["GeneticCode"], "and Mitochondrial Genetic Code:", taxrec["MitoGeneticCode"] print taxrec["Lineage"] print taxrec["LineageEx"] print "Record Created on %s, updated on %s and published on %s." %(taxrec["CreateDate"],taxrec["UpdateDate"],taxrec["PubDate"]) # simple test: get the dog... handle = Bio.Entrez.efetch(db = "taxonomy", id = 9615, retmode = "XML") taxonomic_record = Bio.Entrez.read(handle) print_record(taxonomic_record) # get multiple answers search_handle = Bio.Entrez.esearch(db = "taxonomy", term = "orthetrum c*", retmode = "XML") IdList = Bio.Entrez.read(search_handle)["IdList"] for id in IdList: handle = Bio.Entrez.efetch(db = "taxonomy", id = id, retmode = "XML") orthetrum = Bio.Entrez.read(handle) print_record(orthetrum) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 20:44:34 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 16:44:34 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804052044.m35KiYZm030885@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-05 16:44 EST ------- I use Python 2.3 on Windows because I have the MSVC 6.0 compiler all setup for building python extensions (later versions of Python switched to a later MS compiler). I would object to dropping support for Python 2.3 over what seems to be a minor issue. As I do have Biopython and local blast up and running on this machine, I will be able to try and investigate this issue. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 5 20:46:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 5 Apr 2008 16:46:42 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804052046.m35KkgPq030982@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-05 16:46 EST ------- P.S. There were similar issues in the clustalw wrapper, where I added win32 only code to add quotes to the command line when there were spaces in the file name. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 6 11:02:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 6 Apr 2008 07:02:38 -0400 Subject: [Biopython-dev] [Bug 2468] Tutorial needs a fix: Bio.WWW.NCBI In-Reply-To: Message-ID: <200804061102.m36B2cYw027716@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2468 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-06 07:02 EST ------- Once Eric and Michiel have settled on a Taxonomy XML parser for Bio.Entrez (see Bug 2475), then this section of the tutorial could be updated to use this and the new XML search results parser Michiel has already checked into CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 01:54:13 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 6 Apr 2008 21:54:13 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804070154.m371sDPG004564@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #9 from drpatnaik at yahoo.com 2008-04-06 21:54 EST ------- (In reply to comment #5) Using code: import os your_path =r'"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' os.path.exists(your_path) os.system(your_path) Console output indicates [v] works (outputs: "blastall 2.2.18 arguments ..." ) and [x] doesn't work (outputs some variation of "... is not recognized as ...") [v] r'"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' [x] r"'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe'" [x] r"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe" [x] r'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe' [v] '"C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe"' [x] "'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe'" [x] "C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe" [x] 'C:/Documents and Settings/patnaik/My Documents/blast/bin/blastall.exe' [v]'"C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe"' [x] "'C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe'" [x] "C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe" [x] 'C:\\Documents and Settings\\patnaik\\My Documents\\blast\\bin\\blastall.exe' However, when I run the code (original post) with any of the [v] working values, I get the "blastall does not exist" error. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 08:42:17 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 04:42:17 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804070842.m378gHae024590@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #891 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 08:42:37 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 04:42:37 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804070842.m378gb0U024630@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #892 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 08:45:53 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 04:45:53 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804070845.m378jrSf024885@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #16 from ericgibert at yahoo.fr 2008-04-07 04:45 EST ------- Created an attachment (id=896) --> (http://bugzilla.open-bio.org/attachment.cgi?id=896&action=view) Parser for many Taxons returned from NCBI Taxonomy db This version allows the parsing of 1 to many taxons. The 'record' is a list of dictionaries. Application: # get multiple answers print "Multiple records search:" search_handle = Bio.Entrez.esearch(db = "taxonomy", term = "orthetrum c*", retmode = "XML") IdList = Bio.Entrez.read(search_handle)["IdList"] handle = Bio.Entrez.efetch(db = "taxonomy", id = IdList, retmode = "XML") orthetrum_list = Bio.Entrez.read(handle) print len(orthetrum_list), "Orthetrum match your search:" for orthetrum in orthetrum_list: print orthetrum["Rank"], orthetrum["ScientificName"], "has the TaxId", orthetrum["TaxId"], "and its parent is", orthetrum["ParentTaxId"] -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 10:44:09 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 06:44:09 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804071044.m37Ai9M5031608@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #17 from mdehoon at ims.u-tokyo.ac.jp 2008-04-07 06:44 EST ------- Patch (2008-04-07 04:45 EST) looks fine to me. I have uploaded it to CVS. I renamed it Taxon.py though for consistency with the existing parsers (name is the same as the corresponding DTD file). I have also modified Bio/Entrez/__init__.py accordingly. The patch (2008-04-05 08:33 EST) is obsolete, right? Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 12:59:31 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 08:59:31 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804071259.m37CxVUk005741@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-07 08:59 EST ------- Michiel, It seems that Bio.Entrez.efetch can return XML files containing one record or many records, e.g. taxon_id_list = ['488050', '447868', '333459', '126256'] taxon_handle = Bio.Entrez.efetch(db="taxonomy", id=taxon_id_list, retmode="XML") #This handle contains four Taxon entries taxon_handle = Bio.Entrez.efetch(db="taxonomy", id='488050', retmode="XML") #This handle contains one Taxon entry Bio.Entrez.read(taxon_handle) will return a list of dictionaries (one for each taxon ID supplied). We've established a convention of sorts about "read()" versus "parse()", the first returns a single record and the second a record iterator. If a taxon single entry (currently held as a dictionary) is regarded as a record, then should Bio.Entrez.read() be called Bio.Entrez.parse() instead? I am also wondering if we should create simple record classes for the different XML data types (instead of using dictionaries). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 7 13:46:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 09:46:52 -0400 Subject: [Biopython-dev] [Bug 2481] bitscore not parsed. In-Reply-To: Message-ID: <200804071346.m37DkqVP009002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2481 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-07 09:46 EST ------- I've checked this in to CVS (taking note of Michiel's comments), and confirmed the NCBI XML unit test passes. Sebastian - could you submit your suggested changes as patches next time please? It would have made life a little easier, trying to work out what exactly you wanted to change (which in the end, was fairly small). > 1) In NCBIXML.py: > method = self._secure_name('_end_' + name.replace("-","_")) > I don't see why name.replace("-","_") is needed. Isn't that the purpose of > self._secure_name in the first place? I agree. > 2) In Record.py: > Please add "bits" with a description to the docstring (lines 58-65). I've done this in CVS. > 3) About your change: > > Lines 409-410 uncommented > I wonder why these lines were commented out in the first > place. It was done in revision 1.8 of NCBIXML.py, but I > didn't see any explanation as to why those lines were > commented out. Peter, do you know? I made that old check-in, but it was some time ago and I don't recall the details. The self._descr.bits variable was never setup, causing an exception, and I guess at the time uncommenting this bit seemed like a good solution. With the Record.py fixed, the parser lines can now be uncommented. Perhaps the original code used to work on early NCBI XML files? Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Mon Apr 7 15:26:28 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Apr 2008 16:26:28 +0100 Subject: [Biopython-dev] Statistics code In-Reply-To: <434068.89146.qm@web62413.mail.re1.yahoo.com> References: <6d941f120804021755x52be0194lb3135a0153813405@mail.gmail.com> <434068.89146.qm@web62413.mail.re1.yahoo.com> Message-ID: <320fb6e00804070826v7d4ff977uf4e7cdff68e9fc33@mail.gmail.com> On Thu, Apr 3, 2008 at 10:49 AM, Michiel de Hoon wrote: > > But I think that I will need to use more stats stuff as I implement functionality. > > One solution is just to copy and paste whatever statistics code you need from S > SciPy. That does seem to be an option based on their licence and Biopython's. > > I think that NumPy has only basic stuff (standard deviation, mean). I > > might be wrong, but my research points to that. According to http://www.scipy.org/Numpy_Functions_by_Category they have array statistics: average(), mean(), bincount(), histogram(), corrcoef(), cov(), max(), min(), ptp(), median(), std(), var() plus a selection of random number and distribution functions. > The ideal solution would be to move the statistics stuff from SciPy to NumPy, > or to expand the statistics stuff currently in NumPy. Since SciPy and NumPy > come from the same group of developers, they may not mind too much. Is that something you want to raise with them, Michiel? > Having a statistics library in NumPy would be a big encouragement to move from > Numeric to NumPy. Speaking of which, is that still stuck on the 64bit issue? Bug 2251 - NumPy support for BioPython http://bugzilla.open-bio.org/show_bug.cgi?id=2251 Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 8 01:41:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 21:41:24 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804080141.m381fOL6013094@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #895 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 01:52:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 7 Apr 2008 21:52:58 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804080152.m381qwwN014055@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #19 from ericgibert at yahoo.fr 2008-04-07 21:52 EST ------- I marked the attachment #895 (2008-04-05 08:33 EST) as obsolete. Waiting for Michiel's reply to Peter's reply for updating the current code. Or maybe it is only __init__.py which needs modification (as I did not see "parse()" defined yet). Eric -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 14:17:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 10:17:54 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804081417.m38EHs0q007147@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #20 from mdehoon at ims.u-tokyo.ac.jp 2008-04-08 10:17 EST ------- > Bio.Entrez.read(taxon_handle) will return a list of dictionaries (one for each > taxon ID supplied). We've established a convention of sorts about "read()" > versus "parse()", the first returns a single record and the second a record > iterator. > If a taxon single entry (currently held as a dictionary) is regarded as a > record, then should Bio.Entrez.read() be called Bio.Entrez.parse() instead? I thought about that also, but I think that having Bio.Entrez.read() only is better. The reason is that some XML files returned by NCBI can be regarded as a list of records (possibly a list of only one record), but others can never be regarded as a list of records. That means we could have a Bio.Entrez.parse() in addition to Bio.Entrez.read(), but not instead of Bio.Entrez.read(). Now, in practical situations that could get ugly, not to say counterintuitive. For example, take Bio.Entrez.einfo. Without an argument, Bio.Entrez.einfo() returns a list of NCBI databases. Bio.Entrez.einfo(db="pubmed") then returns a dictionary with information about the pubmed database. (This double usage is not my choice; this is how NCBI has it set up). If we apply the parse/read rule strictly, we'd get the following: >>> from Bio import Entrez >>> handle = Entrez.einfo() >>> records = Entrez.parse(handle) >>> for record in records: ... print record pubmed protein nucleotide nuccore .... taxonomy toolkit unigene unists >>> To me, this seems to be a bit too much, since this is actually just a list. Now if we want information about pubmed, we'd use >>> handle = Entrez.einfo(db="pubmed") >>> record = Entrez.read(handle) # Now we have to use read() instead of parse() And here is the really tricky part: Is the following possible? >>> handle = Entrez.einfo(db=["pubmed","taxonomy"]) For example, Entrez.efetch allows a list of Ids; a user may guess that Entrez.einfo can handle a list of dbs. If it can, should he then call parse() instead of read() (in the example above, with db="pubmed")? Unlike for example Bio.Blast.NCBIXML, where we always get a list of records, for Bio.Entrez some XML files are more like a single record, whereas others are more like a list of records, and it may not be obvious to the user which is which. If you make a mistake, you have to repeat your query to NCBI, because the handle is already partially read. If we define the read/parse rule as "read returns an object, parse returns an iterator", then the existing Bio.Entrez.read() is still fine. > I am also wondering if we should create simple record classes for > the different XML data types (instead of using dictionaries). This can be useful if the record is an empty object deriving from a dict. It allows us to add a docstring to each record, while still preserving the functionality of each record as a dictionary. I don't see a good usage of additional functionality right now. Essentially, the XML file represents a dictionary (or a list of dictionaries); the Python object we returns should correspond to this. One alternative is to have a record class with fields corresponding to the keys in the dictionary. So >>> record.abc >>> record.ddd >>> record.klmnop instead of >>> record["abc"] >>> record["ddd"] >>> record["klmnop"] But I like the second form better, because it allows us to call keys() on the record and get the names of all fields. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 8 14:43:29 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Apr 2008 07:43:29 -0700 (PDT) Subject: [Biopython-dev] Statistics code In-Reply-To: <320fb6e00804070826v7d4ff977uf4e7cdff68e9fc33@mail.gmail.com> Message-ID: <474123.57528.qm@web62410.mail.re1.yahoo.com> Peter Cock wrote:> Having a statistics library in NumPy would be a big encouragement to move from > Numeric to NumPy. Speaking of which, is that still stuck on the 64bit issue? Bug 2251 - NumPy support for BioPython http://bugzilla.open-bio.org/show_bug.cgi?id=2251 I was not driving the issue of NumPy support because in its current state, NumPy seems to have as many advantages as disadvantages compared to Numeric. In addition the NumPy documentation is not free while the Numeric documentation is, so currently in my opinion the balance is in favor of Numeric. The situation changes when the NumPy documentation becomes freely available this summer (at the SciPy conference). Then the scale might tip in favor of NumPy, so we should revisit the issue then. --Michiel. --------------------------------- You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 15:04:54 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 11:04:54 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804081504.m38F4sph009833@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-08 11:04 EST ------- Regarding comment 20, you're right to say the one record/many records issue is cloudy. Lets stick with using Bio.Entrez.read() then. Regarding returning objects or dictionaries, my feeling was that attributes and doc strings could be used to help explain how to interpret the results. However, if as you say the python dictionary is a natural representation of the XML data, then that should suffice - provided the NCBI have been clear with their field naming conventions. If we are all happy with this, then we should update the Bio.Entrez chapter of the tutorial. I would remove some of the longer cut-n-paste sections of XML output, as it doesn't look very good in the PDF output. Getting back to the NCBI taxon issue in BioSQL, did Hilmar's reply on the BioSQL mailing list clarify the use of the left/right fields? http://lists.open-bio.org/pipermail/biosql-l/2008-April/001233.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 23:17:59 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:17:59 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082317.m38NHxZf002619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #22 from ericgibert at yahoo.fr 2008-04-08 19:17 EST ------- Yes, he replied with the following link: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html This page provides explanation and algorithm. The point is that this calculation has to be done on the whole table, not on addition of a new taxon. Thus this will bvery penalizing if eqch time we add a taxon, we force the recalculation. Better let the batch doing so and default the values to NULL (or -1 if not NULL, I did not check). What do you think? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 23:19:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:19:44 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082319.m38NJim6002688@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ericgibert at yahoo.fr ------- Comment #23 from ericgibert at yahoo.fr 2008-04-08 19:19 EST ------- Regarding note #20, I think that always returning a list is better. We need anyway to test if the record was found or not thus we might as well do: if len(return_val) == 0: -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 23:25:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:25:11 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082325.m38NPB4d002893@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #24 from ericgibert at yahoo.fr 2008-04-08 19:25 EST ------- if len(return_val) == 0: print "No record found" elif len(return_val) == 1: print "ok, proceed with", return_val[0] else: print "Ambiguity: please look at the different matches" for tax in return_val: ..... whatever print/select you need -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 8 23:31:10 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 8 Apr 2008 19:31:10 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804082331.m38NVAMe003042@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-08 19:31 EST ------- Hi Eric, On the BioSQL taxon issue, recalculating the whole taxon table left/right values each time we add a new entry doesn't seem very sensible. Could you try my patch (attachment 883 on this bug) which only records a single entry for the new NCBI taxon ID (with null left/right values)? I should have split the Bio.Entrez issue into a separate bug a while ago - but yes, as things stand it is up to the user to check if they get one or more records, depending on what they asked for. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 9 13:34:08 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 9 Apr 2008 09:34:08 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804091334.m39DY84A014676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #26 from mdehoon at ims.u-tokyo.ac.jp 2008-04-09 09:34 EST ------- > If we are all happy with this, then we should update the Bio.Entrez chapter of > the tutorial. OK I can do that. > I would remove some of the longer cut-n-paste sections of XML > output, as it doesn't look very good in the PDF output. Agreed. The raw XML output is now in the tutorial only because it was the best we could do at the time, as we didn't have any parsers in Bio.Entrez. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 10 12:07:32 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 10 Apr 2008 08:07:32 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804101207.m3AC7WL8017496@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #27 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-10 08:07 EST ------- Regarding inserting the lineage into the taxon/taxon_tables, the tree structure is stored in two ways. Firstly, using the taxon.parent field, and secondly using the left/right fields. Over on the BioSQL mailing list we've established that updating the left/right values by recalulating them takes about 10 minutes - doing this from Biopython when adding a new sequence does not seem ideal. We could add missing taxonomy nodes to the tables (based on the Bio.Entrez data), and record the tree structure using the taxon.parent field, but leave the left/right values as NULL. This should be enough for Biopython to recover the full linege when retrieving a sequence - we need to check BioSQL.BioSeq._retrieve_taxon() is happy. If the user wants the left/right values, they would have to (re)run the BioSQL load_ncbi_taxonomy.pl script (which is slow). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 12 19:25:42 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 12 Apr 2008 15:25:42 -0400 Subject: [Biopython-dev] [Bug 2488] New: Adding XML parsers to Bio.Entrez Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2488 Summary: Adding XML parsers to Bio.Entrez Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk This is a placeholder bug for adding more XML parsers to Bio.Entrez to cope with all the NCBI formats. See also Bug 2475 which had a Taxonomy parser, now checked in. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 12 19:38:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 12 Apr 2008 15:38:38 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804121938.m3CJcckt000647@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-12 15:38 EST ------- Created an attachment (id=904) --> (http://bugzilla.open-bio.org/attachment.cgi?id=904&action=view) Bio/Entrez/PubmedArticle.py This is a possible Bio/Entrez/PubmedArticle.py which implements an XML parser for the PubMed database. When constructing a dictionary to hold each publication, I am deliberately flattening and simplifying the very deeply nested structure the NCBI uses. In general, do we want to provide a faithful conversion of the full XML DOM structure into python objects, or just a simplificaton? If the user cares about the exact XML structure, or particular elements, they are probably better off writing their own parsers using DOM or SAX as they see fit. Still needs more testing, perhaps storing the dates as date objects and not as dictionaries. Also I am ignoring the "history" elements. It may be worthwhile returning a Reference object (see the GenBank parser) for these entries... Just thinking out loud about the Bio.Entrez parsers in general: Why don't the Bio/Entrez/XXX.py implement subclasses of Bio.Entrez.DataHandler, rather than just the two methods startElement() and endElement() -- I'm trying to understand why you did it this way round Michiel. Finally, in Bio/Entrez/__init__.py why is the _NameToModule dict defined within the DataHandler class? This seems to prevent it from being edited -- desirable if the user wanted to add or change the parsers called by Bio.Entrez.read() in their script. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 00:55:03 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 12 Apr 2008 20:55:03 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804130055.m3D0t3Fi014906@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2008-04-12 20:55 EST ------- > Why don't the Bio/Entrez/XXX.py implement subclasses of Bio.Entrez.DataHandler, > rather than just the two methods startElement() and endElement() -- I'm trying > to understand why you did it this way round Michiel. We don't know what kind of XML we're handling until after we start reading it, since this is information contained in the XML. So we need to create the handler before knowing which handler will be needed. > Finally, in Bio/Entrez/__init__.py why is the _NameToModule dict defined > within the DataHandler class? This seems to prevent it from being edited > -- desirable if the user wanted to add or change the parsers called by > Bio.Entrez.read() in their script. We can get to _nameToModule. Try it: >>> from Bio import Entrez >>> Entrez.DataHandler._NameToModule Being able to override the parsers in Bio.Entrez is something Sean requested. I am not sure if he still wants it, or if it's really useful. A user can also modify the record after parsing it with the standard parser in Bio.Entrez, which gives the same end result as modifying the parser. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 12:46:39 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 13 Apr 2008 08:46:39 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804131246.m3DCkdiU029292@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2008-04-13 08:46 EST ------- Just one comment on your patch: It would be a good idea to include the exact name of the DTD in the comments. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 13:46:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 13 Apr 2008 09:46:06 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804131346.m3DDk66B032208@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #4 from mdehoon at ims.u-tokyo.ac.jp 2008-04-13 09:46 EST ------- Uploaded a parser for SerialSet to CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 13 14:32:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 13 Apr 2008 10:32:28 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804131432.m3DEWSln001836@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-13 10:32 EST ------- >> Why don't the Bio/Entrez/XXX.py implement subclasses >> of Bio.Entrez.DataHandler, rather than just the two >> methods startElement() and endElement() -- I'm trying >> to understand why you did it this way round Michiel. > > We don't know what kind of XML we're handling until > after we start reading it, since this is information > contained in the XML. So we need to create the > handler before knowing which handler will be needed. OK - I see what you mean now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From samnemo at gmail.com Mon Apr 14 22:06:46 2008 From: samnemo at gmail.com (sam n) Date: Mon, 14 Apr 2008 18:06:46 -0400 Subject: [Biopython-dev] kdtree update Message-ID: I made some changes to the C++ side of Biopython's KDTree that allow one to perform a nearest-neighbor search without specifying a range up-front. I found this saved a considerable amount of CPU time for the problem I was using it for. It might be useful to other people so I can send the update, which is based on http://www.google.com/codesearch?hl=en&q=+kdtree+show:b099E8j0eYY:M9X8aTw_p7E:Tn8Xj-OBPYY&sa=N&cd=4&ct=rc&cs_p=ftp://ftp.diku.dk/diku/users/martinz/tabu.tar.gz&cs_f=kdtree.c#first Where do I send it to? This mailing list? Thanks Sam From biopython-dev at maubp.freeserve.co.uk Mon Apr 14 22:37:59 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 14 Apr 2008 23:37:59 +0100 Subject: [Biopython-dev] kdtree update In-Reply-To: References: Message-ID: <320fb6e00804141537i73a3ccadmbbe4e71769bd5469@mail.gmail.com> On Mon, Apr 14, 2008 at 11:06 PM, sam n wrote: > I made some changes to the C++ side of Biopython's KDTree that allow one to > perform a nearest-neighbor search without specifying a range up-front. I found > this saved a considerable amount of CPU time for the problem I was using it for. > It might be useful to other people so I can send the update, which is based on > http://www.google.com/codesearch?hl=en&q=+kdtree+show:b099E8j0eYY:M9X8aTw_p7E:Tn8Xj-OBPYY&sa=N&cd=4&ct=rc&cs_p=ftp://ftp.diku.dk/diku/users/martinz/tabu.tar.gz&cs_f=kdtree.c#first > > Where do I send it to? This mailing list? Hi Sam, Could you file an "enhancement" bug on Bugzilla, and then attach a patch? http://bugzilla.open-bio.org/ It would also help to have a small example (in python) of how you would use this. Thanks Peter From thamelry at binf.ku.dk Tue Apr 15 06:08:03 2008 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Tue, 15 Apr 2008 08:08:03 +0200 Subject: [Biopython-dev] kdtree update In-Reply-To: References: Message-ID: <2d7c25310804142308l477ba091tcab648f753fb1ba0@mail.gmail.com> On Tue, Apr 15, 2008 at 12:06 AM, sam n wrote: > I made some changes to the C++ side of Biopython's KDTree that allow one > to > perform a nearest-neighbor search without specifying a range up-front. What exactly do you mean by this? Could you post an example, also of the speed up? Cheers, -Thomas From bugzilla-daemon at portal.open-bio.org Tue Apr 15 14:10:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:10:06 -0400 Subject: [Biopython-dev] [Bug 2489] New: KDTree NN search without specifying radius Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2489 Summary: KDTree NN search without specifying radius Product: Biopython Version: 1.45 Platform: PC OS/Version: Windows Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: samnemo at gmail.com CC: thamelry at binf.ku.dk All the current searches in the KDTree require specifying a radius. If you don't know what the radius is, you don't know how far to search without taking a typical estimate of the data set. I just added a function to find the nearest neighbor to a coordinate without specifying this radius up front. I made the changes on the C++ side of Biopython's KDTree. It might be useful to other people so I will post the update, which is based on http://www.google.com/codesearch?hl=en&q=+kdtree+show:b099E8j0eYY:M9X8aTw_p7E:Tn8Xj-OBPYY&sa=N&cd=4&ct=rc&cs_p=ftp://ftp.diku.dk/diku/users/martinz/tabu.tar.gz&cs_f=kdtree.c#first However, I am not currently proficient in the Python C API, so someone else may be able to write the interface in 3 minutes... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 14:19:04 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:19:04 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151419.m3FEJ4Ib010877@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #1 from samnemo at gmail.com 2008-04-15 10:19 EST ------- Created an attachment (id=908) --> (http://bugzilla.open-bio.org/attachment.cgi?id=908&action=view) updated CPP file added public function void KDTree::search_nn(float* coord,bool allowzero) and private function void KDTree::_search_r(Node* node,float* coord,bool allowzero) other changes include leaving coordinates as squared distances when storing them. the user is then responsible for calling sqrt if desired. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 14:25:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:25:47 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151425.m3FEPl7J011220@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #2 from samnemo at gmail.com 2008-04-15 10:25 EST ------- Created an attachment (id=909) --> (http://bugzilla.open-bio.org/attachment.cgi?id=909&action=view) updated .H file added public function void KDTree::search_nn(float* coord,bool allowzero) bool allowzero specifies whether to allow zero distance between nearest neighbor searching for in tree (should be false when searching for nearest neighbor of a coordinate known to be in the tree) and private function : void KDTree::_search_r(Node* node,float* coord,bool allowzero) performs recursive search for nearest neighbor of coord starting from node also note new member variable : float _min_radius_sq; used to keep track of min distance found for a single nearest neighbor search other changes include: leaving coordinates as squared distances when storing them. the user is then responsible for calling sqrt if desired. replaced some of the static declaration of variable sized arrays as vectors since wouldn't compile on msvc++, but easy to change that back... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 14:26:24 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:26:24 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151426.m3FEQO1d011261@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 samnemo at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |samnemo at gmail.com -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 14:45:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 10:45:06 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151445.m3FEj6bc011983@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-15 10:45 EST ------- The python API would need to be updated to match. Do you know how to use SWIG? You mention changes to leaving coordinates as squared distances - this may be more efficient but would probably break any existing code using this. As Thomas Hamelryck said on the mailing list, an example to show why this new code is useful would be very helpful (and to demonstrate the claimed time speed up). Link to email thread for anyone not subscribed: http://lists.open-bio.org/pipermail/biopython-dev/2008-April/003601.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 15 15:03:52 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 15 Apr 2008 11:03:52 -0400 Subject: [Biopython-dev] [Bug 2489] KDTree NN search without specifying radius In-Reply-To: Message-ID: <200804151503.m3FF3q08013312@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2489 ------- Comment #4 from samnemo at gmail.com 2008-04-15 11:03 EST ------- I never used SWIG before, but could learn how to use it...I don't have a lot of time at the moment...so I'll have to come back to this...might be sooner (or later) than I think... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 16 11:02:41 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 16 Apr 2008 07:02:41 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804161102.m3GB2fDi013466@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #6 from mdehoon at ims.u-tokyo.ac.jp 2008-04-16 07:02 EST ------- Added a parser for the OMIM database, and an initial unit test. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 17 12:41:06 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 17 Apr 2008 08:41:06 -0400 Subject: [Biopython-dev] [Bug 2488] Adding XML parsers to Bio.Entrez In-Reply-To: Message-ID: <200804171241.m3HCf6OO031008@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2488 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-17 08:41 EST ------- Michiel - I see you've been doing more work on these parsers (and unit tests). I'm quite happy for you to take on PubmedArticle.py and implement it as you see fit (based on my suggested code or otherwise). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 18 02:35:29 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 17 Apr 2008 22:35:29 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804180235.m3I2ZT1A004790@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #10 from drpatnaik at yahoo.com 2008-04-17 22:35 EST ------- Could this be a blast.exe issue? Using the Windows command console, if I change the working directory (cd) to the one having 'blast.exe', both the following work: [1] blastall.exe -p blastn -d "C:\Documents and Settings\patnaik\My Documents\blast\bin\mine" -i "C:\Documents and Settings\patnaik\My Documents\blast\bin\hairpin" -m 7 [2] "C:\Documents and Settings\patnaik\My Documents\blast\bin\blastall.exe" -p blastn -d "C:\Documents and Settings\patnaik\My Documents\blast\bin\mine" -i "C:\Documents and Settings\patnaik\My Documents\blast\bin\hairpin" -m 7 But neither ([1] for obvious reason), nor 'bin/blast.exe -p ...', etc., work if I move out of the directory that has 'blast.exe'. The console displays: [NULL_Caption] WARNING: Unable to open Documents.nin [NULL_Caption] WARNING: Unable to open and.nin [NULL_Caption] WARNING: Unable to open My.nin ... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 18 02:42:35 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 17 Apr 2008 22:42:35 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804180242.m3I2gZ1q005051@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 ------- Comment #11 from drpatnaik at yahoo.com 2008-04-17 22:42 EST ------- (In reply to comment #10) No, not a blast.exe issue as, e.g., this works (note the escaped quotes): "bin\blastall.exe" -p blastn -d "\"C:\Documents and Settings\patnaik\My Documents\blast\bin\mine\"" -i "C:\Documents and Settings\patnaik\My Documents\blast\bin\hairpin" -m 7 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chris.lasher at gmail.com Fri Apr 18 18:45:36 2008 From: chris.lasher at gmail.com (Chris Lasher) Date: Fri, 18 Apr 2008 14:45:36 -0400 Subject: [Biopython-dev] [O|B|F Helpdesk #332] Transitioning Biopython to SVN In-Reply-To: <-2724361901926009927@unknownmsgid> References: <-2724361901926009927@unknownmsgid> Message-ID: <128a885f0804181145p7440f6bfp94c12c514519ed1@mail.gmail.com> Hi Mauricio, Right now the transition is idling. George Hartzell set up a prototype repository that was read-only and I had a successful checkout. I don't think any Biopython devs noticed any anomalies, so I guess that's a thumbs-up for the transition by default. There are a number of tickets that seem to be getting worked on at the moment but I think we can freeze the CVS repository soon and get SVN going. Also, what's the resolution for providing public read-only access to the repositories? I think that was the only unresolved matter that concerned the Biopython users. Thanks, Chris On Fri, Apr 18, 2008 at 2:34 PM, Mauricio Herrera Cuadra via RT wrote: > What is the status of this so far? > From bugzilla-daemon at portal.open-bio.org Tue Apr 22 12:27:47 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Apr 2008 08:27:47 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804221227.m3MCRl95007664@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 mmokrejs at ribosome.natur.cuni.cz changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmokrejs at ribosome.natur.cuni | |.cz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 22 13:10:58 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 22 Apr 2008 09:10:58 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804221310.m3MDAwOo010833@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #896 is|0 |1 obsolete| | ------- Comment #28 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-22 09:10 EST ------- (From update of attachment 896) Marking this patch as obsolete since we've got something based on Eric's work for Bio.Entrez checked into CVS now. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From MatatTHC at gmx.de Tue Apr 22 16:49:02 2008 From: MatatTHC at gmx.de (Matthias Bernt) Date: Tue, 22 Apr 2008 18:49:02 +0200 Subject: [Biopython-dev] derive from Seq Message-ID: <20080422164902.232560@gmx.net> Hi, Despite of other messages on this list someone needs circular sequences .. me :). I thought that the best way to get a circular behaviour is to make a derived class in order to keep all the nice features of Seq. So I've started to write a derived class which overwrites some of the methods from the Seq - especially __getitem__ - and I run quite fast into problems. E.g. the complement method returns a Seq object. The desired behaviour would be to return an instance of my derived class .. This could be done easily with self.__class__(s, self.alphabet). Unfortunately the __init__ method of my derived class has a third parameter (with default value) which sets the sequence to circular / linear. This could be done with copy or clone methods. So you see that there are problems when deriving from the Seq class. What is the best (or a good) strategy for deriving classes from Seq? Thanks -- Matthias -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger From n.j.loman at bham.ac.uk Tue Apr 22 16:50:36 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Tue, 22 Apr 2008 17:50:36 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader Message-ID: <480E175C.9060301@bham.ac.uk> Dear biopython-developers, As importing data into PostgreSQL is much faster when using the batch "COPY" method I decided I would hack BioSQL.Loader to produce COPY statements for the bulk of the data in a typical GenBank file. As index updating/foreign key checking is also slow, I split the BioSQL schema. I put table definitions in one file and then indexes/foreign key constraints in a separate one. I import the schema file, then apply indexes/FK only after the data is loaded. Caveat, pbviously this can't be done on a "live" database and it relies on only a single import process being run at any one time. A modified 'Loader' uses a new class called 'FakeTable'. FakeTable acts as a very, very basic data store attempting to simulate the behavior of Postgres. FakeTable.dump() outputs COPY statements to stdout instead of SQL commands. I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a GenBank file 42MB large (microbial32.genomic.gbff from RefSeq). load_seqdatabase.pl - not directly comparable as needs foreign keys/rules to run correctly, but conservatively >20 minutes BioSQL.Loader/psyco - 4 minutes, 54 seconds BatchLoader/psyco - 1 minute, 38 seconds +Import the output - 8 seconds Postgres 8.3.1, Gentoo/Linux, 8GB RAM. As the number of sequence files increases, there should be even greater gains, as the interactive version will take longer to execute each query. This is not production-quality code but might act as a starting poing for hacking about with. I would be grateful for any comments. If the team felt this would be a useful inclusion into BioPython I am happy to work it up a bit more. A MySQL compatible version would not be very hard, for example. I reckon this could be faster, for example the sequence parsing could be threaded on a multi-core machines. Code is here: http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/ I'd be grateful for any feedback on how this might be improved, and how we can make it even faster! Many thanks Nick. From biopython-dev at maubp.freeserve.co.uk Tue Apr 22 17:27:38 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Apr 2008 18:27:38 +0100 Subject: [Biopython-dev] derive from Seq In-Reply-To: <20080422164902.232560@gmx.net> References: <20080422164902.232560@gmx.net> Message-ID: <320fb6e00804221027x68e866eeo9f62a9e83250355c@mail.gmail.com> Hi Matthias, > ... I've started to write a derived class which overwrites some of the > methods from the Seq - especially __getitem__ - and I run quite fast > into problems. E.g. the complement method returns a Seq object. > > The desired behaviour would be to return an instance of my derived class > .. This could be done easily with self.__class__(s, self.alphabet). That has been raised before (although I don't think anyone filed a bug) and as I recall, changing this lead to some unexpected side effects (failures in the test suite). If you fancy trying this change, and digging into what (if anything) breaks as a result, that would be very helpful. > Unfortunately the __init__ method of my derived class has a third > parameter (with default value) which sets the sequence to circular / linear. > This could be done with copy or clone methods. > > So you see that there are problems when deriving from the Seq class. What > is the best (or a good) strategy for deriving classes from Seq? Maybe for now your best bet is to subclass, and then write your own (reverse)complement method which calls the base-class to do the work, and then transforms the resulting Seq object into your own CircularSeq object. Unfortunately, you would probably have to do something similar for other problematic methods (until the base class is fixed). Out of interest, how do you interpret integers in your CircularSeq's __getitem__ method? Python's existing negative index behaviour seems to be ideal, for example -1 already returns the last letter. I'd guess you make values longer than the sequence length simply wrap. Is that the only change or do you alter the splice behaviour too (this is where it gets tricky). Peter From biopython-dev at maubp.freeserve.co.uk Tue Apr 22 17:38:24 2008 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 22 Apr 2008 18:38:24 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <480E175C.9060301@bham.ac.uk> References: <480E175C.9060301@bham.ac.uk> Message-ID: <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> On Tue, Apr 22, 2008 at 5:50 PM, Nick Loman wrote: > Dear biopython-developers, > > As importing data into PostgreSQL is much faster when using the batch > "COPY" method I decided I would hack BioSQL.Loader to produce COPY > statements for the bulk of the data in a typical GenBank file. Can I ask what version of Biopython you're using? And given you've got it running on PostgreSQL, is there anything you think should be added to the wiki documentation?: http://biopython.org/wiki/BioSQL > As index updating/foreign key checking is also slow, I split the BioSQL > schema. I put table definitions in one file and then indexes/foreign key > constraints in a separate one. While this is fine for your own use - you'd had have to take this up on the BioSQL mailing list if you wanted it to become a standard (i.e. its not just up to us at Biopython). It might be worth moving some of this discussion there anyway. > I benchmarked load_seqdatabase.pl vs. BioSQL.loader vs. FakeTable with a > GenBank file 42MB large (microbial32.genomic.gbff from RefSeq). > > load_seqdatabase.pl - not directly comparable as needs foreign > keys/rules to run correctly, but > conservatively >20 minutes > > BioSQL.Loader/psyco - 4 minutes, 54 seconds > > BatchLoader/psyco - 1 minute, 38 seconds > +Import the output - 8 seconds > > Postgres 8.3.1, Gentoo/Linux, 8GB RAM. Did you run the numbers for a plain Biopython BioSQL.Loader import (without psyco)? If you do go back and run some more tests, could you also try just parsing the GenBank file without actually doing anything with the data (to see what the overhead is on your machine). > I reckon this could be faster, for example the sequence parsing could be > threaded on a multi-core machines. You should in principle be able to run multiple imports even without making any code changes to Biopython, although I suspect there is some scope for clashes (e.g. two threads both adding new entries to the taxonomy tables). > Code is here: > http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/ > > I'd be grateful for any feedback on how this might be improved, and how we > can make it even faster! That seems to be password protected at the moment. Peter From n.j.loman at bham.ac.uk Wed Apr 23 08:09:54 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 23 Apr 2008 09:09:54 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> Message-ID: <480EEED2.5080602@bham.ac.uk> Hi Peter >> As importing data into PostgreSQL is much faster when using the batch >> "COPY" method I decided I would hack BioSQL.Loader to produce COPY >> statements for the bulk of the data in a typical GenBank file. > > Can I ask what version of Biopython you're using? 1.45. > is there anything you think should be > added to the wiki documentation?: > http://biopython.org/wiki/BioSQL I've added a few lines on Postgres. >> As index updating/foreign key checking is also slow, I split the BioSQL >> schema. I put table definitions in one file and then indexes/foreign key >> constraints in a separate one. > > While this is fine for your own use - you'd had have to take this up > on the BioSQL mailing list if you wanted it to become a standard (i.e. > its not just up to us at Biopython). It might be worth moving some of > this discussion there anyway. Yep, appreciate that! The problem is that you wouldn't want to have non-indexed tables ever if you were updating with the traditional 'interactive' scripts, as they will begin to slow to a crawl as more data is imported. So this approach is only really good for this kind of batch-import model. However I guess it is still reasonably friendly to ask people to import 2 scripts in a row. >> load_seqdatabase.pl - not directly comparable as needs foreign >> keys/rules to run correctly, but >> conservatively >20 minutes >> +Import the output - 8 seconds > > Did you run the numbers for a plain Biopython BioSQL.Loader import > (without psyco)? If you do go back and run some more tests, could you > also try just parsing the GenBank file without actually doing anything > with the data (to see what the overhead is on your machine). Yep, sure. GenBank parsing without psyco - 2 minutes, 15 seconds GenBank parsing with psyco - 1 minute, 20 seconds >> BioSQL.Loader/psyco - 4 minutes, 54 seconds BioSQL.Loader without psyco - 6 minutes, 10 seconds >> BatchLoader/psyco - 1 minute, 38 seconds BatchLoader without psyco - 2 minutes, 42 seconds >> I reckon this could be faster, for example the sequence parsing could be >> threaded on a multi-core machines. > > You should in principle be able to run multiple imports even without > making any code changes to Biopython, although I suspect there is some > scope for clashes (e.g. two threads both adding new entries to the > taxonomy tables). Yep, with the interactive version I reckon this would work without many problems (most taxa should be pulled out of NCBI anyway), but with my flat-file version this wouldn't work unless specifically designed for. I could parallelise the GB parsing stage though as that is the current bottleneck for my app. >> Code is here: >> http://pathogenomics.bham.ac.uk/nick/snippets/biopython-sql/ >> >> I'd be grateful for any feedback on how this might be improved, and how we >> can make it even faster! > > That seems to be password protected at the moment. My bad, it's open now. Regards, Nick. From biopython at maubp.freeserve.co.uk Wed Apr 23 08:56:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 09:56:33 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <480EEED2.5080602@bham.ac.uk> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> Message-ID: <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> > GenBank parsing with psyco - 1 minute, 20 seconds > GenBank parsing without psyco - 2 minutes, 15 seconds > > BioSQL.Loader/psyco - 4 minutes, 54 seconds > BioSQL.Loader without psyco - 6 minutes, 10 seconds > > BatchLoader/psyco - 1 minute, 38 seconds > BatchLoader without psyco - 2 minutes, 42 seconds That's impressive - you seem to have got the database side of things down to about 30 seconds; a fraction of the time to parse the GenBank file! Although, as you pointed out, there are a lot of provisos here. There are still some slow bits in the current GenBank parser which would be an obvious next target for you in your quest for speed. I did a little investigation a while ago, and concluded the parsing of the feature locations was the biggest bottleneck. However, this is a rather complicated lump of code, so its not such an easy task. I tried out a "hack" which special-cased the most common feature location types, with a fall back on the original parser, which gave much better performance. I didn't check this in as it made some already complex code WAY more complicated! > > > I reckon this could be faster, for example the sequence parsing could > > > be threaded on a multi-core machines. Did you mean simply one GenBank file per core, or something more complicated where parsing a single file is done using multiple cores? Peter From n.j.loman at bham.ac.uk Wed Apr 23 12:08:57 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 23 Apr 2008 13:08:57 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> Message-ID: <480F26D9.8000405@bham.ac.uk> Peter wrote: > That's impressive - you seem to have got the database side of things > down to about 30 seconds; a fraction of the time to parse the GenBank > file! Although, as you pointed out, there are a lot of provisos here. Yep. Would it be helpful to do anything further with this code, i.e. put it into CVS and document on the Wiki, perhaps when its been a bit more tested? > There are still some slow bits in the current GenBank parser which > would be an obvious next target for you in your quest for speed. I > did a little investigation a while ago, and concluded the parsing of > the feature locations was the biggest bottleneck. However, this is a > rather complicated lump of code, so its not such an easy task. I > tried out a "hack" which special-cased the most common feature > location types, with a fall back on the original parser, which gave > much better performance. I didn't check this in as it made some > already complex code WAY more complicated! Aha, sounds good. I haven't profiled the Biopython code but I will check this. I'm dealing with bacterial sequences in the main which have mainly simple location identifiers, so there could well be some mileage here. >>>> I reckon this could be faster, for example the sequence parsing could >>>> be threaded on a multi-core machines. > > Did you mean simply one GenBank file per core, or something more > complicated where parsing a single file is done using multiple cores? I mean process one GenBank file per core. Locally that would mean on a 4-core machine you could have 3 parser threads working concurrently, each passing the generated Seq object to the Loader when read. Cheers Nick. From biopython at maubp.freeserve.co.uk Wed Apr 23 12:48:25 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 13:48:25 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <480F26D9.8000405@bham.ac.uk> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> <480F26D9.8000405@bham.ac.uk> Message-ID: <320fb6e00804230548w71e44289q40387a131476ec4b@mail.gmail.com> > > That's impressive - you seem to have got the database side of things > > down to about 30 seconds; a fraction of the time to parse the GenBank > > file! Although, as you pointed out, there are a lot of provisos here. > > Yep. > > Would it be helpful to do anything further with this code, i.e. put it into > CVS and document on the Wiki, perhaps when its been a bit more tested? I'm not ready to put this into the main Biopython CVS. But by all means, add a new page to the wiki to describe your approach. Hopefully there are a few others who might be interested, and we'll see. > > There are still some slow bits in the current GenBank parser which > > would be an obvious next target for you in your quest for speed. I > > did a little investigation a while ago, and concluded the parsing of > > the feature locations was the biggest bottleneck. However, this is a > > rather complicated lump of code, so its not such an easy task. I > > tried out a "hack" which special-cased the most common feature > > location types, with a fall back on the original parser, which gave > > much better performance. I didn't check this in as it made some > > already complex code WAY more complicated! > > Aha, sounds good. I haven't profiled the Biopython code but I will check > this. I'm dealing with bacterial sequences in the main which have mainly > simple location identifiers, so there could well be some mileage here. Yes, I had been experimenting with bacterial sequences too. Beware that the location string in general can be extremely complex (and even reference other files by their identifier). A complete backwards compatible re-write of the location parsing (into sub-features) looked like a big job. That said, if you do run some profiling, you may spot some other "low hanging fruit" which would be easier to tackle. I haven't done any optimisation work since my original re-write of the GenBank parser back in August 2006 when I replaced the older slower Martel parser which didn't scale well with large input files. > I mean process one GenBank file per core. > > Locally that would mean on a 4-core machine you could have 3 parser threads > working concurrently, each passing the generated Seq object to the Loader > when read. I see - that means there is only one thread/job writing to the database, which keeps that side of things thread-safe. To be honest, unless you are trying to import several hundred bacterial genomes into BioSQL, I don't think this level of complexity is a worth while pay off. Right now, I would target the GenBank parsing itself (which would be useful outside the task of loading sequences into BioSQL). Something else you may want to consider is timing the BioPerl scripts for importing a GenBank file into BioSQL. There will probably be some minor differences in their interpretation of the data and exactly they store it, but it would be a useful base mark. Peter From n.j.loman at bham.ac.uk Wed Apr 23 15:46:35 2008 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 23 Apr 2008 16:46:35 +0100 Subject: [Biopython-dev] BioSQL : BatchLoader In-Reply-To: <320fb6e00804230548w71e44289q40387a131476ec4b@mail.gmail.com> References: <480E175C.9060301@bham.ac.uk> <320fb6e00804221038i273ad0dbw6ec7bf866cef190c@mail.gmail.com> <480EEED2.5080602@bham.ac.uk> <320fb6e00804230156p1c6f3b26qb1b12dcc1c1fb543@mail.gmail.com> <480F26D9.8000405@bham.ac.uk> <320fb6e00804230548w71e44289q40387a131476ec4b@mail.gmail.com> Message-ID: <480F59DB.70604@bham.ac.uk> Peter wrote: > I'm not ready to put this into the main Biopython CVS. But by all > means, add a new page to the wiki to describe your approach. > Hopefully there are a few others who might be interested, and we'll > see. Okay! >> I mean process one GenBank file per core. >> >> Locally that would mean on a 4-core machine you could have 3 parser threads >> working concurrently, each passing the generated Seq object to the Loader >> when read. > > I see - that means there is only one thread/job writing to the > database, which keeps that side of things thread-safe. To be honest, > unless you are trying to import several hundred bacterial genomes into > BioSQL, I don't think this level of complexity is a worth while pay > off. Right now, I would target the GenBank parsing itself (which > would be useful outside the task of loading sequences into BioSQL). I agree, I will take a look at GenBank parsing next, and then concurrency after that. The reason I'm doing this is that I need to import all 1686 complete/incomplete bacterial genomes in RefSeq - and plenty more besides! > Something else you may want to consider is timing the BioPerl scripts > for importing a GenBank file into BioSQL. There will probably be some > minor differences in their interpretation of the data and exactly they > store it, but it would be a useful base mark. I did this, it was incredibly slow, at least 5x slower. We've been using Bioperl for some time. I realised I needed a faster script so I investigated the same approach with BioPerl but I thought I'd be able to hack the Biopython stuff a bit faster as the BioSQL stuff seems a bit less complex. Plus Python is easy to read ;) Cheers Nick. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 14:52:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:52:28 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241452.m3OEqS2w005327@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #29 from ericgibert at yahoo.fr 2008-04-24 10:52 EST ------- Created an attachment (id=914) --> (http://bugzilla.open-bio.org/attachment.cgi?id=914&action=view) Usage of Bio/Entrez/Taxon.py parse to load a SeqRecord's taxonomy Modification essentially in the function DatabaseLoader._get_taxon_id(self, record). Note the new optinal parameter in DatabaseLoader.__init__(self, adaptor, dbid, fetch_NCBI_taxonomy=False) Attention: this parameter must be collected from BioSeqDatabase.load(self, record_iterator, fetch_NCBI_taxonomy=False) in BioSQL/BioSeqDatabase.py: See next attachment for this modification. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 14:54:38 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:54:38 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241454.m3OEscMM005448@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ------- Comment #30 from ericgibert at yahoo.fr 2008-04-24 10:54 EST ------- Created an attachment (id=915) --> (http://bugzilla.open-bio.org/attachment.cgi?id=915&action=view) Addition of 'fetch_NCBI_taxonomy' as noptional parameter to DatabaseLoader.load() Extra parameter for BioSeqDatabase.load() and pass it to DatabaseLoader (where it will be a property) db_loader = Loader.DatabaseLoader(self.adaptor, self.dbid, fetch_NCBI_taxonomy) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 14:55:18 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:55:18 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241455.m3OEtImJ005511@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #915|Addition of |Addition of description|'fetch_NCBI_taxonomy' as |'fetch_NCBI_taxonomy' as |noptional parameter to |optional parameter to |DatabaseLoader.load() |DatabaseLoader.load() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 24 14:55:44 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 24 Apr 2008 10:55:44 -0400 Subject: [Biopython-dev] [Bug 2475] BioSQL.Loader should reuse existing taxon entries in lineage In-Reply-To: Message-ID: <200804241455.m3OEtidF005561@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2475 ericgibert at yahoo.fr changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #914|Usage of Bio/Entrez/Taxon.py|Usage of Bio/Entrez/Taxon.py description|parse to load a SeqRecord's |parser to load a SeqRecord's |taxonomy |taxonomy -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 25 23:18:11 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Apr 2008 19:18:11 -0400 Subject: [Biopython-dev] [Bug 2480] Local BLAST fails: Spaces in Windows file-path values In-Reply-To: Message-ID: <200804252318.m3PNIBg9008981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2480 drpatnaik at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |blocker -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 26 03:47:28 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Apr 2008 23:47:28 -0400 Subject: [Biopython-dev] [Bug 2494] New: _retrieve_taxon in BioSQL.py needs urgent optimization Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2494 Summary: _retrieve_taxon in BioSQL.py needs urgent optimization Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: ericgibert at yahoo.fr I ran the Perl script to get the BioSQL tables 'taxon' and 'taxon_name' updated. Taxon contains 419036 rows and taxon_name contains 584058 rows. To retrieve the taxonomy of a DBSeqRecord, the function DBSeq._retrieve_taxon() uses a SQL based on the nested sets defined by left and right values. This approach is extremely time consuming once the tables grow large. When the issue is a bottom-up search, in this case "all taxon parent of this species", it is better to use the links child/parent based on parent_taxon_id field. Please refer to next post with attached script demonstrating my point. Eric -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 26 03:54:23 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 25 Apr 2008 23:54:23 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200804260354.m3Q3sNhM019804@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #1 from ericgibert at yahoo.fr 2008-04-25 23:54 EST ------- Created an attachment (id=916) --> (http://bugzilla.open-bio.org/attachment.cgi?id=916&action=view) script timing the current SQL and proposed bottom-up 'loop' implementation results obtained on my dual core PC (Fedora 7 64 bit): (resulting lists are truncated to help the reading) go for it: getTaxonSQLsimplex took 370.300 ms [1L, 2759L, 6072L, ...... 229390L, 229391L] getTaxonSQL took 6846.810 ms ['Eukaryota', 'Metazoa', '...... 'Nannophya', 'Nannophya pygmaea'] getTaxonSQLall took 6772.037 ms ['root', 'cellular organisms', '... 'Odonata', 'Anisoptera/Anisozygoptera group''Nannophya', 'Nannophya pygmaea'] getTaxonLoop took 14.559 ms ['cellular organisms',... 'Nannophya', 'Nannophya pygmaea'] Conclusion: many runs have shown that the Loop function is always under 15ms while the current SQL will be more than 6500ms. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 26 10:29:25 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 26 Apr 2008 06:29:25 -0400 Subject: [Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization In-Reply-To: Message-ID: <200804261029.m3QATPLW003286@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2494 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2008-04-26 06:29 EST ------- There is a small risk that your numbers will be missleading when applied to other databases (i.e. mysql versus postgres). Other than that, using just the parent id is probably a much better idea (especially given some of the proposals on Bug 2475 about not writing the left/right values). Have you ever used the "diff" command-line tool to produce a patch file? e.g. diff old_version.py new_version.py > patch.txt Or, if you are working on a CVS checkout, modify the local file and then: cvs diff changed_file.py > patch.txt Also read up on the "patch" command for applying the patch to update an unchanged file. If you are on a Unix style platform, these are usually installed already. For Windows I use cygwin's diff command but there are probably other options. There are several advantages, including the fact that patches are smaller. For initial code review, they also highlight the area changed. Another big advantage is if CVS has been updated in the meantime, a patch can often still be applied automatically. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 28 12:42:05 2008 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 28 Apr 2008 08:42:05 -0400 Subject: [Biopython-dev] [Bug 2495] New: parse element symbols for ATOM/HETATM records (Bio.PDB.PDBParser) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2495 Summary: parse element symbols for ATOM/HETATM records (Bio.PDB.PDBParser) Product: Biopython Version: 1.45 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: macrozhu+biopy at gmail.com Hi, the current Bio.PDB.PDBParser does not parse column 77-78 from ATOM records in PDB files, where element symbols are (usually) stored for ATOM. We suggest BioPython to parse this information in the next version. The reasons are given as follows: 1. The current remediated PDB format requires these symbols to be always present ( http://www.wwpdb.org/documentation/format3.1-20080211.pdf ), though in old PDB files (v2.3), these symbols are sometimes missing. 2. In some cases it is not straightforward, if not impossible, to recognize hydrogen atoms by their identifiers in the remediated PDB files. e.g. in 1AWW, ATOM 378 HD11 LEU A 25 46.755 -3.858 0.453 1.00 0.00 H ATOM 379 HD12 LEU A 25 47.178 -2.160 0.234 1.00 0.00 H ATOM 380 HD13 LEU A 25 47.054 -3.226 -1.165 1.00 0.00 H ATOM 381 HD21 LEU A 25 49.453 -1.483 0.307 1.00 0.00 H ATOM 382 HD22 LEU A 25 50.714 -2.537 -0.327 1.00 0.00 H ATOM 383 HD23 LEU A 25 49.413 -1.984 -1.381 1.00 0.00 H In this PDB entry, chemical symbols (H) are not right justified in column 13-14 for hydrogen identifiers like for other elements. A bit extra work is required to figure it out. What's more, sometimes it's even impossible to distinguish hydrogen from mercury without columns 77-78. From the PDB entry format description version 2.1: "Hydrogen naming sometimes conflicts with IUPAC conventions. For example, a hydrogen named HG11 in columns 13 - 16 is differentiated from a mercury atom by the element symbol in columns 77 - 78. Columns 13 - 16 present a unique name for each atom." Therefore we strongly suggest PDBParser to cover column 77-78 for ATOM/HETATM records. We have looked at relevant code and it seems three files (Atom.py, PDBParser.py, StructureBuilder.py) needed to be revised marginally for integrating this update: 1). in Atom.py CVS Revision 1.18 line 17: add one parameter "element" to the function Atom::__init__(...) def __init__(self, name, coord, bfactor, occupancy, altloc, fullname, serial_number, element): line 61: add line self.element = element add a set method: def set_element(self, element): self.element = element add a public method: def get_element(self): return self.element 2). in PDBParser.py CVS Revision 1.20 line 161: add one line to parse element symbol in function PDBParser::_parse_coordinates(self, coords_trailer) element=line[76:78].strip() line 182: add one more parameter to init_atom(): structure_builder.init_atom(name, coord, bfactor, occupancy, altloc, fullname, serial_number, element) 3). in StructureBuilder.py CVS Revision 1.16 line 158: add one parameter "element" to the function StructureBuilder::init_atom(self, name, coord, b_factor, occupancy, altloc, fullname, serial_number=None, element='') line 190: add "element" to the initialization of Atom instance. atom=self.atom=myAtom(name, coord, b_factor, occupancy, altloc, fullname, serial_number, element) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.