From biopython at maubp.freeserve.co.uk Mon Jun 1 06:15:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:15:03 +0100 Subject: [Biopython-dev] More SwissProt inconsistencies In-Reply-To: <880385.97797.qm@web62401.mail.re1.yahoo.com> References: <880385.97797.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00906010315k5e6ed8bdm5327e51eec3ac51e@mail.gmail.com> On Sat, May 30, 2009 at 10:37 AM, Michiel de Hoon wrote: > 1) A multi-line author list such as the following: > ... > is stored without newlines by Bio.SeqIO: > ... > but with newlines by Bio.SwissProt: > > To me, the Bio.SeqIO approach seems more reasonable. I think we should > add a space though at places where there is a newline in the file. > > The same happens for multiline RL such as > > RL ? (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.); > RL ? Proceedings of the XVII international grassland congress, > RL ? pp.2:1033-1034, Dunmore Press, Palmerston North (1993). > > and for multiline RT lines such as > > RT ? "Genome of the host-cell transforming parasite Theileria annulata > RT ? compared with T. parva."; > > This is stored by Bio.SeqIO as > > '"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";' > > and by Bio.SwissProt as > > '"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";' > > whereas I think that both should be stored as > > '"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";' I agree with you - the missing spaces when parsed with Bio.SeqIO are a bug and should be fixed. > 2) Comments in a references such as the following: > RC ? STRAIN=cv. VF36; TISSUE=Anther; > are stored as a single string by Bio.SeqIO: >>>> seq_record.annotations['references'][i].comment > 'STRAIN=cv. VF36; TISSUE=Anther;' > but as a list of (key, value) pairs by Bio.SwissProt: > [('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')] > Whereas I think both are reasonable, Bio.SeqIO drops the space between > two (key, value) pairs if they are on two separate lines: > RC ? STRAIN=C57BL/6J; > RC ? TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex; > is stored as >>>> seq_record.annotations['references'][i].comment > 'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;' > I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing. > > Any objections or comments? Maybe using a list of (key, value) pairs is more sensible, but it would probably break the BioSQL loader (and be inconsistent with reference objects from the GenBank/EMBL parser). It would be reasonable to add the space. This is a simple change which shouldn't hurt anything. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 1 06:19:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 06:19:04 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906011019.n51AJ4vf018935@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |chapmanb at 50mail.com ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-01 06:19 EST ------- Good point Nick. That's a change Brad Chapman (CC'd) made a long time ago - I'd be impressed if he could recall the details now. How about we make the code issue a deprecation warning if these "dummy" arguments are used? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 1 11:18:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 11:18:36 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906011518.n51FIad7009515@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-01 11:18 EST ------- Hi David, I was able to reproduce this problem. When working on Bug 2838, as my test case I was using just the file cor6_6.gb which by chance has simple reference locations - and that worked. I have now tested with GI 28804743. Also, using some of the other GenBank files in our test suites also shows the reference location problem from BioSQL/Loader.py function _load_reference: ValueError: invalid literal for int() with base 10: 'None' This is now fixed in CVS, plus there are now additional unit tests. For the fix, I have used a slight variation of Cymon's patch. Does this look sensible Cymon? BioSQL/BioSeq.py revision: 1.37 Tests/test_BioSQL.py revision: 1.39 Tests/seq_tests_common.py revision: 1.2 If you could retest with a clean checkout from CVS/github, to confirm the problem is fixed, that would be great David. Note - currently in BioSQL we only store one reference location, while GenBank files can have a single reference covering multiple regions of the record. This is a limitation of the current BioSQL schema (although it would be interesting to see how BioPerl deals with this). Note - there are four known failures in test_BioSQL.py right now, a mixed strand feature in NC_000932.gb (which triggers two failures), the project cross reference in NC_005816.gb, and a sub-feature location reference in one_of.gb -- these are all unrelated to this issue (Bug 2840). Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Mon Jun 1 14:14:52 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 1 Jun 2009 19:14:52 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC Message-ID: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> Hi, Is there a final schedule for the Biopython BOF? I am trying to book planes and hotel and would be nice to have a ballpark idea if possible. Many thanks, Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From p.j.a.cock at googlemail.com Mon Jun 1 15:19:11 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 1 Jun 2009 20:19:11 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> Message-ID: <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> 2009/6/1 Tiago Ant?o : > Hi, > > Is there a final schedule for the Biopython BOF? I am trying to book > planes and hotel and would be nice to have a ballpark idea if > possible. > > Many thanks, > Tiago According to http://www.open-bio.org/wiki/BOSC_2009 the schedule will be posted today (1 June 2009), but thus far I don't see any timetable with the BOF sessions included. See http://www.open-bio.org/wiki/BOSC_2009_Schedule which does have the talk titles. Based on last year's page http://www.open-bio.org/wiki/BOSC_2008_schedule and my emails with the organisers I expect the BOF sessions to again be on both the Saturday and the Sunday from about 4:30 till about 6:00 (or for as long as we can keep going?). So, if you are going to fly back Sunday night, try and stay till at least 6pm? I have signed up to the "Discover Stockholm Orienteering Event & Icebreaker" at 6pm on the Saturday. With hindsight this might have been a mistake (at it may oblige me to leave a coding session prematurely), but on the other hand, I expect we'll all need a break at that stage anyway. ;) Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 1 15:45:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 15:45:56 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906011945.n51JjuB3001445@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #7 from cymon.cox at gmail.com 2009-06-01 15:45 EST ------- (In reply to comment #6) > This is now fixed in CVS, plus there are now additional unit tests. For the > fix, I have used a slight variation of Cymon's patch. Does this look sensible > Cymon? Works for me. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Jun 1 18:14:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Jun 2009 18:14:52 -0400 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> Message-ID: <20090601221452.GG15913@sobchak.mgh.harvard.edu> Peter and Tiago; > > Is there a final schedule for the Biopython BOF? > According to http://www.open-bio.org/wiki/BOSC_2009 the schedule will > be posted today (1 June 2009), but thus far I don't see any timetable > with the BOF sessions included. See > http://www.open-bio.org/wiki/BOSC_2009_Schedule which does have the > talk titles. > > Based on last year's page > http://www.open-bio.org/wiki/BOSC_2008_schedule and my emails with the > organisers I expect the BOF sessions to again be on both the Saturday > and the Sunday from about 4:30 till about 6:00 (or for as long as we > can keep going?). So, if you are going to fly back Sunday night, try > and stay till at least 6pm? I arrive in Sweden Thursday morning and leave Tuesday morning. Beyond the BOSC talks and BoF sessions Peter mentioned, I am up for some coding any other time people will be around. I suspect the best turnouts will be Saturday and Sunday evenings when people are around and can be kept motivated from a day of discussions. Brad From bugzilla-daemon at portal.open-bio.org Mon Jun 1 18:42:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 18:42:05 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906012242.n51Mg5SO014894@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #15 from cymon.cox at gmail.com 2009-06-01 18:42 EST ------- Here's an attempt to circumvent the RULES in the BioSQL schema on PostgreSQL; it makes a check for the presence of the RULES, and if they are present insures that the record is injected in to the bioentry table else raises an IntegrityError. A further problem arose with one of the unittest: ====================================================================== FAIL: Make sure can't reimport existing records. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 474, in test_reload err.__class__.__name__ + "\n" + str(err)) AssertionError: OperationalError currval of sequence "bioentry_pk_seq" is not yet defined in this session ---------------------------------------------------------------------- This was seemingly unrelated to the RULES issue and is a consequence of how PostgreSQL handles sequences in sessions: because a new session was started (ie and new suite in the unit test) and the record failed to inject although the RULES were returning a INSERT 0,0 the bioentry_pk_seq was not incremented and when adaptor.last_id was called (actually looks for the curr_val of the sequence) it raised an OperationalError because the next_val() had not been called so far in the session. Adding a manual call to the next_val() in the unittest before trying the load ensures that the unittest fails where expected. (At least I think that is what is happening). diff --git a/BioSQL/BioSeqDatabase.py b/BioSQL/BioSeqDatabase.py index 3f58e9c..89d0d99 100644 --- a/BioSQL/BioSeqDatabase.py +++ b/BioSQL/BioSeqDatabase.py @@ -330,6 +332,14 @@ class BioSeqDatabase: self.adaptor = adaptor self.name = name self.dbid = self.adaptor.fetch_dbid_by_dbname(name) + + ##Test for presence of RULES in schema + self.postgres_rules_present= False + if "psycopg" in self.adaptor.conn.__class__.__module__: + sql = r"SELECT ev_class FROM pg_rewrite WHERE rulename='rule_bioentry_i1'" + if self.adaptor.execute_and_fetchall(sql): + self.postgres_rules_present = True + def __repr__(self): return "BioSeqDatabase(%r, %r)" % (self.adaptor, self.name) @@ -439,5 +449,14 @@ class BioSeqDatabase: num_records = 0 for cur_record in record_iterator : num_records += 1 + if self.postgres_rules_present: + self.adaptor.execute("SELECT count(bioentry_id) FROM bioentry") + curr_val = self.adaptor.cursor.fetchone()[0] db_loader.load_seqrecord(cur_record) + if self.postgres_rules_present: + self.adaptor.execute("SELECT count(bioentry_id) FROM bioentry") + after_val = self.adaptor.cursor.fetchone()[0] + if curr_val == after_val: + raise self.adaptor.conn.IntegrityError("Duplicate record " + "detected: record has not been inserted") return num_records diff --git a/Tests/test_BioSQL.py b/Tests/test_BioSQL.py index 334fe52..bf17ba7 --- a/Tests/test_BioSQL.py +++ b/Tests/test_BioSQL.py @@ -581,6 +581,13 @@ class InDepthLoadTest(unittest.TestCase): self.assertEqual(db_record.name, record.name) self.assertEqual(db_record.description, record.description) self.assertEqual(str(db_record.seq), str(record.seq)) + + #We have to manually advance the sequence because when the repeat load + #of the record fails and returns INSERT 0,0 because of the RULES the call + #to get the last_id causes an OperationalError because the curr_val hasnt + #been defined for the session ie. next_val() hasnt been called + self.db.adaptor.execute(r"select nextval('bioentry_pk_seq')") + #Good... now try reloading it! try : count = self.db.load([record]) Yeah, its nasty, but I thought I'd put it out there for consideration... cymon at gyra:~/git/github-master/Tests$ python ./test_BioSQL.py GenBank file to BioSQL and back to a GenBank file, NC_000932. ... FAIL GenBank file to BioSQL and back to a GenBank file, NC_005816. ... FAIL GenBank file to BioSQL and back to a GenBank file, NT_019265. ... ok GenBank file to BioSQL and back to a GenBank file, arab1. ... ok GenBank file to BioSQL and back to a GenBank file, cor6_6. ... ok GenBank file to BioSQL and back to a GenBank file, noref. ... ok GenBank file to BioSQL and back to a GenBank file, one_of. ... FAIL GenBank file to BioSQL and back to a GenBank file, protein_refseq2. ... ok Make sure can't import records with same ID (in one go). ... ok Make sure can't import a single record twice (in one go). ... ok Make sure can't import a single record twice (in steps). ... ok Make sure all records are correctly loaded. ... ok Make sure can't reimport existing records. ... ok Indepth check that SeqFeatures are transmitted through the db. ... ok Make sure can load record into another namespace. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok GenBank file to BioSQL, then again to a new namespace, NC_000932. ... FAIL GenBank file to BioSQL, then again to a new namespace, NC_005816. ... ok GenBank file to BioSQL, then again to a new namespace, NT_019265. ... ok GenBank file to BioSQL, then again to a new namespace, arab1. ... ok GenBank file to BioSQL, then again to a new namespace, cor6_6. ... ok GenBank file to BioSQL, then again to a new namespace, noref. ... ok GenBank file to BioSQL, then again to a new namespace, one_of. ... ok GenBank file to BioSQL, then again to a new namespace, protein_refseq2. ... ok Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 1 19:08:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 19:08:06 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906012308.n51N86DB016766@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #16 from cymon.cox at gmail.com 2009-06-01 19:08 EST ------- (In reply to comment #15) > @@ -439,5 +449,14 @@ class BioSeqDatabase: > num_records = 0 > for cur_record in record_iterator : > num_records += 1 > + if self.postgres_rules_present: > + self.adaptor.execute("SELECT count(bioentry_id) FROM > bioentry") > + curr_val = self.adaptor.cursor.fetchone()[0] > db_loader.load_seqrecord(cur_record) > + if self.postgres_rules_present: > + self.adaptor.execute("SELECT count(bioentry_id) FROM > bioentry") > + after_val = self.adaptor.cursor.fetchone()[0] > + if curr_val == after_val: > + raise self.adaptor.conn.IntegrityError("Duplicate record " > + "detected: record has not been inserted") > return num_records Actually, I dont think this is going to solve the original problem because the SeqFeatures of the second record will still be inserted over the first record, before the IntegrityError is raised. So the check needs to surround line 50 in Loader.py: bioentry_id = self._load_bioentry_table(record) which means passing a passing a postgres_rules_present parameter to Loader.load_seqrecord(). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 06:51:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 06:51:34 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021051.n52ApY7j030845@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 06:51 EST ------- Cymon (and Andrea), How do you feel about these pragmatic (short term) options: Option (a): At import time, if the rules are present issue a warning, suggesting the user fix their database schema. Option (b): If the user tries to load a record and the rules are present, raise an exception error suggesting the user fix their database schema. Either options (a) or (b) would be a problem for anyone trying to use BioPerl and Biopython with a PostgreSQL BioSQL database - but this is still an improvement on the current situation. However, I'm pleased see you (Cymon) have made some progress towards an ideal situation (until Bug 2839 is fixed), where Biopython could cope with the "evil" rules in the default PostgreSQL schema: Option (c): Even if the rules are present, and a key clash would happen, issue an IntegrityError. The details of how we might do this remain to be resolved... The idea of checking the bioentry table count imposes a performance penalty from this extra query, but also as you note in comment 16 in some ways this is "too late" (a transaction rollback is required). How do you feel about this simplistic solution?: if the rules are present, before loading a new record, do a query to check to make sure there isn't a duplicate already present, and if there is raise an IntegrityError. There is still a performance penalty from this extra query, but it avoids any issues with having to roll back partial transactions. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Tue Jun 2 10:22:37 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Jun 2009 15:22:37 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <20090601221452.GG15913@sobchak.mgh.harvard.edu> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> Message-ID: <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> Hi, On Mon, Jun 1, 2009 at 11:14 PM, Brad Chapman > I arrive in Sweden Thursday morning and leave Tuesday morning. Beyond > the BOSC talks and BoF sessions Peter mentioned, I am up for some coding > any other time people will be around. I suspect the best turnouts > will be Saturday and Sunday evenings when people are around and can Because of a meeting on Thursday (and Friday, which I am skipping) I can only arrive on Friday around 6pm. I am staying at the Rica Talk which seems to be on the same place as the conference. If any of you are around at that evening and want talk/code/have dinner/drink send me an email. I leave on Tuesday. Though I am not on ICMB, I am available to code/discuss/drink/whatever on Monday. Regards, Tiago From p.j.a.cock at googlemail.com Tue Jun 2 11:41:35 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Jun 2009 16:41:35 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> Message-ID: <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> 2009/6/2 Tiago Ant?o : > Hi, > > On Mon, Jun 1, 2009 at 11:14 PM, Brad Chapman >> I arrive in Sweden Thursday morning and leave Tuesday morning. Beyond >> the BOSC talks and BoF sessions Peter mentioned, I am up for some coding >> any other time people will be around. I suspect the best turnouts >> will be Saturday and Sunday evenings when people are around and can > > Because of a meeting on Thursday (and Friday, which I am skipping) I > can only arrive on Friday around 6pm. I am staying at the Rica Talk > which seems to be on the same place as the conference. If any of you > are around at that evening and want talk/code/have dinner/drink send > me an email. > I leave on Tuesday. Though I am not on ICMB, I am available to > code/discuss/drink/whatever on Monday. So in summary so far: Friday: Brad (and Bartek?) Saturday: Brad, Peter, Tiago and Bartek Sunday: Brad, Peter, Tiago and Bartek Monday: Brad, Peter, Tiago (and Bartek?) Tuesday: Peter (and Bartek?) So BoF sessions on both Saturday and Sunday would be fine :) Also, it looks like we can try and schedule some kind of coding/discussion/dinner session on the Monday as well. There are still some gaps on the ISMB schedule so I'm not quite sure when I might be free during Monday - but the evening should be fine. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 2 11:54:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 11:54:23 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021554.n52FsMLv023158@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #18 from cymon.cox at gmail.com 2009-06-02 11:54 EST ------- (In reply to comment #17) > How do you feel about this simplistic solution?: if the rules are present, > before loading a new record, do a query to check to make sure there isn't a > duplicate already present, and if there is raise an IntegrityError. Now thats a much better solution than the way Ive been trying to go... This does the trick: diff --git a/BioSQL/BioSeqDatabase.py b/BioSQL/BioSeqDatabase.py index 3f58e9c..a7a2470 100644 --- a/BioSQL/BioSeqDatabase.py +++ b/BioSQL/BioSeqDatabase.py @@ -330,6 +332,14 @@ class BioSeqDatabase: self.adaptor = adaptor self.name = name self.dbid = self.adaptor.fetch_dbid_by_dbname(name) + + ##Test for presence of RULES in schema + self.postgres_rules_present= False + if "psycopg" in self.adaptor.conn.__class__.__module__: + sql = r"SELECT ev_class FROM pg_rewrite WHERE rulename='rule_bioentry_i1'" + if self.adaptor.execute_and_fetchall(sql): + self.postgres_rules_present = True + def __repr__(self): return "BioSeqDatabase(%r, %r)" % (self.adaptor, self.name) @@ -439,5 +449,11 @@ class BioSeqDatabase: num_records = 0 for cur_record in record_iterator : num_records += 1 + if self.postgres_rules_present: + self.adaptor.execute("SELECT bioentry_id FROM bioentry " + "WHERE identifier = '%s'" % cur_record.id) + if self.adaptor.cursor.fetchone(): + raise self.adaptor.conn.IntegrityError("Duplicate record " + "detected: record has not been inserted") db_loader.load_seqrecord(cur_record) return num_records C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Tue Jun 2 12:03:35 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Jun 2009 17:03:35 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <8b34ec180906020859o2468f987lfd1ac00b4ebea898@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> <8b34ec180906020859o2468f987lfd1ac00b4ebea898@mail.gmail.com> Message-ID: <6d941f120906020903j389a046dw6c84f0952fc04279@mail.gmail.com> On Tue, Jun 2, 2009 at 4:59 PM, Bartek Wilczynski wrote: >> Friday: Brad (and Bartek?Late) Tiago late here also. I land in Arlanda at 16.40. From bartek at rezolwenta.eu.org Tue Jun 2 11:59:01 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 2 Jun 2009 17:59:01 +0200 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> Message-ID: <8b34ec180906020859o2468f987lfd1ac00b4ebea898@mail.gmail.com> 2009/6/2 Peter Cock : > So in summary so far: > > Friday: Brad (and Bartek?Late) > Saturday: Brad, Peter, Tiago and Bartek-Yes > Sunday: Brad, Peter, Tiago and Bartek-Yes > Monday: Brad, Peter, Tiago (and Bartek?No) > Tuesday: Peter (and Bartek?No) > I will arrive on Friday evening (landing on 18.30) and leave on Sunday evening (I should be able to stay at the site till 6pm). > So BoF sessions on both Saturday and Sunday would be fine :) > I should be able to take part in Sat/Sun BOF sessions (but not Monday/Tuesday). cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Jun 2 13:00:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 13:00:56 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021700.n52H0uZb029220@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST ------- (In reply to comment #18) > (In reply to comment #17) > > How do you feel about this simplistic solution?: if the rules are present, > > before loading a new record, do a query to check to make sure there isn't a > > duplicate already present, and if there is raise an IntegrityError. > > Now thats a much better solution than the way Ive been trying to go... > > This does the trick: > ... > + if self.postgres_rules_present: > + self.adaptor.execute("SELECT bioentry_id FROM bioentry " > + "WHERE identifier = '%s'" % > cur_record.id) > + if self.adaptor.cursor.fetchone(): > + raise self.adaptor.conn.IntegrityError("Duplicate record " > + "detected: record has not been inserted") While the above code looks sensible, I don't think it covers all the cases yet. Essentially the two bioentry rules relate to these two uniqueness rules in the default schema: UNIQUE ( identifier , biodatabase_id ) UNIQUE ( accession , biodatabase_id , version ) According to rule_bioentry_i1 (or the equivalent rule) we should allow the same bioentry.identifier to appear in different namespaces (i.e. as long as bioentry.biodatabase_id differs). i.e. something like this in your code: "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id = %s' % (cur_record.id, self.dbid) Then for rule_bioentry_i2 we also need to check the accession, version and biodatabase_id have not been used before. Both checks could probably be done as a single more complex SQL query. Also, when we check for the rules, do you think we should check for rule_bioentry_i2 as well as rule_bioentry_i1? In principle they will either both be there, or neither. What about the other rules - might they also cause problems in Biopython? Finally, on a code style thing, I'd make postgres_rules_present private, i.e. call it _postgres_rules_present instead. Anyway, in principle it looks like this approach should work :) Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 13:25:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 13:25:54 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021725.n52HPsbi031363@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #20 from andrea at biodec.com 2009-06-02 13:25 EST ------- (In reply to comment #19) > (In reply to comment #18) > > (In reply to comment #17) > > > How do you feel about this simplistic solution?: if the rules are present, > > > before loading a new record, do a query to check to make sure there isn't a > > > duplicate already present, and if there is raise an IntegrityError. > > > > Now thats a much better solution than the way Ive been trying to go... > > > > This does the trick: > > ... > > + if self.postgres_rules_present: > > + self.adaptor.execute("SELECT bioentry_id FROM bioentry " > > + "WHERE identifier = '%s'" % > > cur_record.id) > > + if self.adaptor.cursor.fetchone(): > > + raise self.adaptor.conn.IntegrityError("Duplicate record " > > + "detected: record has not been inserted") > > While the above code looks sensible, I don't think it covers all the cases yet. > Essentially the two bioentry rules relate to these two uniqueness rules in the > default schema: > > UNIQUE ( identifier , biodatabase_id ) > UNIQUE ( accession , biodatabase_id , version ) What i think... 1) the solution is almost correct 2) but we have for sure to consider both rules because ("i tried") and they work fully independetly.. so we need to check both rules. 3) the unicity is related to the biodatabase, so i can add 2 record with identical accession, or identifier or both... but different biodatabase and this works perfectly. 3) At the end i would like to add also a warning because the presence of the rules cause an overhead into insertion because trigger other queries.... (and it could be convenient to inform...) > > According to rule_bioentry_i1 (or the equivalent rule) we should allow the same > bioentry.identifier to appear in different namespaces (i.e. as long as > bioentry.biodatabase_id differs). i.e. something like this in your code: > > "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id = > %s' % (cur_record.id, self.dbid) > > Then for rule_bioentry_i2 we also need to check the accession, version and > biodatabase_id have not been used before. sure > > Both checks could probably be done as a single more complex SQL query. "SELECT bioentry_id FROM bioentry WHERE (identifier = '%s AND biodatabase_id = %s') OR (accession = '%s AND version = '%s' AND biodatabase_id = %s')" so if one of the two (or both) is matched you have a bioentry_id and you could have the problem > > Also, when we check for the rules, do you think we should check for > rule_bioentry_i2 as well as rule_bioentry_i1? In principle they will either > both be there, or neither. What about the other rules - might they also cause > problems in Biopython? both... it's fully the same. you have the same problem on aceession,version,biodatabase_id > > Finally, on a code style thing, I'd make postgres_rules_present private, i.e. > call it _postgres_rules_present instead. Anyway, in principle it looks like > this approach should work :) ok > > Peter > thanks andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 13:58:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 13:58:33 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021758.n52HwXM7001408@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:58 EST ------- (In reply to comment #20) > > What i think... > 1) the solution is almost correct > 2) but we have for sure to consider both rules because ("i tried") and > they work fully independetly.. so we need to check both rules. It would be odd for someone to delete one rule but not the other. But yes, we should test for both. > 3) the unicity is related to the biodatabase, so i can add 2 record with > identical accession, or identifier or both... but different biodatabase > and this works perfectly. Good. > 3) At the end i would like to add also a warning because the presence > of the rules cause an overhead into insertion because trigger other > queries.... (and it could be convenient to inform...) Yes, having a warning (even if Biopython can be made to cope with the rules) seems sensible. I've just updated CVS to check for either of the bioentry rules and issue a warning (based on Cymon's patch). Adding the work around with the extra query would be the next step (at which point the warning text would need updating). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 14:15:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 14:15:03 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021815.n52IF3On002717@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #22 from andrea at biodec.com 2009-06-02 14:15 EST ------- (In reply to comment #21) > (In reply to comment #20) > > > > What i think... > > 1) the solution is almost correct > > 2) but we have for sure to consider both rules because ("i tried") and > > they work fully independetly.. so we need to check both rules. > > It would be odd for someone to delete one rule but not the other. But yes, we > should test for both. > very odd... and fully improbable.. but possible and it could rise "noisy bugs" in future (i will slowly forget many of the things we are speaking about) > > 3) the unicity is related to the biodatabase, so i can add 2 record with > > identical accession, or identifier or both... but different biodatabase > > and this works perfectly. > > Good. > > > 3) At the end i would like to add also a warning because the presence > > of the rules cause an overhead into insertion because trigger other > > queries.... (and it could be convenient to inform...) > > Yes, having a warning (even if Biopython can be made to cope with the rules) > seems sensible. Also, in the worning, telling something about performance issues.... > > I've just updated CVS to check for either of the bioentry rules and issue a > warning (based on Cymon's patch). Adding the work around with the extra query > would be the next step (at which point the warning text would need updating). > > Peter > Thanks andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Tue Jun 2 15:39:18 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 2 Jun 2009 20:39:18 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <200906021700.n52H0uZb029220@portal.open-bio.org> References: <200906021700.n52H0uZb029220@portal.open-bio.org> Message-ID: <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> 2009/6/2 > http://bugzilla.open-bio.org/show_bug.cgi?id=2833 > > > > > > ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST ------- > (In reply to comment #18) > > (In reply to comment #17) > > > How do you feel about this simplistic solution?: if the rules are > present, > > > before loading a new record, do a query to check to make sure there > isn't a > > > duplicate already present, and if there is raise an IntegrityError. > > > > Now thats a much better solution than the way Ive been trying to go... > > > > This does the trick: > > ... > > + if self.postgres_rules_present: > > + self.adaptor.execute("SELECT bioentry_id FROM bioentry " > > + "WHERE identifier = '%s'" % > > cur_record.id) > > + if self.adaptor.cursor.fetchone(): > > + raise self.adaptor.conn.IntegrityError("Duplicate > record " > > + "detected: record has not been inserted") > > While the above code looks sensible, I don't think it covers all the cases > yet. > Essentially the two bioentry rules relate to these two uniqueness rules in > the > default schema: > > UNIQUE ( identifier , biodatabase_id ) > UNIQUE ( accession , biodatabase_id , version ) > > According to rule_bioentry_i1 (or the equivalent rule) we should allow the > same > bioentry.identifier to appear in different namespaces (i.e. as long as > bioentry.biodatabase_id differs). i.e. something like this in your code: > > "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id > = > %s' % (cur_record.id, self.dbid) > > Then for rule_bioentry_i2 we also need to check the accession, version and > biodatabase_id have not been used before. In principle, we should only have to check for the second case (accession, biodatabase_id, version) because the GenBank "gi numbers" (i.e the identifier number) parallel the accession.version scheme. When a record changes both the gi number changes and the version number is incremented. Hence, and unique accession.version implies a unique identifier. In the schema, the identifier can be NULL, presumably so that non-GenBank data can be stored provided is has a unique accession.version. If we were only to check case 2 (accession, biodatabase_id, version) the only way I can see to trigger the RULES bug would be to manually assign two different accession.version to two records but assign the same (presumably artificial) identifier number to both record.annotations["gi"]. So, how likely is that? Well, not very, but perhaps we need check both ;) Perhaps we need to first define some unittests of all the permutations, because the code I submitted doesnt trigger any errors in the current suite. Cheers, C. From cy at cymon.org Tue Jun 2 16:29:06 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 2 Jun 2009 21:29:06 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> Message-ID: <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> 2009/6/2 Cymon Cox > 2009/6/2 > > http://bugzilla.open-bio.org/show_bug.cgi?id=2833 >> >> >> >> >> >> ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST ------- >> (In reply to comment #18) >> > (In reply to comment #17) >> > > How do you feel about this simplistic solution?: if the rules are >> present, >> > > before loading a new record, do a query to check to make sure there >> isn't a >> > > duplicate already present, and if there is raise an IntegrityError. >> > >> > Now thats a much better solution than the way Ive been trying to go... >> > >> > This does the trick: >> > ... >> > + if self.postgres_rules_present: >> > + self.adaptor.execute("SELECT bioentry_id FROM bioentry >> " >> > + "WHERE identifier = '%s'" % >> > cur_record.id) >> > + if self.adaptor.cursor.fetchone(): >> > + raise self.adaptor.conn.IntegrityError("Duplicate >> record " >> > + "detected: record has not been inserted") >> >> While the above code looks sensible, I don't think it covers all the cases >> yet. >> Essentially the two bioentry rules relate to these two uniqueness rules in >> the >> default schema: >> >> UNIQUE ( identifier , biodatabase_id ) >> UNIQUE ( accession , biodatabase_id , version ) >> >> According to rule_bioentry_i1 (or the equivalent rule) we should allow the >> same >> bioentry.identifier to appear in different namespaces (i.e. as long as >> bioentry.biodatabase_id differs). i.e. something like this in your code: >> >> "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND >> biodatabase_id = >> %s' % (cur_record.id, self.dbid) >> >> Then for rule_bioentry_i2 we also need to check the accession, version and >> biodatabase_id have not been used before. > > > In principle, we should only have to check for the second case (accession, > biodatabase_id, version) because the GenBank "gi numbers" (i.e the > identifier number) parallel the accession.version scheme. When a record > changes both the gi number changes and the version number is incremented. > Hence, and unique accession.version implies a unique identifier. In the > schema, the identifier can be NULL, presumably so that non-GenBank data can > be stored provided is has a unique accession.version. If we were only to > check case 2 (accession, biodatabase_id, version) the only way I can see to > trigger the RULES bug would be to manually assign two different > accession.version to two records but assign the same (presumably artificial) > identifier number to both record.annotations["gi"]. > Whoa, I see now that in Loader._load_bioentry_table that if the rec.annotations["gi"] is missing, it gets filled with the accession.version: if "gi" in record.annotations : identifier = record.annotations["gi"] else : identifier = record.id So biopythons BioSQL identifiers are not equivalent to GenBank identifiers. I wonder why this is done and identifier is not just left NULL, and the unique constraint maintained by accession/version... Cheers, C. -- From biopython at maubp.freeserve.co.uk Wed Jun 3 08:54:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Jun 2009 13:54:40 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> Message-ID: <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> On Tue, Jun 2, 2009 at 9:29 PM, Cymon Cox wrote: > > Whoa, I see now that in Loader._load_bioentry_table that if the > rec.annotations["gi"] is missing, it gets filled with the accession.version: > > ? ? ? ?if "gi" in record.annotations : > ? ? ? ? ? ?identifier = record.annotations["gi"] > ? ? ? ?else : > ? ? ? ? ? ?identifier = record.id > > So biopythons BioSQL identifiers are not equivalent to GenBank identifiers. > I wonder why this is done and identifier is not just left NULL, and the > unique constraint maintained by accession/version... > Remember, it isn't just GenBank files that get imported into BioSQL. While the record.id is the accession.version when loading a GenBank file, this is not the case in general. Consulting the CVS log, this was changed BioSQL/Loader/py revision 1.33 to cope with loading a FASTA file into a BioSQL database (Bug 2425). Presumably I was trying to mimic the BioPerl loading of FASTA files. Before this change, the bioentry.identifier was taken as the GI number if available. i.e. This change wasn't anything directly to do with the uniqueness rules. Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 3 10:39:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 10:39:19 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906031439.n53EdJon000576@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 10:39 EST ------- (In reply to comment #6) > Note - there are four known failures in test_BioSQL.py right now, a mixed > strand feature in NC_000932.gb (which triggers two failures), the project > cross reference in NC_005816.gb, and a sub-feature location reference in > one_of.gb -- these are all unrelated to this issue (Bug 2840). Just to note these are now fixed in CVS - they where mostly to do with writing GenBank files with Bio.SeqIO (I was testing writing using DBSeqRecord objects pulled out of BioSQL). (In reply to comment #7) > (In reply to comment #6) > > This is now fixed in CVS, plus there are now additional unit tests. For > > the fix, I have used a slight variation of Cymon's patch. Does this look > > sensible Cymon? > > Works for me. > C. Great - marking this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Wed Jun 3 10:52:16 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 3 Jun 2009 15:52:16 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> Message-ID: <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> 2009/6/3 Peter > On Tue, Jun 2, 2009 at 9:29 PM, Cymon Cox wrote: > > > > Whoa, I see now that in Loader._load_bioentry_table that if the > > rec.annotations["gi"] is missing, it gets filled with the > accession.version: > > > > if "gi" in record.annotations : > > identifier = record.annotations["gi"] > > else : > > identifier = record.id > > > > So biopythons BioSQL identifiers are not equivalent to GenBank > identifiers. > > I wonder why this is done and identifier is not just left NULL, and the > > unique constraint maintained by accession/version... > > > > Remember, it isn't just GenBank files that get imported into BioSQL. > While the record.id is the accession.version when loading a GenBank > file, this is not the case in general. > > Consulting the CVS log, this was changed BioSQL/Loader/py revision > 1.33 to cope with loading a FASTA file into a BioSQL database (Bug > 2425). Presumably I was trying to mimic the BioPerl loading of FASTA > files. Before this change, the bioentry.identifier was taken as the GI > number if available. > > i.e. This change wasn't anything directly to do with the uniqueness rules. Thanks Peter. Yes, it seems to have been done to mimic BioPerl - but I'm still curious as to why it is done at all... Anyway, I seem to be chasing my tail here: http://bugzilla.open-bio.org/show_bug.cgi?id=2681#c5 Cheers, C. -- From bugzilla-daemon at portal.open-bio.org Wed Jun 3 12:29:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 12:29:18 -0400 Subject: [Biopython-dev] [Bug 2806] Possible deadlock (hang) in Bio.Application using subprocess wait() In-Reply-To: Message-ID: <200906031629.n53GTIpD017690@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2806 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 12:29 EST ------- Since this bug was filed, the Bio.Application.generic_run function is now used in test_Emboss.py and the new alignment tool wrapper unit tests. None of these have shown a deadlock problem, but I have applied this fix anyway as a precaution. See Bio/Application/__init__.py revision 1.21 in CVS. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 12:36:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 12:36:40 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200906031636.n53Gaenk018729@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 12:36 EST ------- As of Bio/SeqIO/InsdcIO.py CVS revision 1.15, a SeqRecord's dbxrefs are recorded under the DBLINK lines when writing a GenBank file with Bio.SeqIO. Note that the code does not (currently) restrict this to the two database cross references the NCBI currently use for this field, e.g. "Project:28471" and "Trace Assembly Archive:123456", anything in the dbxrefs list is recorded. Marking this bug as fixed. Note if you want to test this, a clean CVS/git checkout would be advised (rather than trying to update individual files only). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 12:37:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 12:37:48 -0400 Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python 2.3 support In-Reply-To: Message-ID: <200906031637.n53GbmI2018929@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2817 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 12:37 EST ------- I think the obvious things are done in terms of removing Python 2.3 specific code. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 16:59:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 16:59:13 -0400 Subject: [Biopython-dev] [Bug 2848] New: SeqIO fastq routines reject valid quality socres Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2848 Summary: SeqIO fastq routines reject valid quality socres Product: Biopython Version: 1.50 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: pmmagic at gmail.com The fastq routines in SeqIO.QualityIO reject what I believe are valid quality scores. According to the MAQ website (http://maq.sourceforge.net/fastq.shtml; I don't know if this is definitive), valid quality values in Sanger style FASTQ format are: := [!-~\n]+ This corresponds to Phred quality scores in the range 0-93. The current code in BioPython 1.50 rejects quality scores > 90. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 17:39:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 17:39:07 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906032139.n53Ld7QK011962@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #1 from sbassi at gmail.com 2009-06-03 17:39 EST ------- Seems that is intented, look at the module docs: The PHRED software reads DNA sequencing trace files, calls bases, and assigns a quality value between 0 and 90 to each called base using a logged transformation of the error probability, Q = -10 log10( Pe ), for example:: Pe = 0.0, Q = 0 Pe = 0.1, Q = 10 Pe = 0.01, Q = 20 ... Pe = 0.00000001, Q = 80 Pe = 0.000000001, Q = 90 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 17:51:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 17:51:16 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906032151.n53LpGCT012916@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #23 from cymon.cox at gmail.com 2009-06-03 17:51 EST ------- Created an attachment (id=1319) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1319&action=view) PostgreSQL BioSQL Rules workaround -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 18:02:22 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 18:02:22 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906032202.n53M2MMb014177@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #24 from cymon.cox at gmail.com 2009-06-03 18:02 EST ------- Ive added a patch against biopython on GitHub. I hope it address all the points made so far... it now passes all tests in test_BioSQL.py (although Ive not added more). One thing we've not yet discussed is the other PostgreSQL driver PyGresql. It appears that the project is still active and I was able to apt-get a Ubuntu package. It failed the tests miserably because it doesn't support autocommit(). Even if it can can be made to work it will obviously be prone to the RULES issue. Presumably, no one is actually using PyGresql (or at least hasnt updated biopython for some time). I'll open a bug. Also, Ive added a create_database() in a setUp() to the ClosedLoopTest unittest case because if this suite is called first (as it is for me - what actually governs which unittests are called first?) then if a test database is missing the suite is going to fail. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 18:12:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 18:12:31 -0400 Subject: [Biopython-dev] [Bug 2849] New: PyGresql PostgreSQL driver support for BioSQL is broken Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2849 Summary: PyGresql PostgreSQL driver support for BioSQL is broken Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com cymon at gyra:~/git/github-master/Tests$ python test_BioSQL.py GenBank file to BioSQL and back to a GenBank file, NC_000932. ... ERROR GenBank file to BioSQL and back to a GenBank file, NC_005816. ... ERROR GenBank file to BioSQL and back to a GenBank file, NT_019265. ... ERROR GenBank file to BioSQL and back to a GenBank file, arab1. ... ERROR GenBank file to BioSQL and back to a GenBank file, cor6_6. ... ERROR GenBank file to BioSQL and back to a GenBank file, noref. ... ERROR GenBank file to BioSQL and back to a GenBank file, one_of. ... ERROR GenBank file to BioSQL and back to a GenBank file, protein_refseq2. ... ERROR Make sure can't import records with same ID (in one go). ... ERROR Make sure can't import a single record twice (in one go). ... ERROR Make sure can't import a single record twice (in steps). ... ERROR Make sure all records are correctly loaded. ... ERROR Make sure can't reimport existing records. ... ERROR Indepth check that SeqFeatures are transmitted through the db. ... ERROR Make sure can load record into another namespace. ... ERROR Load SeqRecord objects into a BioSQL database. ... ERROR Get a list of all items in the database. ... ERROR Test retrieval of items using various ids. ... ERROR Check can add DBSeq objects together. ... ERROR Check can turn a DBSeq object into a Seq or MutableSeq. ... ERROR Make sure Seqs from BioSQL implement the right interface. ... ERROR Check SeqFeatures of a sequence. ... ERROR Make sure SeqRecords from BioSQL implement the right interface. ... ERROR Check that slices of sequences are retrieved properly. ... ERROR GenBank file to BioSQL, then again to a new namespace, NC_000932. ... ERROR GenBank file to BioSQL, then again to a new namespace, NC_005816. ... ERROR GenBank file to BioSQL, then again to a new namespace, NT_019265. ... ERROR GenBank file to BioSQL, then again to a new namespace, arab1. ... ERROR GenBank file to BioSQL, then again to a new namespace, cor6_6. ... ERROR GenBank file to BioSQL, then again to a new namespace, noref. ... ERROR GenBank file to BioSQL, then again to a new namespace, one_of. ... ERROR GenBank file to BioSQL, then again to a new namespace, protein_refseq2. ... ERROR ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NC_000932. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 409, in test_NC_000932 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb") File "test_BioSQL.py", line 443, in loop count = db.load(original_records) File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 479, in load db_loader.load_seqrecord(cur_record) File "/home/cymon/git/github-master/BioSQL/Loader.py", line 50, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "/home/cymon/git/github-master/BioSQL/Loader.py", line 559, in _load_bioentry_table bioentry_id = self.adaptor.last_id('bioentry') File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 168, in last_id return self.dbutils.last_id(self.cursor, table) File "/home/cymon/git/github-master/BioSQL/DBUtils.py", line 96, in last_id cursor.execute(sql) File "/usr/lib/python2.6/dist-packages/pgdb.py", line 259, in execute self.executemany(operation, (params,)) File "/usr/lib/python2.6/dist-packages/pgdb.py", line 289, in executemany raise DatabaseError("error '%s' in '%s'" % (msg, sql)) DatabaseError: error 'ERROR: currval of sequence "bioentry_pk_seq" is not yet defined in this session ' in 'select currval('bioentry_pk_seq')' ====================================================================== etc, etc, etc... Same error until we get to: ====================================================================== ERROR: Make sure can't import records with same ID (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 344, in setUp create_database() File "test_BioSQL.py", line 56, in create_database server.adaptor.autocommit() File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 172, in autocommit return self.dbutils.autocommit(self.conn, y) File "/home/cymon/git/github-master/BioSQL/DBUtils.py", line 101, in autocommit raise NotImplementedError("pgdb does not support this!") NotImplementedError: pgdb does not support this! ====================================================================== ERROR: Make sure can't import a single record twice (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 344, in setUp create_database() File "test_BioSQL.py", line 56, in create_database server.adaptor.autocommit() File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 172, in autocommit return self.dbutils.autocommit(self.conn, y) File "/home/cymon/git/github-master/BioSQL/DBUtils.py", line 101, in autocommit raise NotImplementedError("pgdb does not support this!") NotImplementedError: pgdb does not support this! ====================================================================== -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 05:27:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 05:27:09 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906040927.n549R9CU030203@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 05:27 EST ------- Rereading that MAQ page, you are probably right about allowing 0-93 rather than 0-90 for PHRED scores. Could you pull out a few valid FASTQ read showing this problem as a short example file we can use for a unit test? (and attach it to this bug). Also could you and explicitly confirm what type of FASTQ file you think you have. i.e. Sanger style using PHRED scores and an offset of 33, rather than Solexa/Illumina style using a different scaling and an offset of 64, or something else. Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 05:31:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 05:31:18 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906040931.n549VIgg030515@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 05:31 EST ------- P.S. The reason I originally used 0 to 90 was this line in the MAQ page and the fq_all2std.pl text: "In the quality string, if you can see a character with its ASCII code higher than 90, probably your file is in the Solexa/Illumina format." They do say "probably", so perhaps 91, 92 and 93 can validly occur. It might help to know where your apparently very high quality FASTQ file came from. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 06:08:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 06:08:35 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906041008.n54A8Zkh000532@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1319 is|0 |1 obsolete| | ------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 06:08 EST ------- (From update of attachment 1319) (In reply to comment #24) > Ive added a patch against biopython on GitHub. > > I hope it address all the points made so far... it now passes all tests > in test_BioSQL.py (although Ive not added more). It looked sensible to me. It isn't very elegant (maybe we should move this hack into Loader.py?), but I can live with it until Hilmar fixes Bug 2839. Checked in as BioSQL/BioSeqDatabase.py CVS revision 1.23 - thanks! > One thing we've not yet discussed is the other PostgreSQL driver PyGresql. > It appears that the project is still active ... I'll open a bug. Let's discuss that on the new Bug 2849. > Also, Ive added a create_database() in a setUp() to the ClosedLoopTest > unittest case because if this suite is called first (as it is for me - > what actually governs which unittests are called first?) then if a test > database is missing the suite is going to fail. Good point, although I added a create_database() to the module itself instead. The unit tests order is from sorting their description (first line of the docstring) alphabetically. See Tests/test_BioSQL.py CVS revision 1.41 We may want to add a few more duplicate tests (using the accession and identifier) before closing this bug... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 06:28:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 06:28:32 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041028.n54ASWGJ002151@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 06:28 EST ------- We don't really need the auto commit statement in the unit test, what happens on your machine if you remove it? RCS file: /home/repository/biopython/biopython/Tests/test_BioSQL.py,v retrieving revision 1.41 diff -r1.41 test_BioSQL.py 54,59d53 < # Auto-commit: postgresql cannot drop database in a transaction < try: < server.adaptor.autocommit() < except AttributeError: < pass < With the above change MySQL is happy. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 07:05:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 07:05:11 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041105.n54B5B4a004928@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #2 from cymon.cox at gmail.com 2009-06-04 07:05 EST ------- (In reply to comment #1) > We don't really need the auto commit statement in the unit test, what happens > on your machine if you remove it? I think we do need it for psycopg/psycopg2, as the comment says, postgresql cannot drop a database inside a transaction: INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block C. > > RCS file: /home/repository/biopython/biopython/Tests/test_BioSQL.py,v > retrieving revision 1.41 > diff -r1.41 test_BioSQL.py > 54,59d53 > < # Auto-commit: postgresql cannot drop database in a transaction > < try: > < server.adaptor.autocommit() > < except AttributeError: > < pass > < > > With the above change MySQL is happy. > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 07:12:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 07:12:55 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041112.n54BCtTh005499@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 07:12 EST ------- (In reply to comment #2) > (In reply to comment #1) > > We don't really need the auto commit statement in the unit test, what > > happens on your machine if you remove it? > > I think we do need it for psycopg/psycopg2, as the comment says, postgresql > cannot drop a database inside a transaction: > > INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block > > C. Did you try it anyway? http://www.postgresql.org/docs/7.0/interactive/sql-dropdatabase.html http://www.postgresql.org/docs/8.3/interactive/sql-dropdatabase.html This says "DROP DATABASE cannot be executed inside a transaction block.", which is fine - we don't want a transaction for this as we won't ever roll this back. It also says "This command cannot be executed while connected to the target database." which should be fine. [Maybe I need to setup my own machine with PostgreSQL on it for testing...] Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 07:18:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 07:18:45 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041118.n54BIjj0006002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #4 from cymon.cox at gmail.com 2009-06-04 07:18 EST ------- (In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > We don't really need the auto commit statement in the unit test, what > > > happens on your machine if you remove it? > > > > I think we do need it for psycopg/psycopg2, as the comment says, postgresql > > cannot drop a database inside a transaction: > > > > INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block > > > > C. > > Did you try it anyway? Yes. C. > > http://www.postgresql.org/docs/7.0/interactive/sql-dropdatabase.html > http://www.postgresql.org/docs/8.3/interactive/sql-dropdatabase.html > > This says "DROP DATABASE cannot be executed inside a transaction block.", which > is fine - we don't want a transaction for this as we won't ever roll this back. > It also says "This command cannot be executed while connected to the target > database." which should be fine. > > [Maybe I need to setup my own machine with PostgreSQL on it for testing...] > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 13:49:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 13:49:13 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906041749.n54HnD2r003507@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 13:49 EST ------- (In reply to comment #3) > P.S. The reason I originally used 0 to 90 was this line in the MAQ page and > the fq_all2std.pl text: > > "In the quality string, if you can see a character with its ASCII code higher > than 90, probably your file is in the Solexa/Illumina format." Ignore that - I was thinking PHRED scores but they are talking ASCII codes. I guess they consider PHRED scores of 57+ to be rare. (In reply to comment #0) > The fastq routines in SeqIO.QualityIO reject what I believe are valid quality > scores. > > According to the MAQ website (http://maq.sourceforge.net/fastq.shtml; I don't > know if this is definitive), valid quality values in Sanger style FASTQ format > are: > > := [!-~\n]+ > > This corresponds to Phred quality scores in the range 0-93. Yes, it does: ord("!")-33 = 0 ord("~")-33 = 93 The maq website isn't definitive, but it was written by people at Sanger where the FASTQ format was invented, and to my knowledge is the closest thing to an official description of the format. (In reply to comment #2) > Rereading that MAQ page, you are probably right about allowing 0-93 rather > than 0-90 for PHRED scores. Fixed in CVS. > Could you pull out a few valid FASTQ read showing this problem as a short > example file we can use for a unit test? (and attach it to this bug)... On re-reading your bug report, I'm not sure if you actually have a file where this is a problem, of it you just noticed the minor discrepancy in the threshold? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 15:15:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 15:15:35 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041915.n54JFZ57010708@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #5 from cymon.cox at gmail.com 2009-06-04 15:15 EST ------- (In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > We don't really need the auto commit statement in the unit test, what > > > happens on your machine if you remove it? > > > > I think we do need it for psycopg/psycopg2, as the comment says, postgresql > > cannot drop a database inside a transaction: > > > > INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block > > > > C. > > Did you try it anyway? > > http://www.postgresql.org/docs/7.0/interactive/sql-dropdatabase.html > http://www.postgresql.org/docs/8.3/interactive/sql-dropdatabase.html > > This says "DROP DATABASE cannot be executed inside a transaction block.", which > is fine - we don't want a transaction for this as we won't ever roll this back. Psycopg defaults to a "read committed" isolation level that psycopg wraps in a translation block. So, I think the only way not to be in a translation block is to autocommit. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 15:22:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 15:22:55 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906041922.n54JMt7r011320@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #5 from pmmagic at gmail.com 2009-06-04 15:22 EST ------- (In reply to comment #4) > (In reply to comment #3) > > P.S. The reason I originally used 0 to 90 was this line in the MAQ page and > > the fq_all2std.pl text: > > > > "In the quality string, if you can see a character with its ASCII code higher > > than 90, probably your file is in the Solexa/Illumina format." > > Ignore that - I was thinking PHRED scores but they are talking ASCII codes. I > guess they consider PHRED scores of 57+ to be rare. > > (In reply to comment #0) > > The fastq routines in SeqIO.QualityIO reject what I believe are valid quality > > scores. > > > > According to the MAQ website (http://maq.sourceforge.net/fastq.shtml; I don't > > know if this is definitive), valid quality values in Sanger style FASTQ format > > are: > > > > := [!-~\n]+ > > > > This corresponds to Phred quality scores in the range 0-93. > > Yes, it does: > > ord("!")-33 = 0 > ord("~")-33 = 93 > > The maq website isn't definitive, but it was written by people at Sanger where > the FASTQ format was invented, and to my knowledge is the closest thing to an > official description of the format. > > (In reply to comment #2) > > Rereading that MAQ page, you are probably right about allowing 0-93 rather > > than 0-90 for PHRED scores. > > Fixed in CVS. > > > Could you pull out a few valid FASTQ read showing this problem as a short > > example file we can use for a unit test? (and attach it to this bug)... > > On re-reading your bug report, I'm not sure if you actually have a file where > this is a problem, of it you just noticed the minor discrepancy in the > threshold? > HI Peter, The problem arises in parsing the fastq formatted consensus mappings produced by MAQ, so these are "mapping qualities" rather than read qualities directly. These mapping qualities, however, are in the same scale as Phred quality scores (ttp://maq.sourceforge.net/qual.shtml ) and MAQ's fastq output is Sanger style. Since the mapping scores are, in part, a function read depth it's not too unusual to get very high quality scores in the MAQ output. Here's a simple snippet that is valid fastq: @ref|NC_001133| nnnnnnnnnnnnnnnacacccacacaccacaccacacaccACACCACACCCACACACACA CATCCTAACACTACCCTAACACAGCCctaatcyaacCCTGACCAACCTGTCTCTCAACTT + !!!!!!!!!!!!!!!@EHHHHHHKKJKKKKNNNBN:NNNNQQQQQABGA?LTTWWWZZZI HEFBZLZ]]]]]]]]]ZZZZZT at TTQQQT4A]1?cfiloxL{xuuux{]~~~~~Ake~`~ Thanks, Paul M -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 16:54:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 16:54:59 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906042054.n54KsxB2017457@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 16:54 EST ------- (In reply to comment #5) > HI Peter, > > The problem arises in parsing the fastq formatted consensus mappings > produced by MAQ, so these are "mapping qualities" rather than read > qualities directly. I see - that does explain why there are sometimes very very good quality scores. Presumably maq limits itself to a maximum PHRED quality of 93? > Here's a simple snippet that is valid fastq: > > @ref|NC_001133| > nnnnnnnnnnnnnnnacacccacacaccacaccacacaccACACCACACCCACACACACA > CATCCTAACACTACCCTAACACAGCCctaatcyaacCCTGACCAACCTGTCTCTCAACTT > + > !!!!!!!!!!!!!!!@EHHHHHHKKJKKKKNNNBN:NNNNQQQQQABGA?LTTWWWZZZI > HEFBZLZ]]]]]]]]]ZZZZZT at TTQQQT4A]1?cfiloxL{xuuux{]~~~~~Ake~`~ May we use that for a unit test in Biopython? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 17:11:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 17:11:19 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042111.n54LBJnL019044@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 17:11 EST ------- (In reply to comment #5) > Psycopg defaults to a "read committed" isolation level that psycopg wraps > in a translation block. So, I think the only way not to be in a translation > block is to autocommit. Presumably PyGresql/pgdb is similar in this regard? Given the apparent lack of any documentation on line at http://www.pygresql.org/pgdb.html this might be tricky to resolve, but from looking at the CVS repository I don't think they support autocommit. Maybe we simply can't do a drop database with the pygb driver for PostgreSQL. Perhaps instead we can just empty all the tables in this case? That might fix (at least part of) test_BioSQL.py Does test_BioSQL_SeqIO.py work? Does simple use of BioSQL with pgdb work? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 17:26:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 17:26:51 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042126.n54LQphc020848@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #7 from cymon.cox at gmail.com 2009-06-04 17:26 EST ------- (In reply to comment #6) > (In reply to comment #5) > > Psycopg defaults to a "read committed" isolation level that psycopg wraps > > in a translation block. So, I think the only way not to be in a translation > > block is to autocommit. > > Presumably PyGresql/pgdb is similar in this regard? Given the apparent lack of > any documentation on line at http://www.pygresql.org/pgdb.html this might be > tricky to resolve, but from looking at the CVS repository I don't think they > support autocommit. They dont. > > Maybe we simply can't do a drop database with the pygb driver for PostgreSQL. I had myself convinced this was the case for quite a while, but you can "trick" the cursor with a "COMMIT" and then execute a non-transactional query. I have it working and passing all the tests. Unfortunately, the driver spews forth "NOTICE"'s each time the database is built, ie for each CREATE in the schema, which ruins the unittest output. Ive yet to find a way to silence this. I'll submit a patch forthwith... C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 17:59:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 17:59:03 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906042159.n54Lx3ka023345@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #9 from david.wyllie at ndm.ox.ac.uk 2009-06-04 17:59 EST ------- thank you very much for fixing this. I can confirm that in the current git load (4 june 09) the problem is resolved. David (In reply to comment #6) > Hi David, > > I was able to reproduce this problem. When working on Bug 2838, as my test case > I was using just the file cor6_6.gb which by chance has simple reference > locations - and that worked. I have now tested with GI 28804743. Also, using > some of the other GenBank files in our test suites also shows the reference > location problem from BioSQL/Loader.py function _load_reference: > > ValueError: invalid literal for int() with base 10: 'None' > > This is now fixed in CVS, plus there are now additional unit tests. For the > fix, I have used a slight variation of Cymon's patch. Does this look sensible > Cymon? > > BioSQL/BioSeq.py revision: 1.37 > Tests/test_BioSQL.py revision: 1.39 > Tests/seq_tests_common.py revision: 1.2 > > If you could retest with a clean checkout from CVS/github, to confirm the > problem is fixed, that would be great David. > > Note - currently in BioSQL we only store one reference location, while GenBank > files can have a single reference covering multiple regions of the record. This > is a limitation of the current BioSQL schema (although it would be interesting > to see how BioPerl deals with this). > > Note - there are four known failures in test_BioSQL.py right now, a mixed > strand feature in NC_000932.gb (which triggers two failures), the project cross > reference in NC_005816.gb, and a sub-feature location reference in one_of.gb -- > these are all unrelated to this issue (Bug 2840). > > Thanks, > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 18:05:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 18:05:51 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042205.n54M5p2w023834@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #8 from cymon.cox at gmail.com 2009-06-04 18:05 EST ------- Created an attachment (id=1320) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1320&action=view) pgdb PyGreSQL Postgres Driver support -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 18:08:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 18:08:17 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042208.n54M8H1u024037@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #9 from cymon.cox at gmail.com 2009-06-04 18:08 EST ------- (In reply to comment #7) > Unfortunately, the driver spews forth "NOTICE"'s each time the database is > built, ie for each CREATE in the schema, which ruins the unittest output. Ive > yet to find a way to silence this. This isnt an issue for users you run the entire biopython test suite with run_tests.py, so I'll ignore it. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Thu Jun 4 19:22:27 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Thu, 4 Jun 2009 20:22:27 -0300 Subject: [Biopython-dev] Biopython logo usage Message-ID: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Hello, I wonder if the Biopython logo (http://biopython.org/wiki/Logo) has any usage guidelines. I am asking this because I am working on the cover design of my book "Python for Bioinformatics" and I want to include the Biopython logo. The idea is to highlight the fact that Biopython is covered in the book. Do you think is fair or I should not include this logo on the cover? Here is a draft of the cover: http://www.dnalinux.com/coverdraft1.pdf This is the first draft, without the Biopython logo, but there is a "Python" logo as part of a screenshot of the Python installer for Mac. Best, -- Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a. Non standard disclaimer: READ CAREFULLY. By reading this email, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ?? ???? ?????? ????? ????? ??? ???? ??? ????? ?? ?????? ????????.....???? ????? From idoerg at gmail.com Thu Jun 4 19:30:01 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 4 Jun 2009 16:30:01 -0700 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Message-ID: Wow, congratulations! I am so buying this off my startup.... I think the question is best addressed to Thomas Haelryck. IIRC, his friend designed the logo. There is no license I am aware of, but it is probably a good idea to put a cc-na license on it, like Tux has. Best, Iddo On Thu, Jun 4, 2009 at 4:22 PM, Sebastian Bassi wrote: > Hello, > > I wonder if the Biopython logo (http://biopython.org/wiki/Logo) has > any usage guidelines. > I am asking this because I am working on the cover design of my book > "Python for Bioinformatics" and I want to include the Biopython logo. > The idea is to highlight the fact that Biopython is covered in the > book. Do you think is fair or I should not include this logo on the > cover? Here is a draft of the cover: > http://www.dnalinux.com/coverdraft1.pdf > This is the first draft, without the Biopython logo, but there is a > "Python" logo as part of a screenshot of the Python installer for Mac. > Best, > > -- > Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a. > > Non standard disclaimer: READ CAREFULLY. By reading this email, > you agree, on behalf of your employer, to release me from all > obligations and waivers arising from any and all NON-NEGOTIATED > agreements, licenses, terms-of-service, shrinkwrap, clickwrap, > browsewrap, confidentiality, non-disclosure, non-compete and > acceptable use policies ("BOGUS AGREEMENTS") that I have > entered into with your employer, its partners, licensors, agents and > assigns, in perpetuity, without prejudice to my ongoing rights and > privileges. You further represent that you have the authority to release > me from any BOGUS AGREEMENTS on behalf of your employer. > > ?? ???? ?????? ????? ????? ??? ???? ??? ????? ?? ?????? ????????.....???? > ????? > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From p.j.a.cock at googlemail.com Fri Jun 5 05:16:52 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 5 Jun 2009 10:16:52 +0100 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Message-ID: <320fb6e00906050216m70d1ce90n396a0a36925e6c74@mail.gmail.com> On Fri, Jun 5, 2009 at 12:22 AM, Sebastian Bassi wrote: > Hello, > > I wonder if the Biopython logo (http://biopython.org/wiki/Logo) has > any usage guidelines. I don't know if there was anything formal ever written down. As Iddo said, we should probably clear this with Thomas Hamelryck (BCC'd) (and Henrik Vestergaard) who came up with the logo. http://www.biopython.org/wiki/Logo > I am asking this because I am working on the cover design of my book > "Python for Bioinformatics" and I want to include the Biopython logo. > The idea is to highlight the fact that Biopython is covered in the > book. Do you think is fair or I should not include this logo on the > cover? As you only have a single chapter on Biopython, having our logo too prominent could be misleading. However, I personally like the idea of including the logo on your cover - a bit more promotion of Biopython would be nice. From a visual layout point of view, I'm not sure what to suggest - the yellow snakes don't go very well with the blue background (although there is yellow in the current python logo which should help balance things). > Here is a draft of the cover: http://www.dnalinux.com/coverdraft1.pdf > This is the first draft, without the Biopython logo, but there is a > "Python" logo as part of a screenshot of the Python installer for Mac. That looks good. In fact, when we did the last release I was thinking about including the Biopython logo on the Windows Installers - it looks pretty easy once we have a bitmap the right size... Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 5 05:28:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Jun 2009 05:28:37 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906050928.n559Sbwe002877@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1320 is|0 |1 obsolete| | ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-05 05:28 EST ------- (From update of attachment 1320) Good work - checked in, thanks. One minor thing - do we have to do this: server.adaptor.cursor.execute("COMMIT") rather than something like: server.adaptor.commit() (I guess you had your reasons, and may have been doing this deliberately.) Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 5 06:08:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Jun 2009 06:08:41 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906051008.n55A8flo005419@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #11 from cymon.cox at gmail.com 2009-06-05 06:08 EST ------- (In reply to comment #10) > (From update of attachment 1320 [details]) > Good work - checked in, thanks. > > One minor thing - do we have to do this: > server.adaptor.cursor.execute("COMMIT") > > rather than something like: > server.adaptor.commit() > > (I guess you had your reasons, and may have been doing this deliberately.) Committing on the adaptor doesnt work: when the code goes to drop the db, it throws a usual "cannot run inside a transaction block". So it appears the commit must be made on the cursor. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 5 09:36:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Jun 2009 09:36:21 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906051336.n55DaLWX021023@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #7 from pmmagic at gmail.com 2009-06-05 09:36 EST ------- (In reply to comment #6) > (In reply to comment #5) > > HI Peter, > > > > The problem arises in parsing the fastq formatted consensus mappings > > produced by MAQ, so these are "mapping qualities" rather than read > > qualities directly. > > I see - that does explain why there are sometimes very very good quality > scores. Presumably maq limits itself to a maximum PHRED quality of 93? I presume so. If not that would contradict their own specification. > > > Here's a simple snippet that is valid fastq: > > > > @ref|NC_001133| > > nnnnnnnnnnnnnnnacacccacacaccacaccacacaccACACCACACCCACACACACA > > CATCCTAACACTACCCTAACACAGCCctaatcyaacCCTGACCAACCTGTCTCTCAACTT > > + > > !!!!!!!!!!!!!!!@EHHHHHHKKJKKKKNNNBN:NNNNQQQQQABGA?LTTWWWZZZI > > HEFBZLZ]]]]]]]]]ZZZZZT at TTQQQT4A]1?cfiloxL{xuuux{]~~~~~Ake~`~ > > May we use that for a unit test in Biopython? Absolutely. Cheers, Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Fri Jun 5 10:49:44 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Fri, 5 Jun 2009 11:49:44 -0300 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <320fb6e00906050216m70d1ce90n396a0a36925e6c74@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <320fb6e00906050216m70d1ce90n396a0a36925e6c74@mail.gmail.com> Message-ID: <9e2f512b0906050749m19cba875k120967a4614d695b@mail.gmail.com> On Fri, Jun 5, 2009 at 6:16 AM, Peter Cock wrote: > I don't know if there was anything formal ever written down. As Iddo > said, we should probably clear this with Thomas Hamelryck (BCC'd) > (and Henrik Vestergaard) who came up with the logo. > http://www.biopython.org/wiki/Logo OK, I also wrote to him. > As you only have a single chapter on Biopython, having our logo too > prominent could be misleading. However, I personally like the idea of There is one chapter about Biopython and there are code recipes and most of them use Biopython. But clearly is not a Biopython book. I don't suggest to make it prominent, I included more screenshots in the cover and planned to included the logo in a corner. > including the logo on your cover - a bit more promotion of Biopython > would be nice. From a visual layout point of view, I'm not sure what I think the same, for most bioinformatitians, Bioperl is the first option when they think on bioinformatics programming/scripting language. I will wait for Henrik approval. Best, SB. From anaryin at gmail.com Fri Jun 5 13:48:10 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 5 Jun 2009 19:48:10 +0200 Subject: [Biopython-dev] PolyA Sequence fails to BLAST? Message-ID: Hello all, this is quite a general curiosity. I was trying my application and I was testing the case of a sequence not having matches in BLAST. I chose a long stretch of Alanines, randomly: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA This is the URL that Biopython generated (printed from NCBIWWW.py): http://blast.ncbi.nlm.nih.gov/Blast.cgi?COMPOSITION_BASED_STATISTICS=True&DATABASE=pdb&ENTREZ_QUERY=%28none%29&EXPECT=10&GAPCOSTS=10+1&HITLIST_SIZE=50&MATRIX_NAME=PAM70&PROGRAM=blastp&QUERY=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&WORD_SIZE=3&CMD=Put Not finding this odd enough, because it says "Sequence not in FASTA format", I went to the BLAST server page and manually tried to run it. Same error. My question is not related to BioPython, but since I found it with it, I guess I might as well ask: Why does a poly A sequence crashes BLAST? :x Regards! Jo?o [ .. ] Rodrigues (Blog) http://doeidoei.wordpress.com (MSN) always_asleep_ at hotmail.com (Skype) rodrigues.jglm From sbassi at clubdelarazon.org Fri Jun 5 14:01:42 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Fri, 5 Jun 2009 15:01:42 -0300 Subject: [Biopython-dev] PolyA Sequence fails to BLAST? In-Reply-To: References: Message-ID: <9e2f512b0906051101q1be60531r589695d1fd1d0a17@mail.gmail.com> On Fri, Jun 5, 2009 at 2:48 PM, Jo?o Rodrigues wrote: > Hello all, this is quite a general curiosity. > I was trying my application and I was testing the case of a sequence not > having matches in BLAST. I chose a long stretch of Alanines, randomly: > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > Not finding this odd enough, because it says "Sequence not in FASTA format", > I went to the BLAST server page and manually tried to run it. Same error. That is because this is a "low complexity" region that in most cases is maked (with X or N) before entering into a BLAST search. Look here: "Filter (Low-complexity) Mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs. It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect." From thomas.hamelryck at gmail.com Fri Jun 5 14:05:45 2009 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Fri, 5 Jun 2009 20:05:45 +0200 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Message-ID: <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> On Fri, Jun 5, 2009 at 1:30 AM, Iddo Friedberg wrote: > Wow, congratulations! I am so buying this off my startup.... > > I think the question is best addressed to Thomas Haelryck. IIRC, his friend > designed the logo. There is no license I am aware of, but it is probably a > good idea to put a cc-na license on it, like Tux has. > Should be fine, but will ask to be sure. Cheers, -Thomas From idoerg at gmail.com Sat Jun 6 22:33:47 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 6 Jun 2009 19:33:47 -0700 Subject: [Biopython-dev] skipping a bad record in SeqIO.parse Message-ID: Suppose SeqIO throws an exception due to a bad record. I want to note that in stderr an move on to the next record. How do i do that? The following eyesore of a code simply leaves me stuck reading the same bad record over and over: seq_reader = SeqIO.parse(in_handle, format) while True: try: seq_record = seq_reader.next() except StopIteration: break except: if debug: sys.stderr.write("Sequence not read: %s%s" % (seq_record.id, os.linesep)) sys.stderr.flush() continue if not seq_record: break -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From mjldehoon at yahoo.com Sun Jun 7 07:38:10 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 7 Jun 2009 04:38:10 -0700 (PDT) Subject: [Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines Message-ID: <268230.50854.qm@web62402.mail.re1.yahoo.com> Hi everybody, Comments in SwissProt files such as the following: CC -!- FUNCTION: Core subunit of the mitochondrial membrane respiratory CC chain NADH dehydrogenase (Complex I) that is believed to belong to CC the minimal assembly required for catalysis. Complex I functions CC in the transfer of electrons from NADH to the respiratory chain. CC The immediate electron acceptor for the enzyme is believed to be CC ubiquinone (By similarity). CC -!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol. CC -!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane CC protein (By similarity). CC -!- SIMILARITY: Belongs to the complex I subunit 3 family. CC ----------------------------------------------------------------------- CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms CC Distributed under the Creative Commons Attribution-NoDerivs License CC ----------------------------------------------------------------------- are currently being stored differently by Bio.SeqIO and Bio.SwissProt. Bio.SeqIO stores the comments as one string, as follows: >>> record.annotations['comment'] '-!- FUNCTION: Core subunit of the mitochondrial membrane respiratory\n\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n\n the minimal assembly required for catalysis. Complex I functions\n\n in the transfer of electrons from NADH to the respiratory chain.\n\n The immediate electron acceptor for the enzyme is believed to be\n\n ubiquinone (By similarity).\n\n-!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n\n-!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n\n protein (By similarity).\n\n-!- SIMILARITY: Belongs to the complex I subunit 3 family.\n\n-----------------------------------------------------------------------\n\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\n\nDistributed under the Creative Commons Attribution-NoDerivs License\n\n-----------------------------------------------------------------------\n' Note that two endlines appear at the end of each line; I don't know why. Bio.SwissProt, on the other hand, stores a list of comments (with single newlines): >>> record.comments [' FUNCTION: Core subunit of the mitochondrial membrane respiratory\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n the minimal assembly required for catalysis. Complex I functions\n in the transfer of electrons from NADH to the respiratory chain.\n The immediate electron acceptor for the enzyme is believed to be\n ubiquinone (By similarity).\n', ' CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n', ' SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n protein (By similarity).\n', ' SIMILARITY: Belongs to the complex I subunit 3 family.\n', '-----------------------------------------------------------------------\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\nDistributed under the Creative Commons Attribution-NoDerivs License\n-----------------------------------------------------------------------\n'] I think that the approach used by Bio.SwissProt is more reasonable, although I'd prefer to remove the newlines and to skip the copyright statement altogether (since it's the same for all SwissProt records anyway). Can we do the same for Bio.SeqIO? Or is there a need to keep record.annotations['comments'] as a single string? If they are kept as a single string, how about using a single newline between comments, and no newlines within comments? This btw is the last inconsistency between Bio.SeqIO and Bio.SwissProt. By making this consistent, Bio.SeqIO could use Bio.SwissProt as a backend, which is about three times faster than the current parser, and has the added benefit of having to maintain only one SwissProt parser. --Michiel. --Michiel From biopython at maubp.freeserve.co.uk Sun Jun 7 07:52:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 12:52:04 +0100 Subject: [Biopython-dev] [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: Message-ID: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg wrote: > Suppose an iterator based reader throws an exception due to a bad record. I > want to note that in stderr an move on to the next record. How do i do that? The short answer is you can't (at least not easily), but the details would depend on which parser you are using (i.e. which file format). Do you have a corrupt file, or do you think you might have found a bug in a parser? More details would help. If you really have to do this, then if the file format is simple I would suggest you manually read the file into chunks and then pass them to SeqIO one by one. Not elegant but it would work. For example with a GenBank file, loop over the file line by line caching the data until you reach a new LOCUS line. Then turn the cached lines into a StringIO handle and give it to Bio.SeqIO.read() to parse that single record (in a try/except). Peter From biopython at maubp.freeserve.co.uk Sun Jun 7 08:11:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 13:11:06 +0100 Subject: [Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines In-Reply-To: <268230.50854.qm@web62402.mail.re1.yahoo.com> References: <268230.50854.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00906070511v45b9a6cft2d69518cfaf81a0e@mail.gmail.com> On Sun, Jun 7, 2009 at 12:38 PM, Michiel de Hoon wrote: > > Hi everybody, > > Comments in SwissProt files such as the following: ... > are currently being stored differently by Bio.SeqIO and Bio.SwissProt. > > Bio.SeqIO stores the comments as one string, as follows: ... > Note that two endlines appear at the end of each line; I don't know why. The double new lines sound like a bug to me, we should fix that. > Bio.SwissProt, on the other hand, stores a list of comments (with > single newlines): ... That's just a list containing one string in your example. > I think that the approach used by Bio.SwissProt is more reasonable, > although I'd prefer to remove the newlines and to skip the copyright > statement altogether (since it's the same for all SwissProt records > anyway). In the long term, it looks like the new SwissProt comments are structured in a way that would allow automatic parsing to extract the data. > Can we do the same for Bio.SeqIO? Or is there a need to keep > record.annotations['comments'] as a single string? If they are > kept as a single string, how about using a single newline between > comments, and no newlines within comments? I think there are reasons to keep record.annotations['comments'] as a single string. The GenBank SeqRecord parser (called from Bio.SeqIO) also uses a single string for comments (not a list of strings), so the old SwissProt SeqRecord parser (and thus Bio.SeqIO) is consistent with that. I'd also have to check if switching to a list of strings would be OK with the BioSQL code. Finally, such a change would not be backwards compatible and could break existing scripts. > This btw is the last inconsistency between Bio.SeqIO and > Bio.SwissProt. By making this consistent, Bio.SeqIO could > use Bio.SwissProt as a backend, which is about three times > faster than the current parser, and has the added benefit > of having to maintain only one SwissProt parser. Three times faster sounds very good - assuming it can parse all our existing unit tests of course ;) We don't actually need to change the way comments are stored in the SeqRecord for this parser. I understood your plan is to build a new Bio.SeqIO SwissProt parser on top of the new Bio.SwissProt record based parser, by converting the SwissProt records into SeqRecord objects. At this step, simply concatenate the list of comment strings into one string for the SeqRecord. Then we can use the new faster Bio.SwissProt parser within Bio.SeqIO, without breaking backwards compatibility, and deprecate the old Bio.SwissProt.SProt parser :) Peter From bugzilla-daemon at portal.open-bio.org Sun Jun 7 12:54:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 7 Jun 2009 12:54:24 -0400 Subject: [Biopython-dev] [Bug 2851] New: Psycopg version 1 support for BioSQL Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2851 Summary: Psycopg version 1 support for BioSQL Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com The recent additions the BioSQL interface (bug 2833) as a workaround for the RULES in the schema (bug 2839) (see also PyGreSQL support bug 2849) has broken support for Psycopg VERSION 1. The last release of Psycopg1 was version 1.1.21 on 2005-10-01. So it's pretty old and most users will probably have moved to psycopg2. However, it not been deprecated and, without the recent rules workaround code in place, it does pass all the tests (excepts the rules tests obviously). Psycopg1 fails the rules workaround code because in the BioSeqDatabase namespace, the connection object is a list of functions, and not a class that can be inspected for the driver name, and the IntegrityError is in the driver module namespace. One possible solution is to make the check for the RULES when the database is opened, set a module global variable, and later re-import the module to get the IntegrityError. It is a nasty solution, but in its favour it can be easily be removed when the RULES are eventually removed from the schema. Anyway, attached is a patch using this workaround which works for psycopg1, psycopg2, and PyGreSQL (note only for the pgdb driver and not the pg driver). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 7 12:55:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 7 Jun 2009 12:55:37 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906071655.n57GtbZN017199@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 ------- Comment #1 from cymon.cox at gmail.com 2009-06-07 12:55 EST ------- Created an attachment (id=1321) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1321&action=view) Psycopg 1 RULES workaround -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From idoerg at gmail.com Sun Jun 7 15:10:16 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 12:10:16 -0700 Subject: [Biopython-dev] [Biopython] skipping a bad record read in SeqIO In-Reply-To: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> Message-ID: Thanks Peter. OK, it's a genbank file, but the point is not hacking around that problem (which I did), it's more of a biopython policy question. Biopython cannot handle every record format variant (==error) out there, and we should probably have a method for skipping over illegible records. The records skipped should be noted, of course, e.g. by writing to stderr. If the record cannot be read, then the preceding record ID and / or the record serial number should be written. Does that sound like something we should be doing? On Sun, Jun 7, 2009 at 4:52 AM, Peter wrote: > On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg wrote: > > Suppose an iterator based reader throws an exception due to a bad record. > I > > want to note that in stderr an move on to the next record. How do i do > that? > > The short answer is you can't (at least not easily), but the details > would depend on which parser you are using (i.e. which file format). > > Do you have a corrupt file, or do you think you might have found a bug > in a parser? More details would help. > > If you really have to do this, then if the file format is simple I > would suggest you manually read the file into chunks and then pass > them to SeqIO one by one. Not elegant but it would work. For example > with a GenBank file, loop over the file line by line caching the data > until you reach a new LOCUS line. Then turn the cached lines into a > StringIO handle and give it to Bio.SeqIO.read() to parse that single > record (in a try/except). > > Peter > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 16:10:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 21:10:33 +0100 Subject: [Biopython-dev] [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> Message-ID: <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> On 6/7/09, Iddo Friedberg wrote: > Thanks Peter. > > OK, it's a genbank file, but the point is not hacking around that problem > (which I did), it's more of a biopython policy question. Could you report a bug with this particular GenBank file (or at least, the entry). I think Biopython should try and cope with all valid GenBank files. It has been a long time since I personally found a GenBank file Biopython couldn't parse - the only cases I can remember recently from the mailing list have been invalid files from 3rd party scripts or tools. Sometimes for out of spec files issuing a warning but continuing may be OK (we do already this on some LOCUS line variants, e.g. some GenBank files output from EMBOSS), but for anything unexpected I think the only safe option is to raise an exception. > Biopython cannot handle every record format variant (==error) out there, > and we should probably have a method for skipping over illegible records. > The records skipped should be noted, of course, e.g. by writing to stderr. > If the record cannot be read, then the preceding record ID and / or the > record serial number should be written. > > Does that sound like something we should be doing? No, not really. I'm not 100% sure this is what you meant, but I would oppose any suggestion that the default behaviour should be to completely skip bad records (with only a warning or output to stderr to signal this). In some cases (e.g. GenBank and SwissProt files) the start and end of records are well defined, so for a corrupt record we may be able to recover by issuing a warning and skipping ahead to the next record boundary. In other file formats this could be impossible (or at least, risky). So as a general policy for Bio.SeqIO, I don't think we can offer any way to skip bad records. Perhaps I am biased as most GenBank files I personally use are single records (i.e. genomes). Peter P.S. I would use the warnings module rather than writing to stderr, as this would allow the user to filter warnings, upgrade them to exceptions etc. From idoerg at gmail.com Sun Jun 7 17:14:10 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:14:10 -0700 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> Message-ID: On Sun, Jun 7, 2009 at 1:10 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: > > Thanks Peter. > > > > OK, it's a genbank file, but the point is not hacking around that > problem > > (which I did), it's more of a biopython policy question. > > Could you report a bug with this particular GenBank file (or at least, the > entry). I think Biopython should try and cope with all valid GenBank files. > > It has been a long time since I personally found a GenBank file > Biopython couldn't parse - the only cases I can remember recently from > the mailing list have been invalid files from 3rd party scripts or tools. > > Sometimes for out of spec files issuing a warning but continuing may > be OK (we do already this on some LOCUS line variants, e.g. some > GenBank files output from EMBOSS), but for anything unexpected I > think the only safe option is to raise an exception. > > > Biopython cannot handle every record format variant (==error) out there, > > and we should probably have a method for skipping over illegible > records. > > The records skipped should be noted, of course, e.g. by writing to > stderr. > > If the record cannot be read, then the preceding record ID and / or the > > record serial number should be written. > > > > Does that sound like something we should be doing? > > No, not really. > > I'm not 100% sure this is what you meant, but I would oppose any > suggestion that the default behaviour should be to completely skip bad > records (with only a warning or output to stderr to signal this). > > In some cases (e.g. GenBank and SwissProt files) the start and end of > records are well defined, so for a corrupt record we may be able to > recover by issuing a warning and skipping ahead to the next record > boundary. In other file formats this could be impossible (or at least, > risky). So as a general policy for Bio.SeqIO, I don't think we can > offer any way to skip bad records. > > Perhaps I am biased as most GenBank files I personally use are single > records (i.e. genomes). No, I am not suggesting that it should be the default behavior, but that an argument (skip_bad_records=True or somesuch) could be passed to the parser to make this possible for users who would like to do that. I work with millions of sequences at a time, and if 5,000 or 50,000 are badly formatted (or problematic due to a parser bug), I would rather make a note of it and move on, coming back later to fix the problem. The alternative would be -- well, and ugly hack, which will cause loss of time and research momentum. Also, I am not suggesting an exact implementation (yet). Warnings do sound better than stderr. There are a few million genbank (the format) files out there that did not originate with NCBI genbank (the database). Mostly in metagenomics. Some are meta-file that contain no sequence but only LOCUS fields. It used to be that any format was strictly adhered to, simply because files in that format would always originate from the same source, and FASTA was the universal format used for exchange, since it is very hard to mess up a fasta format. That is not the case any more. For that reason I think we should consider how to handle unparse-able records. > > > Peter > > P.S. I would use the warnings module rather than writing to stderr, as > this would allow the user to filter warnings, upgrade them to > exceptions etc. > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From idoerg at gmail.com Sun Jun 7 17:17:48 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:17:48 -0700 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> Message-ID: Here is the stack dump, coming from the file: ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz The offender: ACCESSION CH991540 ABGB01000000 Syntax error at or near `Tokens('close_paren')' token Traceback (most recent call last): File "./filter_seqs.py", line 108, in matching_seqs, non_matching_seqs = filter_sequences(open(inpath), match_pairs, condition,seq_format) File "./filter_seqs.py", line 23, in filter_sequences for seq_record in SeqIO.parse(in_handle,format): File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 420, in parse_records record = self.parse(handle) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 403, in parse if self.feed(handle, consumer) : File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 381, in feed self._feed_misc_lines(consumer, misc_lines) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 1138, in _feed_misc_lines consumer.contig_location(contig_location) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", line 987, in contig_location self.location(content) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", line 684, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303)) On Sun, Jun 7, 2009 at 2:14 PM, Iddo Friedberg wrote: > > > > > > > On Sun, Jun 7, 2009 at 1:10 PM, Peter wrote: > >> On 6/7/09, Iddo Friedberg wrote: >> > Thanks Peter. >> > >> > OK, it's a genbank file, but the point is not hacking around that >> problem >> > (which I did), it's more of a biopython policy question. >> >> Could you report a bug with this particular GenBank file (or at least, the >> entry). I think Biopython should try and cope with all valid GenBank >> files. >> >> It has been a long time since I personally found a GenBank file >> Biopython couldn't parse - the only cases I can remember recently from >> the mailing list have been invalid files from 3rd party scripts or tools. >> >> Sometimes for out of spec files issuing a warning but continuing may >> be OK (we do already this on some LOCUS line variants, e.g. some >> GenBank files output from EMBOSS), but for anything unexpected I >> think the only safe option is to raise an exception. >> >> > Biopython cannot handle every record format variant (==error) out >> there, >> > and we should probably have a method for skipping over illegible >> records. >> > The records skipped should be noted, of course, e.g. by writing to >> stderr. >> > If the record cannot be read, then the preceding record ID and / or the >> > record serial number should be written. >> > >> > Does that sound like something we should be doing? >> >> No, not really. >> >> I'm not 100% sure this is what you meant, but I would oppose any >> suggestion that the default behaviour should be to completely skip bad >> records (with only a warning or output to stderr to signal this). >> >> In some cases (e.g. GenBank and SwissProt files) the start and end of >> records are well defined, so for a corrupt record we may be able to >> recover by issuing a warning and skipping ahead to the next record >> boundary. In other file formats this could be impossible (or at least, >> risky). So as a general policy for Bio.SeqIO, I don't think we can >> offer any way to skip bad records. >> >> Perhaps I am biased as most GenBank files I personally use are single >> records (i.e. genomes). > > > No, I am not suggesting that it should be the default behavior, but that an > argument (skip_bad_records=True or somesuch) could be passed to the parser > to make this possible for users who would like to do that. I work with > millions of sequences at a time, and if 5,000 or 50,000 are badly formatted > (or problematic due to a parser bug), I would rather make a note of it and > move on, coming back later to fix the problem. The alternative would be -- > well, and ugly hack, which will cause loss of time and research momentum. > > Also, I am not suggesting an exact implementation (yet). Warnings do sound > better than stderr. > > There are a few million genbank (the format) files out there that did not > originate with NCBI genbank (the database). Mostly in metagenomics. Some are > meta-file that contain no sequence but only LOCUS fields. > > It used to be that any format was strictly adhered to, simply because files > in that format would always originate from the same source, and FASTA was > the universal format used for exchange, since it is very hard to mess up a > fasta format. That is not the case any more. For that reason I think we > should consider how to handle unparse-able records. > > > > > >> >> >> Peter >> >> P.S. I would use the warnings module rather than writing to stderr, as >> this would allow the user to filter warnings, upgrade them to >> exceptions etc. >> > > > > -- > Iddo Friedberg, Ph.D. > Atkinson Hall, mail code 0446 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0446, USA > T: +1 (858) 534-0570 > http://iddo-friedberg.org > > > > > -- > Iddo Friedberg, Ph.D. > Atkinson Hall, mail code 0446 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0446, USA > T: +1 (858) 534-0570 > http://iddo-friedberg.org > > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From idoerg at gmail.com Sun Jun 7 17:30:50 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:30:50 -0700 Subject: [Biopython-dev] Fwd: [Biopython] skipping a bad record read in SeqIO In-Reply-To: <320fb6e00906071429u6b1a202di7a32070ec939c267@mail.gmail.com> References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> <320fb6e00906071429u6b1a202di7a32070ec939c267@mail.gmail.com> Message-ID: On 6/7/09, Iddo Friedberg wrote: > On Sun, Jun 7, 2009 at 1:10 PM, Peter > wrote: > > > > Could you report a bug with this particular GenBank file (or at least, > > the entry). I think Biopython should try and cope with all valid > > GenBank files. > > > > It has been a long time since I personally found a GenBank file > > Biopython couldn't parse - the only cases I can remember recently from > > the mailing list have been invalid files from 3rd party scripts or tools. > > > > Sometimes for out of spec files issuing a warning but continuing may > > be OK (we do already this on some LOCUS line variants, e.g. some > > GenBank files output from EMBOSS), but for anything unexpected I > > think the only safe option is to raise an exception. > > > > > > > Biopython cannot handle every record format variant (==error) out > > > there, and we should probably have a method for skipping over > > > illegible records. The records skipped should be noted, of course, > > > e.g. by writing to stderr. If the record cannot be read, then the > > > preceding record ID and / or the record serial number should be > > > written. > > > > > > Does that sound like something we should be doing? > > > > No, not really. > > > > I'm not 100% sure this is what you meant, but I would oppose any > > suggestion that the default behaviour should be to completely skip bad > > records (with only a warning or output to stderr to signal this). > > > > In some cases (e.g. GenBank and SwissProt files) the start and end of > > records are well defined, so for a corrupt record we may be able to > > recover by issuing a warning and skipping ahead to the next record > > boundary. In other file formats this could be impossible (or at least, > > risky). So as a general policy for Bio.SeqIO, I don't think we can > > offer any way to skip bad records. > > > > Perhaps I am biased as most GenBank files I personally use are single > > records (i.e. genomes). > > No, I am not suggesting that it should be the default behavior, OK, good. I was worried there. > but that an argument (skip_bad_records=True or somesuch) could be > passed to the parser to make this possible for users who would like to > do that. I work with millions of sequences at a time, and if 5,000 or > 50,000 are badly formatted (or problematic due to a parser bug), I > would rather make a note of it and move on, coming back later to fix > the problem. The alternative would be -- well, and ugly hack, which > will cause loss of time and research momentum. > > Also, I am not suggesting an exact implementation (yet). Warnings > do sound better than stderr. > > There are a few million genbank (the format) files out there that did not > originate with NCBI genbank (the database). Mostly in metagenomics. > Some are meta-file that contain no sequence but only LOCUS fields. > > It used to be that any format was strictly adhered to, simply because > files in that format would always originate from the same source, and > FASTA was the universal format used for exchange, since it is very > hard to mess up a fasta format. That is not the case any more. For > that reason I think we should consider how to handle unparse-able > records. OK, clearly you have a rather different use case to me, where almost all the GenBank files I have used are from the NCBI, and if not are usually single genomes (with draft annotations) where I am prepared to fix any file format errors by hand. If you haven't already done so I would urge you to report bad files to the upstream source, other such errors are only perpetuated and will cause more headaches in future. You haven't convinced me that we need a general mechanism (in Bio.SeqIO) for skipping bad records in any file format (and I remain sceptical that this is even possible in general). However, for your GenBank situation I can understand your motivation now. I think in your situation I'd implement my earlier "hand waving" suggestion of a pre-parser which breaks the big GenBank file up into individual records, and turn each into a StringIO handle passed to Bio.SeqIO insider a try/except. This would make a nice cookbook recipe... I can picture the code in my head and could probably get it working pretty quickly if you'd like to try this. But probably tomorrow not tonight ;) So, perhaps an option (GenBank/EMBL specific initially) could be considered, but so far this seems like a corner use case to me, which we shouldn't complicate the main code base to accommodate. Peter -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 17:31:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 22:31:48 +0100 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> Message-ID: <320fb6e00906071431m36608514t7b754212ba40bd88@mail.gmail.com> On 6/7/09, Iddo Friedberg wrote: > Here is the stack dump, coming from the file: > > ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz > > The offender: > > ACCESSION CH991540 ABGB01000000 > > Syntax error at or near `Tokens('close_paren')' token > Traceback (most recent call last): > File "./filter_seqs.py", line 108, in > matching_seqs, non_matching_seqs = filter_sequences(open(inpath), > match_pairs, condition,seq_format) > File "./filter_seqs.py", line 23, in filter_sequences > for seq_record in SeqIO.parse(in_handle,format): > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 420, in parse_records > record = self.parse(handle) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 403, in parse > if self.feed(handle, consumer) : > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 381, in feed > self._feed_misc_lines(consumer, misc_lines) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 1138, in _feed_misc_lines > consumer.contig_location(contig_location) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > line 987, in contig_location > self.location(content) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > line 684, in location > raise LocationParserError(location_line) > Bio.GenBank.LocationParserError: > join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303)) > That look like Bug 2745 to me - does the patch on that bug work for you, and would you be happy storing the CONTIG line as string? Peter From idoerg at gmail.com Sun Jun 7 17:33:18 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:33:18 -0700 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: <320fb6e00906071431m36608514t7b754212ba40bd88@mail.gmail.com> References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> <320fb6e00906071431m36608514t7b754212ba40bd88@mail.gmail.com> Message-ID: hmm let me look into that... it it is a noted bug, I may wade into it if nobody else had. Thanks, Iddo On Sun, Jun 7, 2009 at 2:31 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: > > Here is the stack dump, coming from the file: > > > > ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz > > > > The offender: > > > > ACCESSION CH991540 ABGB01000000 > > > > Syntax error at or near `Tokens('close_paren')' token > > Traceback (most recent call last): > > File "./filter_seqs.py", line 108, in > > matching_seqs, non_matching_seqs = filter_sequences(open(inpath), > > match_pairs, condition,seq_format) > > File "./filter_seqs.py", line 23, in filter_sequences > > for seq_record in SeqIO.parse(in_handle,format): > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 420, in parse_records > > record = self.parse(handle) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 403, in parse > > if self.feed(handle, consumer) : > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 381, in feed > > self._feed_misc_lines(consumer, misc_lines) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 1138, in _feed_misc_lines > > consumer.contig_location(contig_location) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > > line 987, in contig_location > > self.location(content) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > > line 684, in location > > raise LocationParserError(location_line) > > Bio.GenBank.LocationParserError: > > > join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303)) > > > > That look like Bug 2745 to me - does the patch on that bug work for > you, and would you be happy storing the CONTIG line as string? > > Peter > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 17:40:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 22:40:57 +0100 Subject: [Biopython-dev] Fwd: [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> <320fb6e00906071429u6b1a202di7a32070ec939c267@mail.gmail.com> Message-ID: <320fb6e00906071440o53bfe8cdh5ac0695ed8e03524@mail.gmail.com> Note - Iddo emailed me off list accidentally, and then forwarded my reply... Peter wrote (forward by Iddo): > On 6/7/09, Iddo Friedberg wrote: > > On Sun, Jun 7, 2009 at 1:10 PM, Peter > > wrote: > > > > > > Could you report a bug with this particular GenBank file (or at > > > least, the entry). I think Biopython should try and cope with all > > > valid GenBank files. > > > > > > It has been a long time since I personally found a GenBank file > > > Biopython couldn't parse - the only cases I can remember recently > > > from the mailing list have been invalid files from 3rd party scripts > > > or tools. Or, the CONTIG line problem (Bug 2745) which I'd forgotten about until Iddo's follow up email with the stack trace (I personally don't use that type of GenBank file). These are valid GenBank files from the NCBI that we should be able to parse. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 8 07:12:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jun 2009 07:12:07 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906081112.n58BC7qQ017036@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-08 07:12 EST ------- Thanks for solving this - patch checked in but with global POSTGRES_RULES_PRESENT renamed to _POSTGRES_RULES_PRESENT (private variable). Marking as fixed. Do you think we should deprecate Biopython support for psycopg version one? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 8 07:38:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jun 2009 07:38:15 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906081138.n58BcFB3018676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 ------- Comment #3 from cymon.cox at gmail.com 2009-06-08 07:38 EST ------- (In reply to comment #2) > Do you think we should deprecate Biopython support for psycopg version one? Yes, I'd deprecate it - its no longer actively developed. Anyone wanting to use Psycopg would surely choose version 2 (version 1 was a pain to build anyway). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jun 8 09:00:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 14:00:37 +0100 Subject: [Biopython-dev] AUTHORS file on github Message-ID: <320fb6e00906080600r1d35a7e5j7e069ca42f77d1b8@mail.gmail.com> Hi Bartek, I was just looking at github and noticed the AUTHORS file is present: http://github.com/biopython/biopython/tree/master This was deleted in CVS seven years ago (well, renamed to CONTRIB). Is this some subtle side effect of the tag changes? Peter From barwil at gmail.com Mon Jun 8 14:05:15 2009 From: barwil at gmail.com (Bartek Wilczynski) Date: Mon, 8 Jun 2009 20:05:15 +0200 Subject: [Biopython-dev] AUTHORS file on github In-Reply-To: <320fb6e00906080600r1d35a7e5j7e069ca42f77d1b8@mail.gmail.com> References: <320fb6e00906080600r1d35a7e5j7e069ca42f77d1b8@mail.gmail.com> Message-ID: <8b34ec180906081105oa232f6ak25ef2c8a2cf69ef7@mail.gmail.com> Hi, On Mon, Jun 8, 2009 at 3:00 PM, Peter wrote: > Hi Bartek, > > I was just looking at github and noticed the AUTHORS file is present: > http://github.com/biopython/biopython/tree/master > > This was deleted in CVS seven years ago (well, renamed to CONTRIB). > I'm attending a workshop now, so my web access is limited, but I'll look into that. > Is this some subtle side effect of the tag changes? I remember that I solved the issue with those removed files at the beginning of the transition, so I would guess you are right. cheers Bartek From biopython at maubp.freeserve.co.uk Tue Jun 9 11:31:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 16:31:32 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <20090428124119.GV34546@sobchak.mgh.harvard.edu> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906090831x1614fc94m18b0c3e9851272e5@mail.gmail.com> On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote: > Hi Peter; > >> I've made some updates to Installation.tex, which I think are an >> improvement over the version shipped with Biopython 1.50 and >> currently online. ?I think we could update these files now: >> >> http://biopython.org/DIST/docs/install/Installation.html >> http://biopython.org/DIST/docs/install/Installation.pdf >> >> Does that seem sensible? ?Before that, would anyone like to proof read >> the text in CVS, or make further updates? ?For example, are the bits >> on ?FreeBSD, Fink and RPMs still valid? > > The FreeBSD port is out of date now, so I commented that section out > and replaced it with a section on using easy_install... I've just updated Installation.tex to take into account the Biopython 1.50 changes (no Martel, no mxTextTools - better late than never), and put the new HTML and PDF files online. This includes Brad's changes. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 9 12:02:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:02:21 -0400 Subject: [Biopython-dev] [Bug 2853] New: Support the "in" keyword with Seq objects / define __contains__ method Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2853 Summary: Support the "in" keyword with Seq objects / define __contains__ method Product: Biopython Version: Not Applicable Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Currently the "in" keyword isn't properly supported in the Seq object, meaning instead of this: >>> if "TAG" in my_seq: >>> print "Found TAG" you have to do something else like this (using the find method added on Bug 2809): >>> if my_seq.find("TAG") >= 0 : >>> print "Found TAG" In dealing with Bug 2809 we already have a policy in place for dealing with the alphabet issues, so the code to do this is very simple. Patch to follow. Because we don't define __contains__ yet, when someone uses "in" at the moment Python does something indirectly via our __getitem__ method, which means "in" returns True when used for a single letter (as a string) that is in the sequence, and False otherwise (e.g. a multi-letter string, or a Seq object). i.e. currently: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna, generic_rna, generic_protein >>> my_dna=Seq("AAGTGCTAATAGAAAAA", generic_dna) >>> "N" in my_dna #works False >>> "A" in my_dna #works True >>> Seq("A") in my_dna #I think this is broken, should be True False >>> "TAG" in my_dna #I think this is broken, should be True False >>> "TAG" in my_dna.tostring() True >>> "TAG" in str(my_dna) True -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 9 12:03:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:03:55 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq objects / define __contains__ method In-Reply-To: Message-ID: <200906091603.n59G3t0j013999@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 12:03 EST ------- Created an attachment (id=1323) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1323&action=view) Add __contains__ to Seq object This includes a doctest -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 9 12:05:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:05:27 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200906091605.n59G5RkF014140@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2853 ------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 12:05 EST ------- This bug also depends on: Bug 2853 - Support the "in" keyword with Seq objects / define __contains__ method -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 9 12:05:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:05:41 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq objects / define __contains__ method In-Reply-To: Message-ID: <200906091605.n59G5fWA014158@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2351 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dalloliogm at gmail.com Tue Jun 9 12:09:38 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 9 Jun 2009 18:09:38 +0200 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> Message-ID: <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> On Tue, Apr 28, 2009 at 3:40 PM, Peter wrote: > On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote: > > Well, easy_install isn't (yet) an official python standard so I hadn't > previously worried about it - our wiki Downloads page does mention it. > Frankly the less "official" ways the are to install, the less ways it > can go wrong, and then the less questions need to be asked when it > goes wrong. If I can say mine, pypi and easy_install are very cool! :-) The biopython package on pypi works very well and it is the quickest way to get the latest version of biopython. It is more reliable than the packages in the repositories of many linux distro (some of them are outdated), and with respect to the manual installation, it makes it a lot easier to update biopython and to install all the dependencies. Nor had I worried about how PyPi's listing might need to be updated. > I assumed it was clever enough to scan the http://biopython.org/DIST/ > directory and parse the filenames. Is the real answer you (Brad) kept > it up to date? > http://pypi.python.org/pypi/biopython/ > I saw that some packages, when installed with easy_install, are downloaded from their own project home pages. For example, when you do easy_install numpy, it downloads the egg code from sourceforge. So maybe there is a way to automatically update packages to pypi, but I don't know it.. > > > Peter, if you have an account on pypi, let me know your login and I > > can add you as an owner for Biopython. > > I don't have an account on pypi. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From bugzilla-daemon at portal.open-bio.org Tue Jun 9 12:20:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:20:46 -0400 Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO In-Reply-To: Message-ID: <200906091620.n59GKkhQ015142@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2294 ------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 12:20 EST ------- The code in CVS should now be writing GenBank files with features properly, and has been tested on complex fuzzy joins and even mixed strand features. Getting features locations to work properly has been more work than I had expected! (In reply to comment #14) > There is still plenty to do: > * Full testing, both manual and with extended unit test coverage Having a 3rd party test the current code would be very helpful - I may have missed things, as different people will use the code in different ways. > * Wrapping long feature locations Done. > * Writing references Not done yet, but for my personal needs this is low priority. > * Extending to cover writing EBML files Not done yet, but should be comparatively straight forward. Let's track this possible enhancement on a separate bug. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 9 12:43:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 17:43:01 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> Message-ID: <320fb6e00906090943y5b9dfe09wb438033c4400668b@mail.gmail.com> Giovanni Marco Dall'Olio wrote: > >Peter wrote: >> Well, easy_install isn't (yet) an official python standard so I hadn't >> previously worried about it - our wiki Downloads page does mention it. >> Frankly the less "official" ways the are to install, the less ways it >> can go wrong, and then the less questions need to be asked when it >> goes wrong. > > If I can say mine, pypi and easy_install are very cool! :-) > > The biopython package on pypi works very well and it is the quickest > way to get the latest version of biopython. > It is more reliable than the packages in the repositories of many linux > distro (some of them are outdated), and with respect to the manual > installation, it makes it a lot easier to update biopython and to install > all the dependencies. Using the packages from your Linux distribution is probably the easiest and most reliable way to get Biopython on Linux - but these are inevitably a little out of date most of the time. If it works, then yes, easy_install / pypi is nice and easy to use. As long as Brad (or someone) is happy to support this, that's fine with me. However, easy_install isn't perfect. If you browse the NumPy/SciPy mailing lists you'll see plenty of issues with easy_install - they have problems with CPU specific optimised builds and so on which are rather complicated to deal with. This is relevant because we would need easy_install to handle NumPy for us. I can certainly see the appeal of easy_install where a tool has lots of dependencies you would otherwise have to manually install. [If you've never tried, install BioPerl from CPAN and try and count how many other perl libraries it depends on - quite an eye opener!] This isn't really the case for Biopython, all we really need are Python and NumPy (and even that can be skipped if you don't want to use Bio.PDB, Bio.Cluster and a few other bits). > I saw that some packages, when installed with easy_install, are > downloaded from their own project home pages. For example, > when you do easy_install numpy, it downloads the egg code from > sourceforge. ... Using easy_install for Biopython should download from biopython.org Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 9 13:19:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 13:19:29 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq objects / define __contains__ method In-Reply-To: Message-ID: <200906091719.n59HJTDK019786@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 13:19 EST ------- In fact, given the way the SeqRecord __getitem__ works it might be worth adding a similar __contains__ method to the SeqRecord as well... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hlapp at gmx.net Tue Jun 9 18:23:53 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 9 Jun 2009 18:23:53 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> Message-ID: I actually don't think it mimics BioPerl. The recommended practice should be that if you don't have a value for an optional attribute, leave it undefined ... On Jun 3, 2009, at 10:52 AM, Cymon Cox wrote: > 2009/6/3 Peter > >> On Tue, Jun 2, 2009 at 9:29 PM, Cymon Cox wrote: >>> >>> Whoa, I see now that in Loader._load_bioentry_table that if the >>> rec.annotations["gi"] is missing, it gets filled with the >> accession.version: >>> >>> if "gi" in record.annotations : >>> identifier = record.annotations["gi"] >>> else : >>> identifier = record.id >>> >>> So biopythons BioSQL identifiers are not equivalent to GenBank >> identifiers. >>> I wonder why this is done and identifier is not just left NULL, >>> and the >>> unique constraint maintained by accession/version... >>> >> >> Remember, it isn't just GenBank files that get imported into BioSQL. >> While the record.id is the accession.version when loading a GenBank >> file, this is not the case in general. >> >> Consulting the CVS log, this was changed BioSQL/Loader/py revision >> 1.33 to cope with loading a FASTA file into a BioSQL database (Bug >> 2425). Presumably I was trying to mimic the BioPerl loading of FASTA >> files. Before this change, the bioentry.identifier was taken as the >> GI >> number if available. >> >> i.e. This change wasn't anything directly to do with the uniqueness >> rules. > > > Thanks Peter. > > Yes, it seems to have been done to mimic BioPerl - but I'm still > curious as > to why it is done at all... > > Anyway, I seem to be chasing my tail here: > http://bugzilla.open-bio.org/show_bug.cgi?id=2681#c5 > > Cheers, C. > -- > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Jun 9 18:57:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 23:57:31 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> Message-ID: <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> On 6/9/09, Hilmar Lapp wrote: > > I actually don't think it mimics BioPerl. The recommended practice > should be that if you don't have a value for an optional attribute, > leave it undefined ... I presume you are talking about bioentry.identity fields, and what if anything should be recorded there (e.g. the NCBI GI number from a GenBank file). I'll have to refresh my mind on how BioPerl stores arbitrary FASTA files in BioSQL where you don't have an NCBI accession & version, or an NCBI gi number - just some identifier string. You're not saying in BioSQL bioentry.identifier should for an NCBI GI number *only*, are you? Peter From hlapp at gmx.net Tue Jun 9 19:07:55 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 9 Jun 2009 19:07:55 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> Message-ID: On Jun 9, 2009, at 6:57 PM, Peter wrote: > You're not saying in BioSQL bioentry.identifier should for an NCBI > GI number *only*, are you? No, absolutely not. It is the "internal database identifier" from where the record came from, if that database assigns - and publishes - such identifiers. For example, it might be the primary key in some database. Just keep in mind that accession is required, whereas identifier is not, and they are not synonymous. So if you only have one identifier for a record, unless you know that it's the GI# and what you have is a GenBank record, the identifier would likely be called the accession, and the identifier column would remain null. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Jun 10 05:11:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Jun 2009 10:11:46 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00906090943y5b9dfe09wb438033c4400668b@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> <320fb6e00906090943y5b9dfe09wb438033c4400668b@mail.gmail.com> Message-ID: <320fb6e00906100211g4bbe15evdd4c0d53e847cdb6@mail.gmail.com> On Tue, Jun 9, 2009 at 5:43 PM, Peter wrote: > Using the packages from your Linux distribution is probably the > easiest and most reliable way to get Biopython on Linux - but > these are inevitably a little out of date most of the time. > > If it works, then yes, easy_install / pypi is nice and easy to use. As > long as Brad (or someone) is happy to support this, that's fine with > me. (But to be clear, I still don't think it should be the recommended "official" way to install Biopython - just an option.) > However, easy_install isn't perfect. If you browse the NumPy/SciPy > mailing lists you'll see plenty of issues with easy_install - they have > problems with CPU specific optimised builds and so on which are > rather complicated to deal with. This is relevant because we would > need easy_install to handle NumPy for us. Also, easy_install doesn't work properly for ReportLab (one of our optional dependencies used only for Bio.Graphics, which includes GenomeDiagram). See for example: http://two.pairlist.net/pipermail/reportlab-users/2009-May/008253.html Peter From cy at cymon.org Wed Jun 10 12:03:28 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 10 Jun 2009 17:03:28 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> Message-ID: <7265d4f0906100903q3f2b75f5p2295174e45da3512@mail.gmail.com> 2009/6/10 Hilmar Lapp > > On Jun 9, 2009, at 6:57 PM, Peter wrote: > > You're not saying in BioSQL bioentry.identifier should for an NCBI GI >> number *only*, are you? >> > > > No, absolutely not. It is the "internal database identifier" from where the > record came from, if that database assigns - and publishes - such > identifiers. For example, it might be the primary key in some database. > > Just keep in mind that accession is required, whereas identifier is not, > and they are not synonymous. So if you only have one identifier for a > record, unless you know that it's the GI# and what you have is a GenBank > record, the identifier would likely be called the accession, and the > identifier column would remain null. Thanks Hilmar. If I'm interpreting you correctly, by implication, the only time a value that is not a NCBI GI number gets added to the bioentry.identifier, is when a database (other than NCBI) implements two unique 'identifiers' such that one would be assigned to the accession and one to the identifier fields. What are these databases? It would be useful to check that they are being dealt with correctly. In which case biopython should not be assigning bioentry.identifier to record.id when the record.annotations['gi'] is missing. Cheers, C. > > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From bugzilla-daemon at portal.open-bio.org Wed Jun 10 17:51:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 10 Jun 2009 17:51:48 -0400 Subject: [Biopython-dev] [Bug 2783] Using alternative start codons in Bio.Seq translate method/function In-Reply-To: Message-ID: <200906102151.n5ALpmNA013225@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2783 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-10 17:51 EST ------- (In reply to comment #5) > > On Bug 2381, comment #51, Leighton wrote: > > In terms of nomenclature: > > > > The default behaviour of translate() as Peter proposed: read through > > in-frame and translate with the appropriate codon table - is fine in > > nearly all circumstances. Most other circumstances are covered by > > stopping at the first in-frame stop codon, which Peter has implemented, > > and is an option we all seem to agree on. > > > > Biologically-speaking, this behaviour is not always correct for CDS in > > prokaryotes, where alternative start codons may occur a significant > > minority of the time. These will be mistranslated if no provision is > > made for them. I think a useful biological sequence object should at > > least try to mimic actual biology, so we should provide an option to > > handle this. > > > > We should not assume that a sequence is a CDS unless it is specified by > > the user. It seems reasonable to me that the term 'cds' should occur in > > any such argument from the user. > > > > We have at least two options for how to proceed with a CDS: i) we can > > provide a strict CDS-type translation, which requires confirmation that > > the sequence is, in fact, a CDS; ii) we can provide a weak CDS-type > > translation, which only modifies the way the start codon is translated. > > In both cases, behaviour is specific to CDS, and so having 'cds' in the > > argument name *somewhere* seems obvious, and entirely reasonable. > > Leighton's option (ii) is start codon only modification. This is what > I implemented in the patch on comment 1 (attachment 1259 [details]). > We haven't agreed on a good name for this - which is partly why I went > back to revisit the alternative: > > Leighton's option (i) is strict CDS-type translation. As Leighton suggests, > having "cds" in the argument name here makes sense. ... After some reflection I have decided to check in code doing what Leighton called option (i), strict CDS-type translation (as provided in BioPerl via their "complete" argument). This code was based on the above patch (attachment 1298), but with the check for an extra in frame stop codon (which was missing but described in the docstrings). I also went with the shorter argument name, just "cds" rather than "complete_cds", but (until the next release) I am open to changing this new option name. Please bring this up on the mailing list if you don't like "cds" or thing it is unclear. Thanks. Marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 11 18:35:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 11 Jun 2009 18:35:00 -0400 Subject: [Biopython-dev] [Bug 2856] New: Duplicate positions for some restriction enzymes in some sequences Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2856 Summary: Duplicate positions for some restriction enzymes in some sequences Product: Biopython Version: 1.50 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: zdmytriv at lbl.gov Returns 2 identical positions for EcoRI enzyme in this sequence: gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga Run this script test.py: from Bio import SeqIO from Bio.Restriction import * from Bio.Seq import Seq from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA if __name__ == "__main__": sequence = "gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga" seq = Seq(sequence, IUPACAmbiguousDNA()) analysis = Analysis([EcoRI], seq, linear=False) results = analysis.full() for enzyme, positions in results.iteritems(): if len(positions) == 0: continue print enzyme for position in positions: print position # returns 2 items 2 and 2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sohm at inaf.cnrs-gif.fr Fri Jun 12 14:53:07 2009 From: sohm at inaf.cnrs-gif.fr (=?ISO-8859-1?Q?Fr=E9d=E9ric_Sohm?=) Date: Fri, 12 Jun 2009 20:53:07 +0200 Subject: [Biopython-dev] [Bug 2856] New: Duplicate positions for some restriction enzymes in some sequences In-Reply-To: References: Message-ID: <4A32A413.9090100@inaf.cnrs-gif.fr> Hi everyone, OK, It is a little mistake in the way the sequence is dealt with by restriction objects to search sites spread over the boundaries of circular sequences. The actual code goes one base too far therefore the beginning of the sequence is scanned twice. Two sites are reported. One at the beginning and one at the end. After correction of the index, the second site is reported at the same position as the first one (which incidentally is a good thing since it proves the corrections are properly handled). Final results is a duplicated report for restriction sites starting at the very first base of a circular sequence. Here is the patch : ====================================================================== --- biopython-1.50-old/Bio/Restriction/Restriction.py 2008-10-22 23:49:06.000000000 +0200 +++ biopython-1.50-new/Bio/Restriction/Restriction.py 2009-06-12 20:28:46.000000000 +0200 @@ -197,7 +197,7 @@ if self.is_linear() : data = self.data else : - data = self.data + self.data[1:size+1] + data = self.data + self.data[1:size] return [(i.start(), i.group) for i in re.finditer(pattern, data)] def __getitem__(self, i) : ======================================================================= I will try to upload it. Best regards Fred bugzilla-daemon at portal.open-bio.org wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2856 > > Summary: Duplicate positions for some restriction enzymes in some > sequences > Product: Biopython > Version: 1.50 > Platform: All > OS/Version: All > Status: NEW > Severity: normal > Priority: P2 > Component: Main Distribution > AssignedTo: biopython-dev at biopython.org > ReportedBy: zdmytriv at lbl.gov > > > Returns 2 identical positions for EcoRI enzyme in this sequence: > gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga > > Run this script test.py: > from Bio import SeqIO > from Bio.Restriction import * > from Bio.Seq import Seq > from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA > > if __name__ == "__main__": > sequence = "gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga" > seq = Seq(sequence, IUPACAmbiguousDNA()) > analysis = Analysis([EcoRI], seq, linear=False) > results = analysis.full() > > for enzyme, positions in results.iteritems(): > if len(positions) == 0: continue > > print enzyme > for position in positions: > print position > > # returns 2 items 2 and 2 > > From cy at cymon.org Sun Jun 14 11:23:23 2009 From: cy at cymon.org (Cymon Cox) Date: Sun, 14 Jun 2009 16:23:23 +0100 Subject: [Biopython-dev] "Your XML file did not start with Folks, I've been using qblast recently, and got a lot of invalid replies from NCBI of this sort: Traceback (most recent call last): File "test_NCBI_qblast.py", line 71, in record = NCBIXML.read(handle) File "/home/cymon/git/github-master/Bio/Blast/NCBIXML.py", line 564, in read first = iterator.next() File "/home/cymon/git/github-master/Bio/Blast/NCBIXML.py", line 611, in parse % XML_START) ValueError: Your XML file did not start with gi|116660609|gb|EG558220.1|EG558220 CR02019H04 Leaf CR02 cDNA library Catharanthus roseus cDNA clone CR02019H04 5', mRNA sequence\nCTCCATTCCCTCTCTATTTTCAGTCTAATCAAATTAGAGCTTAAAAGAATGAGATTTTTAACAAATAAAA\nAAACATAGGGGAGATTTCATAAAAGTTATATTAGTGATTTGAAGAATATTTTAGTCTATTTTTTTTTTTT\nTCTTTTTTTGATGAAGAAAGGGTATATAAAATCAAGAATCTGGGGTGTTTGTGTTGACTTGGGTCGGGTG\nTGTATAATTCTTGATTTTTTCAGGTAGTTGAAAAGGTAGGGAGAAAAGTGGAGAAGCCTAAGCTGATATT\nGAAATTCATATGGATGGAAAAGAACATTGGTTTAGGATTGGATCAAAAAATAGGTGGACATGGAACTGTA\nCCACTACGTCCTTACTATTTTTGGCCGAGGAAAGATGCTTGGGAAGAACTTAAAACAGTTTTAGAAAGCA\nAGCCATGGATTTCTCAGAAGAAAATGATTATACTTCTTAATCAGGCAACTGATATTATCAATTTATGGCA\nGCAGAGTGGTGGCTCCTTGTCCCAGCAGCAGTAATTACTTTTTTTTCTCTTTTTGTTTCCAAATTAAGAA\nACATTAGTATCATATGGCTATTTGCTCAATTGCAGATTTCTTTCTTTTGTGAATG", ...) Status=WAITING Results == '\n\n': continuing... Results == '\n\n': continuing... Results == '\n\n': continuing... Done Anyone else seen this? Am I just unlucky enough to have a flaky internet connection? Cheers, C. -- From biopython at maubp.freeserve.co.uk Sun Jun 14 14:16:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Jun 2009 19:16:08 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> Message-ID: <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> On 6/14/09, Cymon Cox wrote: > Folks, > > I've been using qblast recently, and got a lot of invalid replies from NCBI > of this sort: > > Traceback (most recent call last): > ... > ValueError: Your XML file did not start with Which is true: NCBI is returning "\n\n". If you code around this and just > keep going the results eventually arrive: > ... At first glance, something based on your change looks sensible. Next time I spot the unit test failing I'll try and reproduce this. Peter From cy at cymon.org Sun Jun 14 14:26:56 2009 From: cy at cymon.org (Cymon Cox) Date: Sun, 14 Jun 2009 19:26:56 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> Message-ID: <7265d4f0906141126l2f00fecehaa28273af9b3681a@mail.gmail.com> 2009/6/14 Peter > On 6/14/09, Cymon Cox wrote: > > Folks, > > > > I've been using qblast recently, and got a lot of invalid replies from > NCBI > > of this sort: > > > > Traceback (most recent call last): > > ... > > ValueError: Your XML file did not start with > I've seen message that from the unit test sometimes, and assumed the > NCBI was returning a temporary HTML error page of some kind - > rerunning our test would normally work. Without checking the > traceback, I would guess this is the same issue you have found. > > > Which is true: NCBI is returning "\n\n". If you code around this and > just > > keep going the results eventually arrive: > > ... > > At first glance, something based on your change looks sensible. Next > time I spot the unit test failing I'll try and reproduce this. It's pretty hit or miss: I would guess once in every 10+ times I ran the test_NCBI_qblast I would encounter the problem. Cheers, C. -- From cy at cymon.org Mon Jun 15 05:02:47 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 15 Jun 2009 10:02:47 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> <7265d4f0906141126l2f00fecehaa28273af9b3681a@mail.gmail.com> Message-ID: <7265d4f0906150202j3daeefa9we304cf29c4f6cd6d@mail.gmail.com> 2009/6/14 Cymon Cox > 2009/6/14 Peter > >> On 6/14/09, Cymon Cox wrote: >> > Folks, >> > >> > I've been using qblast recently, and got a lot of invalid replies from >> NCBI >> > of this sort: >> > >> > Traceback (most recent call last): >> > ... >> > ValueError: Your XML file did not start with > >> I've seen message that from the unit test sometimes, and assumed the >> NCBI was returning a temporary HTML error page of some kind - >> rerunning our test would normally work. Without checking the >> traceback, I would guess this is the same issue you have found. >> >> > Which is true: NCBI is returning "\n\n". If you code around this and >> just >> > keep going the results eventually arrive: >> > ... >> >> At first glance, something based on your change looks sensible. Next >> time I spot the unit test failing I'll try and reproduce this. > > > It's pretty hit or miss: I would guess once in every 10+ times I ran the > test_NCBI_qblast I would encounter the problem. > I can be a little more specific; out 742 calls to qblast, 75 returned the "Your XML" error. (This was with a different ISP.) C. -- From eric.talevich at gmail.com Mon Jun 15 13:04:20 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 15 Jun 2009 13:04:20 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython Message-ID: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> Hi all, Previously (June 8-12) I: * Finished writing constructors and XML parsers for Tier 0,1,2 elements (everything that appears in the example phyloXML files) * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence class -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely will require some more thought * Wrote a unit test for counting clades/branches (topology check) * Changed the no-op unit tests to count the total number of tags (nodes) in the given phyloXML file, keeping stdout clean * Miscellaneous code cleanup * Added a few magic methods to make usage easier: __len__, __iter__, __str__ This week (June 15-19) I will: * Finish unittests for parsing and instantiating core elements * Test and check parser performance versus Bioperl and Archaeopterix loading times * Document results of parser testing and performance (on wiki or here) * Document basic usage of the parser on the Biopython wiki Thoughts: * Test-driven development kind of went out the window this week. Implementing each new class is pretty short now that I'm using the from_element class methods consistently, so I just charged ahead rather than write tests for each class first. I checked the more complicated classes in the REPL, but didn't copy that code into the test script... shameful. There are a couple bugs I know of already, but haven't fixed. So catching up there will be the bulk of the effort this week. * The unit tests I do have in place give some sense of memory and CPU usage. For the full NCBI taxonomy, memory usage climbs up above 2 GB with the read() function, which isn't a problem on this workstation but could be for others. * For biopython-dev, a summary of the parsing strategy: There are two top-level functions, read() and parse(), which behave according to convention. Both use ElementTree's iterparse() function to keep memory usage down (if used properly) and enable streaming data from other sources. The structure of the XML file looks like: ... (recursive) ... (can have several trees) (optional, arbitrary tags) ... The read() function returns all of this as a single Python object, with two attributes: phylogenies[] and other[]. parse() ignores the "other" stuff and just iterates through the "phylogeny" trees, so it should be handy if you're not concerned with the extra arbitrary data that may appear after the trees. I have two more functions for parsing phylogeny and clade objects that track the current context of the XML parser, and clear elements after they're completed. Then all other tags are dispatched to the corresponding classes, via from_element() methods attached to each class, or else built-in constructors for primitive types like int, float, str. The from_element() class methods take an ElementTree.Element object, deal with it, and pass any child nodes for complex types to the corresponding class's from_element() method. The only recursive element is Clade, which is treated specially, so there's nothing scary going on with the stack. I'm open to suggestions for reorganizing this to make Nexus/Newick integration more feasible. Optimization strategies are also a good topic this week. A few weeks later in my project plan I'm also scheduled to implement the rest of the magic methods, so we should discuss the appropriate amount and types of magic to add, too -- the showcase for this right now is Tests/test_PhyloXML. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From anaryin at gmail.com Tue Jun 16 04:00:18 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 16 Jun 2009 10:00:18 +0200 Subject: [Biopython-dev] Biopython-dev Digest, Vol 77, Issue 12 In-Reply-To: References: Message-ID: I've been having that problem with the your XML for a while now, with qblast. But I thought it was my connection going nuts and I used just a retry if that happened... Just to add to you two having that error :) Jo?o [ .. ] Rodrigues (Blog) http://doeidoei.wordpress.com (MSN) always_asleep_ at hotmail.com (Skype) rodrigues.jglm On Mon, Jun 15, 2009 at 7:00 PM, wrote: > Send Biopython-dev mailing list submissions to > biopython-dev at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython-dev > or, via email, send a message with subject or body 'help' to > biopython-dev-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-dev-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython-dev digest..." > > > Today's Topics: > > 1. Re: "Your XML file did not start with 2. Re: "Your XML file did not start with 3. Re: "Your XML file did not start with > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 14 Jun 2009 19:16:08 +0100 > From: Peter > Subject: Re: [Biopython-dev] "Your XML file did not start with To: Cymon Cox > Cc: BioPython-Dev Mailing List > Message-ID: > <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On 6/14/09, Cymon Cox wrote: > > Folks, > > > > I've been using qblast recently, and got a lot of invalid replies from > NCBI > > of this sort: > > > > Traceback (most recent call last): > > ... > > ValueError: Your XML file did not start with > I've seen message that from the unit test sometimes, and assumed the > NCBI was returning a temporary HTML error page of some kind - > rerunning our test would normally work. Without checking the > traceback, I would guess this is the same issue you have found. > > > Which is true: NCBI is returning "\n\n". If you code around this and > just > > keep going the results eventually arrive: > > ... > > At first glance, something based on your change looks sensible. Next > time I spot the unit test failing I'll try and reproduce this. > > Peter > > > ------------------------------ > > Message: 2 > Date: Sun, 14 Jun 2009 19:26:56 +0100 > From: Cymon Cox > Subject: Re: [Biopython-dev] "Your XML file did not start with To: Peter > Cc: BioPython-Dev Mailing List > Message-ID: > <7265d4f0906141126l2f00fecehaa28273af9b3681a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > 2009/6/14 Peter > > > On 6/14/09, Cymon Cox wrote: > > > Folks, > > > > > > I've been using qblast recently, and got a lot of invalid replies from > > NCBI > > > of this sort: > > > > > > Traceback (most recent call last): > > > ... > > > ValueError: Your XML file did not start with > > > I've seen message that from the unit test sometimes, and assumed the > > NCBI was returning a temporary HTML error page of some kind - > > rerunning our test would normally work. Without checking the > > traceback, I would guess this is the same issue you have found. > > > > > Which is true: NCBI is returning "\n\n". If you code around this and > > just > > > keep going the results eventually arrive: > > > ... > > > > At first glance, something based on your change looks sensible. Next > > time I spot the unit test failing I'll try and reproduce this. > > > It's pretty hit or miss: I would guess once in every 10+ times I ran the > test_NCBI_qblast I would encounter the problem. > > Cheers, C. > -- > > > ------------------------------ > > Message: 3 > Date: Mon, 15 Jun 2009 10:02:47 +0100 > From: Cymon Cox > Subject: Re: [Biopython-dev] "Your XML file did not start with To: Peter > Cc: BioPython-Dev Mailing List > Message-ID: > <7265d4f0906150202j3daeefa9we304cf29c4f6cd6d at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > 2009/6/14 Cymon Cox > > > 2009/6/14 Peter > > > >> On 6/14/09, Cymon Cox wrote: > >> > Folks, > >> > > >> > I've been using qblast recently, and got a lot of invalid replies > from > >> NCBI > >> > of this sort: > >> > > >> > Traceback (most recent call last): > >> > ... > >> > ValueError: Your XML file did not start with >> > >> I've seen message that from the unit test sometimes, and assumed the > >> NCBI was returning a temporary HTML error page of some kind - > >> rerunning our test would normally work. Without checking the > >> traceback, I would guess this is the same issue you have found. > >> > >> > Which is true: NCBI is returning "\n\n". If you code around this and > >> just > >> > keep going the results eventually arrive: > >> > ... > >> > >> At first glance, something based on your change looks sensible. Next > >> time I spot the unit test failing I'll try and reproduce this. > > > > > > It's pretty hit or miss: I would guess once in every 10+ times I ran the > > test_NCBI_qblast I would encounter the problem. > > > > I can be a little more specific; out 742 calls to qblast, 75 returned the > "Your XML" error. (This was with a different ISP.) > > C. > -- > > > ------------------------------ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 77, Issue 12 > ********************************************* > From biopython at maubp.freeserve.co.uk Tue Jun 16 05:16:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 10:16:46 +0100 Subject: [Biopython-dev] Your XML file did not start with On Tue, Jun 16, 2009 at 9:00 AM, Jo?o Rodrigues wrote: > > I've been having that problem with the your XML for a while now, with > qblast. But I thought it was my connection going nuts and I used just a > retry if that happened... > > Just to add to you two having that error :) > > Jo?o [ .. ] Rodrigues OK - looks like this is a reasonably common issue. Have you tried Cymon's fix yet? If not, are you happy to update you copy of Biopython to the latest code from CVS/github and test that? If so, I'll check in the fix and you can help test that... Thanks, Peter P.S. If replying to the digest emails, please edit the subject line to match the topic you are replying too. From bugzilla-daemon at portal.open-bio.org Tue Jun 16 05:22:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Jun 2009 05:22:54 -0400 Subject: [Biopython-dev] [Bug 2780] PDB file HETATMs cannot be alternative location of a residue that is an ATOM In-Reply-To: Message-ID: <200906160922.n5G9MsNK023105@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2780 klaus.kopec at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Version|1.49 |1.50 ------- Comment #3 from klaus.kopec at tuebingen.mpg.de 2009-06-16 05:22 EST ------- bug still exists in v1.50 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 16 05:25:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Jun 2009 05:25:20 -0400 Subject: [Biopython-dev] [Bug 2781] Bio.PDB Structure instances cannot be deepcopied In-Reply-To: Message-ID: <200906160925.n5G9PKI3023354@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2781 klaus.kopec at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Version|1.49 |1.50 ------- Comment #1 from klaus.kopec at tuebingen.mpg.de 2009-06-16 05:25 EST ------- bug still exists in v1.50 (still with Python 2.6.1, but now Kubuntu 9.04 64-Bit) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From thomas.hamelryck at gmail.com Tue Jun 16 11:09:55 2009 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Tue, 16 Jun 2009 17:09:55 +0200 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> Message-ID: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Hi all, I can now confirm that the logo is released in the public domain. I will give it a suitable license later this week. Cheers, /Thomas On Fri, Jun 5, 2009 at 8:05 PM, Thomas Hamelryck wrote: > > > On Fri, Jun 5, 2009 at 1:30 AM, Iddo Friedberg wrote: > >> Wow, congratulations! I am so buying this off my startup.... >> >> I think the question is best addressed to Thomas Haelryck. IIRC, his >> friend >> designed the logo. There is no license I am aware of, but it is probably a >> good idea to put a cc-na license on it, like Tux has. >> > > Should be fine, but will ask to be sure. > > Cheers, > > -Thomas > > -- Thomas Hamelryck Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://wiki.binf.ku.dk/User:Thomas_Hamelryck http://www.binf.ku.dk/research/structural_bioinformatics/ From jkhilmer at gmail.com Tue Jun 16 16:16:49 2009 From: jkhilmer at gmail.com (Jonathan Hilmer) Date: Tue, 16 Jun 2009 14:16:49 -0600 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Message-ID: <81277ce10906161316x6caa73feve86de4d54dec7efd@mail.gmail.com> I greatly appreciate the effort people have contributed to Biopython in general and this logo particularly, but perhaps it would be better to have a logo that scales better, and is in a vector format? The python-DNA theme could be kept, but with simpler text and snakes to remain legible when used as a small logo or as part of a background. I'd be happy to create a vectorized interpretation if people are interested. Jonathan On Tue, Jun 16, 2009 at 9:09 AM, Thomas Hamelryck wrote: > Hi all, > > I can now confirm that the logo is released in the public domain. > I will give it a suitable license later this week. > > Cheers, > > /Thomas > > On Fri, Jun 5, 2009 at 8:05 PM, Thomas Hamelryck > wrote: > >> >> >> On Fri, Jun 5, 2009 at 1:30 AM, Iddo Friedberg wrote: >> >>> Wow, congratulations! I am so buying this off my startup.... >>> >>> I think the question is best addressed to Thomas Haelryck. IIRC, his >>> friend >>> designed the logo. There is no license I am aware of, but it is probably a >>> good idea to put a cc-na license on it, like Tux has. >>> >> >> Should be fine, but will ask to be sure. >> >> Cheers, >> >> -Thomas >> >> > > > -- > Thomas Hamelryck > Group leader Structural Bioinformatics > Bioinformatics center > Department of Biology > University of Copenhagen > Ole Maaloes Vej 5 > DK-2200 Copenhagen N > Denmark > http://wiki.binf.ku.dk/User:Thomas_Hamelryck > http://www.binf.ku.dk/research/structural_bioinformatics/ > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Tue Jun 16 17:25:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 22:25:27 +0100 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Message-ID: <320fb6e00906161425u16d6f59ey7c9daf2d3e22be94@mail.gmail.com> On Tue, Jun 16, 2009 at 4:09 PM, Thomas Hamelryck wrote: > > Hi all, > > I can now confirm that the logo is released in the public domain. > I will give it a suitable license later this week. > > Cheers, > > /Thomas Thanks Thomas, Did you remember to ask Henrik Vestergaard if he had the original files still? If there is something like an Adobe Illustrator or Photoshop file that could be very helpful to generate a vector based version (e.g. PDF, SVG) suitable for big posters etc. Even any larger JPG files are both having... [This would certainly be easier than Jonathan Hilmer trying to recreate a vector version from scratch - which might be worth trying] Thanks Peter P.S. See also http://www.biopython.org/wiki/Logo From sbassi at clubdelarazon.org Tue Jun 16 17:30:38 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 16 Jun 2009 18:30:38 -0300 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Message-ID: <9e2f512b0906161430s3a57536xa283399a8a3e4c39@mail.gmail.com> On Tue, Jun 16, 2009 at 12:09 PM, Thomas Hamelryck wrote: > I can now confirm that the logo is released in the public domain. > I will give it a suitable license later this week. OK, I'll contact the cover designer to include a small logo in a corner if there is time. Thank you very much. Best, SB. From chapmanb at 50mail.com Wed Jun 17 08:41:01 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 17 Jun 2009 08:41:01 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> Message-ID: <20090617124101.GH44321@sobchak.mgh.harvard.edu> Hi Eric; Nice update and thanks again for copying the Biopython development list on this. > * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence > class > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely will > require some more thought I'm looking forward to seeing how you decide to go forward with this. For the work I do on a day to day basis, a continual struggle involves establishing relationships between things to retrieve more information. For instance, a pair of nodes on a tree is interesting -- how would I find papers, experiments and other information associated with those sequences? It seems like Accession and the ref attribute of Annotation help establish these relationships. > * Test-driven development kind of went out the window this week. Heh. It happens -- sounds sensible to have a clean up and documentation week this week; that will also help others who are interested dig into using it. > * The unit tests I do have in place give some sense of memory and CPU usage. > For the full NCBI taxonomy, memory usage climbs up above 2 GB with the > read() function, which isn't a problem on this workstation but could be for > others. Do you see an opportunity to offer iterating over clades instead of loading them all into memory for these larger trees? This would involve lazily loading subclades on request and would limit some functionality for querying the full tree without loading it all into memory. Another option is to offer some pruning ability as a tree is loading. For instance, if I am loading the whole NCBI taxonomy on a memory limited computer and only need the Angiosperm flowering plant part of the tree. In this case, you'd want to throw away all clades not under the clades of interest. These are probably fringe cases; just brainstorming some ideas. Thanks again, Brad From eric.talevich at gmail.com Wed Jun 17 19:17:41 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 17 Jun 2009 19:17:41 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <20090617124101.GH44321@sobchak.mgh.harvard.edu> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Hi Brad, Here's a mid-week update and partial response to your questions. *SeqRecord transformation* It would be nice if I could round-trip this sequence information perfectly, so that nothing's lost between reading and writing an arbitrary, valid PhyloXML file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look at SeqRecord.features and assume that any matching keys have the appropriate PhyloXML meaning. These are the keys that from_seqrec() would look for: location uri annotations domain_architecture Do you see any risk of collision for those names? And for serialization, would it be unwholesome to convert Annotation and DomainArchitecture objects to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- it's another layer of parsing and kind of esoteric, but I can live with it. *Profiling* Christian also suggested an option to parse just the phylogenies with a name or id matching a given string. I like that and I don't see any problem with extending it to clades as well. It seems like a reasonable use case to select a sub-tree from a complete phyloXML document and treat it as a separate phylogeny from then on. This can be supported by various methods for selecting portions of the tree, and a method on Clade for transforming the selection into a new Phylogeny instance (so the original can be safely deleted). I did some profiling with the cProfile module, and it looks like most of the time is being spent instantiating Clade and Taxonomy objects. (Also, pretty_print is hugely inefficient, but that's less important.) I think I can speed up parsing and reduce memory usage by pulling the from_element methods out of each class and using a separate Parser class to do that work. About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was just looking at Ubuntu's system monitor, and Firefox and a few other things were running at the same time, taking up about 800MB already. So the full NCBI taxonomy actually takes up only 1.2GB or so, which isn't such a problem, and I think it will get smaller as I shrink down these PhyloXML classes. Questions: - Do you know of a better way to profile Python code, or visualize it? - Have you used __slots__ to optimize classes? Do you recommend it? And a few that don't fit anywhere else: - What sort of whole-tree operations would you want to do with these objects that you can't do with a Nexus or Newick tree? What other formats would you want to convert to? I'm thinking of adding an Export module later if there's time, for lossy conversions like a graph for networkx. - What's the most intuitive way to display a phylogenetic tree you've loaded into Biopython? Serialize as Nexus and open in TreeViewX? Convert to a graph and send to matplotlib? Or, is there a module in Bio.Graphics that can draw trees? (If not, should there be?) Thanks, Eric On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman wrote: > Hi Eric; > Nice update and thanks again for copying the Biopython development > list on this. > > > * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence > > class > > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely > will > > require some more thought > > I'm looking forward to seeing how you decide to go forward with > this. For the work I do on a day to day basis, a continual > struggle involves establishing relationships between things to > retrieve more information. For instance, a pair of nodes on a tree > is interesting -- how would I find papers, experiments and other > information associated with those sequences? It seems like Accession > and the ref attribute of Annotation help establish these > relationships. > > > * Test-driven development kind of went out the window this week. > > Heh. It happens -- sounds sensible to have a clean up and > documentation week this week; that will also help others who are > interested dig into using it. > > > * The unit tests I do have in place give some sense of memory and CPU > usage. > > For the full NCBI taxonomy, memory usage climbs up above 2 GB with the > > read() function, which isn't a problem on this workstation but could > be for > > others. > > Do you see an opportunity to offer iterating over clades instead of > loading them all into memory for these larger trees? This would > involve lazily loading subclades on request and would limit some > functionality for querying the full tree without loading it all into > memory. > > Another option is to offer some pruning ability as a tree is > loading. For instance, if I am loading the whole NCBI taxonomy on a > memory limited computer and only need the Angiosperm flowering plant > part of the tree. In this case, you'd want to throw away all clades > not under the clades of interest. > > These are probably fringe cases; just brainstorming some ideas. > > Thanks again, > Brad > From biopython at maubp.freeserve.co.uk Thu Jun 18 05:35:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:35:24 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> On Thu, Jun 18, 2009 at 12:17 AM, Eric Talevich wrote: > Hi Brad, > > Here's a mid-week update and partial response to your questions. > > *SeqRecord transformation* > > It would be nice if I could round-trip this sequence information perfectly, > so that nothing's lost between reading and writing an arbitrary, valid > PhyloXML file. For that to work, PhyloXML.Sequence.from_seqrec() > would need to look at SeqRecord.features and assume that any matching > keys have the appropriate PhyloXML meaning. > > These are the keys that from_seqrec() would look for: > ? ?location > ? ?uri > ? ?annotations > ? ?domain_architecture > > Do you see any risk of collision for those names? And for serialization, > would it be unwholesome to convert Annotation and DomainArchitecture objects > to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- > it's another layer of parsing and kind of esoteric, but I can live with it. If you can show us a sample record, I would be better able to comment on how I would store it in a SeqRecord. Are you fully familiar with the SeqRecord object, its annotations dictionary, and the list of SeqFeature objects (which have locations relative to the parent SeqRecord) which all have their own annotations dictionary (although under the name of qualifiers for some reason). Perhaps you'd like to proof read the new SeqRecord chapter in the tutorial - it is still a work in progress, but should be informative. Peter From chapmanb at 50mail.com Thu Jun 18 08:52:40 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 18 Jun 2009 08:52:40 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <20090618125240.GP44321@sobchak.mgh.harvard.edu> Hi Eric; Nice -- thanks much for the summary. > *SeqRecord transformation* > > It would be nice if I could round-trip this sequence information perfectly, > so > that nothing's lost between reading and writing an arbitrary, valid PhyloXML > file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look > at > SeqRecord.features and assume that any matching keys have the appropriate > PhyloXML meaning. > > These are the keys that from_seqrec() would look for: > location > uri > annotations > domain_architecture > > Do you see any risk of collision for those names? And for serialization, > would it be unwholesome to convert Annotation and DomainArchitecture objects > to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- > it's another layer of parsing and kind of esoteric, but I can live with it. SeqRecords have two places to store information related to the sequence: annotations -- key/value pairs describing the entire sequence, implemented as a dictionary with lists as values. features -- items with a location that refer to part of the sequence, which can have key/value pairs, here called qualifiers. My sense is that much of the PhyloXML markup will fit into annotations. For instance, your annotation string should really be part of the annotation dictionary: {"ref" : ["foo"], "source" : ["bar"] } as opposed to a string that requires deserializing. The easiest way to discuss this is to take a few real life cases and see how they can fit, as Peter suggested. People here familiar with using SeqRecords can hopefully come to a consensus as the best place to store different items. > *Profiling* > > Christian also suggested an option to parse just the phylogenies with a > name or id matching a given string. I like that and I don't see any problem > with extending it to clades as well. It seems like a reasonable use case to > select a sub-tree from a complete phyloXML document and treat it as a > separate > phylogeny from then on. This can be supported by various methods for > selecting > portions of the tree, and a method on Clade for transforming the selection > into > a new Phylogeny instance (so the original can be safely deleted). [...] > About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was just > looking at Ubuntu's system monitor, and Firefox and a few other things were > running at the same time, taking up about 800MB already. So the full NCBI > taxonomy actually takes up only 1.2GB or so, which isn't such a problem, and > I > think it will get smaller as I shrink down these PhyloXML classes. Sounds great. I think you'll be fine with that memory usage and the ability to select subsets based on an identifier. > I did some profiling with the cProfile module, and it looks like most of the > time is being spent instantiating Clade and Taxonomy objects. (Also, > pretty_print is hugely inefficient, but that's less important.) I think I > can > speed up parsing and reduce memory usage by pulling the from_element methods > out of each class and using a separate Parser class to do that work. > > Questions: > - Do you know of a better way to profile Python code, or visualize it? > - Have you used __slots__ to optimize classes? Do you recommend it? I use cProfile and pstats from the standard library, which it sounds like you are on top of. That normally points me in the right place to try optimizations. I haven't used __slots__ but generally try to avoid any python black magic. If people need additional CPU speedups, I'd suggest Psyco. This increases memory usage so it will be a tradeoff for most people. Benchmarks with and without Psyco would give users a guideline if they need to optimize performance. > And a few that don't fit anywhere else: > > - What sort of whole-tree operations would you want to do with these > objects that you can't do with a Nexus or Newick tree? What > other formats would you want to convert to? I'm thinking of adding an > Export module later if there's time, for lossy conversions like a graph for > networkx. This is a good general question for the users. I like the graph conversion idea, as it avoids having to re-invent all of the graph manipulation and query operations already present in networkx. > - What's the most intuitive way to display a phylogenetic tree you've > loaded into Biopython? Serialize as Nexus and open in TreeViewX? > Convert > to a graph and send to matplotlib? Or, is there a module in > Bio.Graphics > that can draw trees? (If not, should there be?) A good general way to do this would be welcome. I've used networkx with pygraphviz to draw rough 'n ready trees before. Here is some horribly non-generalized code that does this: http://github.com/chapmanb/bcbb/blob/master/visualize/tax_data_display.py Brad > On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman wrote: > > > Hi Eric; > > Nice update and thanks again for copying the Biopython development > > list on this. > > > > > * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence > > > class > > > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely > > will > > > require some more thought > > > > I'm looking forward to seeing how you decide to go forward with > > this. For the work I do on a day to day basis, a continual > > struggle involves establishing relationships between things to > > retrieve more information. For instance, a pair of nodes on a tree > > is interesting -- how would I find papers, experiments and other > > information associated with those sequences? It seems like Accession > > and the ref attribute of Annotation help establish these > > relationships. > > > > > * Test-driven development kind of went out the window this week. > > > > Heh. It happens -- sounds sensible to have a clean up and > > documentation week this week; that will also help others who are > > interested dig into using it. > > > > > * The unit tests I do have in place give some sense of memory and CPU > > usage. > > > For the full NCBI taxonomy, memory usage climbs up above 2 GB with the > > > read() function, which isn't a problem on this workstation but could > > be for > > > others. > > > > Do you see an opportunity to offer iterating over clades instead of > > loading them all into memory for these larger trees? This would > > involve lazily loading subclades on request and would limit some > > functionality for querying the full tree without loading it all into > > memory. > > > > Another option is to offer some pruning ability as a tree is > > loading. For instance, if I am loading the whole NCBI taxonomy on a > > memory limited computer and only need the Angiosperm flowering plant > > part of the tree. In this case, you'd want to throw away all clades > > not under the clades of interest. > > > > These are probably fringe cases; just brainstorming some ideas. > > > > Thanks again, > > Brad > > From biopython at maubp.freeserve.co.uk Thu Jun 18 10:10:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 15:10:48 +0100 Subject: [Biopython-dev] CONTIG records in GenBank files Message-ID: <320fb6e00906180710ncbf3346rc662853c2e09f71e@mail.gmail.com> A couple of weeks ago, we were talking about a problem with CONTIG lines in GenBank files, http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006192.html On Sun, Jun 7, 2009 at 10:31 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: >> Here is the stack dump, coming from the file: >> >> ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz >> ... >> Traceback (most recent call last): >> ... >> Bio.GenBank.LocationParserError: >> ... > > That looks like Bug 2745 to me - does the patch on that bug work for > you, and would you be happy storing the CONTIG line as string? Iddo, Did you have a chance to try the patch on Bug 2745 yet? http://bugzilla.open-bio.org/show_bug.cgi?id=2745 If you are happy with the proposed solution, I'd like to get that checked in... Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 18 11:29:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 18 Jun 2009 11:29:32 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <200906181529.n5IFTWQQ021054@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1206 is|0 |1 obsolete| | Attachment #1210 is|0 |1 obsolete| | ------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-18 11:29 EST ------- Created an attachment (id=1327) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1327&action=view) Patch for Bio/GenBank/__init__.py to handle simple locations with re The old patch wasn't applying cleanly any more. This is the same code as before, but updated for the current CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Jun 18 16:22:17 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Jun 2009 16:22:17 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> Message-ID: <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> On Thu, Jun 18, 2009 at 5:35 AM, Peter wrote: > If you can show us a sample record, I would be better able to comment > on how I would store it in a SeqRecord. > Here are a couple of examples from files in Test/PhyloXML/. From phyloxml_examples.xml, a contrived demonstration of various features: A E. coli alcohol dehydrogenase 0.99 Caenorhabditis elegans ADHX Q17335 alcohol dehydrogenase (An extra level of context is shown -- information that doesn't fit into a SeqRecord could also be conceivably moved up into the Clade object.) Assuming values of the SeqRecord.attributes dictionary can also be dictionaries, this isn't to hard to convert to primitive types. Another example from apaf.xml, which appears to be real data: CARD NB-ARC WD40 WD40 WD40 WD40 WD40 WD40 WD40 WD40 WD40 The DomainArchitecture element refers to domains in a protein sequence, according to the spec. This could be reasonably represented as a list of SeqFeature objects, I see now. But converting from a SeqRecord back to PhyloXML, not all SeqFeatures would be protein domains... I don't know what to do with that. The new SeqRecord chapter is very informative -- I was originally just looking at the wiki and epydoc pages. Still unclear: why doesn't the SeqRecord constructor take annotations as an optional argument? Should it? Thanks, Eric From biopython at maubp.freeserve.co.uk Thu Jun 18 16:52:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 21:52:26 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> Message-ID: <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> On Thu, Jun 18, 2009 at 9:22 PM, Eric Talevich wrote: > > The new SeqRecord chapter is very informative -- I was originally just > looking at the wiki and epydoc pages. I can't take all the credit - a good chunk of it was reused from the "Advanced" chapter. Do let me know if you spot any typos. Did you think it makes sense to have this before the SeqIO chapter (as it is now), or afterwards? Right now the SeqRecord chapter uses SeqIO in order to load some complex records to show how they are represented - so you could put them in either order. > Still unclear: why doesn't the SeqRecord constructor take annotations > as an optional argument? Should it? I don't know why it doesn't (its a historical design choice before by time), do you think it would actually be more useful? Maybe Brad can comment? P.S. See also related Bug 2841 http://bugzilla.open-bio.org/show_bug.cgi?id=2841 Peter From eric.talevich at gmail.com Thu Jun 18 17:49:47 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Jun 2009 17:49:47 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> Message-ID: <3f6baf360906181449t1b0b9b85u325b2ac997ed05b8@mail.gmail.com> On Thu, Jun 18, 2009 at 4:52 PM, Peter wrote: > On Thu, Jun 18, 2009 at 9:22 PM, Eric Talevich > wrote: > > > > The new SeqRecord chapter is very informative -- I was originally just > > looking at the wiki and epydoc pages. > > I can't take all the credit - a good chunk of it was reused from the > "Advanced" > chapter. Do let me know if you spot any typos. Did you think it makes sense > to have this before the SeqIO chapter (as it is now), or afterwards? Right > now the SeqRecord chapter uses SeqIO in order to load some complex > records to show how they are represented - so you could put them in either > order. > I didn't notice any typos other than Python being consistently lowercase, which I assume is how the author likes it. The ordering is good -- the SeqIO chapter makes more advanced use of sequences of SeqRecord objects, so it's good to be familiar with the basic objects first. In general, I like the organization of covering fundamental types first, then moving on to larger collections, rather than covering the majority of a big collection in one shot and leaving the tricky parts unaddressed. A quick discussion of SeqFeature objects at the end of the SeqRecord chapter instead of in the "Advanced" chapter would be nice, since apparently it's easy to disregard that last section as an appendix of less important material (I guess I did originally). In that final section it seems like SeqFeature is not meant to be handled by mere mortals, mainly because of the fuzzy positions -- maybe integrating it a little more comfortably into other modules like PDB would help with that. I pushed some work to github, involving the PhyloXML<->SeqRecord translation: http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML -Eric From biopython at maubp.freeserve.co.uk Thu Jun 18 18:05:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 23:05:29 +0100 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) Message-ID: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich wrote: > > I didn't notice any typos other than Python being consistently lowercase, > which I assume is how the author likes it. I was aiming for consistency, with no strong preference - at the time there were more uses of "python" than "Python" so I picked that. We can change it easily enough - does anyone care either way? > The ordering is good -- the SeqIO chapter makes more advanced use > of sequences of SeqRecord objects, so it's good to be familiar with the > basic objects first. In general, I like the organization of covering > fundamental types first, then moving on to larger collections, rather > than covering the majority of a big collection in one shot and leaving > the tricky parts unaddressed. There is a case for leaving messy corner cases to the end, as long as the main chapters cover the core. > A quick discussion of SeqFeature objects at the end of the SeqRecord > chapter instead of in the "Advanced" chapter would be nice, since > apparently it's easy to disregard that last section as an appendix of > less important material (I guess I did originally). Yes, the SeqFeature stuff did originally risk being ignored when it was just part of the "Advanced" chapter near the end. I think it made sense to move it to the new SeqRecord chapter (and there is still room for improvement - I'm thinking of going over one of the features in the example GenBank file in more detail). > In that final section it seems like SeqFeature is not meant to be > handled by mere mortals, mainly because of the fuzzy positions The fuzzy locations are by their nature really horrible to code with. Doing something with SeqFeature objects and locations has been discussed on the mailing list in the last month or so (in connection with GFF files). I hope to have a good chat about this with Brad in person at the BOSC hackathon. > -- maybe integrating it a little more comfortably into other > modules like PDB would help with that. I don't see how SeqFeature objects and their FeatureLocations related to PDB. Could you elaborate? Peter From eric.talevich at gmail.com Thu Jun 18 21:57:55 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Jun 2009 21:57:55 -0400 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> Message-ID: <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> On Thu, Jun 18, 2009 at 6:05 PM, Peter wrote: > On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich > wrote: > > > > I didn't notice any typos other than Python being consistently lowercase, > > which I assume is how the author likes it. > > I was aiming for consistency, with no strong preference - at the time > there were more uses of "python" than "Python" so I picked that. We > can change it easily enough - does anyone care either way? > Python.org capitalizes it. Shrug. > The ordering is good -- the SeqIO chapter makes more advanced use > > of sequences of SeqRecord objects, so it's good to be familiar with the > > basic objects first. In general, I like the organization of covering > > fundamental types first, then moving on to larger collections, rather > > than covering the majority of a big collection in one shot and leaving > > the tricky parts unaddressed. > > There is a case for leaving messy corner cases to the end, as long as > the main chapters cover the core. > Agreed. In the SeqRecord chapter, I was looking for a paragraph or so on what sort of information goes into a SeqFeature to see whether it would be a suitable stand-in for PhyloXML's DomainArchitecture. From the initial description I wasn't sure if annotations or letter_annotations would be more appropriate, and the other mentionings are basically "here be dragons"... which is true, but a quick example would be helpful. The GenBank parsing section would be a good place for that. > -- maybe integrating it a little more comfortably into other > > modules like PDB would help with that. > > I don't see how SeqFeature objects and their FeatureLocations > related to PDB. Could you elaborate? > If secondary structure or miscellaneous information is listed in the PDB header, then parse_pdb_header could produce SeqFeatures from that. Right now it doesn't build any Biopython objects at all. -Eric From biopython at maubp.freeserve.co.uk Fri Jun 19 05:18:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 10:18:40 +0100 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> Message-ID: <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> On Fri, Jun 19, 2009 at 2:57 AM, Eric Talevich wrote: > On Thu, Jun 18, 2009 at 6:05 PM, Peter > wrote: >> >> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich >> wrote: >> > >> > I didn't notice any typos other than Python being consistently >> > lowercase, >> > which I assume is how the author likes it. >> >> I was aiming for consistency, with no strong preference - at the time >> there were more uses of "python" than "Python" so I picked that. We >> can change it easily enough - does anyone care either way? > > Python.org capitalizes it. Shrug. Maybe we should use "Python" then. >> > The ordering is good -- the SeqIO chapter makes more advanced use >> > of sequences of SeqRecord objects, so it's good to be familiar with the >> > basic objects first. In general, I like the organization of covering >> > fundamental types first, then moving on to larger collections, rather >> > than covering the majority of a big collection in one shot and leaving >> > the tricky parts unaddressed. >> >> There is a case for leaving messy corner cases to the end, as long as >> the main chapters cover the core. > > Agreed. In the SeqRecord chapter, I was looking for a paragraph or so on > what sort of information goes into a SeqFeature to see whether it would be a > suitable stand-in for PhyloXML's DomainArchitecture. From the initial > description I wasn't sure if annotations or letter_annotations would be more > appropriate, and the other mentionings are basically "here be dragons"... > which is true, but a quick example would be helpful. The GenBank parsing > section would be a good place for that. OK - that is useful feedback. I will try and clarify that, but in essence: * letter_annotations - where you have a bit of information for each letter (i.e. amino acid or nucleotide) in the sequence, such as a list of quality scores or secondary structure predictions. * features - where you have annotation associated with a particular region of the sequence (e.g. a gene) * annotations - things that apply to the whole sequence like organism There are some odd cases, like the GenBank source feature, which covers the whole of the sequence but is listed in the feature table just like a gene etc (you'd have to ask the NCBI why they did it this way). In Biopython, these source features get stored as a SeqFeature for consistency with the rest of the GenBank feature table entries. Another odd one is any references, which in GenBank files may apply to a particular region of the sequence (but in normal usage seem to apply to the whole thing). These get stored separately in BioSQL, which to me makes sense. At the moment in the SeqRecord they are stored in the annotations dictionary (as a list of reference objects under the key "references"). I've been thinking about upgrading this to a new SeqRecord property (a list of reference objects) but as I have never actually needed to access this information it hasn't been a high priority. >> > -- maybe integrating it a little more comfortably into other >> > modules like PDB would help with that. >> >> I don't see how SeqFeature objects and their FeatureLocations >> related to PDB. Could you elaborate? > > If secondary structure or miscellaneous information is listed in the > PDB header, then parse_pdb_header could produce SeqFeatures > from that. Right now it doesn't build any Biopython objects at all. I see. Yes, the header parsing in Bio.PDB is very limited at the moment, and even sticking to well defined line types (and ignoring many or most of the REMARK lines) there is room for improvement. For the secondary structure, this is given as a string with one letter for each residue - I see this as a more natural match to SeqRecord letter_annotations rather than a SeqFeature, but giving a list of SeqFeatures for the helices, beta sheets, coils etc would also work. Of course, you might also want a Seq object to relate them to (to give the locations meaning). One idea I have toyed with is a Bio.SeqIO parser for PDB files, which would focus on the sequence information in the headers (and probably ignore the ATOM lines completely). I would like to keep the core of Biopython independent of NumPy (and I see Bio.SeqIO as part of the core), so this wouldn't depend on Bio.PDB. I'm not sure this idea would actually be useful so haven't worked on it. Peter From chapmanb at 50mail.com Fri Jun 19 08:25:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Jun 2009 08:25:49 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> Message-ID: <20090619122549.GD64233@sobchak.mgh.harvard.edu> Eric and Peter; > > Still unclear: why doesn't the SeqRecord constructor take annotations > > as an optional argument? Should it? > > I don't know why it doesn't (its a historical design choice before by time), > do you think it would actually be more useful? Maybe Brad can comment? > > P.S. See also related Bug 2841 > http://bugzilla.open-bio.org/show_bug.cgi?id=2841 My recollection of the history here is hazy, but based on the code comments we were probably running into this problem without realizing it: http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm It should be easy enough to allow passing in annotations and letter_annotations by setting the function defaults to None and doing the if annotations is None: annotations = {} trick. My vote is for adding this. Brad From biopython at maubp.freeserve.co.uk Fri Jun 19 08:40:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 13:40:08 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <20090619122549.GD64233@sobchak.mgh.harvard.edu> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> <20090619122549.GD64233@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906190540p49b98ad5se162b7be6b600411@mail.gmail.com> On Fri, Jun 19, 2009 at 1:25 PM, Brad Chapman wrote: > Eric and Peter; > >> > Still unclear: why doesn't the SeqRecord constructor take annotations >> > as an optional argument? Should it? >> >> I don't know why it doesn't (its a historical design choice before my time), >> do you think it would actually be more useful? Maybe Brad can comment? >> >> P.S. See also related Bug 2841 >> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 > > My recollection of the history here is hazy, but based on the code comments > we were probably running into this problem without realizing it: > > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm > > It should be easy enough to allow passing in annotations and > letter_annotations by setting the function defaults to None and doing the > if annotations is None: annotations = {} trick. > > My vote is for adding this. That thought did occur to me last night - having fallen over the same problem myself once before. Would you like to update the SeqFeature along those lines then? Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 19 08:43:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 08:43:55 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906191243.n5JChtaM029245@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 08:43 EST ------- See http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006241.html where Brad wrote: > My recollection of the history here is hazy, but based on the > code comments we were probably running into this problem without > realizing it: > > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm > > It should be easy enough to allow passing in annotations and > letter_annotations by setting the function defaults to None > and doing the if annotations is None: annotations = {} trick. > > My vote is for adding this. I agree that would explain the comments, and fix makes sense. Note if we want to allow the letter_annotations to be set, we shouldn't blindly apply the supplied value (which may be a plain dictionary, and contain inappropriate data) but make sure it is turned into a restricted dictionary. Ideally this should be covered with a new unit test... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 08:48:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 08:48:39 -0400 Subject: [Biopython-dev] [Bug 2860] New: Writing GenBank files should output features in position order Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2860 Summary: Writing GenBank files should output features in position order Product: Biopython Version: 1.50b Platform: All OS/Version: Linux Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: n.j.loman at bham.ac.uk Adding features to a SeqRecord object does not automatically sort them by position. Therefore if you do something like this: for rec in SeqIO.parse(sys.stdin, "genbank"): new_features = [] for feature in rec.features: if feature.type == 'CDS': gene_feature = copy(feature) gene_feature.type = 'gene' new_features.append(gene_feature) rec.features.extend(new_features) SeqIO.write([rec], sys.stdout, "genbank") You will end up with an incorrectly sorted file with CDS features first, then gene features. You can sort rec.features in-place to correct this: rec.features.sort(key=attrgetter('location')) I am not sure the correct fix in terms of BioPython, whether it should concentrate on changing the behaviour SeqRecord.features, or the GenBank output code (which I am aware is a work in progress). I guess the answer to this is should BioPython guarantee Seqrecord.features to be sorted? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 08:52:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 08:52:15 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906191252.n5JCqFoX030054@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 ------- Comment #3 from n.j.loman at bham.ac.uk 2009-06-19 08:52 EST ------- Good stuff, a useful Python gotcha I had not encountered yet. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 09:11:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:11:17 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191311.n5JDBHgA031607@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 09:11 EST ------- Hi Nick, I understand your request, but I am not sure if it is a bug. Do you know if the GenBank file format say anything about the order of the features? And does this actually matter? I know from experience that the NCBI GenBank files do seem to be sorted by position (and then it seems to be gene before CDS and some other tie break rules which I have not explored). Arguably this should be left to the user - a slightly different version of your script could avoid the issue, something like this (untested): from Bio import SeqIO for rec in SeqIO.parse(sys.stdin, "genbank"): new_features = [] for feature in rec.features: if feature.type == 'CDS': gene_feature = copy(feature) gene_feature.type = 'gene' new_features.append(gene_feature) new_features.append(feature) rec.features = new_features SeqIO.write([rec], sys.stdout, "genbank") Peter P.S. Your example may produce odd features, as gene features don't normally include a protein id or a translation while a CDS feature may. Again, Biopython doesn't currently try to limit this - or indeed limit the feature types to a while list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 09:19:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:19:07 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191319.n5JDJ7jv032414@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #2 from n.j.loman at bham.ac.uk 2009-06-19 09:19 EST ------- Well yes, I realise that this is only a standard by convention. My assumption when dealing with such matters is that if NCBI/GenBank does it, it is probably right. My impression is that "source" is always the first qualifier, then it is sorted by location with "gene" features followed by "CDS" features by convention. I guess it is acceptable for the user to deal with the order in the absence of a published standard for GenBank files. But I think it would be equally acceptable to code the GenBank outputter to enforce those rules. (I know that script fragment would give weird output, it was just illustrative of the position issue.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 09:31:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:31:10 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191331.n5JDVAY4001158@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 09:31 EST ------- (In reply to comment #2) > Well yes, I realise that this is only a standard by convention. My assumption > when dealing with such matters is that if NCBI/GenBank does it, it is probably > right. On the other hand, the NCBI are generally very good at defining their file formats, so *if* they don't specify an order, presumably anything is OK? > My impression is that "source" is always the first qualifier, then it is > sorted by location with "gene" features followed by "CDS" features by > convention. Since the "source" feature starts from the first base, it will always be one of the first by location. > I guess it is acceptable for the user to deal with the order in the absence > of a published standard for GenBank files. But I think it would be equally > acceptable to code the GenBank outputter to enforce those rules. We would need to know the rules though. For example, which of these locations is first: "10..20" or "<10..20" or ">10..20" or "one-of(10,12)..20" or are they all tied? We would also need to know the tie break rules for the feature type, not just "source" before "gene" before "CDS". What about "tRNA" etc. Given we don't currently know the rules, we could only implement a best guess. If the order we write out is very clear is just the order of the SeqFeature objects in the list (as now) the behaviour is clearly defined. This is my preference as it gives the user full control (and full responsibility). If we did sort things, there would be no easy way to override this sorting. > (I know that script fragment would give weird output, it was just > illustrative of the position issue.) OK Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 09:37:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:37:29 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191337.n5JDbT19001895@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #4 from n.j.loman at bham.ac.uk 2009-06-19 09:37 EST ------- Fair enough, I guess this is a pretty strong argument to leave the Genbank output as it is. Is there any argument to add a method to SeqRecord like "insert_by_location()". This might make the use of SeqFeature more transparent to new users too who may not be clear that they can write to the features list directly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 09:44:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:44:24 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191344.n5JDiOEV002698@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 09:44 EST ------- (In reply to comment #4) > Fair enough, I guess this is a pretty strong argument to leave the Genbank > output as it is. OK, marking this as "Won't Fix" (unless anyone can find a definitive definition of how GenBank features should be sorted). > Is there any argument to add a method to SeqRecord like > "insert_by_location()". This might make the use of SeqFeature more > transparent to new users too who may not be clear that they can write > to the features list directly. I see where you are going with that idea, but for now rather than adding more code I think we should add more documentation. The new SeqRecord (and SeqFeature) chapter in the Tutorial should help here. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Jun 19 10:40:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 15:40:40 +0100 Subject: [Biopython-dev] Next release plans? Message-ID: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> Hi all, We're made some good progress since Biopython 1.50, and I think it may already be time to plan another release. There was some important "housekeeping" which we will need to flag in the release notes (and perhaps email Linux packagers about explicitly), namely we don't support Python 2.3 any more, don't install Martel any more, and don't depend on mxTextTools any more. There is also some new stuff too. I think the following still need more testing, so I'm wondering about another beta release... maybe before BOSC 2009 as we can then use it in the tutorial sessions? (1) Writing GenBank files with features. I think this copes with all the ambiguous and complex locations, which turned out to be a big job, but some independent testing would be wise. (2) Supporting the new Illumina FASTQ file format. This still needs some good tests (ideally with some real data). (3) Application wrappers (especially Cymon's new alignment wrappers). We have basic documentation for these in the Tutorial now, plus a new chapter on the SeqRecord object. GenomeDiagram would also benefit from further documentation. Any thoughts? Does rolling this out early next week sound sensible or simply too ambitious? Peter From tiagoantao at gmail.com Fri Jun 19 10:44:25 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 19 Jun 2009 15:44:25 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> Message-ID: <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> On Fri, Jun 19, 2009 at 3:40 PM, Peter wrote: > Any thoughts? Does rolling this out early next week sound sensible > or simply too ambitious? I wont have time to roll my genepop code with statistics calculations in that timeframe, no chance. I will leave it for 1.52. From biopython at maubp.freeserve.co.uk Fri Jun 19 10:57:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 15:57:23 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> Message-ID: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> 2009/6/19 Tiago Ant?o: > On Fri, Jun 19, 2009 at 3:40 PM, Peter wrote: >> Any thoughts? Does rolling this out early next week sound sensible >> or simply too ambitious? > > I wont have time to roll my genepop code with statistics calculations > in that timeframe, no chance. I will leave it for 1.52. We don't have to rush this - it just seemed worth having a quite fresh release available for us to work from in the tutorial session at BOSC, and also as a reference point the coding BoF session. If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in July, then I would provisionally expect to do Biopython 1.52 in the Autumn. This could be brought forward if a large and useful contribution was added. Let's chat about the genepop code at BOSC? Peter From eric.talevich at gmail.com Fri Jun 19 11:15:10 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 19 Jun 2009 11:15:10 -0400 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> Message-ID: <3f6baf360906190815j3c8924f2n73a2e050b6489389@mail.gmail.com> 2009/6/19 Peter > If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in > July, then I would provisionally expect to do Biopython 1.52 in the Autumn. > This could be brought forward if a large and useful contribution was added. > Let's chat about the genepop code at BOSC? > The Google Summer of Code projects are supposed to be production-ready by August 17, according to the timeline: http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline There's also a mentor summit in October. If the two phylogenetics projects land and another release gets rolled by then, a stable release featuring our new code could be presented at the summit, which would be hot. -Eric From eric.talevich at gmail.com Fri Jun 19 12:03:46 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 19 Jun 2009 12:03:46 -0400 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> Message-ID: <3f6baf360906190903ke04b158oa2045558ef9fc9f6@mail.gmail.com> On Fri, Jun 19, 2009 at 5:18 AM, Peter wrote: > > > > OK - that is useful feedback. I will try and clarify that, but in essence: > > * letter_annotations - where you have a bit of information for each letter > (i.e. amino acid or nucleotide) in the sequence, such as a list of quality > scores or secondary structure predictions. > * features - where you have annotation associated with a particular > region of the sequence (e.g. a gene) > * annotations - things that apply to the whole sequence like organism Thanks. Another odd one is any references, which in GenBank files may apply > to a particular region of the sequence (but in normal usage seem to > apply to the whole thing). These get stored separately in BioSQL, which > to me makes sense. At the moment in the SeqRecord they are stored > in the annotations dictionary (as a list of reference objects under the > key "references"). I've been thinking about upgrading this to a new > SeqRecord property (a list of reference objects) but as I have never > actually needed to access this information it hasn't been a high priority. > Good to know. I'll be careful with SeqRecord.features['references'] for now. > > > > If secondary structure or miscellaneous information is listed in the > > PDB header, then parse_pdb_header could produce SeqFeatures > > from that. Right now it doesn't build any Biopython objects at all. > > I see. Yes, the header parsing in Bio.PDB is very limited at the > moment, and even sticking to well defined line types (and ignoring > many or most of the REMARK lines) there is room for improvement. > > For the secondary structure, this is given as a string with one letter > for each residue - I see this as a more natural match to SeqRecord > letter_annotations rather than a SeqFeature, but giving a list of > SeqFeatures for the helices, beta sheets, coils etc would also work. > Of course, you might also want a Seq object to relate them to (to > give the locations meaning). > > One idea I have toyed with is a Bio.SeqIO parser for PDB files, which > would focus on the sequence information in the headers (and probably > ignore the ATOM lines completely). I would like to keep the core of > Biopython independent of NumPy (and I see Bio.SeqIO as part of the > core), so this wouldn't depend on Bio.PDB. I'm not sure this idea > would actually be useful so haven't worked on it. > > I'll have a real use for this in the fall, once GSoC is done. It would be nice to link a set of parsed PDB objects to a multiple alignment of protein sequences, but I think I'd always want to have the 3D structure information close at hand. The other use case I've mentioned before is to verify and fix existing PDB files from Biopython, rather than manually -- 3D coordinates would probably be useful here, too, for checking collisions and such. Eventually I'll resurrect my pdbtidy branch and make the parser emit a SeqRecord or whatever's most appropriate. -Eric From idoerg at gmail.com Fri Jun 19 12:06:13 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 19 Jun 2009 09:06:13 -0700 Subject: [Biopython-dev] CONTIG records in GenBank files In-Reply-To: <320fb6e00906180710ncbf3346rc662853c2e09f71e@mail.gmail.com> References: <320fb6e00906180710ncbf3346rc662853c2e09f71e@mail.gmail.com> Message-ID: Not yet, been traveling. Sorry. Will try to get to it next week (plenty of time on the plane to Stockholm) Iddo Friedberg, Ph.D. http://iddo-friedberg.net/contact.html On Jun 18, 2009 7:10 AM, "Peter" wrote: A couple of weeks ago, we were talking about a problem with CONTIG lines in GenBank files, http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006192.html On Sun, Jun 7, 2009 at 10:31 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: >> Here is the stack dump, coming from the file: >> >> ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz >> ... >> Traceback (most recent call last): >> ... >> Bio.GenBank.LocationParserError: >> ... > > That looks like Bug 2745 to me - does the patch on that bug work for > you, and would you be happy storing the CONTIG line as string? Iddo, Did you have a chance to try the patch on Bug 2745 yet? http://bugzilla.open-bio.org/show_bug.cgi?id=2745 If you are happy with the proposed solution, I'd like to get that checked in... Thanks, Peter From mjldehoon at yahoo.com Fri Jun 19 12:04:46 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 19 Jun 2009 09:04:46 -0700 (PDT) Subject: [Biopython-dev] Next release plans? Message-ID: <870693.80287.qm@web62405.mail.re1.yahoo.com> --- On Fri, 6/19/09, Peter wrote: > Any thoughts? Does rolling this out early next week sound > sensible or simply too ambitious? > Sounds good to me. I may check in some Bio.Blast stuff, but nothing dramatic. I also think that we can release 1.51 without a beta release first; see our previous discussion here: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005825.html --Michiel. From biopython at maubp.freeserve.co.uk Fri Jun 19 12:19:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 17:19:10 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <870693.80287.qm@web62405.mail.re1.yahoo.com> References: <870693.80287.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00906190919t47235db0od69a6b7943d180d@mail.gmail.com> On Fri, Jun 19, 2009 at 5:04 PM, Michiel de Hoon wrote: > > > --- On Fri, 6/19/09, Peter wrote: >> Any thoughts? Does rolling this out early next week sound >> sensible or simply too ambitious? > > Sounds good to me. I may check in some Bio.Blast stuff, but nothing > dramatic. I'll keep that in mind - when were you thinking of doing that? I was thinking of doing the release on Monday or Tuesday (as I travel on the Friday, and have to prepare my slides etc). > I also think that we can release 1.51 without a beta release first; > see our previous discussion here: > > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005825.html Yeah - but this is a bit of a special case. Given we will be having a tutorial session and a hackathon session at BOSC, having people using a Biopython 1.51 beta release should ensure some useful feedback (even if it is just documentation improvements) which would make Biopython 1.51 that much better. I am also a *little* nervous that there could be something amiss in the new GenBank feature location writing (despite the test coverage), and doing the beta release first would put my mind more at ease on this area. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 12:30:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 17:30:11 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> <7265d4f0906141126l2f00fecehaa28273af9b3681a@mail.gmail.com> <7265d4f0906150202j3daeefa9we304cf29c4f6cd6d@mail.gmail.com> Message-ID: <320fb6e00906190930n1f614ee0v4fcce8362ccddde1@mail.gmail.com> On Mon, Jun 15, 2009 at 10:02 AM, Cymon Cox wrote: >>> At first glance, something based on your change looks sensible. >>> Next time I spot the unit test failing I'll try and reproduce this. >> >> It's pretty hit or miss: I would guess once in every 10+ times I ran the >> test_NCBI_qblast I would encounter the problem. > > I can be a little more specific; out 742 calls to qblast, 75 returned the > "Your XML" error. (This was with a different ISP.) I haven't had this fail on me personally, and so have not tested the fix directly. However, on another thread Peter Saffrey reported using your patch was helpful, so I have committed the fix to CVS (without the extra print statements). As usual, let me know if I've mangled something with my editing ;) Thanks, Peter From tiagoantao at gmail.com Fri Jun 19 15:59:26 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 19 Jun 2009 20:59:26 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> Message-ID: <6d941f120906191259h62890450o6231bd4362f51ee4@mail.gmail.com> 2009/6/19 Peter : > Let's chat about the genepop code at BOSC? That was actually my plan. I am working to take a functional version with me. Mainly for code review and documentation purposes. With some way to compute statistics this will be quite competitive with other Bio projects and I will start to announce this in places like the evoldir mailing list. As finally it can serve a large group of users... -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From chapmanb at 50mail.com Fri Jun 19 18:39:51 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Jun 2009 18:39:51 -0400 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> Message-ID: <20090619223951.GA5133@sobchak.mgh.harvard.edu> Hi Peter; > >> Any thoughts? Does rolling this out early next week sound sensible > >> or simply too ambitious? > > > > I wont have time to roll my genepop code with statistics calculations > > in that timeframe, no chance. I will leave it for 1.52. > > We don't have to rush this - it just seemed worth having a quite fresh > release available for us to work from in the tutorial session at BOSC, > and also as a reference point the coding BoF session. > > If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in > July, then I would provisionally expect to do Biopython 1.52 in the Autumn. > This could be brought forward if a large and useful contribution was added. > Let's chat about the genepop code at BOSC? This sounds like a good idea if you have time to push it next week. I'd like to get GFF parsing/writing in but need to do some refactoring before realistically proposing that, and it won't happen until after BOSC. I will do the annotation bit this weekend so it'll be in there. 1.52 sounds like it'll be a good target for a lot of new functionality. Nice. Thanks, Brad From biopython at maubp.freeserve.co.uk Sat Jun 20 04:53:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 20 Jun 2009 09:53:15 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <20090619223951.GA5133@sobchak.mgh.harvard.edu> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> <20090619223951.GA5133@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906200153l14d1f398x732cb118eb83603a@mail.gmail.com> On Fri, Jun 19, 2009 at 11:39 PM, Brad Chapman wrote: > Hi Peter; > >> If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in >> July, then I would provisionally expect to do Biopython 1.52 in the Autumn. >> This could be brought forward if a large and useful contribution was added. >> Let's chat about the genepop code at BOSC? > > This sounds like a good idea if you have time to push it next week. > I'd like to get GFF parsing/writing in but need to do some refactoring > before realistically proposing that, and it won't happen until after > BOSC. I will do the annotation bit this weekend so it'll be in > there. Yeah - the GFF parsing and SeqFeature/FeatureLocation stuff would benefit from some in person discussion. This is all very fresh in my mind from the EMBL/GenBank side of things as I have got GenBank feature writing in Bio.SeqIO to work (in CVS), and have been looking at replacing the current GenBank/EMBL location parser with something re based for speed (see Bug 2738 for a proof of concept, I have a more complete version in progress). > 1.52 sounds like it'll be a good target for a lot of new > functionality. Nice. It does look like that :) Peter From rozziite at gmail.com Sun Jun 21 10:13:40 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 21 Jun 2009 10:13:40 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <4057d3bf0906210713p331cdfa4n31d76e95ee78ea85@mail.gmail.com> 2009/6/17 Eric Talevich > Hi Brad, > > Here's a mid-week update and partial response to your questions. > > *SeqRecord transformation* > > It would be nice if I could round-trip this sequence information perfectly, > so > that nothing's lost between reading and writing an arbitrary, valid > PhyloXML > file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look > at > SeqRecord.features and assume that any matching keys have the appropriate > PhyloXML meaning. > > These are the keys that from_seqrec() would look for: > location > uri > annotations > domain_architecture > > Do you see any risk of collision for those names? And for serialization, > would it be unwholesome to convert Annotation and DomainArchitecture objects > to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- > it's another layer of parsing and kind of esoteric, but I can live with it. > > > *Profiling* > > Christian also suggested an option to parse just the phylogenies with a > name or id matching a given string. I like that and I don't see any problem > with extending it to clades as well. It seems like a reasonable use case to > select a sub-tree from a complete phyloXML document and treat it as a > separate > phylogeny from then on. This can be supported by various methods for > selecting > portions of the tree, and a method on Clade for transforming the selection > into > a new Phylogeny instance (so the original can be safely deleted). > I like this idea. I will do the same for PhyloXML implementation in BioRuby. Diana > > I did some profiling with the cProfile module, and it looks like most of > the > time is being spent instantiating Clade and Taxonomy objects. (Also, > pretty_print is hugely inefficient, but that's less important.) I think I > can > speed up parsing and reduce memory usage by pulling the from_element > methods > out of each class and using a separate Parser class to do that work. > > About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was > just > looking at Ubuntu's system monitor, and Firefox and a few other things were > running at the same time, taking up about 800MB already. So the full NCBI > taxonomy actually takes up only 1.2GB or so, which isn't such a problem, > and I > think it will get smaller as I shrink down these PhyloXML classes. > > Questions: > - Do you know of a better way to profile Python code, or visualize it? > - Have you used __slots__ to optimize classes? Do you recommend it? > > And a few that don't fit anywhere else: > > - What sort of whole-tree operations would you want to do with these > objects that you can't do with a Nexus or Newick tree? What other > formats > would you want to convert to? I'm thinking of adding an Export module > later if there's time, for lossy conversions like a graph for > networkx. > > - What's the most intuitive way to display a phylogenetic tree you've > loaded into Biopython? Serialize as Nexus and open in TreeViewX? > Convert > to a graph and send to matplotlib? Or, is there a module in > Bio.Graphics > that can draw trees? (If not, should there be?) > > Thanks, > Eric > > > > On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman wrote: > >> Hi Eric; >> Nice update and thanks again for copying the Biopython development >> list on this. >> >> > * Added to_seqrecord and from_seqrecord methods to the >> PhyloXML.Sequence >> > class >> > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely >> will >> > require some more thought >> >> I'm looking forward to seeing how you decide to go forward with >> this. For the work I do on a day to day basis, a continual >> struggle involves establishing relationships between things to >> retrieve more information. For instance, a pair of nodes on a tree >> is interesting -- how would I find papers, experiments and other >> information associated with those sequences? It seems like Accession >> and the ref attribute of Annotation help establish these >> relationships. >> >> > * Test-driven development kind of went out the window this week. >> >> Heh. It happens -- sounds sensible to have a clean up and >> documentation week this week; that will also help others who are >> interested dig into using it. >> >> > * The unit tests I do have in place give some sense of memory and CPU >> usage. >> > For the full NCBI taxonomy, memory usage climbs up above 2 GB with >> the >> > read() function, which isn't a problem on this workstation but could >> be for >> > others. >> >> Do you see an opportunity to offer iterating over clades instead of >> loading them all into memory for these larger trees? This would >> involve lazily loading subclades on request and would limit some >> functionality for querying the full tree without loading it all into >> memory. >> >> Another option is to offer some pruning ability as a tree is >> loading. For instance, if I am loading the whole NCBI taxonomy on a >> memory limited computer and only need the Angiosperm flowering plant >> part of the tree. In this case, you'd want to throw away all clades >> not under the clades of interest. >> >> These are probably fringe cases; just brainstorming some ideas. >> >> Thanks again, >> Brad >> > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > From biopython at maubp.freeserve.co.uk Mon Jun 22 07:29:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 12:29:50 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> On Thu, Jun 18, 2009 at 12:17 AM, Eric Talevich wrote: > > Hi Brad, > > Here's a mid-week update and partial response to your questions. > > *SeqRecord transformation* > I was just having a very brief scan over the commits that my RSS feed had - and noticed this bit: + # Unpack record.features + if record.features: + kwargs['domain_architecture'] = DomainArchitecture( + domains=[ProteinDomain({ + 'from': feat.location.start + 1, + 'to': feat.location.end + 1, + 'confidence': feat.qualifiers.get('confidence') + }, value=feat.id) + for feat in record.features], + length=len(record.seq) + ) I can understand a +/- one to the start location (moving between Python zero based counting and normal one based counts), but why would the end location also change? Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 22 09:03:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Jun 2009 09:03:16 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906221303.n5MD3Ggo014924@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-22 09:03 EST ------- (In reply to comment #3) > (In reply to comment #2) > > Do you think we should deprecate Biopython support for psycopg version one? > > Yes, I'd deprecate it - its no longer actively developed. Anyone wanting to > use Psycopg would surely choose version 2 (version 1 was a pain to build > anyway). > > C. > OK - I've added a deprecation warning for psycopg (v1) in Biopython 1.51 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jun 22 09:08:05 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Jun 2009 09:08:05 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> Message-ID: <3f6baf360906220608h3c0fff98sc06bd01f0afaafef@mail.gmail.com> On Mon, Jun 22, 2009 at 7:29 AM, Peter wrote: > On Thu, Jun 18, 2009 at 12:17 AM, Eric Talevich > wrote:> *SeqRecord transformation* > > > > I was just having a very brief scan over the commits that my RSS > feed had - and noticed this bit: > > + # Unpack record.features > + if record.features: > + kwargs['domain_architecture'] = DomainArchitecture( > + domains=[ProteinDomain({ > + 'from': feat.location.start + 1, > + 'to': feat.location.end + 1, > + 'confidence': > feat.qualifiers.get('confidence') > + }, value=feat.id) > + for feat in record.features], > + length=len(record.seq) > + ) > > I can understand a +/- one to the start location (moving between > Python zero based counting and normal one based counts), but > why would the end location also change? > > Peter > Er, it wouldn't. Oops. On that note, how do you feel about specifying biology-style indexes in PhyloXML.ProteinDomain, and switching to zero-based indexes when converting to SeqFeature? Would it be better to use zero-based indexes in ProteinDomain and substract 1 from the start position during parsing? Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Jun 22 09:38:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 14:38:27 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906220608h3c0fff98sc06bd01f0afaafef@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> <3f6baf360906220608h3c0fff98sc06bd01f0afaafef@mail.gmail.com> Message-ID: <320fb6e00906220638u5c668583h7a5b750fce295304@mail.gmail.com> On Mon, Jun 22, 2009 at 2:08 PM, Eric Talevich wrote: >> >> I can understand a +/- one to the start location (moving between >> Python zero based counting and normal one based counts), but >> why would the end location also change? >> >> Peter >> > > Er, it wouldn't. Oops. > > On that note, how do you feel about specifying biology-style indexes in > PhyloXML.ProteinDomain, and switching to zero-based indexes when converting > to SeqFeature? Would it be better to use zero-based indexes in ProteinDomain > and substract 1 from the start position during parsing? Personally, within any Python object representation I would expect Python style counting to be used (i.e. so the start and end could be used as is for slicing the sequence). This would be consistent with the SeqFeature usage in Biopython. However, if your object is a simple naive representation of the raw data from the file, you might arguably keep is as in the file (one based). Whatever you do, please make it very explicit in the docstring. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 22 11:46:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Jun 2009 11:46:20 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906221546.n5MFkKCT027601@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from chapmanb at 50mail.com 2009-06-22 11:46 EST ------- Nick, thanks for the report. Fixed in revision 1.20 of SeqFeature. Also fixed in revision 1.37 of SeqRecord, and tests added to test_SeqIO_features.py. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jun 22 12:14:19 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Jun 2009 12:14:19 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython Message-ID: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> Hi folks, Previously (June 15-19) I: * Wrote a pretty-printer for displaying a summary of the parsed tree structure * Made all existing unit tests pass * Started unit tests for instantiation of each phyloXML object * Profiled the parser and utilities using the cProfile module on the unit test suite. Summarized findings on the Biopython mailing list (nothing exciting was discovered) * Used a custom warning type to indicate noncompliance with the PhyloXML spec * Separated parsing code (Parser.py) from the phyloXML class definitions (Tree.py) -- this should make Nexus/Newick compatibility feasible * Improved the conversion from PhyloXML.Sequence to Bio.SeqRecord, making better use of annotations and using SeqFeature objects to represent protein domains This week (June 22-26) I will: Work on the backlog: * Finish unittests for parsing and instantiating core elements * Compare parser performance with Bioperl and Archaeopterix * Document results of parser testing and performance (on wiki or here) * Document basic usage and performance characteristics of the parser on the Biopython wiki Then, serialize phyloXML trees and write back to file: * Write unit tests for serialization * Write serialization methods for each class * Write a top-level function for triggering serialization of the whole hierarchy Question: Biopython has a couple of core objects that I'm reusing in my project. There was a quirk in these libraries (related to this: http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm) that made the objects slightly more awkward to instantiate, but the issues were recently fixed. I'd like to merge these fixes soon. So, GSoC requires a tarball of the code we write at the end of the summer. Merging from upstream would bring code that I didn't write into my development tree -- which I could probably filter out with the right arguments to git-diff, but nonetheless, my project history would no longer be entirely clean. Does Google care about this? Or is it safe to go ahead and pull from the next stable release of Biopython (coming soon)? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jun 22 12:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 17:44:29 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> References: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> Message-ID: <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> On Mon, Jun 22, 2009 at 5:14 PM, Eric Talevich wrote: > Biopython has a couple of core objects that I'm reusing in my project. There > was a quirk in these libraries (related to this: > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm) > that made the objects slightly more awkward to instantiate, but the issues > were recently fixed. I'd like to merge these fixes soon. Brad just committed a fix for Bug 2841 - is there anything else you still think needs fixing in the latest trunk code in this area? I'm not sure what you meant my "merge" in this context. > So, GSoC requires a tarball of the code we write at the end of the summer. > Merging from upstream would bring code that I didn't write into my > development tree -- which I could probably filter out with the right > arguments to git-diff, but nonetheless, my project history would no longer > be entirely clean. Does Google care about this? Or is it safe to go ahead > and pull from the next stable release of Biopython (coming soon)? This probably depends on the GSoC rules - from the Biopython license point of view I don't see a problem with you providing a complete tarball (i.e. all of Biopython plus your code). Ideally use the latested current stable release for this (and for your timetable that will probably be Biopython 1.51 or 1.52 depending on how things go). Peter From eric.talevich at gmail.com Mon Jun 22 13:14:49 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Jun 2009 13:14:49 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> References: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> Message-ID: <3f6baf360906221014l2e730d8n960ecc28cfc021bd@mail.gmail.com> On Mon, Jun 22, 2009 at 12:44 PM, Peter wrote: > On Mon, Jun 22, 2009 at 5:14 PM, Eric Talevich > wrote: > > Biopython has a couple of core objects that I'm reusing in my project. > There > > was a quirk in these libraries (related to this: > > > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm) > > that made the objects slightly more awkward to instantiate, but the > issues > > were recently fixed. I'd like to merge these fixes soon. > > Brad just committed a fix for Bug 2841 - is there anything else you still > think needs fixing in the latest trunk code in this area? I'm not sure what > you meant my "merge" in this context. > I meant pull -- merging from biopython/master into my etal/phyloxml branch. > > So, GSoC requires a tarball of the code we write at the end of the > summer. > > Merging from upstream would bring code that I didn't write into my > > development tree -- which I could probably filter out with the right > > arguments to git-diff, but nonetheless, my project history would no > longer > > be entirely clean. Does Google care about this? Or is it safe to go ahead > > and pull from the next stable release of Biopython (coming soon)? > > This probably depends on the GSoC rules - from the Biopython license > point of view I don't see a problem with you providing a complete tarball > (i.e. all of Biopython plus your code). Ideally use the latested current > stable > release for this (and for your timetable that will probably be Biopython > 1.51 or 1.52 depending on how things go). > OK, good. The GSoC guidelines from previous years look like they're reasonably flexible, I just want to be sure. Also, here's a rant from Linux Torvalds about how to merge from upstream in Git: http://www.mail-archive.com/dri-devel at lists.sourceforge.net/msg39091.html According to that, I should pull from the Biopython master branch at the 1.51 tag when that happens (no rebasing, since I've pushed my stuff to github already), and if I want to land in time for 1.52, then pull from biopython/master then, fix any merging issues, and immeditely submit a pull request on GitHub. Otherwise, pull from the 1.52 tag, rinse, repeat. (By the way, do you think GitMigrationwill be complete before or after GSoC ends? If we're still rocking CVS then it's less important to keep a clean branch history.) Best, Eric From biopython at maubp.freeserve.co.uk Mon Jun 22 13:30:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 18:30:58 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906221014l2e730d8n960ecc28cfc021bd@mail.gmail.com> References: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> <3f6baf360906221014l2e730d8n960ecc28cfc021bd@mail.gmail.com> Message-ID: <320fb6e00906221030o31532870qfbddb02172dfec1@mail.gmail.com> On Mon, Jun 22, 2009 at 6:14 PM, Eric Talevich wrote: > > (By the way, do you think GitMigration will be complete before or after GSoC > ends? If we're still rocking CVS then it's less important to keep a clean > branch history.) > I'm sure this will be one of the topics we (Biopython) people will be discussing at BOSC this coming weekend. http://www.open-bio.org/wiki/BOSC_2009 I'm also *hoping* to talk to one of OBF server administrators at the BOSC/ISMB conference, for a chat about our options (e.g. run git instead of CVS on the OBF servers). Nothing concrete about such a meeting yet though :( Peter From biopython at maubp.freeserve.co.uk Mon Jun 22 13:57:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 18:57:49 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> Message-ID: <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> On Fri, Jun 19, 2009 at 3:40 PM, Peter wrote: > Hi all, > > We're made some good progress since Biopython 1.50, and I think > it may already be time to plan another release. ... Hi all, I've managed to tick off several items on my todo list for the release (e.g. documentation additions), and plan to tackle the release of a Biopython 1.51 beta tomorrow (Tuesday). At the earliest I'll starting the release process in 14 hours time (9am UK time), so any last minute low risk changes can still go in. As usual I'll send out a CVS freeze email before hand. Once the beta release is out, we'll resume taking small changes (especially for documentation additions or clarifications) with a view to releasing Biopython 1.51 final in July (probably the second week, after people get back from BOSC/ISMB). Peter From winda002 at student.otago.ac.nz Mon Jun 22 22:30:30 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 23 Jun 2009 14:30:30 +1200 Subject: [Biopython-dev] Release announcement for 1.51b In-Reply-To: <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> Message-ID: <1245724230.4a403e467e8c1@www.studentmail.otago.ac.nz> Hi Guys, This is a draft for an announcement once the 1.51 beta is ready to roll. Any comments are welcome, it would probably be useful to have another paragraph with some more details on GenBank feature writing and the application wrappers from people that are more familiar with them that I am. Cheers, David Biopython 1.51 beta available for testing. A beta release for Biopython 1.51 is now available for download and testing. In the two months since Biopython 1.50 was released we have introduced support for writing features in GenBank files, added new application wrappers for alignment programs, extended SeqIO's support for the FASTQ format to include files created by Illumnia 1.3+ and made numerous tweaks and bug fixes. All the new features have been tested by the dev team but it's possible there are cases for these tools that we haven't been able to foresee and test, especially for the GenBank feature writer. We are interested in getting feedback on the release as a whole and especially on the new features (and their documentation in the Biopython Tutorial and Cookbook). So, gather your courage, download the release, try it out and let us know what works and what doesn't through the mailing lists. [then some links to the files to download] From amrita at iisermohali.ac.in Tue Jun 23 00:12:32 2009 From: amrita at iisermohali.ac.in (amrita at iisermohali.ac.in) Date: Tue, 23 Jun 2009 09:42:32 +0530 (IST) Subject: [Biopython-dev] biopython Message-ID: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> Dear all, I want to know whether its possible or not to extract chemical shift information about protein from BMRB (BioMagResBank) or Ref-DB (referenced databank) using biopython programming. Amrita Kumari Research Fellow IISER Mohali Chandigarh INDIA From biopython at maubp.freeserve.co.uk Tue Jun 23 06:14:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:14:28 +0100 Subject: [Biopython-dev] CVS mirror server issue (and github) Message-ID: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> Hi all, The main CVS server is up and running (on dev.open-bio.org and possibly other aliases), so there shouldn't be any issues with preparing Biopython 1.51 beta. However, the OBF use a different (virtual) server for: code.open-bio.org cvs.open-bio.org cvs.biopython.org This runs ViewCVS and also hosts what I assume is a read only mirror of CVS (e.g. for anonymous access). Right now this is down. As github hasn't updated recently, I would guess Bartek's machine (which has been doing the CVS to github updates) was pointing at one of the above addresses. The OBF have been altered to this issue, but as an alternative way to get github back up to date, Bartek might be able to switch to fetching the latest code from dev.open-bio.org instead... I hope we can fix this before BOSC, as I would like to use github branches for the hackathon session. Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 06:24:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:24:07 +0100 Subject: [Biopython-dev] Release announcement for 1.51b In-Reply-To: <1245724230.4a403e467e8c1@www.studentmail.otago.ac.nz> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <1245724230.4a403e467e8c1@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00906230324x6bf05a5al6d709d17de9f4065@mail.gmail.com> On Tue, Jun 23, 2009 at 3:30 AM, David Winter wrote: > Hi Guys, > > This is a draft for an announcement once the 1.51 beta is ready to roll. > Any comments are welcome, it would probably be useful to have another > paragraph with some more details on GenBank feature writing and the > application wrappers from people that are more familiar with them that I > am. > > Cheers, > David Thanks for that David, I'll use is as the basis of the release notes this afternoon (in my time zone ;) ). Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 06:25:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:25:38 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.51 (beta) Message-ID: <320fb6e00906230325x2c05f32cp1c3910e159c8b104@mail.gmail.com> Hi all, OK, as per my email last night, please consider CVS "frozen" until further notice, while I prepare the Biopython 1.51 beta release. If there are any last minute additions email me ASAP, and I'll see what I can do. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 06:52:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:52:08 +0100 Subject: [Biopython-dev] CVS mirror server issue (and github) In-Reply-To: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> References: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> Message-ID: <320fb6e00906230352j1735046q1dbd07cebd526313@mail.gmail.com> On Tue, Jun 23, 2009 at 11:14 AM, Peter wrote: > Hi all, > > The main CVS server is up and running (on dev.open-bio.org and > possibly other aliases), so there shouldn't be any issues with > preparing Biopython 1.51 beta. > > However, the OBF use a different (virtual) server for: > code.open-bio.org > cvs.open-bio.org > cvs.biopython.org > > This runs ViewCVS and also hosts what I assume is a read only mirror > of CVS (e.g. for anonymous access). Right now this is down. As github > hasn't updated recently, I would guess Bartek's machine (which has > been doing the CVS to github updates) was pointing at one of the above > addresses. Progress report - the virtual machine is still playing up but a software upgrade is planned. Right now: code.open-bio.org - up cvs.open-bio.org - up cvs.biopython.org - stale DNS entry (at least with my ISP) > The OBF have been altered to this issue, but as an alternative way to > get github back up to date, Bartek might be able to switch to fetching > the latest code from dev.open-bio.org instead... If you are using cvs.biopython.org right now the DNS entry is stale (but I'm sure it will be fixed shortly), so a less disruptive change would be to try pointing at cvs.open-bio.org or code.open-bio.org (i.e. use a different alias for the mirror server, rather than switching to the primary server). Peter From bartek at rezolwenta.eu.org Tue Jun 23 06:54:49 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 23 Jun 2009 12:54:49 +0200 Subject: [Biopython-dev] CVS mirror server issue (and github) In-Reply-To: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> References: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> Message-ID: <8b34ec180906230354x2baf6ed8q1d490fbe45d3d89d@mail.gmail.com> On Tue, Jun 23, 2009 at 12:14 PM, Peter wrote: > > > This runs ViewCVS and also hosts what I assume is a read only mirror > of CVS (e.g. for anonymous access). Right now this is down. As github > hasn't updated recently, I would guess Bartek's machine (which has > been doing the CVS to github updates) was pointing at one of the above > addresses. > The OBF have been altered to this issue, but as an alternative way to > get github back up to date, Bartek might be able to switch to fetching > the latest code from dev.open-bio.org instead... > Hi, The github update process was in fact always based on dev.open-bio.org. The problem with updates was caused by my "solution" to the AUTHORS file re-occurrence in github... So it seems, that when I removed the AUTHORS file, new updates could not make it to github (there was an extra commit, that was not in CVS). Now, as I forced github to accept new updates, we have AUTHORS file in github again... (even though it's not in CVS and not in my git branch which is the source of updates...). That's weird and meybe we will be able to research this better during BOSC... To my knowledge there are no other glitches like this one, but if you see something strange, let me know... > I hope we can fix this before BOSC, as I would like to use github > branches for the hackathon session. > the updates should be working now, the only remaining problem is the AUTHORS file, but we can take care of it during BOSC cheers Bartek From biopython at maubp.freeserve.co.uk Tue Jun 23 06:57:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:57:46 +0100 Subject: [Biopython-dev] CVS mirror server issue (and github) In-Reply-To: <8b34ec180906230354x2baf6ed8q1d490fbe45d3d89d@mail.gmail.com> References: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> <8b34ec180906230354x2baf6ed8q1d490fbe45d3d89d@mail.gmail.com> Message-ID: <320fb6e00906230357s61d1aeb8r5efecedc8501eaf7@mail.gmail.com> On Tue, Jun 23, 2009 at 11:54 AM, Bartek Wilczynski wrote: > > Hi, > > The github update process was in fact always based on dev.open-bio.org. > The problem with updates was caused by my "solution" to the AUTHORS > file re-occurrence in github... maybe we will be able to research this better > during BOSC... > > To my knowledge there are no other glitches like this one, but if you see > something strange, let me know... Rather odd. But as you suggest, we can have a more in depth discussion at BOSC. >> I hope we can fix this before BOSC, as I would like to use github >> branches for the hackathon session. > > the updates should be working now, the only remaining problem is the > AUTHORS file, but we can take care of it during BOSC Excellent :) Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:06:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:06:37 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906231106.n5NB6bg2016683@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:06 EST ------- Can we close this bug now? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:09:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:09:40 -0400 Subject: [Biopython-dev] [Bug 2856] Duplicate positions for some restriction enzymes in some sequences In-Reply-To: Message-ID: <200906231109.n5NB9eIe016872@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2856 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:09 EST ------- Fr??d??ric Sohm (author of Bio.Resistriction) posted a fix on the mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006215.html I have just checked this in, marking bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:15:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:15:02 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method In-Reply-To: Message-ID: <200906231115.n5NBF2gf017552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Support the "in" keyword |Support the "in" keyword |with Seq objects / define |with Seq + SeqRecord objects |__contains__ method |/ define __contains__ method ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:15 EST ------- Patch for Seq object checked in. Leaving bug open for possible similar addition to the SeqRecord object. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:15:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:15:39 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method In-Reply-To: Message-ID: <200906231115.n5NBFd7r017630@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1323 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:15 EST ------- (From update of attachment 1323) As noted above, this has been checked in. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:17:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:17:11 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906231117.n5NBHBGD017709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #13 from cymon.cox at gmail.com 2009-06-23 07:17 EST ------- (In reply to comment #12) > Can we close this bug now? Sure. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:17:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:17:34 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200906231117.n5NBHYgq017747@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|enhancement |normal ------- Comment #26 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:17 EST ------- Switching this from an enhancement (which is done) to a normal bug for the remaining issue, removing the workaround in Bio/PopGen/SimCoal/__init__.py. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 07:22:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:22:06 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906231122.n5NBM6rL018081@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:22 EST ------- Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 23 09:58:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 14:58:28 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.51 (beta) In-Reply-To: <320fb6e00906230325x2c05f32cp1c3910e159c8b104@mail.gmail.com> References: <320fb6e00906230325x2c05f32cp1c3910e159c8b104@mail.gmail.com> Message-ID: <320fb6e00906230658g27301e6asca055c54dee45ab1@mail.gmail.com> On Tue, Jun 23, 2009 at 11:25 AM, Peter wrote: > Hi all, > > OK, as per my email last night, please consider CVS "frozen" > until further notice, while I prepare the Biopython 1.51 beta > release. OK, the release is done (except for the announcements) and tagged in CVS. Little low impact things can be added to CVS again now. You should be able to download it already - let me know if there any problems with the links or the files themselves: http://biopython.org/wiki/Download http://biopython.org/DIST/ If we make any notable improvements to the documentation between now and the release of Biopython 1.51 final, we can also update the online version of the tutorial. Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 10:35:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 15:35:09 +0100 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <320fb6e00906230734j3c921fcew41737bf0efc4df61@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> <320fb6e00906230734j3c921fcew41737bf0efc4df61@mail.gmail.com> Message-ID: <320fb6e00906230735h153fb792w68bff43cf77eec23@mail.gmail.com> On Fri, Jun 19, 2009 at 10:18 AM, Peter wrote: > On Fri, Jun 19, 2009 at 2:57 AM, Eric Talevich wrote: > > On Thu, Jun 18, 2009 at 6:05 PM, Peter wrote: > >> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich wrote: > >> > > >> > I didn't notice any typos other than Python being consistently > >> > lowercase, which I assume is how the author likes it. > >> > >> I was aiming for consistency, with no strong preference - at the time > >> there were more uses of "python" than "Python" so I picked that. We > >> can change it easily enough - does anyone care either way? > > > > Python.org capitalizes it. Shrug. > > Maybe we should use "Python" then. During the polishing for Biopython 1.51b I went though and made all the appropriate "python" uses into "Python". There are a couple of special cases like command line snippets where it should be lower case. Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 10:40:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 15:40:32 +0100 Subject: [Biopython-dev] Biopython beta releases on PyPi? Message-ID: <320fb6e00906230740x6e4b76c8v7e6c0a67c662b751@mail.gmail.com> Hi Brad (and others), Do you think we should push beta releases of Biopython via PyPi? http://pypi.python.org/pypi/biopython Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 12:05:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 17:05:53 +0100 Subject: [Biopython-dev] Biopython 1.51 beta released Message-ID: <320fb6e00906230905i5b5a2364i7ae9c4c96e4ae50d@mail.gmail.com> Dear all, A beta release for Biopython 1.51 is now available for download and testing. In the two months since Biopython 1.50 was released, we have introduced support for writing features in GenBank files using Bio.SeqIO, extended SeqIO~s support for the FASTQ format to include files created by Illumina 1.3+, and added a new set of application wrappers for alignment programs, and made numerous tweaks and bug fixes. All the new features have been tested by the dev team but it's possible there are cases that we haven~t been able to foresee and test, especially for the GenBank feature writer (as there as just so many possible odd fuzzy feature locations). Note that as previously announced, Biopython no longer supports Python 2.3, and our deprecated parsing infrastructure (Martel and Bio.Mindy) has been removed. Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on the new features and the Biopython Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf So, gather your courage, download the release, try it out and let us know what works and what doesn~t through the mailing lists (or bugzilla). -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/06/biopython-151-beta-released/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News Biopython news is also on twitter: http://twitter.com/biopython Thanks also to David Winter for coming up with the draft release message. From biopython at maubp.freeserve.co.uk Wed Jun 24 05:43:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Jun 2009 10:43:51 +0100 Subject: [Biopython-dev] biopython In-Reply-To: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> References: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> Message-ID: <320fb6e00906240243x34cf22c4y3742c1cee84de6e9@mail.gmail.com> On Tue, Jun 23, 2009 at 5:12 AM, wrote: > > Dear all, > > I want to know whether its possible or not to extract chemical shift > information about protein from BMRB (BioMagResBank) or Ref-DB > (referenced databank) using biopython programming. > > Amrita Kumari I'd replied to Amrita directly, and suggested he email the discussion list in case anyone had any suggestions. I don't think there is anything already included with Biopython for chemical shifts from BMRB (BioMagResBank) or Ref-DB (referenced databank), but I don't work with NMR or 3D structures. http://www.bmrb.wisc.edu/ - BioMagResBank http://redpoll.pharmacy.ualberta.ca/RefDB/ - Ref-DB Any ideas? Peter From chapmanb at 50mail.com Wed Jun 24 08:25:17 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 24 Jun 2009 08:25:17 -0400 Subject: [Biopython-dev] Biopython beta releases on PyPi? In-Reply-To: <320fb6e00906230740x6e4b76c8v7e6c0a67c662b751@mail.gmail.com> References: <320fb6e00906230740x6e4b76c8v7e6c0a67c662b751@mail.gmail.com> Message-ID: <20090624122517.GH41327@sobchak.mgh.harvard.edu> Hi Peter; > Do you think we should push beta releases of Biopython via PyPi? > http://pypi.python.org/pypi/biopython I haven't been doing this. Conceptually, PyPi is more for users who want to install the latest thing that just works without too much thought. I suppose beta releases don't quite fall into that category since they are meant for testing, so let's just push final releases there. Brad From bugzilla-daemon at portal.open-bio.org Thu Jun 25 11:39:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 25 Jun 2009 11:39:24 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <200906251539.n5PFdOZZ023918@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-25 11:39 EST ------- Created an attachment (id=1329) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1329&action=view) Patch for Bio/GenBank/__init__.py to handle most locations with re A more complicated version of the previous patch, covering a wider range of features. This is a work in progress but I wanted to stash it somewhere online as a snapshot of my progress - I should try doing this in github in future ;) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cymon.cox at googlemail.com Sun Jun 28 10:10:14 2009 From: cymon.cox at googlemail.com (Cymon Cox) Date: Sun, 28 Jun 2009 15:10:14 +0100 Subject: [Biopython-dev] Bio.Sequencing Message-ID: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> Hi Peter, What is the long-term future of Bio.Sequencing? With the (very cool) QualityIO stuff now in SeqIO, the Phd module looks a bit out of place - is there any reason not to move both Ace and Phd code to SeqIO ie in the AceIO and PhdIO interfaces? I ask because Ive written a Phd writer class for the SeqIO interface and initially added it to PhdIO. Cheers, C. -- From p.j.a.cock at googlemail.com Mon Jun 29 03:23:06 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 08:23:06 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> Message-ID: <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> On Sun, Jun 28, 2009 at 3:10 PM, Cymon Cox wrote: > Hi Peter, > > What is the long-term future of Bio.Sequencing? With the (very cool) > QualityIO stuff now in SeqIO, the Phd module looks a bit out of place - is > there any reason not to move both Ace and Phd code to SeqIO ie > in the AceIO and PhdIO interfaces? In the case of FASTQ and QUAL files, everything gets stored in the SeqRecord, so I didn't see any reason to have something in Bio.Sequencing (although perhaps things like mapping between the PHRED and Solexa scores could live there, along with the basic parser used internally giving string tuples - does this sound worth doing?). As you know, currently the SeqIO "ace" and "phd" are simply built on top of Bio.Sequencing.Ace and Bio.Sequencing.PhD, and only transforms a subset of the data into a SeqRecord object. This also describes the SwissProt parsing now - the general model is we have a SeqRecord interface (which may not cover all the details), and an underlying more file format specific objects used to hold the data. > I ask because Ive written a Phd writer class for the SeqIO interface > and initially added it to PhdIO. Do you want to file an enhancement bug, and then either upload the code to bugzilla, or give a link to a github branch to we can have a look? If your writer takes SeqRecord objects, then I think it would make sense to go in Bio.SeqIO.PhdIO (as I have done for GenBank, although this is in part because I have some intentions to simplify the Bio.GenBank code, and having another writer with a another API in there would make this more complicated). It would also make sense to have a writer in Bio.Sequencing.Phd taking its Record objects (and have Bio.SeqIO turn SeqRecord objects into PhD Record objects, and call that). Perhaps this would be a better idea as it is more flexible, but it would be more work, and could be slower ;) Peter From cy at cymon.org Mon Jun 29 03:49:26 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 29 Jun 2009 08:49:26 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> Message-ID: <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> 2009/6/29 Peter Cock > On Sun, Jun 28, 2009 at 3:10 PM, Cymon Cox > wrote: > > Hi Peter, > > > > What is the long-term future of Bio.Sequencing? With the (very cool) > > QualityIO stuff now in SeqIO, the Phd module looks a bit out of place - > is > > there any reason not to move both Ace and Phd code to SeqIO ie > > in the AceIO and PhdIO interfaces? > > In the case of FASTQ and QUAL files, everything gets stored in > the SeqRecord, so I didn't see any reason to have something in > Bio.Sequencing (although perhaps things like mapping between > the PHRED and Solexa scores could live there, along with the > basic parser used internally giving string tuples - does this sound > worth doing?). > > As you know, currently the SeqIO "ace" and "phd" are simply built > on top of Bio.Sequencing.Ace and Bio.Sequencing.PhD, and only > transforms a subset of the data into a SeqRecord object. Yes, but now that per_letter_annotation's are in SeqRecord there is no reason not to store the Phd 'phred_qualities' and 'peak_locations', so all the Phd file attributes can be stored in a SeqRecord - I altered the parser to do this. > This also > describes the SwissProt parsing now - the general model is we have > a SeqRecord interface (which may not cover all the details), and an > underlying more file format specific objects used to hold the data. > > > I ask because Ive written a Phd writer class for the SeqIO interface > > and initially added it to PhdIO. > > Do you want to file an enhancement bug, and then either upload > the code to bugzilla, or give a link to a github branch to we can > have a look? > > If your writer takes SeqRecord objects, then I think it would make > sense to go in Bio.SeqIO.PhdIO (as I have done for GenBank, > although this is in part because I have some intentions to simplify > the Bio.GenBank code, and having another writer with a another > API in there would make this more complicated). > > It would also make sense to have a writer in Bio.Sequencing.Phd > taking its Record objects (and have Bio.SeqIO turn SeqRecord > objects into PhD Record objects, and call that). Perhaps this would > be a better idea as it is more flexible, but it would be more work, > and could be slower ;) Yes, this was my concern. As I have it now, the parser code is in Bio.Sequencing.Phd and is called by the Bio.SeqIO.PhdIO, but the writer code is in PhdIO. I could move the write_record to the Phd module for symmetry, but as all the Phd attributes can be stored in SeqRecord, the Phd parser code could just as rationally be moved to PhdIO. Cheers, C. -- From p.j.a.cock at googlemail.com Mon Jun 29 03:58:00 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 08:58:00 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> Message-ID: <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> >> As you know, currently the SeqIO "ace" and "phd" are simply built >> on top of Bio.Sequencing.Ace and Bio.Sequencing.PhD, and only >> transforms a subset of the data into a SeqRecord object. > > Yes, but now that per_letter_annotation's are in SeqRecord there is no > reason not to store the Phd 'phred_qualities' and 'peak_locations', so all > the Phd file attributes can be stored in a SeqRecord - I altered the parser > to do this. Cool - that sounds like it might be worth including in Biopython 1.51 final (if you think it is ready for prime time). If as you say that your extended Bio.SeqIO.PhdIO parse covers all the data in the PHRED file, then perhaps we could consider deprecating Bio.Sequencing.Phd in the future. >> > I ask because Ive written a Phd writer class for the SeqIO interface >> > and initially added it to PhdIO. >> >> Do you want to file an enhancement bug, and then either upload >> the code to bugzilla, or give a link to a github branch to we can >> have a look? >> >> If your writer takes SeqRecord objects, then I think it would make >> sense to go in Bio.SeqIO.PhdIO (as I have done for GenBank, >> although this is in part because I have some intentions to simplify >> the Bio.GenBank code, and having another writer with a another >> API in there would make this more complicated). >> >> It would also make sense to have a writer in Bio.Sequencing.Phd >> taking its Record objects (and have Bio.SeqIO turn SeqRecord >> objects into PhD Record objects, and call that). Perhaps this would >> be a better idea as it is more flexible, but it would be more work, >> and could be slower ;) > > Yes, this was my concern. As I have it now, the parser code is in > Bio.Sequencing.Phd and is called by the Bio.SeqIO.PhdIO, but > the writer code is in PhdIO. I could move the write_record to the > Phd module for symmetry, but as all the Phd attributes can be > stored in SeqRecord, the Phd parser code could just as rationally > be moved to PhdIO. For now, having the writer in Bio.SeqIO.PhdIO seems fine. We could as a second step make the Bio.SeqIO.PhdIO parse self contained, and as a third step, declare Bio.Sequencing.Phd obsolete. Peter From cy at cymon.org Mon Jun 29 04:09:22 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 29 Jun 2009 09:09:22 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> Message-ID: <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> 2009/6/29 Peter Cock > >> It would also make sense to have a writer in Bio.Sequencing.Phd > >> taking its Record objects (and have Bio.SeqIO turn SeqRecord > >> objects into PhD Record objects, and call that). Perhaps this would > >> be a better idea as it is more flexible, but it would be more work, > >> and could be slower ;) > > > > Yes, this was my concern. As I have it now, the parser code is in > > Bio.Sequencing.Phd and is called by the Bio.SeqIO.PhdIO, but > > the writer code is in PhdIO. I could move the write_record to the > > Phd module for symmetry, but as all the Phd attributes can be > > stored in SeqRecord, the Phd parser code could just as rationally > > be moved to PhdIO. > > For now, having the writer in Bio.SeqIO.PhdIO seems fine. We > could as a second step make the Bio.SeqIO.PhdIO parse self > contained, and as a third step, declare Bio.Sequencing.Phd > obsolete. This sounds like a plan. I'll try and get it all together and push it to github sometime today. Cheers, C. -- From bugzilla-daemon at portal.open-bio.org Mon Jun 29 08:08:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:08:14 -0400 Subject: [Biopython-dev] [Bug 2865] New: Phd writer class for SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2865 Summary: Phd writer class for SeqIO Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com Attached is a patch to add a SeqIO write interface for phd files, plus unittests. Also to be found on http://github.com/cymon/biopython-github-master/tree/assembly C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 08:09:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:09:15 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200906291209.n5TC9Fj8008875@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #1 from cymon.cox at gmail.com 2009-06-29 08:09 EST ------- Created an attachment (id=1333) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1333&action=view) Phd writer and unittest patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 08:37:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:37:40 -0400 Subject: [Biopython-dev] [Bug 2866] New: SQLite support for BioSQL Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2866 Summary: SQLite support for BioSQL Product: Biopython Version: Not Applicable Platform: PC OS/Version: FreeBSD Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: chapmanb at 50mail.com Attached is a git patch to add SQLite support to the latest BioSQL. I've tested this with SQLite and MySQL on my FreeBSD machine and both pass the test suite. Cymon, Peter, and anyone else who is interested -- it would be great if you could check on PostgreSQL and the various setups y'all have been using. A few notes: - SQLite does not support FOREIGN KEY constraints so I have dropped those from the creation SQL. - get_subseq_as_string used SUBSTRING, which does not seem to be supported on SQLite. I switched to SUBSTR which I believe should be general. - SQLite gives back unicode, which I explicitly convert to strings to be more compatible with what was done previously. If it's easier to check this on GitHub, I can do that when I'm back home. I branched from the main trunk and internet is too slow to learn the right way to do it now. Thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 08:42:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:42:23 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200906291242.n5TCgNnd011737@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #1 from chapmanb at 50mail.com 2009-06-29 08:42 EST ------- Created an attachment (id=1334) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1334&action=view) BioSQL SQLite support -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 09:50:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 09:50:20 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200906291350.n5TDoKkl018215@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-29 09:50 EST ------- Wow - that was quick Brad, we were only chatting about this yesterday! Have you filed an enhancement bug for BioSQL itself for adding this new schema? Hilmar will probably have some feedback on the foreign key stuff. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Mon Jun 29 10:11:15 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 15:11:15 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> Message-ID: <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> Hi Cymon, I've checked in some of your patch on Bug 2865 already, recording the per-letter-annotation which I was planning to do but hadn't got round to yet - thank you: http://bugzilla.open-bio.org/show_bug.cgi?id=2865 This means with the latest code you can now use Biopython to convert a PHD output file into a FASTQ file (or a QUAL file) which could be handy for doing meta assemblies. I did relatively recently update SeqIO for the Ace format to record the qualities - but there is an issue here. Only the nucleotides get given quality scores, but not the insertions (gaps, shown as "*" in the Ace file consensus sequence). Currently the Bio.SeqIO parser gives the gapped sequence. This means to record the quality scores, we need to give some null value to the gap characters (and I used None). What I am wondering about is making the Bio.SeqIO Ace parser just return the ungapped sequence (and the associated PHRED quality scores). This means we could then convert Ace files into FASTQ or QUAL files, and also a simple Ace to FASTA conversion would give something useful for downstream analysis (the ungapped consensus). The gaps *are* important if you want to see how the consensus was built up - in which case it makes sense to think about each Ace contig as a kind of multiple sequence alignment. See this earlier discussion with David Winter: http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html Any thoughts? Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 29 10:30:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 10:30:29 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200906291430.n5TEUTa2021995@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #3 from cymon.cox at gmail.com 2009-06-29 10:30 EST ------- (In reply to comment #1) > Created an attachment (id=1334) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1334&action=view) [details] > BioSQL SQLite support > This patch works for me with SQLite (Python 2.5.2)(both TESTDBs), Psycopg, Psycopg2, and Pgdb on Ubuntu 9.04 - PostgreSQL 8.3. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Mon Jun 29 10:25:30 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 29 Jun 2009 16:25:30 +0200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <761477.83949.qm@web65501.mail.ac4.yahoo.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> Message-ID: <200906291625.30081.jblanca@btc.upv.es> Hi: I'm doing similar things and I took a slightly different approach. Instead of using the ace parser api I've created a contig class and my parsers return contig objects. You can take a look at the code at: http://bioinf.comav.upv.es/svn/biolib/biolib/src/ (By the way if you find any code in that library interesting for biopython I would be delighted to add it to biopython). In my library parsing an ace or a caf file works like: >>> fhand = open('example3.ace', 'r') >>> ace_parser = get_parser(fhand, format='ace') >>> for contig in ace_parser: >>> print contig You are also able to get a particular contig giving its name. >>> ace_parser.contigs('contig_name') The contigs are like a list of sequences with a consensus property. >>> contig[0] #the first sequence >>> contig[1] #the second sequence >>> contig.consensus #the consensus The sequeence and quality for every read is also accessible >>> read0 = contig[0] >>> read0.seq >>> read0.qual There are in fact two different coordinate systems, the contig one and the read one (because every read starts in a different place and it can be reversed). To acces to the read in its own coordinate sequence you have to ask for the sequence property of the read. In fact the Contig and the LocatableSequence classes are capable of doing more things. For instance the contig accepts 2-D indexes and returns new contigs, columns, rows, subcontigs, etc. If you find those classes interesting take a look at the code and take also a look at the tests. There is not much documentation, but many tests. Best regards, Jose Blanca On Monday 29 June 2009 12:49:39 Fungazid wrote: > David hi, > > Many many thanks for the diagram. > I'm not sure I understand the differences between > contig.af[readn].padded_start, and contig.bs[readn].padded_start, and > other unknown parameters. I'll try to compare to the Ace format > > Avi > > --- On Mon, 6/29/09, Peter wrote: > > From: Peter > > Subject: Re: [Biopython] Bio.Sequencing.Ace > > To: "David Winter" > > Cc: biopython at lists.open-bio.org > > Date: Monday, June 29, 2009, 10:26 AM > > On Mon, Jun 29, 2009 at 6:19 AM, > > David > > Winter > > > > wrote: > > > Quoting Peter : > > >> There top level properties are simple enough - but > > > > I find drilling > > > > >> down into the reads a bit more tricky. In general > > > > the Ace parser is > > > > >> a bit non-obvious without knowing the Ace format. > > > > Having some > > > > >> __str__ and __repr__ methods defined on the > > > > objects returned > > > > >> would be very nice - I may get time to work on > > > > this later this year. > > > > >> Anyone else interested in this drop us an email. > > >> > > >> Peter > > > > > > I had a scrawled diagram of the contig class next to > > > > me when I was using > > > > > it more frequently - it was easy enough to reproduce > > > > digitally > > > > > http://biopython.org/wiki/Ace_contig_class > > > > > > Hopefully it helps make sese of where all the data is. > > > > I've added a couple > > > > > of very brief examples there for now - will expand it > > > > when I get a chance. > > > > > David > > > > This could get turned in docstring/doctest for the Ace > > parser :) > > > > Peter > > _______________________________________________ > > Biopython mailing list? -? Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From p.j.a.cock at googlemail.com Mon Jun 29 10:53:12 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 15:53:12 +0100 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <200906291625.30081.jblanca@btc.upv.es> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291625.30081.jblanca@btc.upv.es> Message-ID: <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> On 6/29/09, Jose Blanca wrote: > Hi: > I'm doing similar things and I took a slightly different approach. Instead > of using the ace parser api I've created a contig class and my parsers > return contig objects. You can take a look at the code at: > http://bioinf.comav.upv.es/svn/biolib/biolib/src/ Hi Jose, Are you using Bio.Sequencing.Ace in your code, or did you write a whole new parser instead? Now that I have been using Ace files in my own work, I've been meaning to look over your stuff. In some ways, a contig class can be seen as a generalisation of a multiple sequence alignment class. Certainly this is something we should improve in Biopython (as you might gather from some of the enhancement bugs on bugzilla, I have lots of ideas for the current alignment class), and I'm sure you have some great ideas too. Peter From jblanca at btc.upv.es Mon Jun 29 11:16:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 29 Jun 2009 17:16:06 +0200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291625.30081.jblanca@btc.upv.es> <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> Message-ID: <200906291716.06607.jblanca@btc.upv.es> > Are you using Bio.Sequencing.Ace in your code, or did you write a whole > new parser instead? I wrote one, because I wanted to be able to get one particular contig or just the contig or the read names. But I don't think that is a problem. I gues that the biopyhon parser could be easily adapted to that. > Now that I have been using Ace files in my own work, I've been meaning > to look over your stuff. In some ways, a contig class can be seen as a > generalisation of a multiple sequence alignment class. Certainly this is > something we should improve in Biopython (as you might gather from > some of the enhancement bugs on bugzilla, I have lots of ideas for the > current alignment class), and I'm sure you have some great ideas too. I think that here is the main deviation from Biopython. The contig class is similar to an alignment class, in fact my contig classes shoud be compatible with your new alignment proporsal api. alignment. seq1 +++++++++> seq2 +++++++++> seq3 +++++++++> contig seq1 ++++> seq2 +++++> seq3 ++++++> Basically every read has a different coordinate system in the contig case. What I've done is to create a class named LocatableSequence that is a container for sequence objects. It works like: >>> seq1 = 'ATCG' >>> locseq1 = locate_sequence(seq1, location=10) >>> locseq1[10] == A In that way the contig is a list of LocatableSequences and the coordinate system transformations are done by the LocatableSequences, not by the contig. The LocatableSequences also allow for masks. The LocatableSequence works with any sequence like objects, strs, Seq, SeqRecord, lists, etc. There's also a Location class that represents a fragment of a sequence. My Location class is more limited than the one in the Biopython SeqFeature. In my case the start and end should be integers. I use this class to represent the region not masked in the sequence and the Location of the sequence inside the LocatableSequence. Take a look at Contig.py and at LocatableSequence.py, these are the most relevant classes for this. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From cy at cymon.org Mon Jun 29 11:47:36 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 29 Jun 2009 16:47:36 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> Message-ID: <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> Hi Peter, 2009/6/29 Peter Cock > Hi Cymon, > > I've checked in some of your patch on Bug 2865 already, > recording the per-letter-annotation which I was planning to > do but hadn't got round to yet - thank you: > http://bugzilla.open-bio.org/show_bug.cgi?id=2865 > > This means with the latest code you can now use Biopython > to convert a PHD output file into a FASTQ file (or a QUAL > file) which could be handy for doing meta assemblies. Yeah, that's nice. Conversely, the reason I wrote the Phd writer is that I want to 'fake' some Phd files from FASTA and QUAL files - should/might be possible by using the default headers and equally spaced peak locations. The use-case is to fool Consed into displaying the trace (which it 'fakes') from a 454 Mira assembly ACE file output, but which it will only do if the Phd files are available. So I'm hoping to write the Phd files from the original FASTA/QUAL input files. Not sure if this is going to work, or if its a sensible thing to be trying... > I did relatively recently update SeqIO for the Ace format to > record the qualities - but there is an issue here. Only the > nucleotides get given quality scores, but not the insertions > (gaps, shown as "*" in the Ace file consensus sequence). > Currently the Bio.SeqIO parser gives the gapped sequence. > This means to record the quality scores, we need to give > some null value to the gap characters (and I used None). > > What I am wondering about is making the Bio.SeqIO Ace > parser just return the ungapped sequence (and the > associated PHRED quality scores). This means we could > then convert Ace files into FASTQ or QUAL files, and also > a simple Ace to FASTA conversion would give something > useful for downstream analysis (the ungapped consensus). > > The gaps *are* important if you want to see how the > consensus was built up - in which case it makes sense to > think about each Ace contig as a kind of multiple sequence > alignment. See this earlier discussion with David Winter: > http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html > http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html > > Any thoughts? I think it's probably unwise to return an ungapped sequence/qual by default if the contig in the ACE assembly is gapped. It would be nice if the parser had a switch ungapped=True, but thats not going to work with the SeqIO interface. Second best option would be to have an easy way of getting the ungapped SeqRecord from the gapped SeqRecord - a function somewhere in Bio.Sequencing? Anyway, I assume (havent checked) that currently if all the contigs are free of gaps then the SeqIO.AceIO will parse them into an Ungapped alphabet which can then be written to FASTA/QUAL etc. I think this is the right way to go, if the contigs have gaps the user needs to decide how to deal with them explicitly. Cheers, C. -- From eric.talevich at gmail.com Mon Jun 29 12:50:05 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 29 Jun 2009 12:50:05 -0400 Subject: [Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython Message-ID: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> Hi folks, Previously (June 22--26) I: - Wrote unit tests for: - Instantiation of all implemented elements (properly) - Serialization to an output stream -- reusing the parser tests - Made all the unit tests pass - Tweaked for performance: parsing takes about 1/3 less CPU time now - Started Writer.py, with some imports, a class called Writer, and a top-level function for triggering serialization of the whole hierarchy (just a wrapper for ElementTree.write()) - Added __str__ and __repr__ methods to the base class (used in pretty-printing) - Added the method to_rgb() to class BranchColor. It builds a 24-bit hex string representing the color that can be used from HTML/CSS directly. Just something completely different... - Pulled from the biopython trunk This week (June 29--July 3) I will: - Write serialization methods for each class, matching Parser - Catch up on documentation (on the Biopython wiki): - Explain use cases - Basic usage of the parser - Provide guidance on parser performance (parse() is ~4x faster; compare to Bioperl and Archaopterix) Performance: The normal test suite running on apaf.xml, bcl_2.xml, phyloxml_examples.xml and ncbi_taxonomy_mollusca.xml.zip takes about 5 seconds; adding in ncbi_taxonomy_metazoa.xml.zip and the full ncbi_taxonomy.xml.zip to the utilities tests requires 256 seconds (parsing and pretty-printing), and just parsing all six files without pretty-printing or counting tags takes a total of 186 seconds. The python process creeps up to 1.6GB while parsing all six files, but stays under 40MB during the unit tests on the four more reasonably-sized files. Scheduling: The code for serializing to XML was supposed to be written last week. It was not, but I do have comprehensive tests written for it (abusing the unittest framework to re-run the original parser tests) and see no obstacles to its completion this week. I didn't completely trust the unit tests earlier last week, so I spent some time making the pretty-printer work properly, and in the process added some syntactic sugar that was scheduled for later in the project plan. I think this follows the current Biopython convention: bc = ProteinDomain(start=181, end=503, value='WD40') str(bc) # ProteinDomain WD40 repr(bc) # ProteinDomain(start=181, end=503, value=WD40) My plan is that when a phyloXML tree is exported to networkx for display and other purposes, the str() result will be the label for each node. Pulling from upstream: I intended to pull the tagged 1.51 beta of biopython from github and merge it into my own code to take advantage of some recent improvements. But I don't see the 1.51b tag anywhere. Does anyone else know what happened to that tag? I waited a few hours to see if it would be pushed from CVS automatically, but no luck, so I pulled from a plausible point during the lull after Peter's CVS-freeze announcement. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Tue Jun 30 03:46:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 08:46:43 +0100 Subject: [Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython In-Reply-To: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> References: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> Message-ID: <320fb6e00906300046t75a1f669ne4f1c6af1b4f96e3@mail.gmail.com> On Mon, Jun 29, 2009 at 5:50 PM, Eric Talevich wrote: > > Pulling from upstream: > I intended to pull the tagged 1.51 beta of biopython from github > and merge it into my own code to take advantage of some recent > improvements. But I don't see the 1.51b tag anywhere. Does > anyone else know what happened to that tag? I waited a few > hours to see if it would be pushed from CVS automatically, but > no luck, so I pulled from a plausible point during the lull after > Peter's CVS-freeze announcement. If I recall correctly, when pulling from git by default it does not featch the tags - you have to explicitly ask for them. Peter From p.j.a.cock at googlemail.com Tue Jun 30 04:01:28 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jun 2009 09:01:28 +0100 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <200906291716.06607.jblanca@btc.upv.es> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291625.30081.jblanca@btc.upv.es> <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> <200906291716.06607.jblanca@btc.upv.es> Message-ID: <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> On Mon, Jun 29, 2009 at 4:16 PM, Jose Blanca wrote: >> Are you using Bio.Sequencing.Ace in your code, or did you write a whole >> new parser instead? > I wrote one, because I wanted to be able to get one particular contig or just > the contig or the read names. But I don't think that is a problem. I gues > that the biopyhon parser could be easily adapted to that. I see. This touches on the indexing discussion - the same idea on this thread would probably work on Ace files too: http://lists.open-bio.org/pipermail/biopython/2009-June/005275.html >> Now that I have been using Ace files in my own work, I've been meaning >> to look over your stuff. In some ways, a contig class can be seen as a >> generalisation of a multiple sequence alignment class. Certainly this is >> something we should improve in Biopython (as you might gather from >> some of the enhancement bugs on bugzilla, I have lots of ideas for the >> current alignment class), and I'm sure you have some great ideas too. > > I think that here is the main deviation from Biopython. The contig class is > similar to an alignment class, in fact my contig classes shoud be compatible > with your new alignment proporsal api. That's good. I agree that a specialised contig class that works like the traditional multiple sequence alignment class would be nice. It would then make sense to have Bio.AlignIO handle contigs as well as traditional multiple sequence alignments. > alignment. > seq1 +++++++++> > seq2 +++++++++> > seq3 +++++++++> > > contig > seq1 ++++> > seq2 ? ?+++++> > seq3 ? ? ? ?++++++> > > Basically every read has a different coordinate system in the contig case. > What I've done is to create a class named LocatableSequence that is a > container for sequence objects. It works like: >>>> seq1 = 'ATCG' >>>> locseq1 = locate_sequence(seq1, location=10) >>>> locseq1[10] == A > In that way the contig is a list of LocatableSequences and the coordinate > system transformations are done by the LocatableSequences, not by the contig. > The LocatableSequences also allow for masks. > The LocatableSequence works with any sequence like objects, strs, Seq, > SeqRecord, lists, etc. > There's also a Location class that represents a fragment of a sequence. My > Location class is more limited than the one in the Biopython SeqFeature. In > my case the start and end should be integers. I use this class to represent > the region not masked in the sequence and the Location of the sequence inside > the LocatableSequence. > Take a look at Contig.py and at LocatableSequence.py, these are the most > relevant classes for this. > Best regards, I'll have to make some time for looking at your code. What I was thinking of was a contig class as an alignment subclass, holding a list of SeqRecord objects and offsets. The consensus might just be one element of this list - but could be handled specially. This sounds simpler than having to introduce a whole new object system, related to but different to SeqFeature objects. However, I don't yet have a sample implementation to demonstrate this. One important thing I think we should do BEFORE adding any contig class to Biopython, is get it working with at least one other contig file format in addition to Ace. I don't want to end up with a class which is too specialised for how ace contigs work. Peter From biopython at maubp.freeserve.co.uk Tue Jun 30 04:18:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 09:18:44 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> Message-ID: <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox wrote: > Hi Peter, > > 2009/6/29 Peter >> >> Hi Cymon, >> >> I've checked in some of your patch on Bug 2865 already, >> recording the per-letter-annotation which I was planning to >> do but hadn't got round to yet - thank you: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 >> >> This means with the latest code you can now use Biopython >> to convert a PHD output file into a FASTQ file (or a QUAL >> file) which could be handy for doing meta assemblies. > > Yeah, that's nice. Conversely, the reason I wrote the Phd writer is that I > want to 'fake' some Phd files from FASTA and QUAL files - should/might be > possible by using the default headers and equally spaced peak locations. The > use-case is to fool Consed into displaying the trace (which it 'fakes') from > a 454 Mira assembly ACE file output, but which it will only do if the Phd > files are available. So I'm hoping to write the Phd files from the original > FASTA/QUAL input files. Not sure if this is going to work, or if its a > sensible thing to be trying... That sounds reasonable - as long as you know you are faking it ;) >> I did relatively recently update SeqIO for the Ace format to >> record the qualities - but there is an issue here. Only the >> nucleotides get given quality scores, but not the insertions >> (gaps, shown as "*" in the Ace file consensus sequence). >> Currently the Bio.SeqIO parser gives the gapped sequence. >> This means to record the quality scores, we need to give >> some null value to the gap characters (and I used None). >> >> What I am wondering about is making the Bio.SeqIO Ace >> parser just return the ungapped sequence (and the >> associated PHRED quality scores). This means we could >> then convert Ace files into FASTQ or QUAL files, and also >> a simple Ace to FASTA conversion would give something >> useful for downstream analysis (the ungapped consensus). >> >> The gaps *are* important if you want to see how the >> consensus was built up - in which case it makes sense to >> think about each Ace contig as a kind of multiple sequence >> alignment. See this earlier discussion with David Winter: >> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html >> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html >> >> Any thoughts? > > I think it's probably unwise to return an ungapped sequence/qual by default > if the contig in the ACE assembly is gapped. It would be nice if the parser > had a switch ungapped=True, but thats not going to work with the SeqIO > interface. We can certainly add a ungapped optional argument to the parser in Bio.SeqIO.AceIO - that would be a small improvement, meaning the functionality would be there if you needed it (all be it a bit hidden). Several of the Bio.SeqIO parsers already have optional arguments. I have sometimes wondered about letting the SeqIO functions take a **kwargs argument, and passing these arbitrary options to the underlying parser. This would allow for example passing wrap options to the FASTA writer, or skiping the features when parsing GenBank and EBML. On the other hand, it gets very complicated, and detracts from the current simplicity of Bio.SeqIO (which I like). > Second best option would be to have an easy way of getting the > ungapped SeqRecord from the gapped SeqRecord - a function > somewhere in Bio.Sequencing? I've already suggested some kind of "ungapped" method for Seq objects, and yes, having this at the SeqRecord level too would solve this particular use case. Removing the per-letter-annotations associated with the gaps would be straight forward. I'm not sure what we would want to do with any features in the SeqRecord (perhaps a corner case), but most likely any SeqFeature covering a region containing a gap would be lost. > Anyway, I assume (havent checked) that currently if all the > contigs are free of gaps then the SeqIO.AceIO will parse > them into an Ungapped alphabet which can then be written > to FASTA/QUAL etc. I think this is the right way to go, if > the contigs have gaps the user needs to decide how to deal > with them explicitly. Yes, if the Ace contig has no gaps, it will have a nice integer PHRED quality for each base, and could be saved as FASTQ or QUAL (or FASTA). The thing about "gaps" in contigs is that the consensus is really the ungapped sequence. I'd have to check but I think Newbler and CAP3 will output both FASTA and ACE files, and in the FASTA files there are no insertions/gaps in the contig sequences. What I am thinking is Bio.SeqIO could return the ungapped consensus sequences as SeqRecord objects (which can then be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO could return contig-alignment objects (with the gaps, like David's cookbook but in the long run with a contig class). This has some merit, but breaks my current convention that parsing an alignment file with SeqIO works by giving each gapped sequence in each alignment in turn. Peter From jblanca at btc.upv.es Tue Jun 30 04:31:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 30 Jun 2009 10:31:06 +0200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291716.06607.jblanca@btc.upv.es> <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> Message-ID: <200906301031.06273.jblanca@btc.upv.es> > What I was thinking of was a contig class as an alignment subclass, > holding a list of SeqRecord objects and offsets. The consensus might > just be one element of this list - but could be handled specially. This > sounds simpler than having to introduce a whole new object system, > related to but different to SeqFeature objects. However, I don't yet > have a sample implementation to demonstrate this. I thought about that implementation and I created some code. The problem I found with that approach is that the contig class code got too messy. Take into account that besides the offset you also need the masks and that some sequences could be reversed. That's why I decided to split the part that calculates the offset and the mask into a separate class. > One important thing I think we should do BEFORE adding any contig > class to Biopython, is get it working with at least one other contig file > format in addition to Ace. I don't want to end up with a class which > is too specialised for how ace contigs work. > > Peter Well, In fact my contig class is modeled after the caf file format. The ace parsing was just an afterthought, my primary interest was the caf format. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From bartek at rezolwenta.eu.org Tue Jun 30 04:39:01 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 30 Jun 2009 10:39:01 +0200 Subject: [Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython In-Reply-To: <320fb6e00906300046t75a1f669ne4f1c6af1b4f96e3@mail.gmail.com> References: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> <320fb6e00906300046t75a1f669ne4f1c6af1b4f96e3@mail.gmail.com> Message-ID: <8b34ec180906300139q722fbd7ei241ef51d004ea2@mail.gmail.com> On Tue, Jun 30, 2009 at 9:46 AM, Peter wrote: > On Mon, Jun 29, 2009 at 5:50 PM, Eric Talevich wrote: >> improvements. But I don't see the 1.51b tag anywhere. Does >> anyone else know what happened to that tag? I waited a few >> hours to see if it would be pushed from CVS automatically, but >> no luck, so I pulled from a plausible point during the lull after >> Peter's CVS-freeze announcement. > > If I recall correctly, when pulling from git by default it does not > featch the tags - you have to explicitly ask for them. Hi, This was actually connected to the issues with moving tags from cvs to github. It's fixed now. cheers Bartek From cy at cymon.org Tue Jun 30 05:02:04 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 30 Jun 2009 10:02:04 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> Message-ID: <7265d4f0906300202n119bac77j52f60e5680db528d@mail.gmail.com> 2009/6/30 Peter > On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox wrote: > > Hi Peter, > > > > 2009/6/29 Peter > >> > >> Hi Cymon, > >> > [...] > > Several of the Bio.SeqIO parsers already have optional arguments. > I have sometimes wondered about letting the SeqIO functions take > a **kwargs argument, and passing these arbitrary options to the > underlying parser. This would allow for example passing wrap options > to the FASTA writer, or skiping the features when parsing GenBank > and EBML. On the other hand, it gets very complicated, and detracts > from the current simplicity of Bio.SeqIO (which I like). Its a bit of a slippery-slope - but it would have been nice to have a "useDefaults" switch in the PhdWriter. > > Anyway, I assume (havent checked) that currently if all the > > contigs are free of gaps then the SeqIO.AceIO will parse > > them into an Ungapped alphabet which can then be written > > to FASTA/QUAL etc. I think this is the right way to go, if > > the contigs have gaps the user needs to decide how to deal > > with them explicitly. > > Yes, if the Ace contig has no gaps, it will have a nice integer > PHRED quality for each base, and could be saved as FASTQ > or QUAL (or FASTA). > > The thing about "gaps" in contigs is that the consensus is > really the ungapped sequence. Yes, but... there is still some ambiguity over the consensus sequence which is lost in the ungapped sequence. OK, so this isnt such a bid deal with the massive coverages achieved by 454 tech but I can imagine cases of hybrid Sanger/454 where this might be an issue (might be scraping the bottom of the barrel a bit here...). I'd have to check but I think > Newbler and CAP3 will output both FASTA and ACE files, > and in the FASTA files there are no insertions/gaps in the > contig sequences. For comparison, Mira outputs ACE, plus X.gapped.fasta, and X.ungapped.fasta > What I am thinking is Bio.SeqIO could return the ungapped > consensus sequences as SeqRecord objects (which can then > be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO > could return contig-alignment objects (with the gaps, like > David's cookbook but in the long run with a contig class). Yeah, I like this. Although, I'm not sure how intuitive it is that SeqIO would necessarily return the ungapped rather than gapped sequences - but it kinda makes sense... Cheers, C. -- From biopython at maubp.freeserve.co.uk Tue Jun 30 05:33:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 10:33:00 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906300202n119bac77j52f60e5680db528d@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> <7265d4f0906300202n119bac77j52f60e5680db528d@mail.gmail.com> Message-ID: <320fb6e00906300233r60998635lcbfe8788c73ab119@mail.gmail.com> Cymon wrote: > >Peter wrote: >> The thing about "gaps" in contigs is that the consensus is >> really the ungapped sequence. > > Yes, but... there is still some ambiguity over the consensus sequence which > is lost in the ungapped sequence. OK, so this isnt such a bid deal with the > massive coverages achieved by 454 tech but I can imagine cases of hybrid > Sanger/454 where this might be an issue (might be scraping the bottom of the > barrel a bit here...). > >Peter wrote: >> I'd have to check but I think >> Newbler and CAP3 will output both FASTA and ACE files, >> and in the FASTA files there are no insertions/gaps in the >> contig sequences. > > For comparison, Mira outputs ACE, plus X.gapped.fasta, and X.ungapped.fasta That is nice an explicit. :) >> What I am thinking is Bio.SeqIO could return the ungapped >> consensus sequences as SeqRecord objects (which can then >> be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO >> could return contig-alignment objects (with the gaps, like >> David's cookbook but in the long run with a contig class). > > Yeah, I like this. Cool. I will try and look into this later in July. > Although, I'm not sure how intuitive it is that SeqIO would > necessarily return the ungapped rather than gapped > sequences - but it kinda makes sense... Yeah - I'm a bit on the fence myself about Ace to SeqRecord, and whether gapped or ungapped makes most sense. Given that the current Bio.SeqIO behaviour gives the gapped sequence, I guess we should just leave it like that. Peter From p.j.a.cock at googlemail.com Tue Jun 30 05:47:51 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jun 2009 10:47:51 +0100 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <200906301031.06273.jblanca@btc.upv.es> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291716.06607.jblanca@btc.upv.es> <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> <200906301031.06273.jblanca@btc.upv.es> Message-ID: <320fb6e00906300247ve2f45eau4ef97d3f65bb50c5@mail.gmail.com> On Tue, Jun 30, 2009 at 9:31 AM, Jose Blanca wrote: >> What I was thinking of was a contig class as an alignment subclass, >> holding a list of SeqRecord objects and offsets. The consensus might >> just be one element of this list - but could be handled specially. This >> sounds simpler than having to introduce a whole new object system, >> related to but different to SeqFeature objects. However, I don't yet >> have a sample implementation to demonstrate this. > > I thought about that implementation and I created some code. The > problem I found with that approach is that the contig class code got > too messy. ?Take into account that besides the offset you also need > the masks and that some sequences could be reversed. That's why > I decided to split the part that calculates the offset and the mask > into a separate class. A simple masked sequence class would also be useful for Roche SFF files which hold sequencing reads (of about 500bp) with start and end trim points. This is a use case separate from the location offset in an alignment - so I'm not convinced it makes sense to do both in one class. Perhaps having the contig class hold a list of (masked) SeqRecord objects, their offset, and their direction would work? >> One important thing I think we should do BEFORE adding any contig >> class to Biopython, is get it working with at least one other contig file >> format in addition to Ace. I don't want to end up with a class which >> is too specialised for how ace contigs work. > > Well, In fact my contig class is modeled after the caf file format. > The ace parsing was just an afterthought, my primary interest > was the caf format. Well, as the CAF file format was an extension of the ACE format, perhaps a third contig format would be worth looking at before considering if a contig class would be sufficiently general. Have you got any links to the CAF file format you found useful when writing your parser? In addition to: http://www.sanger.ac.uk/Software/formats/CAF/ http://www.genome.org/cgi/content/full/8/3/260 Thanks, Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 30 13:13:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 30 Jun 2009 13:13:29 -0400 Subject: [Biopython-dev] [Bug 2867] New: Bio.PDB.PDBList.update_pdb calls invalid os.cmd Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2867 Summary: Bio.PDB.PDBList.update_pdb calls invalid os.cmd Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mlrodrigues at igc.gulbenkian.pt As listed in the traceback, this module tries to use an invalid function from the os module. Traceback (most recent call last): File "update_pdb.py", line 33, in update(pdblist, my_try) File "update_pdb.py", line 27, in update x.update_pdb() File "/usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py", line 280, in update_pdb os.cmd('mv %s %s'%(old_file,new_file)) AttributeError: 'module' object has no attribute 'cmd' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From agrobertson at telus.net Tue Jun 30 20:27:21 2009 From: agrobertson at telus.net (Gordon Robertson) Date: Tue, 30 Jun 2009 17:27:21 -0700 Subject: [Biopython-dev] Fwd: ACE files at Biopython-dev References: Message-ID: <9FFF9C34-5253-45F4-B009-6193A5FEEA6B@telus.net> I flagged the current ACE discussion to the Consed author, David Gordon, and am forwarding his response. G Begin forwarded message: > From: David Gordon > Date: June 30, 2009 9:38:10 AM PDT > To: Gordon Robertson > Cc: Yaron Butterfield , David Gordon > > Subject: Re: ACE files at Biopython-dev > > Hi, Gordon, > > Could you add a comment from me to this thread, please? Here is the > text to add: > > > I am the author of Consed and briefly read this thread. > > I have one suggestion on this tool: that it create phd ball files > instead of phd files, particularly if the number of reads is more than > a few hundred. phd files are a leftover from the days of sequencing > when there were only a few thousand reads at most. The linux > operating system and software cannot handle millions of phd files in > the same directory, so Consed now typically uses a small number of phd > balls. Here is an example of a phd ball file that contains 2 reads > (typically a phd ball file will contain up to a million--more than > that becomes difficult to copy). Notice that there is a comment at > the beginning starting with "#" at the beginning of the line. Also > notice that the BEGIN_SEQUENCE line is slightly different due to the > "1" at the end--this is the version, which corresponds to the > extension on the end of a phd file name such as > HWI-EAS94_4_1_1_537_446.phd.1 > > Notice also that peak positions (which normally form a 3rd column > after the quality) are now optional, which helps keep the file size > down. For reads that you want to see the traces, you will need to > have peak positions. A 454 example follows: > > > > # solexa file ../solexa_dir/solexa_reads.fastq (beginning) > > BEGIN_SEQUENCE HWI-EAS94_4_1_1_537_446 1 > BEGIN_COMMENT > TIME: Wed Dec 24 11:21:50 2008 > CHEM: solexa > END_COMMENT > BEGIN_DNA > g 30 > c 30 > c 30 > a 30 > a 30 > t 30 > c 30 > a 30 > g 30 > g 30 > t 30 > t 30 > t 30 > c 30 > t 30 > c 30 > t 30 > g 30 > c 30 > a 30 > a 28 > g 23 > c 30 > c 30 > c 30 > c 30 > t 30 > t 30 > t 28 > a 22 > g 8 > c 22 > a 7 > g 15 > c 15 > t 15 > g 10 > a 10 > g 11 > c 15 > END_DNA > END_SEQUENCE > > BEGIN_SEQUENCE HWI-EAS94_4_1_1_602_99 1 > BEGIN_COMMENT > TIME: Wed Dec 24 11:21:50 2008 > CHEM: solexa > END_COMMENT > BEGIN_DNA > g 30 > c 30 > c 30 > a 30 > t 30 > g 30 > g 30 > c 30 > a 30 > c 30 > a 30 > t 30 > a 30 > t 30 > a 30 > t 30 > g 30 > a 30 > a 30 > g 30 > g 30 > t 30 > c 30 > a 30 > g 30 > a 30 > g 16 > g 30 > a 28 > c 22 > a 22 > a 22 > c 14 > t 15 > t 15 > g 5 > c 10 > t 15 > g 10 > t 5 > END_DNA > END_SEQUENCE > > > phd ball files for 454 reads (in which traces are displayed) have more > information. Here is an example: > > BEGIN_SEQUENCE EBE03TV04IHLTF.77-243 1 > > BEGIN_COMMENT > > CHROMAT_FILE: sff:reads.sff:EBE03TV04IHLTF > QUALITY_LEVELS: 99 > TIME: Thu Jul 27 12:33:48 2000 > TRACE_ARRAY_MIN_INDEX: 0 > TRACE_ARRAY_MAX_INDEX: 4723 > CHEM: 454 > > END_COMMENT > > BEGIN_DNA > g 37 91 > g 37 110 > g 37 129 > g 37 148 > a 37 167 > t 37 186 > g 37 205 > a 37 224 > a 37 243 > a 37 262 > g 37 281 > g 37 300 > g 37 319 > . > . > . > a 26 4385 > t 26 4404 > c 26 4423 > t 30 4442 > c 33 4461 > g 33 4480 > g 33 4499 > t 33 4518 > g 33 4537 > g 36 4556 > t 36 4575 > a 33 4594 > g 33 4613 > g 33 4632 > t 36 4651 > g 26 4670 > a 22 4689 > END_DNA > > END_SEQUENCE > > (more BEGIN_SEQUENCE/END_SEQUENCE blocks to follow) > > > The line: > CHROMAT_FILE: sff:reads.sff:EBE03TV04IHLTF > indicates both the sff file that the read came from as well as the > read name. > > When creating ace files, BS lines are now optional. BS lines really > only make sense when the assembly is phrap > > > > David Gordon > > > > On Tue, 30 Jun 2009, Gordon Robertson wrote: > >> David >> >> I thought I should flag with you that code for ACE files are being >> discussed now in BioPython. >> >> G >> >> Begin forwarded message: >> >>> From: biopython-dev-request at lists.open-bio.org >>> Date: June 30, 2009 1:39:04 AM PDT >>> To: biopython-dev at lists.open-bio.org >>> Subject: Biopython-dev Digest, Vol 77, Issue 30 >>> Reply-To: biopython-dev at lists.open-bio.org >>> Send Biopython-dev mailing list submissions to >>> biopython-dev at lists.open-bio.org >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> or, via email, send a message with subject or body 'help' to >>> biopython-dev-request at lists.open-bio.org >>> You can reach the person managing the list at >>> biopython-dev-owner at lists.open-bio.org >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Biopython-dev digest..." >>> Today's Topics: >>> >>> 1. GSoC Weekly Update 6: PhyloXML for Biopython (Eric Talevich) >>> 2. Re: GSoC Weekly Update 6: PhyloXML for Biopython (Peter) >>> 3. Re: [Biopython] Bio.Sequencing.Ace (Peter Cock) >>> 4. Re: Bio.Sequencing (Peter) >>> 5. Re: [Biopython] Bio.Sequencing.Ace (Jose Blanca) >>> 6. Re: GSoC Weekly Update 6: PhyloXML for Biopython >>> (Bartek Wilczynski) >>> ---------------------------------------------------------------------- >>> >>> ------------------------------ >>> Message: 3 >>> Date: Tue, 30 Jun 2009 09:01:28 +0100 >>> From: Peter Cock >>> Subject: Re: [Biopython-dev] [Biopython] Bio.Sequencing.Ace >>> To: Jose Blanca >>> Cc: biopython-dev at lists.open-bio.org >>> Message-ID: >>> <320fb6e00906300101r3e3faa37l6a47295bd5e12538 at mail.gmail.com> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> On Mon, Jun 29, 2009 at 4:16 PM, Jose Blanca >>> wrote: >>>>> Are you using Bio.Sequencing.Ace in your code, or did you write >>>>> a whole >>>>> new parser instead? >>>> I wrote one, because I wanted to be able to get one particular >>>> contig or just >>>> the contig or the read names. But I don't think that is a >>>> problem. I gues >>>> that the biopyhon parser could be easily adapted to that. >>> I see. This touches on the indexing discussion - the same idea on >>> this thread would probably work on Ace files too: >>> http://lists.open-bio.org/pipermail/biopython/2009-June/005275.html >>>>> Now that I have been using Ace files in my own work, I've been >>>>> meaning >>>>> to look over your stuff. In some ways, a contig class can be >>>>> seen as a >>>>> generalisation of a multiple sequence alignment class. Certainly >>>>> this is >>>>> something we should improve in Biopython (as you might gather from >>>>> some of the enhancement bugs on bugzilla, I have lots of ideas >>>>> for the >>>>> current alignment class), and I'm sure you have some great ideas >>>>> too. >>>> I think that here is the main deviation from Biopython. The >>>> contig class is >>>> similar to an alignment class, in fact my contig classes shoud be >>>> compatible >>>> with your new alignment proporsal api. >>> That's good. I agree that a specialised contig class that works like >>> the traditional multiple sequence alignment class would be nice. >>> It would then make sense to have Bio.AlignIO handle contigs as >>> well as traditional multiple sequence alignments. >>>> alignment. >>>> seq1 +++++++++> >>>> seq2 +++++++++> >>>> seq3 +++++++++> >>>> contig >>>> seq1 ++++> >>>> seq2 ? ?+++++> >>>> seq3 ? ? ? ?++++++> >>>> Basically every read has a different coordinate system in the >>>> contig case. >>>> What I've done is to create a class named LocatableSequence that >>>> is a >>>> container for sequence objects. It works like: >>>>>>> seq1 = 'ATCG' >>>>>>> locseq1 = locate_sequence(seq1, location=10) >>>>>>> locseq1[10] == A >>>> In that way the contig is a list of LocatableSequences and the >>>> coordinate >>>> system transformations are done by the LocatableSequences, not by >>>> the contig. >>>> The LocatableSequences also allow for masks. >>>> The LocatableSequence works with any sequence like objects, strs, >>>> Seq, >>>> SeqRecord, lists, etc. >>>> There's also a Location class that represents a fragment of a >>>> sequence. My >>>> Location class is more limited than the one in the Biopython >>>> SeqFeature. In >>>> my case the start and end should be integers. I use this class to >>>> represent >>>> the region not masked in the sequence and the Location of the >>>> sequence inside >>>> the LocatableSequence. >>>> Take a look at Contig.py and at LocatableSequence.py, these are >>>> the most >>>> relevant classes for this. >>>> Best regards, >>> I'll have to make some time for looking at your code. >>> What I was thinking of was a contig class as an alignment subclass, >>> holding a list of SeqRecord objects and offsets. The consensus might >>> just be one element of this list - but could be handled specially. >>> This >>> sounds simpler than having to introduce a whole new object system, >>> related to but different to SeqFeature objects. However, I don't yet >>> have a sample implementation to demonstrate this. >>> One important thing I think we should do BEFORE adding any contig >>> class to Biopython, is get it working with at least one other >>> contig file >>> format in addition to Ace. I don't want to end up with a class which >>> is too specialised for how ace contigs work. >>> Peter >>> ------------------------------ >>> Message: 4 >>> Date: Tue, 30 Jun 2009 09:18:44 +0100 >>> From: Peter >>> Subject: Re: [Biopython-dev] Bio.Sequencing >>> To: Cymon Cox >>> Cc: BioPython-Dev Mailing List >>> Message-ID: >>> <320fb6e00906300118l78ca2a98kc25278e24ad433a1 at mail.gmail.com> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox wrote: >>>> Hi Peter, >>>> 2009/6/29 Peter >>>>> Hi Cymon, >>>>> I've checked in some of your patch on Bug 2865 already, >>>>> recording the per-letter-annotation which I was planning to >>>>> do but hadn't got round to yet - thank you: >>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 >>>>> This means with the latest code you can now use Biopython >>>>> to convert a PHD output file into a FASTQ file (or a QUAL >>>>> file) which could be handy for doing meta assemblies. >>>> Yeah, that's nice. Conversely, the reason I wrote the Phd writer >>>> is that I >>>> want to 'fake' some Phd files from FASTA and QUAL files - should/ >>>> might be >>>> possible by using the default headers and equally spaced peak >>>> locations. The >>>> use-case is to fool Consed into displaying the trace (which it >>>> 'fakes') from >>>> a 454 Mira assembly ACE file output, but which it will only do if >>>> the Phd >>>> files are available. So I'm hoping to write the Phd files from >>>> the original >>>> FASTA/QUAL input files. Not sure if this is going to work, or if >>>> its a >>>> sensible thing to be trying... >>> That sounds reasonable - as long as you know you are faking it ;) >>>>> I did relatively recently update SeqIO for the Ace format to >>>>> record the qualities - but there is an issue here. Only the >>>>> nucleotides get given quality scores, but not the insertions >>>>> (gaps, shown as "*" in the Ace file consensus sequence). >>>>> Currently the Bio.SeqIO parser gives the gapped sequence. >>>>> This means to record the quality scores, we need to give >>>>> some null value to the gap characters (and I used None). >>>>> What I am wondering about is making the Bio.SeqIO Ace >>>>> parser just return the ungapped sequence (and the >>>>> associated PHRED quality scores). This means we could >>>>> then convert Ace files into FASTQ or QUAL files, and also >>>>> a simple Ace to FASTA conversion would give something >>>>> useful for downstream analysis (the ungapped consensus). >>>>> The gaps *are* important if you want to see how the >>>>> consensus was built up - in which case it makes sense to >>>>> think about each Ace contig as a kind of multiple sequence >>>>> alignment. See this earlier discussion with David Winter: >>>>> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html >>>>> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html >>>>> Any thoughts? >>>> I think it's probably unwise to return an ungapped sequence/qual >>>> by default >>>> if the contig in the ACE assembly is gapped. It would be nice if >>>> the parser >>>> had a switch ungapped=True, but thats not going to work with the >>>> SeqIO >>>> interface. >>> We can certainly add a ungapped optional argument to the parser >>> in Bio.SeqIO.AceIO - that would be a small improvement, meaning >>> the functionality would be there if you needed it (all be it a bit >>> hidden). >>> Several of the Bio.SeqIO parsers already have optional arguments. >>> I have sometimes wondered about letting the SeqIO functions take >>> a **kwargs argument, and passing these arbitrary options to the >>> underlying parser. This would allow for example passing wrap options >>> to the FASTA writer, or skiping the features when parsing GenBank >>> and EBML. On the other hand, it gets very complicated, and detracts >>> from the current simplicity of Bio.SeqIO (which I like). >>>> Second best option would be to have an easy way of getting the >>>> ungapped SeqRecord from the gapped SeqRecord - a function >>>> somewhere in Bio.Sequencing? >>> I've already suggested some kind of "ungapped" method for Seq >>> objects, and yes, having this at the SeqRecord level too would >>> solve this particular use case. Removing the per-letter-annotations >>> associated with the gaps would be straight forward. I'm not sure >>> what we would want to do with any features in the SeqRecord >>> (perhaps a corner case), but most likely any SeqFeature covering >>> a region containing a gap would be lost. >>>> Anyway, I assume (havent checked) that currently if all the >>>> contigs are free of gaps then the SeqIO.AceIO will parse >>>> them into an Ungapped alphabet which can then be written >>>> to FASTA/QUAL etc. I think this is the right way to go, if >>>> the contigs have gaps the user needs to decide how to deal >>>> with them explicitly. >>> Yes, if the Ace contig has no gaps, it will have a nice integer >>> PHRED quality for each base, and could be saved as FASTQ >>> or QUAL (or FASTA). >>> The thing about "gaps" in contigs is that the consensus is >>> really the ungapped sequence. I'd have to check but I think >>> Newbler and CAP3 will output both FASTA and ACE files, >>> and in the FASTA files there are no insertions/gaps in the >>> contig sequences. >>> What I am thinking is Bio.SeqIO could return the ungapped >>> consensus sequences as SeqRecord objects (which can then >>> be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO >>> could return contig-alignment objects (with the gaps, like >>> David's cookbook but in the long run with a contig class). >>> This has some merit, but breaks my current convention that >>> parsing an alignment file with SeqIO works by giving each >>> gapped sequence in each alignment in turn. >>> Peter >>> ------------------------------ >>> Message: 5 >>> Date: Tue, 30 Jun 2009 10:31:06 +0200 >>> From: Jose Blanca >>> Subject: Re: [Biopython-dev] [Biopython] Bio.Sequencing.Ace >>> To: biopython-dev at lists.open-bio.org >>> Message-ID: <200906301031.06273.jblanca at btc.upv.es> >>> Content-Type: text/plain; charset="iso-8859-1" >>>> What I was thinking of was a contig class as an alignment subclass, >>>> holding a list of SeqRecord objects and offsets. The consensus >>>> might >>>> just be one element of this list - but could be handled >>>> specially. This >>>> sounds simpler than having to introduce a whole new object system, >>>> related to but different to SeqFeature objects. However, I don't >>>> yet >>>> have a sample implementation to demonstrate this. >>> I thought about that implementation and I created some code. The >>> problem I >>> found with that approach is that the contig class code got too >>> messy. Take >>> into account that besides the offset you also need the masks and >>> that some >>> sequences could be reversed. That's why I decided to split the >>> part that >>> calculates the offset and the mask into a separate class. >>>> One important thing I think we should do BEFORE adding any contig >>>> class to Biopython, is get it working with at least one other >>>> contig file >>>> format in addition to Ace. I don't want to end up with a class >>>> which >>>> is too specialised for how ace contigs work. >>>> Peter >>> Well, In fact my contig class is modeled after the caf file >>> format. The ace >>> parsing was just an afterthought, my primary interest was the caf >>> format. >>> -- >>> Jose M. Blanca Postigo >>> Instituto Universitario de Conservacion y >>> Mejora de la Agrodiversidad Valenciana (COMAV) >>> Universidad Politecnica de Valencia (UPV) >>> Edificio CPI (Ciudad Politecnica de la Innovacion), 8E >>> 46022 Valencia (SPAIN) >>> Tlf.:+34-96-3877000 (ext 88473) >>> ------------------------------ >>> >>> _______________________________________________ >>> Biopython-dev mailing list >>> Biopython-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> End of Biopython-dev Digest, Vol 77, Issue 30 >>> ********************************************* >> >> -- >> Gordon Robertson >> Canada's Michael Smith Genome Sciences Centre >> Vancouver BC Canada >> >> >> > -- Gordon Robertson Canada's Michael Smith Genome Sciences Centre Vancouver BC Canada From biopython at maubp.freeserve.co.uk Mon Jun 1 10:15:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:15:03 +0100 Subject: [Biopython-dev] More SwissProt inconsistencies In-Reply-To: <880385.97797.qm@web62401.mail.re1.yahoo.com> References: <880385.97797.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00906010315k5e6ed8bdm5327e51eec3ac51e@mail.gmail.com> On Sat, May 30, 2009 at 10:37 AM, Michiel de Hoon wrote: > 1) A multi-line author list such as the following: > ... > is stored without newlines by Bio.SeqIO: > ... > but with newlines by Bio.SwissProt: > > To me, the Bio.SeqIO approach seems more reasonable. I think we should > add a space though at places where there is a newline in the file. > > The same happens for multiline RL such as > > RL ? (In) Baker M.J., Crush J.R., Humphreys L.R. (eds.); > RL ? Proceedings of the XVII international grassland congress, > RL ? pp.2:1033-1034, Dunmore Press, Palmerston North (1993). > > and for multiline RT lines such as > > RT ? "Genome of the host-cell transforming parasite Theileria annulata > RT ? compared with T. parva."; > > This is stored by Bio.SeqIO as > > '"Genome of the host-cell transforming parasite Theileria annulatacompared with T. parva.";' > > and by Bio.SwissProt as > > '"Genome of the host-cell transforming parasite Theileria annulata\ncompared with T. parva.";' > > whereas I think that both should be stored as > > '"Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.";' I agree with you - the missing spaces when parsed with Bio.SeqIO are a bug and should be fixed. > 2) Comments in a references such as the following: > RC ? STRAIN=cv. VF36; TISSUE=Anther; > are stored as a single string by Bio.SeqIO: >>>> seq_record.annotations['references'][i].comment > 'STRAIN=cv. VF36; TISSUE=Anther;' > but as a list of (key, value) pairs by Bio.SwissProt: > [('STRAIN', 'cv. VF36'), ('TISSUE', 'Anther')] > Whereas I think both are reasonable, Bio.SeqIO drops the space between > two (key, value) pairs if they are on two separate lines: > RC ? STRAIN=C57BL/6J; > RC ? TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex; > is stored as >>>> seq_record.annotations['references'][i].comment > 'STRAIN=C57BL/6J;TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;' > I think we should add a space here, or just store these as (key, value) pairs as Bio.SwissProt is doing. > > Any objections or comments? Maybe using a list of (key, value) pairs is more sensible, but it would probably break the BioSQL loader (and be inconsistent with reference objects from the GenBank/EMBL parser). It would be reasonable to add the space. This is a simple change which shouldn't hurt anything. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 1 10:19:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 06:19:04 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906011019.n51AJ4vf018935@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |chapmanb at 50mail.com ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-01 06:19 EST ------- Good point Nick. That's a change Brad Chapman (CC'd) made a long time ago - I'd be impressed if he could recall the details now. How about we make the code issue a deprecation warning if these "dummy" arguments are used? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 1 15:18:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 11:18:36 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906011518.n51FIad7009515@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-01 11:18 EST ------- Hi David, I was able to reproduce this problem. When working on Bug 2838, as my test case I was using just the file cor6_6.gb which by chance has simple reference locations - and that worked. I have now tested with GI 28804743. Also, using some of the other GenBank files in our test suites also shows the reference location problem from BioSQL/Loader.py function _load_reference: ValueError: invalid literal for int() with base 10: 'None' This is now fixed in CVS, plus there are now additional unit tests. For the fix, I have used a slight variation of Cymon's patch. Does this look sensible Cymon? BioSQL/BioSeq.py revision: 1.37 Tests/test_BioSQL.py revision: 1.39 Tests/seq_tests_common.py revision: 1.2 If you could retest with a clean checkout from CVS/github, to confirm the problem is fixed, that would be great David. Note - currently in BioSQL we only store one reference location, while GenBank files can have a single reference covering multiple regions of the record. This is a limitation of the current BioSQL schema (although it would be interesting to see how BioPerl deals with this). Note - there are four known failures in test_BioSQL.py right now, a mixed strand feature in NC_000932.gb (which triggers two failures), the project cross reference in NC_005816.gb, and a sub-feature location reference in one_of.gb -- these are all unrelated to this issue (Bug 2840). Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Mon Jun 1 18:14:52 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 1 Jun 2009 19:14:52 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC Message-ID: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> Hi, Is there a final schedule for the Biopython BOF? I am trying to book planes and hotel and would be nice to have a ballpark idea if possible. Many thanks, Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From p.j.a.cock at googlemail.com Mon Jun 1 19:19:11 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 1 Jun 2009 20:19:11 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> Message-ID: <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> 2009/6/1 Tiago Ant?o : > Hi, > > Is there a final schedule for the Biopython BOF? I am trying to book > planes and hotel and would be nice to have a ballpark idea if > possible. > > Many thanks, > Tiago According to http://www.open-bio.org/wiki/BOSC_2009 the schedule will be posted today (1 June 2009), but thus far I don't see any timetable with the BOF sessions included. See http://www.open-bio.org/wiki/BOSC_2009_Schedule which does have the talk titles. Based on last year's page http://www.open-bio.org/wiki/BOSC_2008_schedule and my emails with the organisers I expect the BOF sessions to again be on both the Saturday and the Sunday from about 4:30 till about 6:00 (or for as long as we can keep going?). So, if you are going to fly back Sunday night, try and stay till at least 6pm? I have signed up to the "Discover Stockholm Orienteering Event & Icebreaker" at 6pm on the Saturday. With hindsight this might have been a mistake (at it may oblige me to leave a coding session prematurely), but on the other hand, I expect we'll all need a break at that stage anyway. ;) Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 1 19:45:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 15:45:56 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906011945.n51JjuB3001445@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #7 from cymon.cox at gmail.com 2009-06-01 15:45 EST ------- (In reply to comment #6) > This is now fixed in CVS, plus there are now additional unit tests. For the > fix, I have used a slight variation of Cymon's patch. Does this look sensible > Cymon? Works for me. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Jun 1 22:14:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Jun 2009 18:14:52 -0400 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> Message-ID: <20090601221452.GG15913@sobchak.mgh.harvard.edu> Peter and Tiago; > > Is there a final schedule for the Biopython BOF? > According to http://www.open-bio.org/wiki/BOSC_2009 the schedule will > be posted today (1 June 2009), but thus far I don't see any timetable > with the BOF sessions included. See > http://www.open-bio.org/wiki/BOSC_2009_Schedule which does have the > talk titles. > > Based on last year's page > http://www.open-bio.org/wiki/BOSC_2008_schedule and my emails with the > organisers I expect the BOF sessions to again be on both the Saturday > and the Sunday from about 4:30 till about 6:00 (or for as long as we > can keep going?). So, if you are going to fly back Sunday night, try > and stay till at least 6pm? I arrive in Sweden Thursday morning and leave Tuesday morning. Beyond the BOSC talks and BoF sessions Peter mentioned, I am up for some coding any other time people will be around. I suspect the best turnouts will be Saturday and Sunday evenings when people are around and can be kept motivated from a day of discussions. Brad From bugzilla-daemon at portal.open-bio.org Mon Jun 1 22:42:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 18:42:05 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906012242.n51Mg5SO014894@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #15 from cymon.cox at gmail.com 2009-06-01 18:42 EST ------- Here's an attempt to circumvent the RULES in the BioSQL schema on PostgreSQL; it makes a check for the presence of the RULES, and if they are present insures that the record is injected in to the bioentry table else raises an IntegrityError. A further problem arose with one of the unittest: ====================================================================== FAIL: Make sure can't reimport existing records. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 474, in test_reload err.__class__.__name__ + "\n" + str(err)) AssertionError: OperationalError currval of sequence "bioentry_pk_seq" is not yet defined in this session ---------------------------------------------------------------------- This was seemingly unrelated to the RULES issue and is a consequence of how PostgreSQL handles sequences in sessions: because a new session was started (ie and new suite in the unit test) and the record failed to inject although the RULES were returning a INSERT 0,0 the bioentry_pk_seq was not incremented and when adaptor.last_id was called (actually looks for the curr_val of the sequence) it raised an OperationalError because the next_val() had not been called so far in the session. Adding a manual call to the next_val() in the unittest before trying the load ensures that the unittest fails where expected. (At least I think that is what is happening). diff --git a/BioSQL/BioSeqDatabase.py b/BioSQL/BioSeqDatabase.py index 3f58e9c..89d0d99 100644 --- a/BioSQL/BioSeqDatabase.py +++ b/BioSQL/BioSeqDatabase.py @@ -330,6 +332,14 @@ class BioSeqDatabase: self.adaptor = adaptor self.name = name self.dbid = self.adaptor.fetch_dbid_by_dbname(name) + + ##Test for presence of RULES in schema + self.postgres_rules_present= False + if "psycopg" in self.adaptor.conn.__class__.__module__: + sql = r"SELECT ev_class FROM pg_rewrite WHERE rulename='rule_bioentry_i1'" + if self.adaptor.execute_and_fetchall(sql): + self.postgres_rules_present = True + def __repr__(self): return "BioSeqDatabase(%r, %r)" % (self.adaptor, self.name) @@ -439,5 +449,14 @@ class BioSeqDatabase: num_records = 0 for cur_record in record_iterator : num_records += 1 + if self.postgres_rules_present: + self.adaptor.execute("SELECT count(bioentry_id) FROM bioentry") + curr_val = self.adaptor.cursor.fetchone()[0] db_loader.load_seqrecord(cur_record) + if self.postgres_rules_present: + self.adaptor.execute("SELECT count(bioentry_id) FROM bioentry") + after_val = self.adaptor.cursor.fetchone()[0] + if curr_val == after_val: + raise self.adaptor.conn.IntegrityError("Duplicate record " + "detected: record has not been inserted") return num_records diff --git a/Tests/test_BioSQL.py b/Tests/test_BioSQL.py index 334fe52..bf17ba7 --- a/Tests/test_BioSQL.py +++ b/Tests/test_BioSQL.py @@ -581,6 +581,13 @@ class InDepthLoadTest(unittest.TestCase): self.assertEqual(db_record.name, record.name) self.assertEqual(db_record.description, record.description) self.assertEqual(str(db_record.seq), str(record.seq)) + + #We have to manually advance the sequence because when the repeat load + #of the record fails and returns INSERT 0,0 because of the RULES the call + #to get the last_id causes an OperationalError because the curr_val hasnt + #been defined for the session ie. next_val() hasnt been called + self.db.adaptor.execute(r"select nextval('bioentry_pk_seq')") + #Good... now try reloading it! try : count = self.db.load([record]) Yeah, its nasty, but I thought I'd put it out there for consideration... cymon at gyra:~/git/github-master/Tests$ python ./test_BioSQL.py GenBank file to BioSQL and back to a GenBank file, NC_000932. ... FAIL GenBank file to BioSQL and back to a GenBank file, NC_005816. ... FAIL GenBank file to BioSQL and back to a GenBank file, NT_019265. ... ok GenBank file to BioSQL and back to a GenBank file, arab1. ... ok GenBank file to BioSQL and back to a GenBank file, cor6_6. ... ok GenBank file to BioSQL and back to a GenBank file, noref. ... ok GenBank file to BioSQL and back to a GenBank file, one_of. ... FAIL GenBank file to BioSQL and back to a GenBank file, protein_refseq2. ... ok Make sure can't import records with same ID (in one go). ... ok Make sure can't import a single record twice (in one go). ... ok Make sure can't import a single record twice (in steps). ... ok Make sure all records are correctly loaded. ... ok Make sure can't reimport existing records. ... ok Indepth check that SeqFeatures are transmitted through the db. ... ok Make sure can load record into another namespace. ... ok Load SeqRecord objects into a BioSQL database. ... ok Get a list of all items in the database. ... ok Test retrieval of items using various ids. ... ok Check can add DBSeq objects together. ... ok Check can turn a DBSeq object into a Seq or MutableSeq. ... ok Make sure Seqs from BioSQL implement the right interface. ... ok Check SeqFeatures of a sequence. ... ok Make sure SeqRecords from BioSQL implement the right interface. ... ok Check that slices of sequences are retrieved properly. ... ok GenBank file to BioSQL, then again to a new namespace, NC_000932. ... FAIL GenBank file to BioSQL, then again to a new namespace, NC_005816. ... ok GenBank file to BioSQL, then again to a new namespace, NT_019265. ... ok GenBank file to BioSQL, then again to a new namespace, arab1. ... ok GenBank file to BioSQL, then again to a new namespace, cor6_6. ... ok GenBank file to BioSQL, then again to a new namespace, noref. ... ok GenBank file to BioSQL, then again to a new namespace, one_of. ... ok GenBank file to BioSQL, then again to a new namespace, protein_refseq2. ... ok Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 1 23:08:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 1 Jun 2009 19:08:06 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906012308.n51N86DB016766@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #16 from cymon.cox at gmail.com 2009-06-01 19:08 EST ------- (In reply to comment #15) > @@ -439,5 +449,14 @@ class BioSeqDatabase: > num_records = 0 > for cur_record in record_iterator : > num_records += 1 > + if self.postgres_rules_present: > + self.adaptor.execute("SELECT count(bioentry_id) FROM > bioentry") > + curr_val = self.adaptor.cursor.fetchone()[0] > db_loader.load_seqrecord(cur_record) > + if self.postgres_rules_present: > + self.adaptor.execute("SELECT count(bioentry_id) FROM > bioentry") > + after_val = self.adaptor.cursor.fetchone()[0] > + if curr_val == after_val: > + raise self.adaptor.conn.IntegrityError("Duplicate record " > + "detected: record has not been inserted") > return num_records Actually, I dont think this is going to solve the original problem because the SeqFeatures of the second record will still be inserted over the first record, before the IntegrityError is raised. So the check needs to surround line 50 in Loader.py: bioentry_id = self._load_bioentry_table(record) which means passing a passing a postgres_rules_present parameter to Loader.load_seqrecord(). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 10:51:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 06:51:34 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021051.n52ApY7j030845@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 06:51 EST ------- Cymon (and Andrea), How do you feel about these pragmatic (short term) options: Option (a): At import time, if the rules are present issue a warning, suggesting the user fix their database schema. Option (b): If the user tries to load a record and the rules are present, raise an exception error suggesting the user fix their database schema. Either options (a) or (b) would be a problem for anyone trying to use BioPerl and Biopython with a PostgreSQL BioSQL database - but this is still an improvement on the current situation. However, I'm pleased see you (Cymon) have made some progress towards an ideal situation (until Bug 2839 is fixed), where Biopython could cope with the "evil" rules in the default PostgreSQL schema: Option (c): Even if the rules are present, and a key clash would happen, issue an IntegrityError. The details of how we might do this remain to be resolved... The idea of checking the bioentry table count imposes a performance penalty from this extra query, but also as you note in comment 16 in some ways this is "too late" (a transaction rollback is required). How do you feel about this simplistic solution?: if the rules are present, before loading a new record, do a query to check to make sure there isn't a duplicate already present, and if there is raise an IntegrityError. There is still a performance penalty from this extra query, but it avoids any issues with having to roll back partial transactions. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Tue Jun 2 14:22:37 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Jun 2009 15:22:37 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <20090601221452.GG15913@sobchak.mgh.harvard.edu> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> Message-ID: <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> Hi, On Mon, Jun 1, 2009 at 11:14 PM, Brad Chapman > I arrive in Sweden Thursday morning and leave Tuesday morning. Beyond > the BOSC talks and BoF sessions Peter mentioned, I am up for some coding > any other time people will be around. I suspect the best turnouts > will be Saturday and Sunday evenings when people are around and can Because of a meeting on Thursday (and Friday, which I am skipping) I can only arrive on Friday around 6pm. I am staying at the Rica Talk which seems to be on the same place as the conference. If any of you are around at that evening and want talk/code/have dinner/drink send me an email. I leave on Tuesday. Though I am not on ICMB, I am available to code/discuss/drink/whatever on Monday. Regards, Tiago From p.j.a.cock at googlemail.com Tue Jun 2 15:41:35 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Jun 2009 16:41:35 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> Message-ID: <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> 2009/6/2 Tiago Ant?o : > Hi, > > On Mon, Jun 1, 2009 at 11:14 PM, Brad Chapman >> I arrive in Sweden Thursday morning and leave Tuesday morning. Beyond >> the BOSC talks and BoF sessions Peter mentioned, I am up for some coding >> any other time people will be around. I suspect the best turnouts >> will be Saturday and Sunday evenings when people are around and can > > Because of a meeting on Thursday (and Friday, which I am skipping) I > can only arrive on Friday around 6pm. I am staying at the Rica Talk > which seems to be on the same place as the conference. If any of you > are around at that evening and want talk/code/have dinner/drink send > me an email. > I leave on Tuesday. Though I am not on ICMB, I am available to > code/discuss/drink/whatever on Monday. So in summary so far: Friday: Brad (and Bartek?) Saturday: Brad, Peter, Tiago and Bartek Sunday: Brad, Peter, Tiago and Bartek Monday: Brad, Peter, Tiago (and Bartek?) Tuesday: Peter (and Bartek?) So BoF sessions on both Saturday and Sunday would be fine :) Also, it looks like we can try and schedule some kind of coding/discussion/dinner session on the Monday as well. There are still some gaps on the ISMB schedule so I'm not quite sure when I might be free during Monday - but the evening should be fine. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 2 15:54:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 11:54:23 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021554.n52FsMLv023158@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #18 from cymon.cox at gmail.com 2009-06-02 11:54 EST ------- (In reply to comment #17) > How do you feel about this simplistic solution?: if the rules are present, > before loading a new record, do a query to check to make sure there isn't a > duplicate already present, and if there is raise an IntegrityError. Now thats a much better solution than the way Ive been trying to go... This does the trick: diff --git a/BioSQL/BioSeqDatabase.py b/BioSQL/BioSeqDatabase.py index 3f58e9c..a7a2470 100644 --- a/BioSQL/BioSeqDatabase.py +++ b/BioSQL/BioSeqDatabase.py @@ -330,6 +332,14 @@ class BioSeqDatabase: self.adaptor = adaptor self.name = name self.dbid = self.adaptor.fetch_dbid_by_dbname(name) + + ##Test for presence of RULES in schema + self.postgres_rules_present= False + if "psycopg" in self.adaptor.conn.__class__.__module__: + sql = r"SELECT ev_class FROM pg_rewrite WHERE rulename='rule_bioentry_i1'" + if self.adaptor.execute_and_fetchall(sql): + self.postgres_rules_present = True + def __repr__(self): return "BioSeqDatabase(%r, %r)" % (self.adaptor, self.name) @@ -439,5 +449,11 @@ class BioSeqDatabase: num_records = 0 for cur_record in record_iterator : num_records += 1 + if self.postgres_rules_present: + self.adaptor.execute("SELECT bioentry_id FROM bioentry " + "WHERE identifier = '%s'" % cur_record.id) + if self.adaptor.cursor.fetchone(): + raise self.adaptor.conn.IntegrityError("Duplicate record " + "detected: record has not been inserted") db_loader.load_seqrecord(cur_record) return num_records C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Tue Jun 2 16:03:35 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 2 Jun 2009 17:03:35 +0100 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <8b34ec180906020859o2468f987lfd1ac00b4ebea898@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> <8b34ec180906020859o2468f987lfd1ac00b4ebea898@mail.gmail.com> Message-ID: <6d941f120906020903j389a046dw6c84f0952fc04279@mail.gmail.com> On Tue, Jun 2, 2009 at 4:59 PM, Bartek Wilczynski wrote: >> Friday: Brad (and Bartek?Late) Tiago late here also. I land in Arlanda at 16.40. From bartek at rezolwenta.eu.org Tue Jun 2 15:59:01 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 2 Jun 2009 17:59:01 +0200 Subject: [Biopython-dev] Biopython BOF at BOSC In-Reply-To: <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> References: <6d941f120906011114n13c4936ak15d1859c2294be4@mail.gmail.com> <320fb6e00906011219u6a549153o28c0d2b4bb4ad05@mail.gmail.com> <20090601221452.GG15913@sobchak.mgh.harvard.edu> <6d941f120906020722g7d0b2963ka1d6fe42ac8b097f@mail.gmail.com> <320fb6e00906020841p4ecd5ac9t85b004848deddbed@mail.gmail.com> Message-ID: <8b34ec180906020859o2468f987lfd1ac00b4ebea898@mail.gmail.com> 2009/6/2 Peter Cock : > So in summary so far: > > Friday: Brad (and Bartek?Late) > Saturday: Brad, Peter, Tiago and Bartek-Yes > Sunday: Brad, Peter, Tiago and Bartek-Yes > Monday: Brad, Peter, Tiago (and Bartek?No) > Tuesday: Peter (and Bartek?No) > I will arrive on Friday evening (landing on 18.30) and leave on Sunday evening (I should be able to stay at the site till 6pm). > So BoF sessions on both Saturday and Sunday would be fine :) > I should be able to take part in Sat/Sun BOF sessions (but not Monday/Tuesday). cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Jun 2 17:00:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 13:00:56 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021700.n52H0uZb029220@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST ------- (In reply to comment #18) > (In reply to comment #17) > > How do you feel about this simplistic solution?: if the rules are present, > > before loading a new record, do a query to check to make sure there isn't a > > duplicate already present, and if there is raise an IntegrityError. > > Now thats a much better solution than the way Ive been trying to go... > > This does the trick: > ... > + if self.postgres_rules_present: > + self.adaptor.execute("SELECT bioentry_id FROM bioentry " > + "WHERE identifier = '%s'" % > cur_record.id) > + if self.adaptor.cursor.fetchone(): > + raise self.adaptor.conn.IntegrityError("Duplicate record " > + "detected: record has not been inserted") While the above code looks sensible, I don't think it covers all the cases yet. Essentially the two bioentry rules relate to these two uniqueness rules in the default schema: UNIQUE ( identifier , biodatabase_id ) UNIQUE ( accession , biodatabase_id , version ) According to rule_bioentry_i1 (or the equivalent rule) we should allow the same bioentry.identifier to appear in different namespaces (i.e. as long as bioentry.biodatabase_id differs). i.e. something like this in your code: "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id = %s' % (cur_record.id, self.dbid) Then for rule_bioentry_i2 we also need to check the accession, version and biodatabase_id have not been used before. Both checks could probably be done as a single more complex SQL query. Also, when we check for the rules, do you think we should check for rule_bioentry_i2 as well as rule_bioentry_i1? In principle they will either both be there, or neither. What about the other rules - might they also cause problems in Biopython? Finally, on a code style thing, I'd make postgres_rules_present private, i.e. call it _postgres_rules_present instead. Anyway, in principle it looks like this approach should work :) Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 17:25:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 13:25:54 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021725.n52HPsbi031363@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #20 from andrea at biodec.com 2009-06-02 13:25 EST ------- (In reply to comment #19) > (In reply to comment #18) > > (In reply to comment #17) > > > How do you feel about this simplistic solution?: if the rules are present, > > > before loading a new record, do a query to check to make sure there isn't a > > > duplicate already present, and if there is raise an IntegrityError. > > > > Now thats a much better solution than the way Ive been trying to go... > > > > This does the trick: > > ... > > + if self.postgres_rules_present: > > + self.adaptor.execute("SELECT bioentry_id FROM bioentry " > > + "WHERE identifier = '%s'" % > > cur_record.id) > > + if self.adaptor.cursor.fetchone(): > > + raise self.adaptor.conn.IntegrityError("Duplicate record " > > + "detected: record has not been inserted") > > While the above code looks sensible, I don't think it covers all the cases yet. > Essentially the two bioentry rules relate to these two uniqueness rules in the > default schema: > > UNIQUE ( identifier , biodatabase_id ) > UNIQUE ( accession , biodatabase_id , version ) What i think... 1) the solution is almost correct 2) but we have for sure to consider both rules because ("i tried") and they work fully independetly.. so we need to check both rules. 3) the unicity is related to the biodatabase, so i can add 2 record with identical accession, or identifier or both... but different biodatabase and this works perfectly. 3) At the end i would like to add also a warning because the presence of the rules cause an overhead into insertion because trigger other queries.... (and it could be convenient to inform...) > > According to rule_bioentry_i1 (or the equivalent rule) we should allow the same > bioentry.identifier to appear in different namespaces (i.e. as long as > bioentry.biodatabase_id differs). i.e. something like this in your code: > > "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id = > %s' % (cur_record.id, self.dbid) > > Then for rule_bioentry_i2 we also need to check the accession, version and > biodatabase_id have not been used before. sure > > Both checks could probably be done as a single more complex SQL query. "SELECT bioentry_id FROM bioentry WHERE (identifier = '%s AND biodatabase_id = %s') OR (accession = '%s AND version = '%s' AND biodatabase_id = %s')" so if one of the two (or both) is matched you have a bioentry_id and you could have the problem > > Also, when we check for the rules, do you think we should check for > rule_bioentry_i2 as well as rule_bioentry_i1? In principle they will either > both be there, or neither. What about the other rules - might they also cause > problems in Biopython? both... it's fully the same. you have the same problem on aceession,version,biodatabase_id > > Finally, on a code style thing, I'd make postgres_rules_present private, i.e. > call it _postgres_rules_present instead. Anyway, in principle it looks like > this approach should work :) ok > > Peter > thanks andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 17:58:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 13:58:33 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021758.n52HwXM7001408@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:58 EST ------- (In reply to comment #20) > > What i think... > 1) the solution is almost correct > 2) but we have for sure to consider both rules because ("i tried") and > they work fully independetly.. so we need to check both rules. It would be odd for someone to delete one rule but not the other. But yes, we should test for both. > 3) the unicity is related to the biodatabase, so i can add 2 record with > identical accession, or identifier or both... but different biodatabase > and this works perfectly. Good. > 3) At the end i would like to add also a warning because the presence > of the rules cause an overhead into insertion because trigger other > queries.... (and it could be convenient to inform...) Yes, having a warning (even if Biopython can be made to cope with the rules) seems sensible. I've just updated CVS to check for either of the bioentry rules and issue a warning (based on Cymon's patch). Adding the work around with the extra query would be the next step (at which point the warning text would need updating). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 2 18:15:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 2 Jun 2009 14:15:03 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906021815.n52IF3On002717@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #22 from andrea at biodec.com 2009-06-02 14:15 EST ------- (In reply to comment #21) > (In reply to comment #20) > > > > What i think... > > 1) the solution is almost correct > > 2) but we have for sure to consider both rules because ("i tried") and > > they work fully independetly.. so we need to check both rules. > > It would be odd for someone to delete one rule but not the other. But yes, we > should test for both. > very odd... and fully improbable.. but possible and it could rise "noisy bugs" in future (i will slowly forget many of the things we are speaking about) > > 3) the unicity is related to the biodatabase, so i can add 2 record with > > identical accession, or identifier or both... but different biodatabase > > and this works perfectly. > > Good. > > > 3) At the end i would like to add also a warning because the presence > > of the rules cause an overhead into insertion because trigger other > > queries.... (and it could be convenient to inform...) > > Yes, having a warning (even if Biopython can be made to cope with the rules) > seems sensible. Also, in the worning, telling something about performance issues.... > > I've just updated CVS to check for either of the bioentry rules and issue a > warning (based on Cymon's patch). Adding the work around with the extra query > would be the next step (at which point the warning text would need updating). > > Peter > Thanks andrea -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Tue Jun 2 19:39:18 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 2 Jun 2009 20:39:18 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <200906021700.n52H0uZb029220@portal.open-bio.org> References: <200906021700.n52H0uZb029220@portal.open-bio.org> Message-ID: <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> 2009/6/2 > http://bugzilla.open-bio.org/show_bug.cgi?id=2833 > > > > > > ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST ------- > (In reply to comment #18) > > (In reply to comment #17) > > > How do you feel about this simplistic solution?: if the rules are > present, > > > before loading a new record, do a query to check to make sure there > isn't a > > > duplicate already present, and if there is raise an IntegrityError. > > > > Now thats a much better solution than the way Ive been trying to go... > > > > This does the trick: > > ... > > + if self.postgres_rules_present: > > + self.adaptor.execute("SELECT bioentry_id FROM bioentry " > > + "WHERE identifier = '%s'" % > > cur_record.id) > > + if self.adaptor.cursor.fetchone(): > > + raise self.adaptor.conn.IntegrityError("Duplicate > record " > > + "detected: record has not been inserted") > > While the above code looks sensible, I don't think it covers all the cases > yet. > Essentially the two bioentry rules relate to these two uniqueness rules in > the > default schema: > > UNIQUE ( identifier , biodatabase_id ) > UNIQUE ( accession , biodatabase_id , version ) > > According to rule_bioentry_i1 (or the equivalent rule) we should allow the > same > bioentry.identifier to appear in different namespaces (i.e. as long as > bioentry.biodatabase_id differs). i.e. something like this in your code: > > "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND biodatabase_id > = > %s' % (cur_record.id, self.dbid) > > Then for rule_bioentry_i2 we also need to check the accession, version and > biodatabase_id have not been used before. In principle, we should only have to check for the second case (accession, biodatabase_id, version) because the GenBank "gi numbers" (i.e the identifier number) parallel the accession.version scheme. When a record changes both the gi number changes and the version number is incremented. Hence, and unique accession.version implies a unique identifier. In the schema, the identifier can be NULL, presumably so that non-GenBank data can be stored provided is has a unique accession.version. If we were only to check case 2 (accession, biodatabase_id, version) the only way I can see to trigger the RULES bug would be to manually assign two different accession.version to two records but assign the same (presumably artificial) identifier number to both record.annotations["gi"]. So, how likely is that? Well, not very, but perhaps we need check both ;) Perhaps we need to first define some unittests of all the permutations, because the code I submitted doesnt trigger any errors in the current suite. Cheers, C. From cy at cymon.org Tue Jun 2 20:29:06 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 2 Jun 2009 21:29:06 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> Message-ID: <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> 2009/6/2 Cymon Cox > 2009/6/2 > > http://bugzilla.open-bio.org/show_bug.cgi?id=2833 >> >> >> >> >> >> ------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-02 13:00 EST ------- >> (In reply to comment #18) >> > (In reply to comment #17) >> > > How do you feel about this simplistic solution?: if the rules are >> present, >> > > before loading a new record, do a query to check to make sure there >> isn't a >> > > duplicate already present, and if there is raise an IntegrityError. >> > >> > Now thats a much better solution than the way Ive been trying to go... >> > >> > This does the trick: >> > ... >> > + if self.postgres_rules_present: >> > + self.adaptor.execute("SELECT bioentry_id FROM bioentry >> " >> > + "WHERE identifier = '%s'" % >> > cur_record.id) >> > + if self.adaptor.cursor.fetchone(): >> > + raise self.adaptor.conn.IntegrityError("Duplicate >> record " >> > + "detected: record has not been inserted") >> >> While the above code looks sensible, I don't think it covers all the cases >> yet. >> Essentially the two bioentry rules relate to these two uniqueness rules in >> the >> default schema: >> >> UNIQUE ( identifier , biodatabase_id ) >> UNIQUE ( accession , biodatabase_id , version ) >> >> According to rule_bioentry_i1 (or the equivalent rule) we should allow the >> same >> bioentry.identifier to appear in different namespaces (i.e. as long as >> bioentry.biodatabase_id differs). i.e. something like this in your code: >> >> "SELECT bioentry_id FROM bioentry WHERE identifier = '%s AND >> biodatabase_id = >> %s' % (cur_record.id, self.dbid) >> >> Then for rule_bioentry_i2 we also need to check the accession, version and >> biodatabase_id have not been used before. > > > In principle, we should only have to check for the second case (accession, > biodatabase_id, version) because the GenBank "gi numbers" (i.e the > identifier number) parallel the accession.version scheme. When a record > changes both the gi number changes and the version number is incremented. > Hence, and unique accession.version implies a unique identifier. In the > schema, the identifier can be NULL, presumably so that non-GenBank data can > be stored provided is has a unique accession.version. If we were only to > check case 2 (accession, biodatabase_id, version) the only way I can see to > trigger the RULES bug would be to manually assign two different > accession.version to two records but assign the same (presumably artificial) > identifier number to both record.annotations["gi"]. > Whoa, I see now that in Loader._load_bioentry_table that if the rec.annotations["gi"] is missing, it gets filled with the accession.version: if "gi" in record.annotations : identifier = record.annotations["gi"] else : identifier = record.id So biopythons BioSQL identifiers are not equivalent to GenBank identifiers. I wonder why this is done and identifier is not just left NULL, and the unique constraint maintained by accession/version... Cheers, C. -- From biopython at maubp.freeserve.co.uk Wed Jun 3 12:54:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Jun 2009 13:54:40 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> Message-ID: <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> On Tue, Jun 2, 2009 at 9:29 PM, Cymon Cox wrote: > > Whoa, I see now that in Loader._load_bioentry_table that if the > rec.annotations["gi"] is missing, it gets filled with the accession.version: > > ? ? ? ?if "gi" in record.annotations : > ? ? ? ? ? ?identifier = record.annotations["gi"] > ? ? ? ?else : > ? ? ? ? ? ?identifier = record.id > > So biopythons BioSQL identifiers are not equivalent to GenBank identifiers. > I wonder why this is done and identifier is not just left NULL, and the > unique constraint maintained by accession/version... > Remember, it isn't just GenBank files that get imported into BioSQL. While the record.id is the accession.version when loading a GenBank file, this is not the case in general. Consulting the CVS log, this was changed BioSQL/Loader/py revision 1.33 to cope with loading a FASTA file into a BioSQL database (Bug 2425). Presumably I was trying to mimic the BioPerl loading of FASTA files. Before this change, the bioentry.identifier was taken as the GI number if available. i.e. This change wasn't anything directly to do with the uniqueness rules. Peter From bugzilla-daemon at portal.open-bio.org Wed Jun 3 14:39:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 10:39:19 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906031439.n53EdJon000576@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 10:39 EST ------- (In reply to comment #6) > Note - there are four known failures in test_BioSQL.py right now, a mixed > strand feature in NC_000932.gb (which triggers two failures), the project > cross reference in NC_005816.gb, and a sub-feature location reference in > one_of.gb -- these are all unrelated to this issue (Bug 2840). Just to note these are now fixed in CVS - they where mostly to do with writing GenBank files with Bio.SeqIO (I was testing writing using DBSeqRecord objects pulled out of BioSQL). (In reply to comment #7) > (In reply to comment #6) > > This is now fixed in CVS, plus there are now additional unit tests. For > > the fix, I have used a slight variation of Cymon's patch. Does this look > > sensible Cymon? > > Works for me. > C. Great - marking this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cy at cymon.org Wed Jun 3 14:52:16 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 3 Jun 2009 15:52:16 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> Message-ID: <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> 2009/6/3 Peter > On Tue, Jun 2, 2009 at 9:29 PM, Cymon Cox wrote: > > > > Whoa, I see now that in Loader._load_bioentry_table that if the > > rec.annotations["gi"] is missing, it gets filled with the > accession.version: > > > > if "gi" in record.annotations : > > identifier = record.annotations["gi"] > > else : > > identifier = record.id > > > > So biopythons BioSQL identifiers are not equivalent to GenBank > identifiers. > > I wonder why this is done and identifier is not just left NULL, and the > > unique constraint maintained by accession/version... > > > > Remember, it isn't just GenBank files that get imported into BioSQL. > While the record.id is the accession.version when loading a GenBank > file, this is not the case in general. > > Consulting the CVS log, this was changed BioSQL/Loader/py revision > 1.33 to cope with loading a FASTA file into a BioSQL database (Bug > 2425). Presumably I was trying to mimic the BioPerl loading of FASTA > files. Before this change, the bioentry.identifier was taken as the GI > number if available. > > i.e. This change wasn't anything directly to do with the uniqueness rules. Thanks Peter. Yes, it seems to have been done to mimic BioPerl - but I'm still curious as to why it is done at all... Anyway, I seem to be chasing my tail here: http://bugzilla.open-bio.org/show_bug.cgi?id=2681#c5 Cheers, C. -- From bugzilla-daemon at portal.open-bio.org Wed Jun 3 16:29:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 12:29:18 -0400 Subject: [Biopython-dev] [Bug 2806] Possible deadlock (hang) in Bio.Application using subprocess wait() In-Reply-To: Message-ID: <200906031629.n53GTIpD017690@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2806 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 12:29 EST ------- Since this bug was filed, the Bio.Application.generic_run function is now used in test_Emboss.py and the new alignment tool wrapper unit tests. None of these have shown a deadlock problem, but I have applied this fix anyway as a precaution. See Bio/Application/__init__.py revision 1.21 in CVS. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 16:36:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 12:36:40 -0400 Subject: [Biopython-dev] [Bug 2826] SeqRecord dbxrefs not written to GenBank by SeqIO In-Reply-To: Message-ID: <200906031636.n53Gaenk018729@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2826 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 12:36 EST ------- As of Bio/SeqIO/InsdcIO.py CVS revision 1.15, a SeqRecord's dbxrefs are recorded under the DBLINK lines when writing a GenBank file with Bio.SeqIO. Note that the code does not (currently) restrict this to the two database cross references the NCBI currently use for this field, e.g. "Project:28471" and "Trace Assembly Archive:123456", anything in the dbxrefs list is recorded. Marking this bug as fixed. Note if you want to test this, a clean CVS/git checkout would be advised (rather than trying to update individual files only). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 16:37:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 12:37:48 -0400 Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python 2.3 support In-Reply-To: Message-ID: <200906031637.n53GbmI2018929@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2817 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-03 12:37 EST ------- I think the obvious things are done in terms of removing Python 2.3 specific code. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 20:59:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 16:59:13 -0400 Subject: [Biopython-dev] [Bug 2848] New: SeqIO fastq routines reject valid quality socres Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2848 Summary: SeqIO fastq routines reject valid quality socres Product: Biopython Version: 1.50 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: pmmagic at gmail.com The fastq routines in SeqIO.QualityIO reject what I believe are valid quality scores. According to the MAQ website (http://maq.sourceforge.net/fastq.shtml; I don't know if this is definitive), valid quality values in Sanger style FASTQ format are: := [!-~\n]+ This corresponds to Phred quality scores in the range 0-93. The current code in BioPython 1.50 rejects quality scores > 90. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 21:39:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 17:39:07 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906032139.n53Ld7QK011962@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #1 from sbassi at gmail.com 2009-06-03 17:39 EST ------- Seems that is intented, look at the module docs: The PHRED software reads DNA sequencing trace files, calls bases, and assigns a quality value between 0 and 90 to each called base using a logged transformation of the error probability, Q = -10 log10( Pe ), for example:: Pe = 0.0, Q = 0 Pe = 0.1, Q = 10 Pe = 0.01, Q = 20 ... Pe = 0.00000001, Q = 80 Pe = 0.000000001, Q = 90 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 21:51:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 17:51:16 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906032151.n53LpGCT012916@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #23 from cymon.cox at gmail.com 2009-06-03 17:51 EST ------- Created an attachment (id=1319) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1319&action=view) PostgreSQL BioSQL Rules workaround -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 22:02:22 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 18:02:22 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906032202.n53M2MMb014177@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 ------- Comment #24 from cymon.cox at gmail.com 2009-06-03 18:02 EST ------- Ive added a patch against biopython on GitHub. I hope it address all the points made so far... it now passes all tests in test_BioSQL.py (although Ive not added more). One thing we've not yet discussed is the other PostgreSQL driver PyGresql. It appears that the project is still active and I was able to apt-get a Ubuntu package. It failed the tests miserably because it doesn't support autocommit(). Even if it can can be made to work it will obviously be prone to the RULES issue. Presumably, no one is actually using PyGresql (or at least hasnt updated biopython for some time). I'll open a bug. Also, Ive added a create_database() in a setUp() to the ClosedLoopTest unittest case because if this suite is called first (as it is for me - what actually governs which unittests are called first?) then if a test database is missing the suite is going to fail. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jun 3 22:12:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jun 2009 18:12:31 -0400 Subject: [Biopython-dev] [Bug 2849] New: PyGresql PostgreSQL driver support for BioSQL is broken Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2849 Summary: PyGresql PostgreSQL driver support for BioSQL is broken Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com cymon at gyra:~/git/github-master/Tests$ python test_BioSQL.py GenBank file to BioSQL and back to a GenBank file, NC_000932. ... ERROR GenBank file to BioSQL and back to a GenBank file, NC_005816. ... ERROR GenBank file to BioSQL and back to a GenBank file, NT_019265. ... ERROR GenBank file to BioSQL and back to a GenBank file, arab1. ... ERROR GenBank file to BioSQL and back to a GenBank file, cor6_6. ... ERROR GenBank file to BioSQL and back to a GenBank file, noref. ... ERROR GenBank file to BioSQL and back to a GenBank file, one_of. ... ERROR GenBank file to BioSQL and back to a GenBank file, protein_refseq2. ... ERROR Make sure can't import records with same ID (in one go). ... ERROR Make sure can't import a single record twice (in one go). ... ERROR Make sure can't import a single record twice (in steps). ... ERROR Make sure all records are correctly loaded. ... ERROR Make sure can't reimport existing records. ... ERROR Indepth check that SeqFeatures are transmitted through the db. ... ERROR Make sure can load record into another namespace. ... ERROR Load SeqRecord objects into a BioSQL database. ... ERROR Get a list of all items in the database. ... ERROR Test retrieval of items using various ids. ... ERROR Check can add DBSeq objects together. ... ERROR Check can turn a DBSeq object into a Seq or MutableSeq. ... ERROR Make sure Seqs from BioSQL implement the right interface. ... ERROR Check SeqFeatures of a sequence. ... ERROR Make sure SeqRecords from BioSQL implement the right interface. ... ERROR Check that slices of sequences are retrieved properly. ... ERROR GenBank file to BioSQL, then again to a new namespace, NC_000932. ... ERROR GenBank file to BioSQL, then again to a new namespace, NC_005816. ... ERROR GenBank file to BioSQL, then again to a new namespace, NT_019265. ... ERROR GenBank file to BioSQL, then again to a new namespace, arab1. ... ERROR GenBank file to BioSQL, then again to a new namespace, cor6_6. ... ERROR GenBank file to BioSQL, then again to a new namespace, noref. ... ERROR GenBank file to BioSQL, then again to a new namespace, one_of. ... ERROR GenBank file to BioSQL, then again to a new namespace, protein_refseq2. ... ERROR ====================================================================== ERROR: GenBank file to BioSQL and back to a GenBank file, NC_000932. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 409, in test_NC_000932 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_000932.gb"), "gb") File "test_BioSQL.py", line 443, in loop count = db.load(original_records) File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 479, in load db_loader.load_seqrecord(cur_record) File "/home/cymon/git/github-master/BioSQL/Loader.py", line 50, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "/home/cymon/git/github-master/BioSQL/Loader.py", line 559, in _load_bioentry_table bioentry_id = self.adaptor.last_id('bioentry') File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 168, in last_id return self.dbutils.last_id(self.cursor, table) File "/home/cymon/git/github-master/BioSQL/DBUtils.py", line 96, in last_id cursor.execute(sql) File "/usr/lib/python2.6/dist-packages/pgdb.py", line 259, in execute self.executemany(operation, (params,)) File "/usr/lib/python2.6/dist-packages/pgdb.py", line 289, in executemany raise DatabaseError("error '%s' in '%s'" % (msg, sql)) DatabaseError: error 'ERROR: currval of sequence "bioentry_pk_seq" is not yet defined in this session ' in 'select currval('bioentry_pk_seq')' ====================================================================== etc, etc, etc... Same error until we get to: ====================================================================== ERROR: Make sure can't import records with same ID (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 344, in setUp create_database() File "test_BioSQL.py", line 56, in create_database server.adaptor.autocommit() File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 172, in autocommit return self.dbutils.autocommit(self.conn, y) File "/home/cymon/git/github-master/BioSQL/DBUtils.py", line 101, in autocommit raise NotImplementedError("pgdb does not support this!") NotImplementedError: pgdb does not support this! ====================================================================== ERROR: Make sure can't import a single record twice (in one go). ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 344, in setUp create_database() File "test_BioSQL.py", line 56, in create_database server.adaptor.autocommit() File "/home/cymon/git/github-master/BioSQL/BioSeqDatabase.py", line 172, in autocommit return self.dbutils.autocommit(self.conn, y) File "/home/cymon/git/github-master/BioSQL/DBUtils.py", line 101, in autocommit raise NotImplementedError("pgdb does not support this!") NotImplementedError: pgdb does not support this! ====================================================================== -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 09:27:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 05:27:09 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906040927.n549R9CU030203@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 05:27 EST ------- Rereading that MAQ page, you are probably right about allowing 0-93 rather than 0-90 for PHRED scores. Could you pull out a few valid FASTQ read showing this problem as a short example file we can use for a unit test? (and attach it to this bug). Also could you and explicitly confirm what type of FASTQ file you think you have. i.e. Sanger style using PHRED scores and an offset of 33, rather than Solexa/Illumina style using a different scaling and an offset of 64, or something else. Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 09:31:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 05:31:18 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906040931.n549VIgg030515@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 05:31 EST ------- P.S. The reason I originally used 0 to 90 was this line in the MAQ page and the fq_all2std.pl text: "In the quality string, if you can see a character with its ASCII code higher than 90, probably your file is in the Solexa/Illumina format." They do say "probably", so perhaps 91, 92 and 93 can validly occur. It might help to know where your apparently very high quality FASTQ file came from. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 10:08:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 06:08:35 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: Message-ID: <200906041008.n54A8Zkh000532@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2833 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1319 is|0 |1 obsolete| | ------- Comment #25 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 06:08 EST ------- (From update of attachment 1319) (In reply to comment #24) > Ive added a patch against biopython on GitHub. > > I hope it address all the points made so far... it now passes all tests > in test_BioSQL.py (although Ive not added more). It looked sensible to me. It isn't very elegant (maybe we should move this hack into Loader.py?), but I can live with it until Hilmar fixes Bug 2839. Checked in as BioSQL/BioSeqDatabase.py CVS revision 1.23 - thanks! > One thing we've not yet discussed is the other PostgreSQL driver PyGresql. > It appears that the project is still active ... I'll open a bug. Let's discuss that on the new Bug 2849. > Also, Ive added a create_database() in a setUp() to the ClosedLoopTest > unittest case because if this suite is called first (as it is for me - > what actually governs which unittests are called first?) then if a test > database is missing the suite is going to fail. Good point, although I added a create_database() to the module itself instead. The unit tests order is from sorting their description (first line of the docstring) alphabetically. See Tests/test_BioSQL.py CVS revision 1.41 We may want to add a few more duplicate tests (using the accession and identifier) before closing this bug... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 10:28:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 06:28:32 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041028.n54ASWGJ002151@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 06:28 EST ------- We don't really need the auto commit statement in the unit test, what happens on your machine if you remove it? RCS file: /home/repository/biopython/biopython/Tests/test_BioSQL.py,v retrieving revision 1.41 diff -r1.41 test_BioSQL.py 54,59d53 < # Auto-commit: postgresql cannot drop database in a transaction < try: < server.adaptor.autocommit() < except AttributeError: < pass < With the above change MySQL is happy. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 11:05:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 07:05:11 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041105.n54B5B4a004928@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #2 from cymon.cox at gmail.com 2009-06-04 07:05 EST ------- (In reply to comment #1) > We don't really need the auto commit statement in the unit test, what happens > on your machine if you remove it? I think we do need it for psycopg/psycopg2, as the comment says, postgresql cannot drop a database inside a transaction: INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block C. > > RCS file: /home/repository/biopython/biopython/Tests/test_BioSQL.py,v > retrieving revision 1.41 > diff -r1.41 test_BioSQL.py > 54,59d53 > < # Auto-commit: postgresql cannot drop database in a transaction > < try: > < server.adaptor.autocommit() > < except AttributeError: > < pass > < > > With the above change MySQL is happy. > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 11:12:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 07:12:55 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041112.n54BCtTh005499@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 07:12 EST ------- (In reply to comment #2) > (In reply to comment #1) > > We don't really need the auto commit statement in the unit test, what > > happens on your machine if you remove it? > > I think we do need it for psycopg/psycopg2, as the comment says, postgresql > cannot drop a database inside a transaction: > > INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block > > C. Did you try it anyway? http://www.postgresql.org/docs/7.0/interactive/sql-dropdatabase.html http://www.postgresql.org/docs/8.3/interactive/sql-dropdatabase.html This says "DROP DATABASE cannot be executed inside a transaction block.", which is fine - we don't want a transaction for this as we won't ever roll this back. It also says "This command cannot be executed while connected to the target database." which should be fine. [Maybe I need to setup my own machine with PostgreSQL on it for testing...] Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 11:18:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 07:18:45 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041118.n54BIjj0006002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #4 from cymon.cox at gmail.com 2009-06-04 07:18 EST ------- (In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > We don't really need the auto commit statement in the unit test, what > > > happens on your machine if you remove it? > > > > I think we do need it for psycopg/psycopg2, as the comment says, postgresql > > cannot drop a database inside a transaction: > > > > INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block > > > > C. > > Did you try it anyway? Yes. C. > > http://www.postgresql.org/docs/7.0/interactive/sql-dropdatabase.html > http://www.postgresql.org/docs/8.3/interactive/sql-dropdatabase.html > > This says "DROP DATABASE cannot be executed inside a transaction block.", which > is fine - we don't want a transaction for this as we won't ever roll this back. > It also says "This command cannot be executed while connected to the target > database." which should be fine. > > [Maybe I need to setup my own machine with PostgreSQL on it for testing...] > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 17:49:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 13:49:13 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906041749.n54HnD2r003507@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 13:49 EST ------- (In reply to comment #3) > P.S. The reason I originally used 0 to 90 was this line in the MAQ page and > the fq_all2std.pl text: > > "In the quality string, if you can see a character with its ASCII code higher > than 90, probably your file is in the Solexa/Illumina format." Ignore that - I was thinking PHRED scores but they are talking ASCII codes. I guess they consider PHRED scores of 57+ to be rare. (In reply to comment #0) > The fastq routines in SeqIO.QualityIO reject what I believe are valid quality > scores. > > According to the MAQ website (http://maq.sourceforge.net/fastq.shtml; I don't > know if this is definitive), valid quality values in Sanger style FASTQ format > are: > > := [!-~\n]+ > > This corresponds to Phred quality scores in the range 0-93. Yes, it does: ord("!")-33 = 0 ord("~")-33 = 93 The maq website isn't definitive, but it was written by people at Sanger where the FASTQ format was invented, and to my knowledge is the closest thing to an official description of the format. (In reply to comment #2) > Rereading that MAQ page, you are probably right about allowing 0-93 rather > than 0-90 for PHRED scores. Fixed in CVS. > Could you pull out a few valid FASTQ read showing this problem as a short > example file we can use for a unit test? (and attach it to this bug)... On re-reading your bug report, I'm not sure if you actually have a file where this is a problem, of it you just noticed the minor discrepancy in the threshold? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 19:15:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 15:15:35 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906041915.n54JFZ57010708@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #5 from cymon.cox at gmail.com 2009-06-04 15:15 EST ------- (In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > We don't really need the auto commit statement in the unit test, what > > > happens on your machine if you remove it? > > > > I think we do need it for psycopg/psycopg2, as the comment says, postgresql > > cannot drop a database inside a transaction: > > > > INTERNAL ERROR: DROP DATABASE cannot run inside a transaction block > > > > C. > > Did you try it anyway? > > http://www.postgresql.org/docs/7.0/interactive/sql-dropdatabase.html > http://www.postgresql.org/docs/8.3/interactive/sql-dropdatabase.html > > This says "DROP DATABASE cannot be executed inside a transaction block.", which > is fine - we don't want a transaction for this as we won't ever roll this back. Psycopg defaults to a "read committed" isolation level that psycopg wraps in a translation block. So, I think the only way not to be in a translation block is to autocommit. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 19:22:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 15:22:55 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906041922.n54JMt7r011320@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #5 from pmmagic at gmail.com 2009-06-04 15:22 EST ------- (In reply to comment #4) > (In reply to comment #3) > > P.S. The reason I originally used 0 to 90 was this line in the MAQ page and > > the fq_all2std.pl text: > > > > "In the quality string, if you can see a character with its ASCII code higher > > than 90, probably your file is in the Solexa/Illumina format." > > Ignore that - I was thinking PHRED scores but they are talking ASCII codes. I > guess they consider PHRED scores of 57+ to be rare. > > (In reply to comment #0) > > The fastq routines in SeqIO.QualityIO reject what I believe are valid quality > > scores. > > > > According to the MAQ website (http://maq.sourceforge.net/fastq.shtml; I don't > > know if this is definitive), valid quality values in Sanger style FASTQ format > > are: > > > > := [!-~\n]+ > > > > This corresponds to Phred quality scores in the range 0-93. > > Yes, it does: > > ord("!")-33 = 0 > ord("~")-33 = 93 > > The maq website isn't definitive, but it was written by people at Sanger where > the FASTQ format was invented, and to my knowledge is the closest thing to an > official description of the format. > > (In reply to comment #2) > > Rereading that MAQ page, you are probably right about allowing 0-93 rather > > than 0-90 for PHRED scores. > > Fixed in CVS. > > > Could you pull out a few valid FASTQ read showing this problem as a short > > example file we can use for a unit test? (and attach it to this bug)... > > On re-reading your bug report, I'm not sure if you actually have a file where > this is a problem, of it you just noticed the minor discrepancy in the > threshold? > HI Peter, The problem arises in parsing the fastq formatted consensus mappings produced by MAQ, so these are "mapping qualities" rather than read qualities directly. These mapping qualities, however, are in the same scale as Phred quality scores (ttp://maq.sourceforge.net/qual.shtml ) and MAQ's fastq output is Sanger style. Since the mapping scores are, in part, a function read depth it's not too unusual to get very high quality scores in the MAQ output. Here's a simple snippet that is valid fastq: @ref|NC_001133| nnnnnnnnnnnnnnnacacccacacaccacaccacacaccACACCACACCCACACACACA CATCCTAACACTACCCTAACACAGCCctaatcyaacCCTGACCAACCTGTCTCTCAACTT + !!!!!!!!!!!!!!!@EHHHHHHKKJKKKKNNNBN:NNNNQQQQQABGA?LTTWWWZZZI HEFBZLZ]]]]]]]]]ZZZZZT at TTQQQT4A]1?cfiloxL{xuuux{]~~~~~Ake~`~ Thanks, Paul M -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 20:54:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 16:54:59 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906042054.n54KsxB2017457@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 16:54 EST ------- (In reply to comment #5) > HI Peter, > > The problem arises in parsing the fastq formatted consensus mappings > produced by MAQ, so these are "mapping qualities" rather than read > qualities directly. I see - that does explain why there are sometimes very very good quality scores. Presumably maq limits itself to a maximum PHRED quality of 93? > Here's a simple snippet that is valid fastq: > > @ref|NC_001133| > nnnnnnnnnnnnnnnacacccacacaccacaccacacaccACACCACACCCACACACACA > CATCCTAACACTACCCTAACACAGCCctaatcyaacCCTGACCAACCTGTCTCTCAACTT > + > !!!!!!!!!!!!!!!@EHHHHHHKKJKKKKNNNBN:NNNNQQQQQABGA?LTTWWWZZZI > HEFBZLZ]]]]]]]]]ZZZZZT at TTQQQT4A]1?cfiloxL{xuuux{]~~~~~Ake~`~ May we use that for a unit test in Biopython? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 21:11:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 17:11:19 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042111.n54LBJnL019044@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-04 17:11 EST ------- (In reply to comment #5) > Psycopg defaults to a "read committed" isolation level that psycopg wraps > in a translation block. So, I think the only way not to be in a translation > block is to autocommit. Presumably PyGresql/pgdb is similar in this regard? Given the apparent lack of any documentation on line at http://www.pygresql.org/pgdb.html this might be tricky to resolve, but from looking at the CVS repository I don't think they support autocommit. Maybe we simply can't do a drop database with the pygb driver for PostgreSQL. Perhaps instead we can just empty all the tables in this case? That might fix (at least part of) test_BioSQL.py Does test_BioSQL_SeqIO.py work? Does simple use of BioSQL with pgdb work? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 21:26:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 17:26:51 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042126.n54LQphc020848@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #7 from cymon.cox at gmail.com 2009-06-04 17:26 EST ------- (In reply to comment #6) > (In reply to comment #5) > > Psycopg defaults to a "read committed" isolation level that psycopg wraps > > in a translation block. So, I think the only way not to be in a translation > > block is to autocommit. > > Presumably PyGresql/pgdb is similar in this regard? Given the apparent lack of > any documentation on line at http://www.pygresql.org/pgdb.html this might be > tricky to resolve, but from looking at the CVS repository I don't think they > support autocommit. They dont. > > Maybe we simply can't do a drop database with the pygb driver for PostgreSQL. I had myself convinced this was the case for quite a while, but you can "trick" the cursor with a "COMMIT" and then execute a non-transactional query. I have it working and passing all the tests. Unfortunately, the driver spews forth "NOTICE"'s each time the database is built, ie for each CREATE in the schema, which ruins the unittest output. Ive yet to find a way to silence this. I'll submit a patch forthwith... C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 21:59:03 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 17:59:03 -0400 Subject: [Biopython-dev] [Bug 2840] When a record has been loaded from BioSQL, trying to save it to another database fails with loader db_loader.load_seqrecord in _load_reference In-Reply-To: Message-ID: <200906042159.n54Lx3ka023345@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2840 ------- Comment #9 from david.wyllie at ndm.ox.ac.uk 2009-06-04 17:59 EST ------- thank you very much for fixing this. I can confirm that in the current git load (4 june 09) the problem is resolved. David (In reply to comment #6) > Hi David, > > I was able to reproduce this problem. When working on Bug 2838, as my test case > I was using just the file cor6_6.gb which by chance has simple reference > locations - and that worked. I have now tested with GI 28804743. Also, using > some of the other GenBank files in our test suites also shows the reference > location problem from BioSQL/Loader.py function _load_reference: > > ValueError: invalid literal for int() with base 10: 'None' > > This is now fixed in CVS, plus there are now additional unit tests. For the > fix, I have used a slight variation of Cymon's patch. Does this look sensible > Cymon? > > BioSQL/BioSeq.py revision: 1.37 > Tests/test_BioSQL.py revision: 1.39 > Tests/seq_tests_common.py revision: 1.2 > > If you could retest with a clean checkout from CVS/github, to confirm the > problem is fixed, that would be great David. > > Note - currently in BioSQL we only store one reference location, while GenBank > files can have a single reference covering multiple regions of the record. This > is a limitation of the current BioSQL schema (although it would be interesting > to see how BioPerl deals with this). > > Note - there are four known failures in test_BioSQL.py right now, a mixed > strand feature in NC_000932.gb (which triggers two failures), the project cross > reference in NC_005816.gb, and a sub-feature location reference in one_of.gb -- > these are all unrelated to this issue (Bug 2840). > > Thanks, > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 22:05:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 18:05:51 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042205.n54M5p2w023834@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #8 from cymon.cox at gmail.com 2009-06-04 18:05 EST ------- Created an attachment (id=1320) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1320&action=view) pgdb PyGreSQL Postgres Driver support -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 4 22:08:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 4 Jun 2009 18:08:17 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906042208.n54M8H1u024037@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #9 from cymon.cox at gmail.com 2009-06-04 18:08 EST ------- (In reply to comment #7) > Unfortunately, the driver spews forth "NOTICE"'s each time the database is > built, ie for each CREATE in the schema, which ruins the unittest output. Ive > yet to find a way to silence this. This isnt an issue for users you run the entire biopython test suite with run_tests.py, so I'll ignore it. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Thu Jun 4 23:22:27 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Thu, 4 Jun 2009 20:22:27 -0300 Subject: [Biopython-dev] Biopython logo usage Message-ID: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Hello, I wonder if the Biopython logo (http://biopython.org/wiki/Logo) has any usage guidelines. I am asking this because I am working on the cover design of my book "Python for Bioinformatics" and I want to include the Biopython logo. The idea is to highlight the fact that Biopython is covered in the book. Do you think is fair or I should not include this logo on the cover? Here is a draft of the cover: http://www.dnalinux.com/coverdraft1.pdf This is the first draft, without the Biopython logo, but there is a "Python" logo as part of a screenshot of the Python installer for Mac. Best, -- Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a. Non standard disclaimer: READ CAREFULLY. By reading this email, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ?? ???? ?????? ????? ????? ??? ???? ??? ????? ?? ?????? ????????.....???? ????? From idoerg at gmail.com Thu Jun 4 23:30:01 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 4 Jun 2009 16:30:01 -0700 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Message-ID: Wow, congratulations! I am so buying this off my startup.... I think the question is best addressed to Thomas Haelryck. IIRC, his friend designed the logo. There is no license I am aware of, but it is probably a good idea to put a cc-na license on it, like Tux has. Best, Iddo On Thu, Jun 4, 2009 at 4:22 PM, Sebastian Bassi wrote: > Hello, > > I wonder if the Biopython logo (http://biopython.org/wiki/Logo) has > any usage guidelines. > I am asking this because I am working on the cover design of my book > "Python for Bioinformatics" and I want to include the Biopython logo. > The idea is to highlight the fact that Biopython is covered in the > book. Do you think is fair or I should not include this logo on the > cover? Here is a draft of the cover: > http://www.dnalinux.com/coverdraft1.pdf > This is the first draft, without the Biopython logo, but there is a > "Python" logo as part of a screenshot of the Python installer for Mac. > Best, > > -- > Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a. > > Non standard disclaimer: READ CAREFULLY. By reading this email, > you agree, on behalf of your employer, to release me from all > obligations and waivers arising from any and all NON-NEGOTIATED > agreements, licenses, terms-of-service, shrinkwrap, clickwrap, > browsewrap, confidentiality, non-disclosure, non-compete and > acceptable use policies ("BOGUS AGREEMENTS") that I have > entered into with your employer, its partners, licensors, agents and > assigns, in perpetuity, without prejudice to my ongoing rights and > privileges. You further represent that you have the authority to release > me from any BOGUS AGREEMENTS on behalf of your employer. > > ?? ???? ?????? ????? ????? ??? ???? ??? ????? ?? ?????? ????????.....???? > ????? > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From p.j.a.cock at googlemail.com Fri Jun 5 09:16:52 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 5 Jun 2009 10:16:52 +0100 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Message-ID: <320fb6e00906050216m70d1ce90n396a0a36925e6c74@mail.gmail.com> On Fri, Jun 5, 2009 at 12:22 AM, Sebastian Bassi wrote: > Hello, > > I wonder if the Biopython logo (http://biopython.org/wiki/Logo) has > any usage guidelines. I don't know if there was anything formal ever written down. As Iddo said, we should probably clear this with Thomas Hamelryck (BCC'd) (and Henrik Vestergaard) who came up with the logo. http://www.biopython.org/wiki/Logo > I am asking this because I am working on the cover design of my book > "Python for Bioinformatics" and I want to include the Biopython logo. > The idea is to highlight the fact that Biopython is covered in the > book. Do you think is fair or I should not include this logo on the > cover? As you only have a single chapter on Biopython, having our logo too prominent could be misleading. However, I personally like the idea of including the logo on your cover - a bit more promotion of Biopython would be nice. From a visual layout point of view, I'm not sure what to suggest - the yellow snakes don't go very well with the blue background (although there is yellow in the current python logo which should help balance things). > Here is a draft of the cover: http://www.dnalinux.com/coverdraft1.pdf > This is the first draft, without the Biopython logo, but there is a > "Python" logo as part of a screenshot of the Python installer for Mac. That looks good. In fact, when we did the last release I was thinking about including the Biopython logo on the Windows Installers - it looks pretty easy once we have a bitmap the right size... Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 5 09:28:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Jun 2009 05:28:37 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906050928.n559Sbwe002877@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1320 is|0 |1 obsolete| | ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-05 05:28 EST ------- (From update of attachment 1320) Good work - checked in, thanks. One minor thing - do we have to do this: server.adaptor.cursor.execute("COMMIT") rather than something like: server.adaptor.commit() (I guess you had your reasons, and may have been doing this deliberately.) Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 5 10:08:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Jun 2009 06:08:41 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906051008.n55A8flo005419@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #11 from cymon.cox at gmail.com 2009-06-05 06:08 EST ------- (In reply to comment #10) > (From update of attachment 1320 [details]) > Good work - checked in, thanks. > > One minor thing - do we have to do this: > server.adaptor.cursor.execute("COMMIT") > > rather than something like: > server.adaptor.commit() > > (I guess you had your reasons, and may have been doing this deliberately.) Committing on the adaptor doesnt work: when the code goes to drop the db, it throws a usual "cannot run inside a transaction block". So it appears the commit must be made on the cursor. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 5 13:36:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 5 Jun 2009 09:36:21 -0400 Subject: [Biopython-dev] [Bug 2848] SeqIO fastq routines reject valid quality socres In-Reply-To: Message-ID: <200906051336.n55DaLWX021023@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2848 ------- Comment #7 from pmmagic at gmail.com 2009-06-05 09:36 EST ------- (In reply to comment #6) > (In reply to comment #5) > > HI Peter, > > > > The problem arises in parsing the fastq formatted consensus mappings > > produced by MAQ, so these are "mapping qualities" rather than read > > qualities directly. > > I see - that does explain why there are sometimes very very good quality > scores. Presumably maq limits itself to a maximum PHRED quality of 93? I presume so. If not that would contradict their own specification. > > > Here's a simple snippet that is valid fastq: > > > > @ref|NC_001133| > > nnnnnnnnnnnnnnnacacccacacaccacaccacacaccACACCACACCCACACACACA > > CATCCTAACACTACCCTAACACAGCCctaatcyaacCCTGACCAACCTGTCTCTCAACTT > > + > > !!!!!!!!!!!!!!!@EHHHHHHKKJKKKKNNNBN:NNNNQQQQQABGA?LTTWWWZZZI > > HEFBZLZ]]]]]]]]]ZZZZZT at TTQQQT4A]1?cfiloxL{xuuux{]~~~~~Ake~`~ > > May we use that for a unit test in Biopython? Absolutely. Cheers, Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Fri Jun 5 14:49:44 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Fri, 5 Jun 2009 11:49:44 -0300 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <320fb6e00906050216m70d1ce90n396a0a36925e6c74@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <320fb6e00906050216m70d1ce90n396a0a36925e6c74@mail.gmail.com> Message-ID: <9e2f512b0906050749m19cba875k120967a4614d695b@mail.gmail.com> On Fri, Jun 5, 2009 at 6:16 AM, Peter Cock wrote: > I don't know if there was anything formal ever written down. As Iddo > said, we should probably clear this with Thomas Hamelryck (BCC'd) > (and Henrik Vestergaard) who came up with the logo. > http://www.biopython.org/wiki/Logo OK, I also wrote to him. > As you only have a single chapter on Biopython, having our logo too > prominent could be misleading. However, I personally like the idea of There is one chapter about Biopython and there are code recipes and most of them use Biopython. But clearly is not a Biopython book. I don't suggest to make it prominent, I included more screenshots in the cover and planned to included the logo in a corner. > including the logo on your cover - a bit more promotion of Biopython > would be nice. From a visual layout point of view, I'm not sure what I think the same, for most bioinformatitians, Bioperl is the first option when they think on bioinformatics programming/scripting language. I will wait for Henrik approval. Best, SB. From anaryin at gmail.com Fri Jun 5 17:48:10 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 5 Jun 2009 19:48:10 +0200 Subject: [Biopython-dev] PolyA Sequence fails to BLAST? Message-ID: Hello all, this is quite a general curiosity. I was trying my application and I was testing the case of a sequence not having matches in BLAST. I chose a long stretch of Alanines, randomly: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA This is the URL that Biopython generated (printed from NCBIWWW.py): http://blast.ncbi.nlm.nih.gov/Blast.cgi?COMPOSITION_BASED_STATISTICS=True&DATABASE=pdb&ENTREZ_QUERY=%28none%29&EXPECT=10&GAPCOSTS=10+1&HITLIST_SIZE=50&MATRIX_NAME=PAM70&PROGRAM=blastp&QUERY=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&WORD_SIZE=3&CMD=Put Not finding this odd enough, because it says "Sequence not in FASTA format", I went to the BLAST server page and manually tried to run it. Same error. My question is not related to BioPython, but since I found it with it, I guess I might as well ask: Why does a poly A sequence crashes BLAST? :x Regards! Jo?o [ .. ] Rodrigues (Blog) http://doeidoei.wordpress.com (MSN) always_asleep_ at hotmail.com (Skype) rodrigues.jglm From sbassi at clubdelarazon.org Fri Jun 5 18:01:42 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Fri, 5 Jun 2009 15:01:42 -0300 Subject: [Biopython-dev] PolyA Sequence fails to BLAST? In-Reply-To: References: Message-ID: <9e2f512b0906051101q1be60531r589695d1fd1d0a17@mail.gmail.com> On Fri, Jun 5, 2009 at 2:48 PM, Jo?o Rodrigues wrote: > Hello all, this is quite a general curiosity. > I was trying my application and I was testing the case of a sequence not > having matches in BLAST. I chose a long stretch of Alanines, randomly: > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > Not finding this odd enough, because it says "Sequence not in FASTA format", > I went to the BLAST server page and manually tried to run it. Same error. That is because this is a "low complexity" region that in most cases is maked (with X or N) before entering into a BLAST search. Look here: "Filter (Low-complexity) Mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs. It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect." From thomas.hamelryck at gmail.com Fri Jun 5 18:05:45 2009 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Fri, 5 Jun 2009 20:05:45 +0200 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> Message-ID: <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> On Fri, Jun 5, 2009 at 1:30 AM, Iddo Friedberg wrote: > Wow, congratulations! I am so buying this off my startup.... > > I think the question is best addressed to Thomas Haelryck. IIRC, his friend > designed the logo. There is no license I am aware of, but it is probably a > good idea to put a cc-na license on it, like Tux has. > Should be fine, but will ask to be sure. Cheers, -Thomas From idoerg at gmail.com Sun Jun 7 02:33:47 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 6 Jun 2009 19:33:47 -0700 Subject: [Biopython-dev] skipping a bad record in SeqIO.parse Message-ID: Suppose SeqIO throws an exception due to a bad record. I want to note that in stderr an move on to the next record. How do i do that? The following eyesore of a code simply leaves me stuck reading the same bad record over and over: seq_reader = SeqIO.parse(in_handle, format) while True: try: seq_record = seq_reader.next() except StopIteration: break except: if debug: sys.stderr.write("Sequence not read: %s%s" % (seq_record.id, os.linesep)) sys.stderr.flush() continue if not seq_record: break -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From mjldehoon at yahoo.com Sun Jun 7 11:38:10 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 7 Jun 2009 04:38:10 -0700 (PDT) Subject: [Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines Message-ID: <268230.50854.qm@web62402.mail.re1.yahoo.com> Hi everybody, Comments in SwissProt files such as the following: CC -!- FUNCTION: Core subunit of the mitochondrial membrane respiratory CC chain NADH dehydrogenase (Complex I) that is believed to belong to CC the minimal assembly required for catalysis. Complex I functions CC in the transfer of electrons from NADH to the respiratory chain. CC The immediate electron acceptor for the enzyme is believed to be CC ubiquinone (By similarity). CC -!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol. CC -!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane CC protein (By similarity). CC -!- SIMILARITY: Belongs to the complex I subunit 3 family. CC ----------------------------------------------------------------------- CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms CC Distributed under the Creative Commons Attribution-NoDerivs License CC ----------------------------------------------------------------------- are currently being stored differently by Bio.SeqIO and Bio.SwissProt. Bio.SeqIO stores the comments as one string, as follows: >>> record.annotations['comment'] '-!- FUNCTION: Core subunit of the mitochondrial membrane respiratory\n\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n\n the minimal assembly required for catalysis. Complex I functions\n\n in the transfer of electrons from NADH to the respiratory chain.\n\n The immediate electron acceptor for the enzyme is believed to be\n\n ubiquinone (By similarity).\n\n-!- CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n\n-!- SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n\n protein (By similarity).\n\n-!- SIMILARITY: Belongs to the complex I subunit 3 family.\n\n-----------------------------------------------------------------------\n\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\n\nDistributed under the Creative Commons Attribution-NoDerivs License\n\n-----------------------------------------------------------------------\n' Note that two endlines appear at the end of each line; I don't know why. Bio.SwissProt, on the other hand, stores a list of comments (with single newlines): >>> record.comments [' FUNCTION: Core subunit of the mitochondrial membrane respiratory\n chain NADH dehydrogenase (Complex I) that is believed to belong to\n the minimal assembly required for catalysis. Complex I functions\n in the transfer of electrons from NADH to the respiratory chain.\n The immediate electron acceptor for the enzyme is believed to be\n ubiquinone (By similarity).\n', ' CATALYTIC ACTIVITY: NADH + ubiquinone = NAD(+) + ubiquinol.\n', ' SUBCELLULAR LOCATION: Mitochondrion membrane; Multi-pass membrane\n protein (By similarity).\n', ' SIMILARITY: Belongs to the complex I subunit 3 family.\n', '-----------------------------------------------------------------------\nCopyrighted by the UniProt Consortium, see http://www.uniprot.org/terms\nDistributed under the Creative Commons Attribution-NoDerivs License\n-----------------------------------------------------------------------\n'] I think that the approach used by Bio.SwissProt is more reasonable, although I'd prefer to remove the newlines and to skip the copyright statement altogether (since it's the same for all SwissProt records anyway). Can we do the same for Bio.SeqIO? Or is there a need to keep record.annotations['comments'] as a single string? If they are kept as a single string, how about using a single newline between comments, and no newlines within comments? This btw is the last inconsistency between Bio.SeqIO and Bio.SwissProt. By making this consistent, Bio.SeqIO could use Bio.SwissProt as a backend, which is about three times faster than the current parser, and has the added benefit of having to maintain only one SwissProt parser. --Michiel. --Michiel From biopython at maubp.freeserve.co.uk Sun Jun 7 11:52:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 12:52:04 +0100 Subject: [Biopython-dev] [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: Message-ID: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg wrote: > Suppose an iterator based reader throws an exception due to a bad record. I > want to note that in stderr an move on to the next record. How do i do that? The short answer is you can't (at least not easily), but the details would depend on which parser you are using (i.e. which file format). Do you have a corrupt file, or do you think you might have found a bug in a parser? More details would help. If you really have to do this, then if the file format is simple I would suggest you manually read the file into chunks and then pass them to SeqIO one by one. Not elegant but it would work. For example with a GenBank file, loop over the file line by line caching the data until you reach a new LOCUS line. Then turn the cached lines into a StringIO handle and give it to Bio.SeqIO.read() to parse that single record (in a try/except). Peter From biopython at maubp.freeserve.co.uk Sun Jun 7 12:11:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 13:11:06 +0100 Subject: [Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines In-Reply-To: <268230.50854.qm@web62402.mail.re1.yahoo.com> References: <268230.50854.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00906070511v45b9a6cft2d69518cfaf81a0e@mail.gmail.com> On Sun, Jun 7, 2009 at 12:38 PM, Michiel de Hoon wrote: > > Hi everybody, > > Comments in SwissProt files such as the following: ... > are currently being stored differently by Bio.SeqIO and Bio.SwissProt. > > Bio.SeqIO stores the comments as one string, as follows: ... > Note that two endlines appear at the end of each line; I don't know why. The double new lines sound like a bug to me, we should fix that. > Bio.SwissProt, on the other hand, stores a list of comments (with > single newlines): ... That's just a list containing one string in your example. > I think that the approach used by Bio.SwissProt is more reasonable, > although I'd prefer to remove the newlines and to skip the copyright > statement altogether (since it's the same for all SwissProt records > anyway). In the long term, it looks like the new SwissProt comments are structured in a way that would allow automatic parsing to extract the data. > Can we do the same for Bio.SeqIO? Or is there a need to keep > record.annotations['comments'] as a single string? If they are > kept as a single string, how about using a single newline between > comments, and no newlines within comments? I think there are reasons to keep record.annotations['comments'] as a single string. The GenBank SeqRecord parser (called from Bio.SeqIO) also uses a single string for comments (not a list of strings), so the old SwissProt SeqRecord parser (and thus Bio.SeqIO) is consistent with that. I'd also have to check if switching to a list of strings would be OK with the BioSQL code. Finally, such a change would not be backwards compatible and could break existing scripts. > This btw is the last inconsistency between Bio.SeqIO and > Bio.SwissProt. By making this consistent, Bio.SeqIO could > use Bio.SwissProt as a backend, which is about three times > faster than the current parser, and has the added benefit > of having to maintain only one SwissProt parser. Three times faster sounds very good - assuming it can parse all our existing unit tests of course ;) We don't actually need to change the way comments are stored in the SeqRecord for this parser. I understood your plan is to build a new Bio.SeqIO SwissProt parser on top of the new Bio.SwissProt record based parser, by converting the SwissProt records into SeqRecord objects. At this step, simply concatenate the list of comment strings into one string for the SeqRecord. Then we can use the new faster Bio.SwissProt parser within Bio.SeqIO, without breaking backwards compatibility, and deprecate the old Bio.SwissProt.SProt parser :) Peter From bugzilla-daemon at portal.open-bio.org Sun Jun 7 16:54:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 7 Jun 2009 12:54:24 -0400 Subject: [Biopython-dev] [Bug 2851] New: Psycopg version 1 support for BioSQL Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2851 Summary: Psycopg version 1 support for BioSQL Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com The recent additions the BioSQL interface (bug 2833) as a workaround for the RULES in the schema (bug 2839) (see also PyGreSQL support bug 2849) has broken support for Psycopg VERSION 1. The last release of Psycopg1 was version 1.1.21 on 2005-10-01. So it's pretty old and most users will probably have moved to psycopg2. However, it not been deprecated and, without the recent rules workaround code in place, it does pass all the tests (excepts the rules tests obviously). Psycopg1 fails the rules workaround code because in the BioSeqDatabase namespace, the connection object is a list of functions, and not a class that can be inspected for the driver name, and the IntegrityError is in the driver module namespace. One possible solution is to make the check for the RULES when the database is opened, set a module global variable, and later re-import the module to get the IntegrityError. It is a nasty solution, but in its favour it can be easily be removed when the RULES are eventually removed from the schema. Anyway, attached is a patch using this workaround which works for psycopg1, psycopg2, and PyGreSQL (note only for the pgdb driver and not the pg driver). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jun 7 16:55:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 7 Jun 2009 12:55:37 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906071655.n57GtbZN017199@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 ------- Comment #1 from cymon.cox at gmail.com 2009-06-07 12:55 EST ------- Created an attachment (id=1321) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1321&action=view) Psycopg 1 RULES workaround -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From idoerg at gmail.com Sun Jun 7 19:10:16 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 12:10:16 -0700 Subject: [Biopython-dev] [Biopython] skipping a bad record read in SeqIO In-Reply-To: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> Message-ID: Thanks Peter. OK, it's a genbank file, but the point is not hacking around that problem (which I did), it's more of a biopython policy question. Biopython cannot handle every record format variant (==error) out there, and we should probably have a method for skipping over illegible records. The records skipped should be noted, of course, e.g. by writing to stderr. If the record cannot be read, then the preceding record ID and / or the record serial number should be written. Does that sound like something we should be doing? On Sun, Jun 7, 2009 at 4:52 AM, Peter wrote: > On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg wrote: > > Suppose an iterator based reader throws an exception due to a bad record. > I > > want to note that in stderr an move on to the next record. How do i do > that? > > The short answer is you can't (at least not easily), but the details > would depend on which parser you are using (i.e. which file format). > > Do you have a corrupt file, or do you think you might have found a bug > in a parser? More details would help. > > If you really have to do this, then if the file format is simple I > would suggest you manually read the file into chunks and then pass > them to SeqIO one by one. Not elegant but it would work. For example > with a GenBank file, loop over the file line by line caching the data > until you reach a new LOCUS line. Then turn the cached lines into a > StringIO handle and give it to Bio.SeqIO.read() to parse that single > record (in a try/except). > > Peter > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 20:10:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 21:10:33 +0100 Subject: [Biopython-dev] [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> Message-ID: <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> On 6/7/09, Iddo Friedberg wrote: > Thanks Peter. > > OK, it's a genbank file, but the point is not hacking around that problem > (which I did), it's more of a biopython policy question. Could you report a bug with this particular GenBank file (or at least, the entry). I think Biopython should try and cope with all valid GenBank files. It has been a long time since I personally found a GenBank file Biopython couldn't parse - the only cases I can remember recently from the mailing list have been invalid files from 3rd party scripts or tools. Sometimes for out of spec files issuing a warning but continuing may be OK (we do already this on some LOCUS line variants, e.g. some GenBank files output from EMBOSS), but for anything unexpected I think the only safe option is to raise an exception. > Biopython cannot handle every record format variant (==error) out there, > and we should probably have a method for skipping over illegible records. > The records skipped should be noted, of course, e.g. by writing to stderr. > If the record cannot be read, then the preceding record ID and / or the > record serial number should be written. > > Does that sound like something we should be doing? No, not really. I'm not 100% sure this is what you meant, but I would oppose any suggestion that the default behaviour should be to completely skip bad records (with only a warning or output to stderr to signal this). In some cases (e.g. GenBank and SwissProt files) the start and end of records are well defined, so for a corrupt record we may be able to recover by issuing a warning and skipping ahead to the next record boundary. In other file formats this could be impossible (or at least, risky). So as a general policy for Bio.SeqIO, I don't think we can offer any way to skip bad records. Perhaps I am biased as most GenBank files I personally use are single records (i.e. genomes). Peter P.S. I would use the warnings module rather than writing to stderr, as this would allow the user to filter warnings, upgrade them to exceptions etc. From idoerg at gmail.com Sun Jun 7 21:14:10 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:14:10 -0700 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> Message-ID: On Sun, Jun 7, 2009 at 1:10 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: > > Thanks Peter. > > > > OK, it's a genbank file, but the point is not hacking around that > problem > > (which I did), it's more of a biopython policy question. > > Could you report a bug with this particular GenBank file (or at least, the > entry). I think Biopython should try and cope with all valid GenBank files. > > It has been a long time since I personally found a GenBank file > Biopython couldn't parse - the only cases I can remember recently from > the mailing list have been invalid files from 3rd party scripts or tools. > > Sometimes for out of spec files issuing a warning but continuing may > be OK (we do already this on some LOCUS line variants, e.g. some > GenBank files output from EMBOSS), but for anything unexpected I > think the only safe option is to raise an exception. > > > Biopython cannot handle every record format variant (==error) out there, > > and we should probably have a method for skipping over illegible > records. > > The records skipped should be noted, of course, e.g. by writing to > stderr. > > If the record cannot be read, then the preceding record ID and / or the > > record serial number should be written. > > > > Does that sound like something we should be doing? > > No, not really. > > I'm not 100% sure this is what you meant, but I would oppose any > suggestion that the default behaviour should be to completely skip bad > records (with only a warning or output to stderr to signal this). > > In some cases (e.g. GenBank and SwissProt files) the start and end of > records are well defined, so for a corrupt record we may be able to > recover by issuing a warning and skipping ahead to the next record > boundary. In other file formats this could be impossible (or at least, > risky). So as a general policy for Bio.SeqIO, I don't think we can > offer any way to skip bad records. > > Perhaps I am biased as most GenBank files I personally use are single > records (i.e. genomes). No, I am not suggesting that it should be the default behavior, but that an argument (skip_bad_records=True or somesuch) could be passed to the parser to make this possible for users who would like to do that. I work with millions of sequences at a time, and if 5,000 or 50,000 are badly formatted (or problematic due to a parser bug), I would rather make a note of it and move on, coming back later to fix the problem. The alternative would be -- well, and ugly hack, which will cause loss of time and research momentum. Also, I am not suggesting an exact implementation (yet). Warnings do sound better than stderr. There are a few million genbank (the format) files out there that did not originate with NCBI genbank (the database). Mostly in metagenomics. Some are meta-file that contain no sequence but only LOCUS fields. It used to be that any format was strictly adhered to, simply because files in that format would always originate from the same source, and FASTA was the universal format used for exchange, since it is very hard to mess up a fasta format. That is not the case any more. For that reason I think we should consider how to handle unparse-able records. > > > Peter > > P.S. I would use the warnings module rather than writing to stderr, as > this would allow the user to filter warnings, upgrade them to > exceptions etc. > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From idoerg at gmail.com Sun Jun 7 21:17:48 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:17:48 -0700 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> Message-ID: Here is the stack dump, coming from the file: ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz The offender: ACCESSION CH991540 ABGB01000000 Syntax error at or near `Tokens('close_paren')' token Traceback (most recent call last): File "./filter_seqs.py", line 108, in matching_seqs, non_matching_seqs = filter_sequences(open(inpath), match_pairs, condition,seq_format) File "./filter_seqs.py", line 23, in filter_sequences for seq_record in SeqIO.parse(in_handle,format): File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 420, in parse_records record = self.parse(handle) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 403, in parse if self.feed(handle, consumer) : File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 381, in feed self._feed_misc_lines(consumer, misc_lines) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", line 1138, in _feed_misc_lines consumer.contig_location(contig_location) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", line 987, in contig_location self.location(content) File "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", line 684, in location raise LocationParserError(location_line) Bio.GenBank.LocationParserError: join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303)) On Sun, Jun 7, 2009 at 2:14 PM, Iddo Friedberg wrote: > > > > > > > On Sun, Jun 7, 2009 at 1:10 PM, Peter wrote: > >> On 6/7/09, Iddo Friedberg wrote: >> > Thanks Peter. >> > >> > OK, it's a genbank file, but the point is not hacking around that >> problem >> > (which I did), it's more of a biopython policy question. >> >> Could you report a bug with this particular GenBank file (or at least, the >> entry). I think Biopython should try and cope with all valid GenBank >> files. >> >> It has been a long time since I personally found a GenBank file >> Biopython couldn't parse - the only cases I can remember recently from >> the mailing list have been invalid files from 3rd party scripts or tools. >> >> Sometimes for out of spec files issuing a warning but continuing may >> be OK (we do already this on some LOCUS line variants, e.g. some >> GenBank files output from EMBOSS), but for anything unexpected I >> think the only safe option is to raise an exception. >> >> > Biopython cannot handle every record format variant (==error) out >> there, >> > and we should probably have a method for skipping over illegible >> records. >> > The records skipped should be noted, of course, e.g. by writing to >> stderr. >> > If the record cannot be read, then the preceding record ID and / or the >> > record serial number should be written. >> > >> > Does that sound like something we should be doing? >> >> No, not really. >> >> I'm not 100% sure this is what you meant, but I would oppose any >> suggestion that the default behaviour should be to completely skip bad >> records (with only a warning or output to stderr to signal this). >> >> In some cases (e.g. GenBank and SwissProt files) the start and end of >> records are well defined, so for a corrupt record we may be able to >> recover by issuing a warning and skipping ahead to the next record >> boundary. In other file formats this could be impossible (or at least, >> risky). So as a general policy for Bio.SeqIO, I don't think we can >> offer any way to skip bad records. >> >> Perhaps I am biased as most GenBank files I personally use are single >> records (i.e. genomes). > > > No, I am not suggesting that it should be the default behavior, but that an > argument (skip_bad_records=True or somesuch) could be passed to the parser > to make this possible for users who would like to do that. I work with > millions of sequences at a time, and if 5,000 or 50,000 are badly formatted > (or problematic due to a parser bug), I would rather make a note of it and > move on, coming back later to fix the problem. The alternative would be -- > well, and ugly hack, which will cause loss of time and research momentum. > > Also, I am not suggesting an exact implementation (yet). Warnings do sound > better than stderr. > > There are a few million genbank (the format) files out there that did not > originate with NCBI genbank (the database). Mostly in metagenomics. Some are > meta-file that contain no sequence but only LOCUS fields. > > It used to be that any format was strictly adhered to, simply because files > in that format would always originate from the same source, and FASTA was > the universal format used for exchange, since it is very hard to mess up a > fasta format. That is not the case any more. For that reason I think we > should consider how to handle unparse-able records. > > > > > >> >> >> Peter >> >> P.S. I would use the warnings module rather than writing to stderr, as >> this would allow the user to filter warnings, upgrade them to >> exceptions etc. >> > > > > -- > Iddo Friedberg, Ph.D. > Atkinson Hall, mail code 0446 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0446, USA > T: +1 (858) 534-0570 > http://iddo-friedberg.org > > > > > -- > Iddo Friedberg, Ph.D. > Atkinson Hall, mail code 0446 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0446, USA > T: +1 (858) 534-0570 > http://iddo-friedberg.org > > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From idoerg at gmail.com Sun Jun 7 21:30:50 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:30:50 -0700 Subject: [Biopython-dev] Fwd: [Biopython] skipping a bad record read in SeqIO In-Reply-To: <320fb6e00906071429u6b1a202di7a32070ec939c267@mail.gmail.com> References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> <320fb6e00906071429u6b1a202di7a32070ec939c267@mail.gmail.com> Message-ID: On 6/7/09, Iddo Friedberg wrote: > On Sun, Jun 7, 2009 at 1:10 PM, Peter > wrote: > > > > Could you report a bug with this particular GenBank file (or at least, > > the entry). I think Biopython should try and cope with all valid > > GenBank files. > > > > It has been a long time since I personally found a GenBank file > > Biopython couldn't parse - the only cases I can remember recently from > > the mailing list have been invalid files from 3rd party scripts or tools. > > > > Sometimes for out of spec files issuing a warning but continuing may > > be OK (we do already this on some LOCUS line variants, e.g. some > > GenBank files output from EMBOSS), but for anything unexpected I > > think the only safe option is to raise an exception. > > > > > > > Biopython cannot handle every record format variant (==error) out > > > there, and we should probably have a method for skipping over > > > illegible records. The records skipped should be noted, of course, > > > e.g. by writing to stderr. If the record cannot be read, then the > > > preceding record ID and / or the record serial number should be > > > written. > > > > > > Does that sound like something we should be doing? > > > > No, not really. > > > > I'm not 100% sure this is what you meant, but I would oppose any > > suggestion that the default behaviour should be to completely skip bad > > records (with only a warning or output to stderr to signal this). > > > > In some cases (e.g. GenBank and SwissProt files) the start and end of > > records are well defined, so for a corrupt record we may be able to > > recover by issuing a warning and skipping ahead to the next record > > boundary. In other file formats this could be impossible (or at least, > > risky). So as a general policy for Bio.SeqIO, I don't think we can > > offer any way to skip bad records. > > > > Perhaps I am biased as most GenBank files I personally use are single > > records (i.e. genomes). > > No, I am not suggesting that it should be the default behavior, OK, good. I was worried there. > but that an argument (skip_bad_records=True or somesuch) could be > passed to the parser to make this possible for users who would like to > do that. I work with millions of sequences at a time, and if 5,000 or > 50,000 are badly formatted (or problematic due to a parser bug), I > would rather make a note of it and move on, coming back later to fix > the problem. The alternative would be -- well, and ugly hack, which > will cause loss of time and research momentum. > > Also, I am not suggesting an exact implementation (yet). Warnings > do sound better than stderr. > > There are a few million genbank (the format) files out there that did not > originate with NCBI genbank (the database). Mostly in metagenomics. > Some are meta-file that contain no sequence but only LOCUS fields. > > It used to be that any format was strictly adhered to, simply because > files in that format would always originate from the same source, and > FASTA was the universal format used for exchange, since it is very > hard to mess up a fasta format. That is not the case any more. For > that reason I think we should consider how to handle unparse-able > records. OK, clearly you have a rather different use case to me, where almost all the GenBank files I have used are from the NCBI, and if not are usually single genomes (with draft annotations) where I am prepared to fix any file format errors by hand. If you haven't already done so I would urge you to report bad files to the upstream source, other such errors are only perpetuated and will cause more headaches in future. You haven't convinced me that we need a general mechanism (in Bio.SeqIO) for skipping bad records in any file format (and I remain sceptical that this is even possible in general). However, for your GenBank situation I can understand your motivation now. I think in your situation I'd implement my earlier "hand waving" suggestion of a pre-parser which breaks the big GenBank file up into individual records, and turn each into a StringIO handle passed to Bio.SeqIO insider a try/except. This would make a nice cookbook recipe... I can picture the code in my head and could probably get it working pretty quickly if you'd like to try this. But probably tomorrow not tonight ;) So, perhaps an option (GenBank/EMBL specific initially) could be considered, but so far this seems like a corner use case to me, which we shouldn't complicate the main code base to accommodate. Peter -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 21:31:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 22:31:48 +0100 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> Message-ID: <320fb6e00906071431m36608514t7b754212ba40bd88@mail.gmail.com> On 6/7/09, Iddo Friedberg wrote: > Here is the stack dump, coming from the file: > > ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz > > The offender: > > ACCESSION CH991540 ABGB01000000 > > Syntax error at or near `Tokens('close_paren')' token > Traceback (most recent call last): > File "./filter_seqs.py", line 108, in > matching_seqs, non_matching_seqs = filter_sequences(open(inpath), > match_pairs, condition,seq_format) > File "./filter_seqs.py", line 23, in filter_sequences > for seq_record in SeqIO.parse(in_handle,format): > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 420, in parse_records > record = self.parse(handle) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 403, in parse > if self.feed(handle, consumer) : > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 381, in feed > self._feed_misc_lines(consumer, misc_lines) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > line 1138, in _feed_misc_lines > consumer.contig_location(contig_location) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > line 987, in contig_location > self.location(content) > File > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > line 684, in location > raise LocationParserError(location_line) > Bio.GenBank.LocationParserError: > join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303)) > That look like Bug 2745 to me - does the patch on that bug work for you, and would you be happy storing the CONTIG line as string? Peter From idoerg at gmail.com Sun Jun 7 21:33:18 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 14:33:18 -0700 Subject: [Biopython-dev] skipping a bad record read in SeqIO In-Reply-To: <320fb6e00906071431m36608514t7b754212ba40bd88@mail.gmail.com> References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> <320fb6e00906071431m36608514t7b754212ba40bd88@mail.gmail.com> Message-ID: hmm let me look into that... it it is a noted bug, I may wade into it if nobody else had. Thanks, Iddo On Sun, Jun 7, 2009 at 2:31 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: > > Here is the stack dump, coming from the file: > > > > ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz > > > > The offender: > > > > ACCESSION CH991540 ABGB01000000 > > > > Syntax error at or near `Tokens('close_paren')' token > > Traceback (most recent call last): > > File "./filter_seqs.py", line 108, in > > matching_seqs, non_matching_seqs = filter_sequences(open(inpath), > > match_pairs, condition,seq_format) > > File "./filter_seqs.py", line 23, in filter_sequences > > for seq_record in SeqIO.parse(in_handle,format): > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 420, in parse_records > > record = self.parse(handle) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 403, in parse > > if self.feed(handle, consumer) : > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 381, in feed > > self._feed_misc_lines(consumer, misc_lines) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/Scanner.py", > > line 1138, in _feed_misc_lines > > consumer.contig_location(contig_location) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > > line 987, in contig_location > > self.location(content) > > File > > "/home/idoerg/biopy_cvs/biopython/Bio/GenBank/__init__.py", > > line 684, in location > > raise LocationParserError(location_line) > > Bio.GenBank.LocationParserError: > > > join(complement(ABGB01000004.1:1..81568),gap(unk100),complement(ABGB01000012.1:1..1260),gap(unk100),ABGB01000013.1:1..1227,gap(unk100),ABGB01000011.1:1..1338,gap(unk100),complement(ABGB01000001.1:1..118303)) > > > > That look like Bug 2745 to me - does the patch on that bug work for > you, and would you be happy storing the CONTIG line as string? > > Peter > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 21:40:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 22:40:57 +0100 Subject: [Biopython-dev] Fwd: [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> <320fb6e00906071310w20b3ac61n5238093b399fd85a@mail.gmail.com> <320fb6e00906071429u6b1a202di7a32070ec939c267@mail.gmail.com> Message-ID: <320fb6e00906071440o53bfe8cdh5ac0695ed8e03524@mail.gmail.com> Note - Iddo emailed me off list accidentally, and then forwarded my reply... Peter wrote (forward by Iddo): > On 6/7/09, Iddo Friedberg wrote: > > On Sun, Jun 7, 2009 at 1:10 PM, Peter > > wrote: > > > > > > Could you report a bug with this particular GenBank file (or at > > > least, the entry). I think Biopython should try and cope with all > > > valid GenBank files. > > > > > > It has been a long time since I personally found a GenBank file > > > Biopython couldn't parse - the only cases I can remember recently > > > from the mailing list have been invalid files from 3rd party scripts > > > or tools. Or, the CONTIG line problem (Bug 2745) which I'd forgotten about until Iddo's follow up email with the stack trace (I personally don't use that type of GenBank file). These are valid GenBank files from the NCBI that we should be able to parse. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 8 11:12:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jun 2009 07:12:07 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906081112.n58BC7qQ017036@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-08 07:12 EST ------- Thanks for solving this - patch checked in but with global POSTGRES_RULES_PRESENT renamed to _POSTGRES_RULES_PRESENT (private variable). Marking as fixed. Do you think we should deprecate Biopython support for psycopg version one? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 8 11:38:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jun 2009 07:38:15 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906081138.n58BcFB3018676@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 ------- Comment #3 from cymon.cox at gmail.com 2009-06-08 07:38 EST ------- (In reply to comment #2) > Do you think we should deprecate Biopython support for psycopg version one? Yes, I'd deprecate it - its no longer actively developed. Anyone wanting to use Psycopg would surely choose version 2 (version 1 was a pain to build anyway). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jun 8 13:00:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 14:00:37 +0100 Subject: [Biopython-dev] AUTHORS file on github Message-ID: <320fb6e00906080600r1d35a7e5j7e069ca42f77d1b8@mail.gmail.com> Hi Bartek, I was just looking at github and noticed the AUTHORS file is present: http://github.com/biopython/biopython/tree/master This was deleted in CVS seven years ago (well, renamed to CONTRIB). Is this some subtle side effect of the tag changes? Peter From barwil at gmail.com Mon Jun 8 18:05:15 2009 From: barwil at gmail.com (Bartek Wilczynski) Date: Mon, 8 Jun 2009 20:05:15 +0200 Subject: [Biopython-dev] AUTHORS file on github In-Reply-To: <320fb6e00906080600r1d35a7e5j7e069ca42f77d1b8@mail.gmail.com> References: <320fb6e00906080600r1d35a7e5j7e069ca42f77d1b8@mail.gmail.com> Message-ID: <8b34ec180906081105oa232f6ak25ef2c8a2cf69ef7@mail.gmail.com> Hi, On Mon, Jun 8, 2009 at 3:00 PM, Peter wrote: > Hi Bartek, > > I was just looking at github and noticed the AUTHORS file is present: > http://github.com/biopython/biopython/tree/master > > This was deleted in CVS seven years ago (well, renamed to CONTRIB). > I'm attending a workshop now, so my web access is limited, but I'll look into that. > Is this some subtle side effect of the tag changes? I remember that I solved the issue with those removed files at the beginning of the transition, so I would guess you are right. cheers Bartek From biopython at maubp.freeserve.co.uk Tue Jun 9 15:31:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 16:31:32 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <20090428124119.GV34546@sobchak.mgh.harvard.edu> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906090831x1614fc94m18b0c3e9851272e5@mail.gmail.com> On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote: > Hi Peter; > >> I've made some updates to Installation.tex, which I think are an >> improvement over the version shipped with Biopython 1.50 and >> currently online. ?I think we could update these files now: >> >> http://biopython.org/DIST/docs/install/Installation.html >> http://biopython.org/DIST/docs/install/Installation.pdf >> >> Does that seem sensible? ?Before that, would anyone like to proof read >> the text in CVS, or make further updates? ?For example, are the bits >> on ?FreeBSD, Fink and RPMs still valid? > > The FreeBSD port is out of date now, so I commented that section out > and replaced it with a section on using easy_install... I've just updated Installation.tex to take into account the Biopython 1.50 changes (no Martel, no mxTextTools - better late than never), and put the new HTML and PDF files online. This includes Brad's changes. Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 9 16:02:21 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:02:21 -0400 Subject: [Biopython-dev] [Bug 2853] New: Support the "in" keyword with Seq objects / define __contains__ method Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2853 Summary: Support the "in" keyword with Seq objects / define __contains__ method Product: Biopython Version: Not Applicable Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Currently the "in" keyword isn't properly supported in the Seq object, meaning instead of this: >>> if "TAG" in my_seq: >>> print "Found TAG" you have to do something else like this (using the find method added on Bug 2809): >>> if my_seq.find("TAG") >= 0 : >>> print "Found TAG" In dealing with Bug 2809 we already have a policy in place for dealing with the alphabet issues, so the code to do this is very simple. Patch to follow. Because we don't define __contains__ yet, when someone uses "in" at the moment Python does something indirectly via our __getitem__ method, which means "in" returns True when used for a single letter (as a string) that is in the sequence, and False otherwise (e.g. a multi-letter string, or a Seq object). i.e. currently: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna, generic_rna, generic_protein >>> my_dna=Seq("AAGTGCTAATAGAAAAA", generic_dna) >>> "N" in my_dna #works False >>> "A" in my_dna #works True >>> Seq("A") in my_dna #I think this is broken, should be True False >>> "TAG" in my_dna #I think this is broken, should be True False >>> "TAG" in my_dna.tostring() True >>> "TAG" in str(my_dna) True -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 9 16:03:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:03:55 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq objects / define __contains__ method In-Reply-To: Message-ID: <200906091603.n59G3t0j013999@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 12:03 EST ------- Created an attachment (id=1323) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1323&action=view) Add __contains__ to Seq object This includes a doctest -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 9 16:05:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:05:27 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200906091605.n59G5RkF014140@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2853 ------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 12:05 EST ------- This bug also depends on: Bug 2853 - Support the "in" keyword with Seq objects / define __contains__ method -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 9 16:05:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:05:41 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq objects / define __contains__ method In-Reply-To: Message-ID: <200906091605.n59G5fWA014158@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2351 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dalloliogm at gmail.com Tue Jun 9 16:09:38 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 9 Jun 2009 18:09:38 +0200 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> Message-ID: <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> On Tue, Apr 28, 2009 at 3:40 PM, Peter wrote: > On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote: > > Well, easy_install isn't (yet) an official python standard so I hadn't > previously worried about it - our wiki Downloads page does mention it. > Frankly the less "official" ways the are to install, the less ways it > can go wrong, and then the less questions need to be asked when it > goes wrong. If I can say mine, pypi and easy_install are very cool! :-) The biopython package on pypi works very well and it is the quickest way to get the latest version of biopython. It is more reliable than the packages in the repositories of many linux distro (some of them are outdated), and with respect to the manual installation, it makes it a lot easier to update biopython and to install all the dependencies. Nor had I worried about how PyPi's listing might need to be updated. > I assumed it was clever enough to scan the http://biopython.org/DIST/ > directory and parse the filenames. Is the real answer you (Brad) kept > it up to date? > http://pypi.python.org/pypi/biopython/ > I saw that some packages, when installed with easy_install, are downloaded from their own project home pages. For example, when you do easy_install numpy, it downloads the egg code from sourceforge. So maybe there is a way to automatically update packages to pypi, but I don't know it.. > > > Peter, if you have an account on pypi, let me know your login and I > > can add you as an owner for Biopython. > > I don't have an account on pypi. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From bugzilla-daemon at portal.open-bio.org Tue Jun 9 16:20:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 12:20:46 -0400 Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO In-Reply-To: Message-ID: <200906091620.n59GKkhQ015142@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2294 ------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 12:20 EST ------- The code in CVS should now be writing GenBank files with features properly, and has been tested on complex fuzzy joins and even mixed strand features. Getting features locations to work properly has been more work than I had expected! (In reply to comment #14) > There is still plenty to do: > * Full testing, both manual and with extended unit test coverage Having a 3rd party test the current code would be very helpful - I may have missed things, as different people will use the code in different ways. > * Wrapping long feature locations Done. > * Writing references Not done yet, but for my personal needs this is low priority. > * Extending to cover writing EBML files Not done yet, but should be comparatively straight forward. Let's track this possible enhancement on a separate bug. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 9 16:43:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 17:43:01 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> Message-ID: <320fb6e00906090943y5b9dfe09wb438033c4400668b@mail.gmail.com> Giovanni Marco Dall'Olio wrote: > >Peter wrote: >> Well, easy_install isn't (yet) an official python standard so I hadn't >> previously worried about it - our wiki Downloads page does mention it. >> Frankly the less "official" ways the are to install, the less ways it >> can go wrong, and then the less questions need to be asked when it >> goes wrong. > > If I can say mine, pypi and easy_install are very cool! :-) > > The biopython package on pypi works very well and it is the quickest > way to get the latest version of biopython. > It is more reliable than the packages in the repositories of many linux > distro (some of them are outdated), and with respect to the manual > installation, it makes it a lot easier to update biopython and to install > all the dependencies. Using the packages from your Linux distribution is probably the easiest and most reliable way to get Biopython on Linux - but these are inevitably a little out of date most of the time. If it works, then yes, easy_install / pypi is nice and easy to use. As long as Brad (or someone) is happy to support this, that's fine with me. However, easy_install isn't perfect. If you browse the NumPy/SciPy mailing lists you'll see plenty of issues with easy_install - they have problems with CPU specific optimised builds and so on which are rather complicated to deal with. This is relevant because we would need easy_install to handle NumPy for us. I can certainly see the appeal of easy_install where a tool has lots of dependencies you would otherwise have to manually install. [If you've never tried, install BioPerl from CPAN and try and count how many other perl libraries it depends on - quite an eye opener!] This isn't really the case for Biopython, all we really need are Python and NumPy (and even that can be skipped if you don't want to use Bio.PDB, Bio.Cluster and a few other bits). > I saw that some packages, when installed with easy_install, are > downloaded from their own project home pages. For example, > when you do easy_install numpy, it downloads the egg code from > sourceforge. ... Using easy_install for Biopython should download from biopython.org Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 9 17:19:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jun 2009 13:19:29 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq objects / define __contains__ method In-Reply-To: Message-ID: <200906091719.n59HJTDK019786@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-09 13:19 EST ------- In fact, given the way the SeqRecord __getitem__ works it might be worth adding a similar __contains__ method to the SeqRecord as well... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From hlapp at gmx.net Tue Jun 9 22:23:53 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 9 Jun 2009 18:23:53 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> Message-ID: I actually don't think it mimics BioPerl. The recommended practice should be that if you don't have a value for an optional attribute, leave it undefined ... On Jun 3, 2009, at 10:52 AM, Cymon Cox wrote: > 2009/6/3 Peter > >> On Tue, Jun 2, 2009 at 9:29 PM, Cymon Cox wrote: >>> >>> Whoa, I see now that in Loader._load_bioentry_table that if the >>> rec.annotations["gi"] is missing, it gets filled with the >> accession.version: >>> >>> if "gi" in record.annotations : >>> identifier = record.annotations["gi"] >>> else : >>> identifier = record.id >>> >>> So biopythons BioSQL identifiers are not equivalent to GenBank >> identifiers. >>> I wonder why this is done and identifier is not just left NULL, >>> and the >>> unique constraint maintained by accession/version... >>> >> >> Remember, it isn't just GenBank files that get imported into BioSQL. >> While the record.id is the accession.version when loading a GenBank >> file, this is not the case in general. >> >> Consulting the CVS log, this was changed BioSQL/Loader/py revision >> 1.33 to cope with loading a FASTA file into a BioSQL database (Bug >> 2425). Presumably I was trying to mimic the BioPerl loading of FASTA >> files. Before this change, the bioentry.identifier was taken as the >> GI >> number if available. >> >> i.e. This change wasn't anything directly to do with the uniqueness >> rules. > > > Thanks Peter. > > Yes, it seems to have been done to mimic BioPerl - but I'm still > curious as > to why it is done at all... > > Anyway, I seem to be chasing my tail here: > http://bugzilla.open-bio.org/show_bug.cgi?id=2681#c5 > > Cheers, C. > -- > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Jun 9 22:57:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 23:57:31 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> Message-ID: <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> On 6/9/09, Hilmar Lapp wrote: > > I actually don't think it mimics BioPerl. The recommended practice > should be that if you don't have a value for an optional attribute, > leave it undefined ... I presume you are talking about bioentry.identity fields, and what if anything should be recorded there (e.g. the NCBI GI number from a GenBank file). I'll have to refresh my mind on how BioPerl stores arbitrary FASTA files in BioSQL where you don't have an NCBI accession & version, or an NCBI gi number - just some identifier string. You're not saying in BioSQL bioentry.identifier should for an NCBI GI number *only*, are you? Peter From hlapp at gmx.net Tue Jun 9 23:07:55 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 9 Jun 2009 19:07:55 -0400 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> Message-ID: On Jun 9, 2009, at 6:57 PM, Peter wrote: > You're not saying in BioSQL bioentry.identifier should for an NCBI > GI number *only*, are you? No, absolutely not. It is the "internal database identifier" from where the record came from, if that database assigns - and publishes - such identifiers. For example, it might be the primary key in some database. Just keep in mind that accession is required, whereas identifier is not, and they are not synonymous. So if you only have one identifier for a record, unless you know that it's the GI# and what you have is a GenBank record, the identifier would likely be called the accession, and the identifier column would remain null. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Jun 10 09:11:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Jun 2009 10:11:46 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00906090943y5b9dfe09wb438033c4400668b@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> <5aa3b3570906090909s3174f7a8r225a95a48103f251@mail.gmail.com> <320fb6e00906090943y5b9dfe09wb438033c4400668b@mail.gmail.com> Message-ID: <320fb6e00906100211g4bbe15evdd4c0d53e847cdb6@mail.gmail.com> On Tue, Jun 9, 2009 at 5:43 PM, Peter wrote: > Using the packages from your Linux distribution is probably the > easiest and most reliable way to get Biopython on Linux - but > these are inevitably a little out of date most of the time. > > If it works, then yes, easy_install / pypi is nice and easy to use. As > long as Brad (or someone) is happy to support this, that's fine with > me. (But to be clear, I still don't think it should be the recommended "official" way to install Biopython - just an option.) > However, easy_install isn't perfect. If you browse the NumPy/SciPy > mailing lists you'll see plenty of issues with easy_install - they have > problems with CPU specific optimised builds and so on which are > rather complicated to deal with. This is relevant because we would > need easy_install to handle NumPy for us. Also, easy_install doesn't work properly for ReportLab (one of our optional dependencies used only for Bio.Graphics, which includes GenomeDiagram). See for example: http://two.pairlist.net/pipermail/reportlab-users/2009-May/008253.html Peter From cy at cymon.org Wed Jun 10 16:03:28 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 10 Jun 2009 17:03:28 +0100 Subject: [Biopython-dev] [Bug 2833] Features insertion on previous bioentry_id In-Reply-To: References: <200906021700.n52H0uZb029220@portal.open-bio.org> <7265d4f0906021239n2c9b71f9l3a69d362bbd66183@mail.gmail.com> <7265d4f0906021329y6b6ca58dv925b7741d1e827d5@mail.gmail.com> <320fb6e00906030554p6daa59a3u54bfaade81c07d5f@mail.gmail.com> <7265d4f0906030752l64e48498v92fd18a80d20658c@mail.gmail.com> <320fb6e00906091557v2b0ac4c0y396947dac22d72e1@mail.gmail.com> Message-ID: <7265d4f0906100903q3f2b75f5p2295174e45da3512@mail.gmail.com> 2009/6/10 Hilmar Lapp > > On Jun 9, 2009, at 6:57 PM, Peter wrote: > > You're not saying in BioSQL bioentry.identifier should for an NCBI GI >> number *only*, are you? >> > > > No, absolutely not. It is the "internal database identifier" from where the > record came from, if that database assigns - and publishes - such > identifiers. For example, it might be the primary key in some database. > > Just keep in mind that accession is required, whereas identifier is not, > and they are not synonymous. So if you only have one identifier for a > record, unless you know that it's the GI# and what you have is a GenBank > record, the identifier would likely be called the accession, and the > identifier column would remain null. Thanks Hilmar. If I'm interpreting you correctly, by implication, the only time a value that is not a NCBI GI number gets added to the bioentry.identifier, is when a database (other than NCBI) implements two unique 'identifiers' such that one would be assigned to the accession and one to the identifier fields. What are these databases? It would be useful to check that they are being dealt with correctly. In which case biopython should not be assigning bioentry.identifier to record.id when the record.annotations['gi'] is missing. Cheers, C. > > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From bugzilla-daemon at portal.open-bio.org Wed Jun 10 21:51:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 10 Jun 2009 17:51:48 -0400 Subject: [Biopython-dev] [Bug 2783] Using alternative start codons in Bio.Seq translate method/function In-Reply-To: Message-ID: <200906102151.n5ALpmNA013225@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2783 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-10 17:51 EST ------- (In reply to comment #5) > > On Bug 2381, comment #51, Leighton wrote: > > In terms of nomenclature: > > > > The default behaviour of translate() as Peter proposed: read through > > in-frame and translate with the appropriate codon table - is fine in > > nearly all circumstances. Most other circumstances are covered by > > stopping at the first in-frame stop codon, which Peter has implemented, > > and is an option we all seem to agree on. > > > > Biologically-speaking, this behaviour is not always correct for CDS in > > prokaryotes, where alternative start codons may occur a significant > > minority of the time. These will be mistranslated if no provision is > > made for them. I think a useful biological sequence object should at > > least try to mimic actual biology, so we should provide an option to > > handle this. > > > > We should not assume that a sequence is a CDS unless it is specified by > > the user. It seems reasonable to me that the term 'cds' should occur in > > any such argument from the user. > > > > We have at least two options for how to proceed with a CDS: i) we can > > provide a strict CDS-type translation, which requires confirmation that > > the sequence is, in fact, a CDS; ii) we can provide a weak CDS-type > > translation, which only modifies the way the start codon is translated. > > In both cases, behaviour is specific to CDS, and so having 'cds' in the > > argument name *somewhere* seems obvious, and entirely reasonable. > > Leighton's option (ii) is start codon only modification. This is what > I implemented in the patch on comment 1 (attachment 1259 [details]). > We haven't agreed on a good name for this - which is partly why I went > back to revisit the alternative: > > Leighton's option (i) is strict CDS-type translation. As Leighton suggests, > having "cds" in the argument name here makes sense. ... After some reflection I have decided to check in code doing what Leighton called option (i), strict CDS-type translation (as provided in BioPerl via their "complete" argument). This code was based on the above patch (attachment 1298), but with the check for an extra in frame stop codon (which was missing but described in the docstrings). I also went with the shorter argument name, just "cds" rather than "complete_cds", but (until the next release) I am open to changing this new option name. Please bring this up on the mailing list if you don't like "cds" or thing it is unclear. Thanks. Marking this bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jun 11 22:35:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 11 Jun 2009 18:35:00 -0400 Subject: [Biopython-dev] [Bug 2856] New: Duplicate positions for some restriction enzymes in some sequences Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2856 Summary: Duplicate positions for some restriction enzymes in some sequences Product: Biopython Version: 1.50 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: zdmytriv at lbl.gov Returns 2 identical positions for EcoRI enzyme in this sequence: gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga Run this script test.py: from Bio import SeqIO from Bio.Restriction import * from Bio.Seq import Seq from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA if __name__ == "__main__": sequence = "gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga" seq = Seq(sequence, IUPACAmbiguousDNA()) analysis = Analysis([EcoRI], seq, linear=False) results = analysis.full() for enzyme, positions in results.iteritems(): if len(positions) == 0: continue print enzyme for position in positions: print position # returns 2 items 2 and 2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sohm at inaf.cnrs-gif.fr Fri Jun 12 18:53:07 2009 From: sohm at inaf.cnrs-gif.fr (=?ISO-8859-1?Q?Fr=E9d=E9ric_Sohm?=) Date: Fri, 12 Jun 2009 20:53:07 +0200 Subject: [Biopython-dev] [Bug 2856] New: Duplicate positions for some restriction enzymes in some sequences In-Reply-To: References: Message-ID: <4A32A413.9090100@inaf.cnrs-gif.fr> Hi everyone, OK, It is a little mistake in the way the sequence is dealt with by restriction objects to search sites spread over the boundaries of circular sequences. The actual code goes one base too far therefore the beginning of the sequence is scanned twice. Two sites are reported. One at the beginning and one at the end. After correction of the index, the second site is reported at the same position as the first one (which incidentally is a good thing since it proves the corrections are properly handled). Final results is a duplicated report for restriction sites starting at the very first base of a circular sequence. Here is the patch : ====================================================================== --- biopython-1.50-old/Bio/Restriction/Restriction.py 2008-10-22 23:49:06.000000000 +0200 +++ biopython-1.50-new/Bio/Restriction/Restriction.py 2009-06-12 20:28:46.000000000 +0200 @@ -197,7 +197,7 @@ if self.is_linear() : data = self.data else : - data = self.data + self.data[1:size+1] + data = self.data + self.data[1:size] return [(i.start(), i.group) for i in re.finditer(pattern, data)] def __getitem__(self, i) : ======================================================================= I will try to upload it. Best regards Fred bugzilla-daemon at portal.open-bio.org wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2856 > > Summary: Duplicate positions for some restriction enzymes in some > sequences > Product: Biopython > Version: 1.50 > Platform: All > OS/Version: All > Status: NEW > Severity: normal > Priority: P2 > Component: Main Distribution > AssignedTo: biopython-dev at biopython.org > ReportedBy: zdmytriv at lbl.gov > > > Returns 2 identical positions for EcoRI enzyme in this sequence: > gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga > > Run this script test.py: > from Bio import SeqIO > from Bio.Restriction import * > from Bio.Seq import Seq > from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA > > if __name__ == "__main__": > sequence = "gaattccggatgagcattcatcaggcgggcaagaatgtgaataaaggccgga" > seq = Seq(sequence, IUPACAmbiguousDNA()) > analysis = Analysis([EcoRI], seq, linear=False) > results = analysis.full() > > for enzyme, positions in results.iteritems(): > if len(positions) == 0: continue > > print enzyme > for position in positions: > print position > > # returns 2 items 2 and 2 > > From cy at cymon.org Sun Jun 14 15:23:23 2009 From: cy at cymon.org (Cymon Cox) Date: Sun, 14 Jun 2009 16:23:23 +0100 Subject: [Biopython-dev] "Your XML file did not start with Folks, I've been using qblast recently, and got a lot of invalid replies from NCBI of this sort: Traceback (most recent call last): File "test_NCBI_qblast.py", line 71, in record = NCBIXML.read(handle) File "/home/cymon/git/github-master/Bio/Blast/NCBIXML.py", line 564, in read first = iterator.next() File "/home/cymon/git/github-master/Bio/Blast/NCBIXML.py", line 611, in parse % XML_START) ValueError: Your XML file did not start with gi|116660609|gb|EG558220.1|EG558220 CR02019H04 Leaf CR02 cDNA library Catharanthus roseus cDNA clone CR02019H04 5', mRNA sequence\nCTCCATTCCCTCTCTATTTTCAGTCTAATCAAATTAGAGCTTAAAAGAATGAGATTTTTAACAAATAAAA\nAAACATAGGGGAGATTTCATAAAAGTTATATTAGTGATTTGAAGAATATTTTAGTCTATTTTTTTTTTTT\nTCTTTTTTTGATGAAGAAAGGGTATATAAAATCAAGAATCTGGGGTGTTTGTGTTGACTTGGGTCGGGTG\nTGTATAATTCTTGATTTTTTCAGGTAGTTGAAAAGGTAGGGAGAAAAGTGGAGAAGCCTAAGCTGATATT\nGAAATTCATATGGATGGAAAAGAACATTGGTTTAGGATTGGATCAAAAAATAGGTGGACATGGAACTGTA\nCCACTACGTCCTTACTATTTTTGGCCGAGGAAAGATGCTTGGGAAGAACTTAAAACAGTTTTAGAAAGCA\nAGCCATGGATTTCTCAGAAGAAAATGATTATACTTCTTAATCAGGCAACTGATATTATCAATTTATGGCA\nGCAGAGTGGTGGCTCCTTGTCCCAGCAGCAGTAATTACTTTTTTTTCTCTTTTTGTTTCCAAATTAAGAA\nACATTAGTATCATATGGCTATTTGCTCAATTGCAGATTTCTTTCTTTTGTGAATG", ...) Status=WAITING Results == '\n\n': continuing... Results == '\n\n': continuing... Results == '\n\n': continuing... Done Anyone else seen this? Am I just unlucky enough to have a flaky internet connection? Cheers, C. -- From biopython at maubp.freeserve.co.uk Sun Jun 14 18:16:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 14 Jun 2009 19:16:08 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> Message-ID: <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> On 6/14/09, Cymon Cox wrote: > Folks, > > I've been using qblast recently, and got a lot of invalid replies from NCBI > of this sort: > > Traceback (most recent call last): > ... > ValueError: Your XML file did not start with Which is true: NCBI is returning "\n\n". If you code around this and just > keep going the results eventually arrive: > ... At first glance, something based on your change looks sensible. Next time I spot the unit test failing I'll try and reproduce this. Peter From cy at cymon.org Sun Jun 14 18:26:56 2009 From: cy at cymon.org (Cymon Cox) Date: Sun, 14 Jun 2009 19:26:56 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> Message-ID: <7265d4f0906141126l2f00fecehaa28273af9b3681a@mail.gmail.com> 2009/6/14 Peter > On 6/14/09, Cymon Cox wrote: > > Folks, > > > > I've been using qblast recently, and got a lot of invalid replies from > NCBI > > of this sort: > > > > Traceback (most recent call last): > > ... > > ValueError: Your XML file did not start with > I've seen message that from the unit test sometimes, and assumed the > NCBI was returning a temporary HTML error page of some kind - > rerunning our test would normally work. Without checking the > traceback, I would guess this is the same issue you have found. > > > Which is true: NCBI is returning "\n\n". If you code around this and > just > > keep going the results eventually arrive: > > ... > > At first glance, something based on your change looks sensible. Next > time I spot the unit test failing I'll try and reproduce this. It's pretty hit or miss: I would guess once in every 10+ times I ran the test_NCBI_qblast I would encounter the problem. Cheers, C. -- From cy at cymon.org Mon Jun 15 09:02:47 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 15 Jun 2009 10:02:47 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> <7265d4f0906141126l2f00fecehaa28273af9b3681a@mail.gmail.com> Message-ID: <7265d4f0906150202j3daeefa9we304cf29c4f6cd6d@mail.gmail.com> 2009/6/14 Cymon Cox > 2009/6/14 Peter > >> On 6/14/09, Cymon Cox wrote: >> > Folks, >> > >> > I've been using qblast recently, and got a lot of invalid replies from >> NCBI >> > of this sort: >> > >> > Traceback (most recent call last): >> > ... >> > ValueError: Your XML file did not start with > >> I've seen message that from the unit test sometimes, and assumed the >> NCBI was returning a temporary HTML error page of some kind - >> rerunning our test would normally work. Without checking the >> traceback, I would guess this is the same issue you have found. >> >> > Which is true: NCBI is returning "\n\n". If you code around this and >> just >> > keep going the results eventually arrive: >> > ... >> >> At first glance, something based on your change looks sensible. Next >> time I spot the unit test failing I'll try and reproduce this. > > > It's pretty hit or miss: I would guess once in every 10+ times I ran the > test_NCBI_qblast I would encounter the problem. > I can be a little more specific; out 742 calls to qblast, 75 returned the "Your XML" error. (This was with a different ISP.) C. -- From eric.talevich at gmail.com Mon Jun 15 17:04:20 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 15 Jun 2009 13:04:20 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython Message-ID: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> Hi all, Previously (June 8-12) I: * Finished writing constructors and XML parsers for Tier 0,1,2 elements (everything that appears in the example phyloXML files) * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence class -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely will require some more thought * Wrote a unit test for counting clades/branches (topology check) * Changed the no-op unit tests to count the total number of tags (nodes) in the given phyloXML file, keeping stdout clean * Miscellaneous code cleanup * Added a few magic methods to make usage easier: __len__, __iter__, __str__ This week (June 15-19) I will: * Finish unittests for parsing and instantiating core elements * Test and check parser performance versus Bioperl and Archaeopterix loading times * Document results of parser testing and performance (on wiki or here) * Document basic usage of the parser on the Biopython wiki Thoughts: * Test-driven development kind of went out the window this week. Implementing each new class is pretty short now that I'm using the from_element class methods consistently, so I just charged ahead rather than write tests for each class first. I checked the more complicated classes in the REPL, but didn't copy that code into the test script... shameful. There are a couple bugs I know of already, but haven't fixed. So catching up there will be the bulk of the effort this week. * The unit tests I do have in place give some sense of memory and CPU usage. For the full NCBI taxonomy, memory usage climbs up above 2 GB with the read() function, which isn't a problem on this workstation but could be for others. * For biopython-dev, a summary of the parsing strategy: There are two top-level functions, read() and parse(), which behave according to convention. Both use ElementTree's iterparse() function to keep memory usage down (if used properly) and enable streaming data from other sources. The structure of the XML file looks like: ... (recursive) ... (can have several trees) (optional, arbitrary tags) ... The read() function returns all of this as a single Python object, with two attributes: phylogenies[] and other[]. parse() ignores the "other" stuff and just iterates through the "phylogeny" trees, so it should be handy if you're not concerned with the extra arbitrary data that may appear after the trees. I have two more functions for parsing phylogeny and clade objects that track the current context of the XML parser, and clear elements after they're completed. Then all other tags are dispatched to the corresponding classes, via from_element() methods attached to each class, or else built-in constructors for primitive types like int, float, str. The from_element() class methods take an ElementTree.Element object, deal with it, and pass any child nodes for complex types to the corresponding class's from_element() method. The only recursive element is Clade, which is treated specially, so there's nothing scary going on with the stack. I'm open to suggestions for reorganizing this to make Nexus/Newick integration more feasible. Optimization strategies are also a good topic this week. A few weeks later in my project plan I'm also scheduled to implement the rest of the magic methods, so we should discuss the appropriate amount and types of magic to add, too -- the showcase for this right now is Tests/test_PhyloXML. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From anaryin at gmail.com Tue Jun 16 08:00:18 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 16 Jun 2009 10:00:18 +0200 Subject: [Biopython-dev] Biopython-dev Digest, Vol 77, Issue 12 In-Reply-To: References: Message-ID: I've been having that problem with the your XML for a while now, with qblast. But I thought it was my connection going nuts and I used just a retry if that happened... Just to add to you two having that error :) Jo?o [ .. ] Rodrigues (Blog) http://doeidoei.wordpress.com (MSN) always_asleep_ at hotmail.com (Skype) rodrigues.jglm On Mon, Jun 15, 2009 at 7:00 PM, wrote: > Send Biopython-dev mailing list submissions to > biopython-dev at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython-dev > or, via email, send a message with subject or body 'help' to > biopython-dev-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-dev-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython-dev digest..." > > > Today's Topics: > > 1. Re: "Your XML file did not start with 2. Re: "Your XML file did not start with 3. Re: "Your XML file did not start with > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 14 Jun 2009 19:16:08 +0100 > From: Peter > Subject: Re: [Biopython-dev] "Your XML file did not start with To: Cymon Cox > Cc: BioPython-Dev Mailing List > Message-ID: > <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On 6/14/09, Cymon Cox wrote: > > Folks, > > > > I've been using qblast recently, and got a lot of invalid replies from > NCBI > > of this sort: > > > > Traceback (most recent call last): > > ... > > ValueError: Your XML file did not start with > I've seen message that from the unit test sometimes, and assumed the > NCBI was returning a temporary HTML error page of some kind - > rerunning our test would normally work. Without checking the > traceback, I would guess this is the same issue you have found. > > > Which is true: NCBI is returning "\n\n". If you code around this and > just > > keep going the results eventually arrive: > > ... > > At first glance, something based on your change looks sensible. Next > time I spot the unit test failing I'll try and reproduce this. > > Peter > > > ------------------------------ > > Message: 2 > Date: Sun, 14 Jun 2009 19:26:56 +0100 > From: Cymon Cox > Subject: Re: [Biopython-dev] "Your XML file did not start with To: Peter > Cc: BioPython-Dev Mailing List > Message-ID: > <7265d4f0906141126l2f00fecehaa28273af9b3681a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > 2009/6/14 Peter > > > On 6/14/09, Cymon Cox wrote: > > > Folks, > > > > > > I've been using qblast recently, and got a lot of invalid replies from > > NCBI > > > of this sort: > > > > > > Traceback (most recent call last): > > > ... > > > ValueError: Your XML file did not start with > > > I've seen message that from the unit test sometimes, and assumed the > > NCBI was returning a temporary HTML error page of some kind - > > rerunning our test would normally work. Without checking the > > traceback, I would guess this is the same issue you have found. > > > > > Which is true: NCBI is returning "\n\n". If you code around this and > > just > > > keep going the results eventually arrive: > > > ... > > > > At first glance, something based on your change looks sensible. Next > > time I spot the unit test failing I'll try and reproduce this. > > > It's pretty hit or miss: I would guess once in every 10+ times I ran the > test_NCBI_qblast I would encounter the problem. > > Cheers, C. > -- > > > ------------------------------ > > Message: 3 > Date: Mon, 15 Jun 2009 10:02:47 +0100 > From: Cymon Cox > Subject: Re: [Biopython-dev] "Your XML file did not start with To: Peter > Cc: BioPython-Dev Mailing List > Message-ID: > <7265d4f0906150202j3daeefa9we304cf29c4f6cd6d at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > 2009/6/14 Cymon Cox > > > 2009/6/14 Peter > > > >> On 6/14/09, Cymon Cox wrote: > >> > Folks, > >> > > >> > I've been using qblast recently, and got a lot of invalid replies > from > >> NCBI > >> > of this sort: > >> > > >> > Traceback (most recent call last): > >> > ... > >> > ValueError: Your XML file did not start with >> > >> I've seen message that from the unit test sometimes, and assumed the > >> NCBI was returning a temporary HTML error page of some kind - > >> rerunning our test would normally work. Without checking the > >> traceback, I would guess this is the same issue you have found. > >> > >> > Which is true: NCBI is returning "\n\n". If you code around this and > >> just > >> > keep going the results eventually arrive: > >> > ... > >> > >> At first glance, something based on your change looks sensible. Next > >> time I spot the unit test failing I'll try and reproduce this. > > > > > > It's pretty hit or miss: I would guess once in every 10+ times I ran the > > test_NCBI_qblast I would encounter the problem. > > > > I can be a little more specific; out 742 calls to qblast, 75 returned the > "Your XML" error. (This was with a different ISP.) > > C. > -- > > > ------------------------------ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > End of Biopython-dev Digest, Vol 77, Issue 12 > ********************************************* > From biopython at maubp.freeserve.co.uk Tue Jun 16 09:16:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 10:16:46 +0100 Subject: [Biopython-dev] Your XML file did not start with On Tue, Jun 16, 2009 at 9:00 AM, Jo?o Rodrigues wrote: > > I've been having that problem with the your XML for a while now, with > qblast. But I thought it was my connection going nuts and I used just a > retry if that happened... > > Just to add to you two having that error :) > > Jo?o [ .. ] Rodrigues OK - looks like this is a reasonably common issue. Have you tried Cymon's fix yet? If not, are you happy to update you copy of Biopython to the latest code from CVS/github and test that? If so, I'll check in the fix and you can help test that... Thanks, Peter P.S. If replying to the digest emails, please edit the subject line to match the topic you are replying too. From bugzilla-daemon at portal.open-bio.org Tue Jun 16 09:22:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Jun 2009 05:22:54 -0400 Subject: [Biopython-dev] [Bug 2780] PDB file HETATMs cannot be alternative location of a residue that is an ATOM In-Reply-To: Message-ID: <200906160922.n5G9MsNK023105@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2780 klaus.kopec at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Version|1.49 |1.50 ------- Comment #3 from klaus.kopec at tuebingen.mpg.de 2009-06-16 05:22 EST ------- bug still exists in v1.50 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 16 09:25:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 16 Jun 2009 05:25:20 -0400 Subject: [Biopython-dev] [Bug 2781] Bio.PDB Structure instances cannot be deepcopied In-Reply-To: Message-ID: <200906160925.n5G9PKI3023354@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2781 klaus.kopec at tuebingen.mpg.de changed: What |Removed |Added ---------------------------------------------------------------------------- Version|1.49 |1.50 ------- Comment #1 from klaus.kopec at tuebingen.mpg.de 2009-06-16 05:25 EST ------- bug still exists in v1.50 (still with Python 2.6.1, but now Kubuntu 9.04 64-Bit) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From thomas.hamelryck at gmail.com Tue Jun 16 15:09:55 2009 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Tue, 16 Jun 2009 17:09:55 +0200 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> Message-ID: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Hi all, I can now confirm that the logo is released in the public domain. I will give it a suitable license later this week. Cheers, /Thomas On Fri, Jun 5, 2009 at 8:05 PM, Thomas Hamelryck wrote: > > > On Fri, Jun 5, 2009 at 1:30 AM, Iddo Friedberg wrote: > >> Wow, congratulations! I am so buying this off my startup.... >> >> I think the question is best addressed to Thomas Haelryck. IIRC, his >> friend >> designed the logo. There is no license I am aware of, but it is probably a >> good idea to put a cc-na license on it, like Tux has. >> > > Should be fine, but will ask to be sure. > > Cheers, > > -Thomas > > -- Thomas Hamelryck Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://wiki.binf.ku.dk/User:Thomas_Hamelryck http://www.binf.ku.dk/research/structural_bioinformatics/ From jkhilmer at gmail.com Tue Jun 16 20:16:49 2009 From: jkhilmer at gmail.com (Jonathan Hilmer) Date: Tue, 16 Jun 2009 14:16:49 -0600 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Message-ID: <81277ce10906161316x6caa73feve86de4d54dec7efd@mail.gmail.com> I greatly appreciate the effort people have contributed to Biopython in general and this logo particularly, but perhaps it would be better to have a logo that scales better, and is in a vector format? The python-DNA theme could be kept, but with simpler text and snakes to remain legible when used as a small logo or as part of a background. I'd be happy to create a vectorized interpretation if people are interested. Jonathan On Tue, Jun 16, 2009 at 9:09 AM, Thomas Hamelryck wrote: > Hi all, > > I can now confirm that the logo is released in the public domain. > I will give it a suitable license later this week. > > Cheers, > > /Thomas > > On Fri, Jun 5, 2009 at 8:05 PM, Thomas Hamelryck > wrote: > >> >> >> On Fri, Jun 5, 2009 at 1:30 AM, Iddo Friedberg wrote: >> >>> Wow, congratulations! I am so buying this off my startup.... >>> >>> I think the question is best addressed to Thomas Haelryck. IIRC, his >>> friend >>> designed the logo. There is no license I am aware of, but it is probably a >>> good idea to put a cc-na license on it, like Tux has. >>> >> >> Should be fine, but will ask to be sure. >> >> Cheers, >> >> -Thomas >> >> > > > -- > Thomas Hamelryck > Group leader Structural Bioinformatics > Bioinformatics center > Department of Biology > University of Copenhagen > Ole Maaloes Vej 5 > DK-2200 Copenhagen N > Denmark > http://wiki.binf.ku.dk/User:Thomas_Hamelryck > http://www.binf.ku.dk/research/structural_bioinformatics/ > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Tue Jun 16 21:25:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 22:25:27 +0100 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Message-ID: <320fb6e00906161425u16d6f59ey7c9daf2d3e22be94@mail.gmail.com> On Tue, Jun 16, 2009 at 4:09 PM, Thomas Hamelryck wrote: > > Hi all, > > I can now confirm that the logo is released in the public domain. > I will give it a suitable license later this week. > > Cheers, > > /Thomas Thanks Thomas, Did you remember to ask Henrik Vestergaard if he had the original files still? If there is something like an Adobe Illustrator or Photoshop file that could be very helpful to generate a vector based version (e.g. PDF, SVG) suitable for big posters etc. Even any larger JPG files are both having... [This would certainly be easier than Jonathan Hilmer trying to recreate a vector version from scratch - which might be worth trying] Thanks Peter P.S. See also http://www.biopython.org/wiki/Logo From sbassi at clubdelarazon.org Tue Jun 16 21:30:38 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 16 Jun 2009 18:30:38 -0300 Subject: [Biopython-dev] Biopython logo usage In-Reply-To: <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> References: <9e2f512b0906041622p3e2320cdxc735ab6a596d2629@mail.gmail.com> <2d7c25310906051105k36105d51had0593c2903a8464@mail.gmail.com> <2d7c25310906160809h61a36e98w5615e1f0b5117d5f@mail.gmail.com> Message-ID: <9e2f512b0906161430s3a57536xa283399a8a3e4c39@mail.gmail.com> On Tue, Jun 16, 2009 at 12:09 PM, Thomas Hamelryck wrote: > I can now confirm that the logo is released in the public domain. > I will give it a suitable license later this week. OK, I'll contact the cover designer to include a small logo in a corner if there is time. Thank you very much. Best, SB. From chapmanb at 50mail.com Wed Jun 17 12:41:01 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 17 Jun 2009 08:41:01 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> Message-ID: <20090617124101.GH44321@sobchak.mgh.harvard.edu> Hi Eric; Nice update and thanks again for copying the Biopython development list on this. > * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence > class > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely will > require some more thought I'm looking forward to seeing how you decide to go forward with this. For the work I do on a day to day basis, a continual struggle involves establishing relationships between things to retrieve more information. For instance, a pair of nodes on a tree is interesting -- how would I find papers, experiments and other information associated with those sequences? It seems like Accession and the ref attribute of Annotation help establish these relationships. > * Test-driven development kind of went out the window this week. Heh. It happens -- sounds sensible to have a clean up and documentation week this week; that will also help others who are interested dig into using it. > * The unit tests I do have in place give some sense of memory and CPU usage. > For the full NCBI taxonomy, memory usage climbs up above 2 GB with the > read() function, which isn't a problem on this workstation but could be for > others. Do you see an opportunity to offer iterating over clades instead of loading them all into memory for these larger trees? This would involve lazily loading subclades on request and would limit some functionality for querying the full tree without loading it all into memory. Another option is to offer some pruning ability as a tree is loading. For instance, if I am loading the whole NCBI taxonomy on a memory limited computer and only need the Angiosperm flowering plant part of the tree. In this case, you'd want to throw away all clades not under the clades of interest. These are probably fringe cases; just brainstorming some ideas. Thanks again, Brad From eric.talevich at gmail.com Wed Jun 17 23:17:41 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 17 Jun 2009 19:17:41 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <20090617124101.GH44321@sobchak.mgh.harvard.edu> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> Message-ID: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Hi Brad, Here's a mid-week update and partial response to your questions. *SeqRecord transformation* It would be nice if I could round-trip this sequence information perfectly, so that nothing's lost between reading and writing an arbitrary, valid PhyloXML file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look at SeqRecord.features and assume that any matching keys have the appropriate PhyloXML meaning. These are the keys that from_seqrec() would look for: location uri annotations domain_architecture Do you see any risk of collision for those names? And for serialization, would it be unwholesome to convert Annotation and DomainArchitecture objects to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- it's another layer of parsing and kind of esoteric, but I can live with it. *Profiling* Christian also suggested an option to parse just the phylogenies with a name or id matching a given string. I like that and I don't see any problem with extending it to clades as well. It seems like a reasonable use case to select a sub-tree from a complete phyloXML document and treat it as a separate phylogeny from then on. This can be supported by various methods for selecting portions of the tree, and a method on Clade for transforming the selection into a new Phylogeny instance (so the original can be safely deleted). I did some profiling with the cProfile module, and it looks like most of the time is being spent instantiating Clade and Taxonomy objects. (Also, pretty_print is hugely inefficient, but that's less important.) I think I can speed up parsing and reduce memory usage by pulling the from_element methods out of each class and using a separate Parser class to do that work. About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was just looking at Ubuntu's system monitor, and Firefox and a few other things were running at the same time, taking up about 800MB already. So the full NCBI taxonomy actually takes up only 1.2GB or so, which isn't such a problem, and I think it will get smaller as I shrink down these PhyloXML classes. Questions: - Do you know of a better way to profile Python code, or visualize it? - Have you used __slots__ to optimize classes? Do you recommend it? And a few that don't fit anywhere else: - What sort of whole-tree operations would you want to do with these objects that you can't do with a Nexus or Newick tree? What other formats would you want to convert to? I'm thinking of adding an Export module later if there's time, for lossy conversions like a graph for networkx. - What's the most intuitive way to display a phylogenetic tree you've loaded into Biopython? Serialize as Nexus and open in TreeViewX? Convert to a graph and send to matplotlib? Or, is there a module in Bio.Graphics that can draw trees? (If not, should there be?) Thanks, Eric On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman wrote: > Hi Eric; > Nice update and thanks again for copying the Biopython development > list on this. > > > * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence > > class > > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely > will > > require some more thought > > I'm looking forward to seeing how you decide to go forward with > this. For the work I do on a day to day basis, a continual > struggle involves establishing relationships between things to > retrieve more information. For instance, a pair of nodes on a tree > is interesting -- how would I find papers, experiments and other > information associated with those sequences? It seems like Accession > and the ref attribute of Annotation help establish these > relationships. > > > * Test-driven development kind of went out the window this week. > > Heh. It happens -- sounds sensible to have a clean up and > documentation week this week; that will also help others who are > interested dig into using it. > > > * The unit tests I do have in place give some sense of memory and CPU > usage. > > For the full NCBI taxonomy, memory usage climbs up above 2 GB with the > > read() function, which isn't a problem on this workstation but could > be for > > others. > > Do you see an opportunity to offer iterating over clades instead of > loading them all into memory for these larger trees? This would > involve lazily loading subclades on request and would limit some > functionality for querying the full tree without loading it all into > memory. > > Another option is to offer some pruning ability as a tree is > loading. For instance, if I am loading the whole NCBI taxonomy on a > memory limited computer and only need the Angiosperm flowering plant > part of the tree. In this case, you'd want to throw away all clades > not under the clades of interest. > > These are probably fringe cases; just brainstorming some ideas. > > Thanks again, > Brad > From biopython at maubp.freeserve.co.uk Thu Jun 18 09:35:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:35:24 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> On Thu, Jun 18, 2009 at 12:17 AM, Eric Talevich wrote: > Hi Brad, > > Here's a mid-week update and partial response to your questions. > > *SeqRecord transformation* > > It would be nice if I could round-trip this sequence information perfectly, > so that nothing's lost between reading and writing an arbitrary, valid > PhyloXML file. For that to work, PhyloXML.Sequence.from_seqrec() > would need to look at SeqRecord.features and assume that any matching > keys have the appropriate PhyloXML meaning. > > These are the keys that from_seqrec() would look for: > ? ?location > ? ?uri > ? ?annotations > ? ?domain_architecture > > Do you see any risk of collision for those names? And for serialization, > would it be unwholesome to convert Annotation and DomainArchitecture objects > to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- > it's another layer of parsing and kind of esoteric, but I can live with it. If you can show us a sample record, I would be better able to comment on how I would store it in a SeqRecord. Are you fully familiar with the SeqRecord object, its annotations dictionary, and the list of SeqFeature objects (which have locations relative to the parent SeqRecord) which all have their own annotations dictionary (although under the name of qualifiers for some reason). Perhaps you'd like to proof read the new SeqRecord chapter in the tutorial - it is still a work in progress, but should be informative. Peter From chapmanb at 50mail.com Thu Jun 18 12:52:40 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 18 Jun 2009 08:52:40 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <20090618125240.GP44321@sobchak.mgh.harvard.edu> Hi Eric; Nice -- thanks much for the summary. > *SeqRecord transformation* > > It would be nice if I could round-trip this sequence information perfectly, > so > that nothing's lost between reading and writing an arbitrary, valid PhyloXML > file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look > at > SeqRecord.features and assume that any matching keys have the appropriate > PhyloXML meaning. > > These are the keys that from_seqrec() would look for: > location > uri > annotations > domain_architecture > > Do you see any risk of collision for those names? And for serialization, > would it be unwholesome to convert Annotation and DomainArchitecture objects > to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- > it's another layer of parsing and kind of esoteric, but I can live with it. SeqRecords have two places to store information related to the sequence: annotations -- key/value pairs describing the entire sequence, implemented as a dictionary with lists as values. features -- items with a location that refer to part of the sequence, which can have key/value pairs, here called qualifiers. My sense is that much of the PhyloXML markup will fit into annotations. For instance, your annotation string should really be part of the annotation dictionary: {"ref" : ["foo"], "source" : ["bar"] } as opposed to a string that requires deserializing. The easiest way to discuss this is to take a few real life cases and see how they can fit, as Peter suggested. People here familiar with using SeqRecords can hopefully come to a consensus as the best place to store different items. > *Profiling* > > Christian also suggested an option to parse just the phylogenies with a > name or id matching a given string. I like that and I don't see any problem > with extending it to clades as well. It seems like a reasonable use case to > select a sub-tree from a complete phyloXML document and treat it as a > separate > phylogeny from then on. This can be supported by various methods for > selecting > portions of the tree, and a method on Clade for transforming the selection > into > a new Phylogeny instance (so the original can be safely deleted). [...] > About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was just > looking at Ubuntu's system monitor, and Firefox and a few other things were > running at the same time, taking up about 800MB already. So the full NCBI > taxonomy actually takes up only 1.2GB or so, which isn't such a problem, and > I > think it will get smaller as I shrink down these PhyloXML classes. Sounds great. I think you'll be fine with that memory usage and the ability to select subsets based on an identifier. > I did some profiling with the cProfile module, and it looks like most of the > time is being spent instantiating Clade and Taxonomy objects. (Also, > pretty_print is hugely inefficient, but that's less important.) I think I > can > speed up parsing and reduce memory usage by pulling the from_element methods > out of each class and using a separate Parser class to do that work. > > Questions: > - Do you know of a better way to profile Python code, or visualize it? > - Have you used __slots__ to optimize classes? Do you recommend it? I use cProfile and pstats from the standard library, which it sounds like you are on top of. That normally points me in the right place to try optimizations. I haven't used __slots__ but generally try to avoid any python black magic. If people need additional CPU speedups, I'd suggest Psyco. This increases memory usage so it will be a tradeoff for most people. Benchmarks with and without Psyco would give users a guideline if they need to optimize performance. > And a few that don't fit anywhere else: > > - What sort of whole-tree operations would you want to do with these > objects that you can't do with a Nexus or Newick tree? What > other formats would you want to convert to? I'm thinking of adding an > Export module later if there's time, for lossy conversions like a graph for > networkx. This is a good general question for the users. I like the graph conversion idea, as it avoids having to re-invent all of the graph manipulation and query operations already present in networkx. > - What's the most intuitive way to display a phylogenetic tree you've > loaded into Biopython? Serialize as Nexus and open in TreeViewX? > Convert > to a graph and send to matplotlib? Or, is there a module in > Bio.Graphics > that can draw trees? (If not, should there be?) A good general way to do this would be welcome. I've used networkx with pygraphviz to draw rough 'n ready trees before. Here is some horribly non-generalized code that does this: http://github.com/chapmanb/bcbb/blob/master/visualize/tax_data_display.py Brad > On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman wrote: > > > Hi Eric; > > Nice update and thanks again for copying the Biopython development > > list on this. > > > > > * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence > > > class > > > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely > > will > > > require some more thought > > > > I'm looking forward to seeing how you decide to go forward with > > this. For the work I do on a day to day basis, a continual > > struggle involves establishing relationships between things to > > retrieve more information. For instance, a pair of nodes on a tree > > is interesting -- how would I find papers, experiments and other > > information associated with those sequences? It seems like Accession > > and the ref attribute of Annotation help establish these > > relationships. > > > > > * Test-driven development kind of went out the window this week. > > > > Heh. It happens -- sounds sensible to have a clean up and > > documentation week this week; that will also help others who are > > interested dig into using it. > > > > > * The unit tests I do have in place give some sense of memory and CPU > > usage. > > > For the full NCBI taxonomy, memory usage climbs up above 2 GB with the > > > read() function, which isn't a problem on this workstation but could > > be for > > > others. > > > > Do you see an opportunity to offer iterating over clades instead of > > loading them all into memory for these larger trees? This would > > involve lazily loading subclades on request and would limit some > > functionality for querying the full tree without loading it all into > > memory. > > > > Another option is to offer some pruning ability as a tree is > > loading. For instance, if I am loading the whole NCBI taxonomy on a > > memory limited computer and only need the Angiosperm flowering plant > > part of the tree. In this case, you'd want to throw away all clades > > not under the clades of interest. > > > > These are probably fringe cases; just brainstorming some ideas. > > > > Thanks again, > > Brad > > From biopython at maubp.freeserve.co.uk Thu Jun 18 14:10:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 15:10:48 +0100 Subject: [Biopython-dev] CONTIG records in GenBank files Message-ID: <320fb6e00906180710ncbf3346rc662853c2e09f71e@mail.gmail.com> A couple of weeks ago, we were talking about a problem with CONTIG lines in GenBank files, http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006192.html On Sun, Jun 7, 2009 at 10:31 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: >> Here is the stack dump, coming from the file: >> >> ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz >> ... >> Traceback (most recent call last): >> ... >> Bio.GenBank.LocationParserError: >> ... > > That looks like Bug 2745 to me - does the patch on that bug work for > you, and would you be happy storing the CONTIG line as string? Iddo, Did you have a chance to try the patch on Bug 2745 yet? http://bugzilla.open-bio.org/show_bug.cgi?id=2745 If you are happy with the proposed solution, I'd like to get that checked in... Thanks, Peter From bugzilla-daemon at portal.open-bio.org Thu Jun 18 15:29:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 18 Jun 2009 11:29:32 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <200906181529.n5IFTWQQ021054@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1206 is|0 |1 obsolete| | Attachment #1210 is|0 |1 obsolete| | ------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-18 11:29 EST ------- Created an attachment (id=1327) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1327&action=view) Patch for Bio/GenBank/__init__.py to handle simple locations with re The old patch wasn't applying cleanly any more. This is the same code as before, but updated for the current CVS. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Jun 18 20:22:17 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Jun 2009 16:22:17 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> Message-ID: <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> On Thu, Jun 18, 2009 at 5:35 AM, Peter wrote: > If you can show us a sample record, I would be better able to comment > on how I would store it in a SeqRecord. > Here are a couple of examples from files in Test/PhyloXML/. From phyloxml_examples.xml, a contrived demonstration of various features: A E. coli alcohol dehydrogenase 0.99 Caenorhabditis elegans ADHX Q17335 alcohol dehydrogenase (An extra level of context is shown -- information that doesn't fit into a SeqRecord could also be conceivably moved up into the Clade object.) Assuming values of the SeqRecord.attributes dictionary can also be dictionaries, this isn't to hard to convert to primitive types. Another example from apaf.xml, which appears to be real data: CARD NB-ARC WD40 WD40 WD40 WD40 WD40 WD40 WD40 WD40 WD40 The DomainArchitecture element refers to domains in a protein sequence, according to the spec. This could be reasonably represented as a list of SeqFeature objects, I see now. But converting from a SeqRecord back to PhyloXML, not all SeqFeatures would be protein domains... I don't know what to do with that. The new SeqRecord chapter is very informative -- I was originally just looking at the wiki and epydoc pages. Still unclear: why doesn't the SeqRecord constructor take annotations as an optional argument? Should it? Thanks, Eric From biopython at maubp.freeserve.co.uk Thu Jun 18 20:52:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 21:52:26 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> Message-ID: <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> On Thu, Jun 18, 2009 at 9:22 PM, Eric Talevich wrote: > > The new SeqRecord chapter is very informative -- I was originally just > looking at the wiki and epydoc pages. I can't take all the credit - a good chunk of it was reused from the "Advanced" chapter. Do let me know if you spot any typos. Did you think it makes sense to have this before the SeqIO chapter (as it is now), or afterwards? Right now the SeqRecord chapter uses SeqIO in order to load some complex records to show how they are represented - so you could put them in either order. > Still unclear: why doesn't the SeqRecord constructor take annotations > as an optional argument? Should it? I don't know why it doesn't (its a historical design choice before by time), do you think it would actually be more useful? Maybe Brad can comment? P.S. See also related Bug 2841 http://bugzilla.open-bio.org/show_bug.cgi?id=2841 Peter From eric.talevich at gmail.com Thu Jun 18 21:49:47 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Jun 2009 17:49:47 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> Message-ID: <3f6baf360906181449t1b0b9b85u325b2ac997ed05b8@mail.gmail.com> On Thu, Jun 18, 2009 at 4:52 PM, Peter wrote: > On Thu, Jun 18, 2009 at 9:22 PM, Eric Talevich > wrote: > > > > The new SeqRecord chapter is very informative -- I was originally just > > looking at the wiki and epydoc pages. > > I can't take all the credit - a good chunk of it was reused from the > "Advanced" > chapter. Do let me know if you spot any typos. Did you think it makes sense > to have this before the SeqIO chapter (as it is now), or afterwards? Right > now the SeqRecord chapter uses SeqIO in order to load some complex > records to show how they are represented - so you could put them in either > order. > I didn't notice any typos other than Python being consistently lowercase, which I assume is how the author likes it. The ordering is good -- the SeqIO chapter makes more advanced use of sequences of SeqRecord objects, so it's good to be familiar with the basic objects first. In general, I like the organization of covering fundamental types first, then moving on to larger collections, rather than covering the majority of a big collection in one shot and leaving the tricky parts unaddressed. A quick discussion of SeqFeature objects at the end of the SeqRecord chapter instead of in the "Advanced" chapter would be nice, since apparently it's easy to disregard that last section as an appendix of less important material (I guess I did originally). In that final section it seems like SeqFeature is not meant to be handled by mere mortals, mainly because of the fuzzy positions -- maybe integrating it a little more comfortably into other modules like PDB would help with that. I pushed some work to github, involving the PhyloXML<->SeqRecord translation: http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML -Eric From biopython at maubp.freeserve.co.uk Thu Jun 18 22:05:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 23:05:29 +0100 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) Message-ID: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich wrote: > > I didn't notice any typos other than Python being consistently lowercase, > which I assume is how the author likes it. I was aiming for consistency, with no strong preference - at the time there were more uses of "python" than "Python" so I picked that. We can change it easily enough - does anyone care either way? > The ordering is good -- the SeqIO chapter makes more advanced use > of sequences of SeqRecord objects, so it's good to be familiar with the > basic objects first. In general, I like the organization of covering > fundamental types first, then moving on to larger collections, rather > than covering the majority of a big collection in one shot and leaving > the tricky parts unaddressed. There is a case for leaving messy corner cases to the end, as long as the main chapters cover the core. > A quick discussion of SeqFeature objects at the end of the SeqRecord > chapter instead of in the "Advanced" chapter would be nice, since > apparently it's easy to disregard that last section as an appendix of > less important material (I guess I did originally). Yes, the SeqFeature stuff did originally risk being ignored when it was just part of the "Advanced" chapter near the end. I think it made sense to move it to the new SeqRecord chapter (and there is still room for improvement - I'm thinking of going over one of the features in the example GenBank file in more detail). > In that final section it seems like SeqFeature is not meant to be > handled by mere mortals, mainly because of the fuzzy positions The fuzzy locations are by their nature really horrible to code with. Doing something with SeqFeature objects and locations has been discussed on the mailing list in the last month or so (in connection with GFF files). I hope to have a good chat about this with Brad in person at the BOSC hackathon. > -- maybe integrating it a little more comfortably into other > modules like PDB would help with that. I don't see how SeqFeature objects and their FeatureLocations related to PDB. Could you elaborate? Peter From eric.talevich at gmail.com Fri Jun 19 01:57:55 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Jun 2009 21:57:55 -0400 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> Message-ID: <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> On Thu, Jun 18, 2009 at 6:05 PM, Peter wrote: > On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich > wrote: > > > > I didn't notice any typos other than Python being consistently lowercase, > > which I assume is how the author likes it. > > I was aiming for consistency, with no strong preference - at the time > there were more uses of "python" than "Python" so I picked that. We > can change it easily enough - does anyone care either way? > Python.org capitalizes it. Shrug. > The ordering is good -- the SeqIO chapter makes more advanced use > > of sequences of SeqRecord objects, so it's good to be familiar with the > > basic objects first. In general, I like the organization of covering > > fundamental types first, then moving on to larger collections, rather > > than covering the majority of a big collection in one shot and leaving > > the tricky parts unaddressed. > > There is a case for leaving messy corner cases to the end, as long as > the main chapters cover the core. > Agreed. In the SeqRecord chapter, I was looking for a paragraph or so on what sort of information goes into a SeqFeature to see whether it would be a suitable stand-in for PhyloXML's DomainArchitecture. From the initial description I wasn't sure if annotations or letter_annotations would be more appropriate, and the other mentionings are basically "here be dragons"... which is true, but a quick example would be helpful. The GenBank parsing section would be a good place for that. > -- maybe integrating it a little more comfortably into other > > modules like PDB would help with that. > > I don't see how SeqFeature objects and their FeatureLocations > related to PDB. Could you elaborate? > If secondary structure or miscellaneous information is listed in the PDB header, then parse_pdb_header could produce SeqFeatures from that. Right now it doesn't build any Biopython objects at all. -Eric From biopython at maubp.freeserve.co.uk Fri Jun 19 09:18:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 10:18:40 +0100 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> Message-ID: <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> On Fri, Jun 19, 2009 at 2:57 AM, Eric Talevich wrote: > On Thu, Jun 18, 2009 at 6:05 PM, Peter > wrote: >> >> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich >> wrote: >> > >> > I didn't notice any typos other than Python being consistently >> > lowercase, >> > which I assume is how the author likes it. >> >> I was aiming for consistency, with no strong preference - at the time >> there were more uses of "python" than "Python" so I picked that. We >> can change it easily enough - does anyone care either way? > > Python.org capitalizes it. Shrug. Maybe we should use "Python" then. >> > The ordering is good -- the SeqIO chapter makes more advanced use >> > of sequences of SeqRecord objects, so it's good to be familiar with the >> > basic objects first. In general, I like the organization of covering >> > fundamental types first, then moving on to larger collections, rather >> > than covering the majority of a big collection in one shot and leaving >> > the tricky parts unaddressed. >> >> There is a case for leaving messy corner cases to the end, as long as >> the main chapters cover the core. > > Agreed. In the SeqRecord chapter, I was looking for a paragraph or so on > what sort of information goes into a SeqFeature to see whether it would be a > suitable stand-in for PhyloXML's DomainArchitecture. From the initial > description I wasn't sure if annotations or letter_annotations would be more > appropriate, and the other mentionings are basically "here be dragons"... > which is true, but a quick example would be helpful. The GenBank parsing > section would be a good place for that. OK - that is useful feedback. I will try and clarify that, but in essence: * letter_annotations - where you have a bit of information for each letter (i.e. amino acid or nucleotide) in the sequence, such as a list of quality scores or secondary structure predictions. * features - where you have annotation associated with a particular region of the sequence (e.g. a gene) * annotations - things that apply to the whole sequence like organism There are some odd cases, like the GenBank source feature, which covers the whole of the sequence but is listed in the feature table just like a gene etc (you'd have to ask the NCBI why they did it this way). In Biopython, these source features get stored as a SeqFeature for consistency with the rest of the GenBank feature table entries. Another odd one is any references, which in GenBank files may apply to a particular region of the sequence (but in normal usage seem to apply to the whole thing). These get stored separately in BioSQL, which to me makes sense. At the moment in the SeqRecord they are stored in the annotations dictionary (as a list of reference objects under the key "references"). I've been thinking about upgrading this to a new SeqRecord property (a list of reference objects) but as I have never actually needed to access this information it hasn't been a high priority. >> > -- maybe integrating it a little more comfortably into other >> > modules like PDB would help with that. >> >> I don't see how SeqFeature objects and their FeatureLocations >> related to PDB. Could you elaborate? > > If secondary structure or miscellaneous information is listed in the > PDB header, then parse_pdb_header could produce SeqFeatures > from that. Right now it doesn't build any Biopython objects at all. I see. Yes, the header parsing in Bio.PDB is very limited at the moment, and even sticking to well defined line types (and ignoring many or most of the REMARK lines) there is room for improvement. For the secondary structure, this is given as a string with one letter for each residue - I see this as a more natural match to SeqRecord letter_annotations rather than a SeqFeature, but giving a list of SeqFeatures for the helices, beta sheets, coils etc would also work. Of course, you might also want a Seq object to relate them to (to give the locations meaning). One idea I have toyed with is a Bio.SeqIO parser for PDB files, which would focus on the sequence information in the headers (and probably ignore the ATOM lines completely). I would like to keep the core of Biopython independent of NumPy (and I see Bio.SeqIO as part of the core), so this wouldn't depend on Bio.PDB. I'm not sure this idea would actually be useful so haven't worked on it. Peter From chapmanb at 50mail.com Fri Jun 19 12:25:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Jun 2009 08:25:49 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> Message-ID: <20090619122549.GD64233@sobchak.mgh.harvard.edu> Eric and Peter; > > Still unclear: why doesn't the SeqRecord constructor take annotations > > as an optional argument? Should it? > > I don't know why it doesn't (its a historical design choice before by time), > do you think it would actually be more useful? Maybe Brad can comment? > > P.S. See also related Bug 2841 > http://bugzilla.open-bio.org/show_bug.cgi?id=2841 My recollection of the history here is hazy, but based on the code comments we were probably running into this problem without realizing it: http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm It should be easy enough to allow passing in annotations and letter_annotations by setting the function defaults to None and doing the if annotations is None: annotations = {} trick. My vote is for adding this. Brad From biopython at maubp.freeserve.co.uk Fri Jun 19 12:40:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 13:40:08 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <20090619122549.GD64233@sobchak.mgh.harvard.edu> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906180235k236a5538pf47ff0abae669180@mail.gmail.com> <3f6baf360906181322n5981a66eobd05b2692965504a@mail.gmail.com> <320fb6e00906181352o6baf0c92u1d8b1f493cd7e4d7@mail.gmail.com> <20090619122549.GD64233@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906190540p49b98ad5se162b7be6b600411@mail.gmail.com> On Fri, Jun 19, 2009 at 1:25 PM, Brad Chapman wrote: > Eric and Peter; > >> > Still unclear: why doesn't the SeqRecord constructor take annotations >> > as an optional argument? Should it? >> >> I don't know why it doesn't (its a historical design choice before my time), >> do you think it would actually be more useful? Maybe Brad can comment? >> >> P.S. See also related Bug 2841 >> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 > > My recollection of the history here is hazy, but based on the code comments > we were probably running into this problem without realizing it: > > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm > > It should be easy enough to allow passing in annotations and > letter_annotations by setting the function defaults to None and doing the > if annotations is None: annotations = {} trick. > > My vote is for adding this. That thought did occur to me last night - having fallen over the same problem myself once before. Would you like to update the SeqFeature along those lines then? Peter From bugzilla-daemon at portal.open-bio.org Fri Jun 19 12:43:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 08:43:55 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906191243.n5JChtaM029245@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 08:43 EST ------- See http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006241.html where Brad wrote: > My recollection of the history here is hazy, but based on the > code comments we were probably running into this problem without > realizing it: > > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm > > It should be easy enough to allow passing in annotations and > letter_annotations by setting the function defaults to None > and doing the if annotations is None: annotations = {} trick. > > My vote is for adding this. I agree that would explain the comments, and fix makes sense. Note if we want to allow the letter_annotations to be set, we shouldn't blindly apply the supplied value (which may be a plain dictionary, and contain inappropriate data) but make sure it is turned into a restricted dictionary. Ideally this should be covered with a new unit test... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 12:48:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 08:48:39 -0400 Subject: [Biopython-dev] [Bug 2860] New: Writing GenBank files should output features in position order Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2860 Summary: Writing GenBank files should output features in position order Product: Biopython Version: 1.50b Platform: All OS/Version: Linux Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: n.j.loman at bham.ac.uk Adding features to a SeqRecord object does not automatically sort them by position. Therefore if you do something like this: for rec in SeqIO.parse(sys.stdin, "genbank"): new_features = [] for feature in rec.features: if feature.type == 'CDS': gene_feature = copy(feature) gene_feature.type = 'gene' new_features.append(gene_feature) rec.features.extend(new_features) SeqIO.write([rec], sys.stdout, "genbank") You will end up with an incorrectly sorted file with CDS features first, then gene features. You can sort rec.features in-place to correct this: rec.features.sort(key=attrgetter('location')) I am not sure the correct fix in terms of BioPython, whether it should concentrate on changing the behaviour SeqRecord.features, or the GenBank output code (which I am aware is a work in progress). I guess the answer to this is should BioPython guarantee Seqrecord.features to be sorted? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 12:52:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 08:52:15 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906191252.n5JCqFoX030054@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 ------- Comment #3 from n.j.loman at bham.ac.uk 2009-06-19 08:52 EST ------- Good stuff, a useful Python gotcha I had not encountered yet. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 13:11:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:11:17 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191311.n5JDBHgA031607@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 09:11 EST ------- Hi Nick, I understand your request, but I am not sure if it is a bug. Do you know if the GenBank file format say anything about the order of the features? And does this actually matter? I know from experience that the NCBI GenBank files do seem to be sorted by position (and then it seems to be gene before CDS and some other tie break rules which I have not explored). Arguably this should be left to the user - a slightly different version of your script could avoid the issue, something like this (untested): from Bio import SeqIO for rec in SeqIO.parse(sys.stdin, "genbank"): new_features = [] for feature in rec.features: if feature.type == 'CDS': gene_feature = copy(feature) gene_feature.type = 'gene' new_features.append(gene_feature) new_features.append(feature) rec.features = new_features SeqIO.write([rec], sys.stdout, "genbank") Peter P.S. Your example may produce odd features, as gene features don't normally include a protein id or a translation while a CDS feature may. Again, Biopython doesn't currently try to limit this - or indeed limit the feature types to a while list. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 13:19:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:19:07 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191319.n5JDJ7jv032414@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #2 from n.j.loman at bham.ac.uk 2009-06-19 09:19 EST ------- Well yes, I realise that this is only a standard by convention. My assumption when dealing with such matters is that if NCBI/GenBank does it, it is probably right. My impression is that "source" is always the first qualifier, then it is sorted by location with "gene" features followed by "CDS" features by convention. I guess it is acceptable for the user to deal with the order in the absence of a published standard for GenBank files. But I think it would be equally acceptable to code the GenBank outputter to enforce those rules. (I know that script fragment would give weird output, it was just illustrative of the position issue.) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 13:31:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:31:10 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191331.n5JDVAY4001158@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 09:31 EST ------- (In reply to comment #2) > Well yes, I realise that this is only a standard by convention. My assumption > when dealing with such matters is that if NCBI/GenBank does it, it is probably > right. On the other hand, the NCBI are generally very good at defining their file formats, so *if* they don't specify an order, presumably anything is OK? > My impression is that "source" is always the first qualifier, then it is > sorted by location with "gene" features followed by "CDS" features by > convention. Since the "source" feature starts from the first base, it will always be one of the first by location. > I guess it is acceptable for the user to deal with the order in the absence > of a published standard for GenBank files. But I think it would be equally > acceptable to code the GenBank outputter to enforce those rules. We would need to know the rules though. For example, which of these locations is first: "10..20" or "<10..20" or ">10..20" or "one-of(10,12)..20" or are they all tied? We would also need to know the tie break rules for the feature type, not just "source" before "gene" before "CDS". What about "tRNA" etc. Given we don't currently know the rules, we could only implement a best guess. If the order we write out is very clear is just the order of the SeqFeature objects in the list (as now) the behaviour is clearly defined. This is my preference as it gives the user full control (and full responsibility). If we did sort things, there would be no easy way to override this sorting. > (I know that script fragment would give weird output, it was just > illustrative of the position issue.) OK Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 13:37:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:37:29 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191337.n5JDbT19001895@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 ------- Comment #4 from n.j.loman at bham.ac.uk 2009-06-19 09:37 EST ------- Fair enough, I guess this is a pretty strong argument to leave the Genbank output as it is. Is there any argument to add a method to SeqRecord like "insert_by_location()". This might make the use of SeqFeature more transparent to new users too who may not be clear that they can write to the features list directly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jun 19 13:44:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 19 Jun 2009 09:44:24 -0400 Subject: [Biopython-dev] [Bug 2860] Writing GenBank files should output features in position order In-Reply-To: Message-ID: <200906191344.n5JDiOEV002698@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2860 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WONTFIX ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-19 09:44 EST ------- (In reply to comment #4) > Fair enough, I guess this is a pretty strong argument to leave the Genbank > output as it is. OK, marking this as "Won't Fix" (unless anyone can find a definitive definition of how GenBank features should be sorted). > Is there any argument to add a method to SeqRecord like > "insert_by_location()". This might make the use of SeqFeature more > transparent to new users too who may not be clear that they can write > to the features list directly. I see where you are going with that idea, but for now rather than adding more code I think we should add more documentation. The new SeqRecord (and SeqFeature) chapter in the Tutorial should help here. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Jun 19 14:40:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 15:40:40 +0100 Subject: [Biopython-dev] Next release plans? Message-ID: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> Hi all, We're made some good progress since Biopython 1.50, and I think it may already be time to plan another release. There was some important "housekeeping" which we will need to flag in the release notes (and perhaps email Linux packagers about explicitly), namely we don't support Python 2.3 any more, don't install Martel any more, and don't depend on mxTextTools any more. There is also some new stuff too. I think the following still need more testing, so I'm wondering about another beta release... maybe before BOSC 2009 as we can then use it in the tutorial sessions? (1) Writing GenBank files with features. I think this copes with all the ambiguous and complex locations, which turned out to be a big job, but some independent testing would be wise. (2) Supporting the new Illumina FASTQ file format. This still needs some good tests (ideally with some real data). (3) Application wrappers (especially Cymon's new alignment wrappers). We have basic documentation for these in the Tutorial now, plus a new chapter on the SeqRecord object. GenomeDiagram would also benefit from further documentation. Any thoughts? Does rolling this out early next week sound sensible or simply too ambitious? Peter From tiagoantao at gmail.com Fri Jun 19 14:44:25 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 19 Jun 2009 15:44:25 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> Message-ID: <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> On Fri, Jun 19, 2009 at 3:40 PM, Peter wrote: > Any thoughts? Does rolling this out early next week sound sensible > or simply too ambitious? I wont have time to roll my genepop code with statistics calculations in that timeframe, no chance. I will leave it for 1.52. From biopython at maubp.freeserve.co.uk Fri Jun 19 14:57:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 15:57:23 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> Message-ID: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> 2009/6/19 Tiago Ant?o: > On Fri, Jun 19, 2009 at 3:40 PM, Peter wrote: >> Any thoughts? Does rolling this out early next week sound sensible >> or simply too ambitious? > > I wont have time to roll my genepop code with statistics calculations > in that timeframe, no chance. I will leave it for 1.52. We don't have to rush this - it just seemed worth having a quite fresh release available for us to work from in the tutorial session at BOSC, and also as a reference point the coding BoF session. If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in July, then I would provisionally expect to do Biopython 1.52 in the Autumn. This could be brought forward if a large and useful contribution was added. Let's chat about the genepop code at BOSC? Peter From eric.talevich at gmail.com Fri Jun 19 15:15:10 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 19 Jun 2009 11:15:10 -0400 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> Message-ID: <3f6baf360906190815j3c8924f2n73a2e050b6489389@mail.gmail.com> 2009/6/19 Peter > If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in > July, then I would provisionally expect to do Biopython 1.52 in the Autumn. > This could be brought forward if a large and useful contribution was added. > Let's chat about the genepop code at BOSC? > The Google Summer of Code projects are supposed to be production-ready by August 17, according to the timeline: http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline There's also a mentor summit in October. If the two phylogenetics projects land and another release gets rolled by then, a stable release featuring our new code could be presented at the summit, which would be hot. -Eric From eric.talevich at gmail.com Fri Jun 19 16:03:46 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 19 Jun 2009 12:03:46 -0400 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> Message-ID: <3f6baf360906190903ke04b158oa2045558ef9fc9f6@mail.gmail.com> On Fri, Jun 19, 2009 at 5:18 AM, Peter wrote: > > > > OK - that is useful feedback. I will try and clarify that, but in essence: > > * letter_annotations - where you have a bit of information for each letter > (i.e. amino acid or nucleotide) in the sequence, such as a list of quality > scores or secondary structure predictions. > * features - where you have annotation associated with a particular > region of the sequence (e.g. a gene) > * annotations - things that apply to the whole sequence like organism Thanks. Another odd one is any references, which in GenBank files may apply > to a particular region of the sequence (but in normal usage seem to > apply to the whole thing). These get stored separately in BioSQL, which > to me makes sense. At the moment in the SeqRecord they are stored > in the annotations dictionary (as a list of reference objects under the > key "references"). I've been thinking about upgrading this to a new > SeqRecord property (a list of reference objects) but as I have never > actually needed to access this information it hasn't been a high priority. > Good to know. I'll be careful with SeqRecord.features['references'] for now. > > > > If secondary structure or miscellaneous information is listed in the > > PDB header, then parse_pdb_header could produce SeqFeatures > > from that. Right now it doesn't build any Biopython objects at all. > > I see. Yes, the header parsing in Bio.PDB is very limited at the > moment, and even sticking to well defined line types (and ignoring > many or most of the REMARK lines) there is room for improvement. > > For the secondary structure, this is given as a string with one letter > for each residue - I see this as a more natural match to SeqRecord > letter_annotations rather than a SeqFeature, but giving a list of > SeqFeatures for the helices, beta sheets, coils etc would also work. > Of course, you might also want a Seq object to relate them to (to > give the locations meaning). > > One idea I have toyed with is a Bio.SeqIO parser for PDB files, which > would focus on the sequence information in the headers (and probably > ignore the ATOM lines completely). I would like to keep the core of > Biopython independent of NumPy (and I see Bio.SeqIO as part of the > core), so this wouldn't depend on Bio.PDB. I'm not sure this idea > would actually be useful so haven't worked on it. > > I'll have a real use for this in the fall, once GSoC is done. It would be nice to link a set of parsed PDB objects to a multiple alignment of protein sequences, but I think I'd always want to have the 3D structure information close at hand. The other use case I've mentioned before is to verify and fix existing PDB files from Biopython, rather than manually -- 3D coordinates would probably be useful here, too, for checking collisions and such. Eventually I'll resurrect my pdbtidy branch and make the parser emit a SeqRecord or whatever's most appropriate. -Eric From idoerg at gmail.com Fri Jun 19 16:06:13 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 19 Jun 2009 09:06:13 -0700 Subject: [Biopython-dev] CONTIG records in GenBank files In-Reply-To: <320fb6e00906180710ncbf3346rc662853c2e09f71e@mail.gmail.com> References: <320fb6e00906180710ncbf3346rc662853c2e09f71e@mail.gmail.com> Message-ID: Not yet, been traveling. Sorry. Will try to get to it next week (plenty of time on the plane to Stockholm) Iddo Friedberg, Ph.D. http://iddo-friedberg.net/contact.html On Jun 18, 2009 7:10 AM, "Peter" wrote: A couple of weeks ago, we were talking about a problem with CONTIG lines in GenBank files, http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006192.html On Sun, Jun 7, 2009 at 10:31 PM, Peter wrote: > On 6/7/09, Iddo Friedberg wrote: >> Here is the stack dump, coming from the file: >> >> ftp://ftp.ncbi.nih.gov/genbank/gbcon11.seq.gz >> ... >> Traceback (most recent call last): >> ... >> Bio.GenBank.LocationParserError: >> ... > > That looks like Bug 2745 to me - does the patch on that bug work for > you, and would you be happy storing the CONTIG line as string? Iddo, Did you have a chance to try the patch on Bug 2745 yet? http://bugzilla.open-bio.org/show_bug.cgi?id=2745 If you are happy with the proposed solution, I'd like to get that checked in... Thanks, Peter From mjldehoon at yahoo.com Fri Jun 19 16:04:46 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 19 Jun 2009 09:04:46 -0700 (PDT) Subject: [Biopython-dev] Next release plans? Message-ID: <870693.80287.qm@web62405.mail.re1.yahoo.com> --- On Fri, 6/19/09, Peter wrote: > Any thoughts? Does rolling this out early next week sound > sensible or simply too ambitious? > Sounds good to me. I may check in some Bio.Blast stuff, but nothing dramatic. I also think that we can release 1.51 without a beta release first; see our previous discussion here: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005825.html --Michiel. From biopython at maubp.freeserve.co.uk Fri Jun 19 16:19:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 17:19:10 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <870693.80287.qm@web62405.mail.re1.yahoo.com> References: <870693.80287.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e00906190919t47235db0od69a6b7943d180d@mail.gmail.com> On Fri, Jun 19, 2009 at 5:04 PM, Michiel de Hoon wrote: > > > --- On Fri, 6/19/09, Peter wrote: >> Any thoughts? Does rolling this out early next week sound >> sensible or simply too ambitious? > > Sounds good to me. I may check in some Bio.Blast stuff, but nothing > dramatic. I'll keep that in mind - when were you thinking of doing that? I was thinking of doing the release on Monday or Tuesday (as I travel on the Friday, and have to prepare my slides etc). > I also think that we can release 1.51 without a beta release first; > see our previous discussion here: > > http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005825.html Yeah - but this is a bit of a special case. Given we will be having a tutorial session and a hackathon session at BOSC, having people using a Biopython 1.51 beta release should ensure some useful feedback (even if it is just documentation improvements) which would make Biopython 1.51 that much better. I am also a *little* nervous that there could be something amiss in the new GenBank feature location writing (despite the test coverage), and doing the beta release first would put my mind more at ease on this area. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 16:30:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 17:30:11 +0100 Subject: [Biopython-dev] "Your XML file did not start with References: <7265d4f0906140823r9979362y7b1633447e13292f@mail.gmail.com> <320fb6e00906141116l4cf9a9d5u733497d3b02e1b6a@mail.gmail.com> <7265d4f0906141126l2f00fecehaa28273af9b3681a@mail.gmail.com> <7265d4f0906150202j3daeefa9we304cf29c4f6cd6d@mail.gmail.com> Message-ID: <320fb6e00906190930n1f614ee0v4fcce8362ccddde1@mail.gmail.com> On Mon, Jun 15, 2009 at 10:02 AM, Cymon Cox wrote: >>> At first glance, something based on your change looks sensible. >>> Next time I spot the unit test failing I'll try and reproduce this. >> >> It's pretty hit or miss: I would guess once in every 10+ times I ran the >> test_NCBI_qblast I would encounter the problem. > > I can be a little more specific; out 742 calls to qblast, 75 returned the > "Your XML" error. (This was with a different ISP.) I haven't had this fail on me personally, and so have not tested the fix directly. However, on another thread Peter Saffrey reported using your patch was helpful, so I have committed the fix to CVS (without the extra print statements). As usual, let me know if I've mangled something with my editing ;) Thanks, Peter From tiagoantao at gmail.com Fri Jun 19 19:59:26 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 19 Jun 2009 20:59:26 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> Message-ID: <6d941f120906191259h62890450o6231bd4362f51ee4@mail.gmail.com> 2009/6/19 Peter : > Let's chat about the genepop code at BOSC? That was actually my plan. I am working to take a functional version with me. Mainly for code review and documentation purposes. With some way to compute statistics this will be quite competitive with other Bio projects and I will start to announce this in places like the evoldir mailing list. As finally it can serve a large group of users... -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From chapmanb at 50mail.com Fri Jun 19 22:39:51 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Jun 2009 18:39:51 -0400 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> Message-ID: <20090619223951.GA5133@sobchak.mgh.harvard.edu> Hi Peter; > >> Any thoughts? Does rolling this out early next week sound sensible > >> or simply too ambitious? > > > > I wont have time to roll my genepop code with statistics calculations > > in that timeframe, no chance. I will leave it for 1.52. > > We don't have to rush this - it just seemed worth having a quite fresh > release available for us to work from in the tutorial session at BOSC, > and also as a reference point the coding BoF session. > > If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in > July, then I would provisionally expect to do Biopython 1.52 in the Autumn. > This could be brought forward if a large and useful contribution was added. > Let's chat about the genepop code at BOSC? This sounds like a good idea if you have time to push it next week. I'd like to get GFF parsing/writing in but need to do some refactoring before realistically proposing that, and it won't happen until after BOSC. I will do the annotation bit this weekend so it'll be in there. 1.52 sounds like it'll be a good target for a lot of new functionality. Nice. Thanks, Brad From biopython at maubp.freeserve.co.uk Sat Jun 20 08:53:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 20 Jun 2009 09:53:15 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <20090619223951.GA5133@sobchak.mgh.harvard.edu> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <6d941f120906190744y3b260c50mf414311c84ad128d@mail.gmail.com> <320fb6e00906190757q275b26f4p4bc6933ad99e9a49@mail.gmail.com> <20090619223951.GA5133@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906200153l14d1f398x732cb118eb83603a@mail.gmail.com> On Fri, Jun 19, 2009 at 11:39 PM, Brad Chapman wrote: > Hi Peter; > >> If we do do Biopython 1.51 beta next week, with Biopython 1.51 final in >> July, then I would provisionally expect to do Biopython 1.52 in the Autumn. >> This could be brought forward if a large and useful contribution was added. >> Let's chat about the genepop code at BOSC? > > This sounds like a good idea if you have time to push it next week. > I'd like to get GFF parsing/writing in but need to do some refactoring > before realistically proposing that, and it won't happen until after > BOSC. I will do the annotation bit this weekend so it'll be in > there. Yeah - the GFF parsing and SeqFeature/FeatureLocation stuff would benefit from some in person discussion. This is all very fresh in my mind from the EMBL/GenBank side of things as I have got GenBank feature writing in Bio.SeqIO to work (in CVS), and have been looking at replacing the current GenBank/EMBL location parser with something re based for speed (see Bug 2738 for a proof of concept, I have a more complete version in progress). > 1.52 sounds like it'll be a good target for a lot of new > functionality. Nice. It does look like that :) Peter From rozziite at gmail.com Sun Jun 21 14:13:40 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 21 Jun 2009 10:13:40 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <4057d3bf0906210713p331cdfa4n31d76e95ee78ea85@mail.gmail.com> 2009/6/17 Eric Talevich > Hi Brad, > > Here's a mid-week update and partial response to your questions. > > *SeqRecord transformation* > > It would be nice if I could round-trip this sequence information perfectly, > so > that nothing's lost between reading and writing an arbitrary, valid > PhyloXML > file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look > at > SeqRecord.features and assume that any matching keys have the appropriate > PhyloXML meaning. > > These are the keys that from_seqrec() would look for: > location > uri > annotations > domain_architecture > > Do you see any risk of collision for those names? And for serialization, > would it be unwholesome to convert Annotation and DomainArchitecture objects > to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." -- > it's another layer of parsing and kind of esoteric, but I can live with it. > > > *Profiling* > > Christian also suggested an option to parse just the phylogenies with a > name or id matching a given string. I like that and I don't see any problem > with extending it to clades as well. It seems like a reasonable use case to > select a sub-tree from a complete phyloXML document and treat it as a > separate > phylogeny from then on. This can be supported by various methods for > selecting > portions of the tree, and a method on Clade for transforming the selection > into > a new Phylogeny instance (so the original can be safely deleted). > I like this idea. I will do the same for PhyloXML implementation in BioRuby. Diana > > I did some profiling with the cProfile module, and it looks like most of > the > time is being spent instantiating Clade and Taxonomy objects. (Also, > pretty_print is hugely inefficient, but that's less important.) I think I > can > speed up parsing and reduce memory usage by pulling the from_element > methods > out of each class and using a separate Parser class to do that work. > > About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was > just > looking at Ubuntu's system monitor, and Firefox and a few other things were > running at the same time, taking up about 800MB already. So the full NCBI > taxonomy actually takes up only 1.2GB or so, which isn't such a problem, > and I > think it will get smaller as I shrink down these PhyloXML classes. > > Questions: > - Do you know of a better way to profile Python code, or visualize it? > - Have you used __slots__ to optimize classes? Do you recommend it? > > And a few that don't fit anywhere else: > > - What sort of whole-tree operations would you want to do with these > objects that you can't do with a Nexus or Newick tree? What other > formats > would you want to convert to? I'm thinking of adding an Export module > later if there's time, for lossy conversions like a graph for > networkx. > > - What's the most intuitive way to display a phylogenetic tree you've > loaded into Biopython? Serialize as Nexus and open in TreeViewX? > Convert > to a graph and send to matplotlib? Or, is there a module in > Bio.Graphics > that can draw trees? (If not, should there be?) > > Thanks, > Eric > > > > On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman wrote: > >> Hi Eric; >> Nice update and thanks again for copying the Biopython development >> list on this. >> >> > * Added to_seqrecord and from_seqrecord methods to the >> PhyloXML.Sequence >> > class >> > -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely >> will >> > require some more thought >> >> I'm looking forward to seeing how you decide to go forward with >> this. For the work I do on a day to day basis, a continual >> struggle involves establishing relationships between things to >> retrieve more information. For instance, a pair of nodes on a tree >> is interesting -- how would I find papers, experiments and other >> information associated with those sequences? It seems like Accession >> and the ref attribute of Annotation help establish these >> relationships. >> >> > * Test-driven development kind of went out the window this week. >> >> Heh. It happens -- sounds sensible to have a clean up and >> documentation week this week; that will also help others who are >> interested dig into using it. >> >> > * The unit tests I do have in place give some sense of memory and CPU >> usage. >> > For the full NCBI taxonomy, memory usage climbs up above 2 GB with >> the >> > read() function, which isn't a problem on this workstation but could >> be for >> > others. >> >> Do you see an opportunity to offer iterating over clades instead of >> loading them all into memory for these larger trees? This would >> involve lazily loading subclades on request and would limit some >> functionality for querying the full tree without loading it all into >> memory. >> >> Another option is to offer some pruning ability as a tree is >> loading. For instance, if I am loading the whole NCBI taxonomy on a >> memory limited computer and only need the Angiosperm flowering plant >> part of the tree. In this case, you'd want to throw away all clades >> not under the clades of interest. >> >> These are probably fringe cases; just brainstorming some ideas. >> >> Thanks again, >> Brad >> > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > From biopython at maubp.freeserve.co.uk Mon Jun 22 11:29:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 12:29:50 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> Message-ID: <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> On Thu, Jun 18, 2009 at 12:17 AM, Eric Talevich wrote: > > Hi Brad, > > Here's a mid-week update and partial response to your questions. > > *SeqRecord transformation* > I was just having a very brief scan over the commits that my RSS feed had - and noticed this bit: + # Unpack record.features + if record.features: + kwargs['domain_architecture'] = DomainArchitecture( + domains=[ProteinDomain({ + 'from': feat.location.start + 1, + 'to': feat.location.end + 1, + 'confidence': feat.qualifiers.get('confidence') + }, value=feat.id) + for feat in record.features], + length=len(record.seq) + ) I can understand a +/- one to the start location (moving between Python zero based counting and normal one based counts), but why would the end location also change? Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 22 13:03:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Jun 2009 09:03:16 -0400 Subject: [Biopython-dev] [Bug 2851] Psycopg version 1 support for BioSQL In-Reply-To: Message-ID: <200906221303.n5MD3Ggo014924@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2851 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-22 09:03 EST ------- (In reply to comment #3) > (In reply to comment #2) > > Do you think we should deprecate Biopython support for psycopg version one? > > Yes, I'd deprecate it - its no longer actively developed. Anyone wanting to > use Psycopg would surely choose version 2 (version 1 was a pain to build > anyway). > > C. > OK - I've added a deprecation warning for psycopg (v1) in Biopython 1.51 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jun 22 13:08:05 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Jun 2009 09:08:05 -0400 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> Message-ID: <3f6baf360906220608h3c0fff98sc06bd01f0afaafef@mail.gmail.com> On Mon, Jun 22, 2009 at 7:29 AM, Peter wrote: > On Thu, Jun 18, 2009 at 12:17 AM, Eric Talevich > wrote:> *SeqRecord transformation* > > > > I was just having a very brief scan over the commits that my RSS > feed had - and noticed this bit: > > + # Unpack record.features > + if record.features: > + kwargs['domain_architecture'] = DomainArchitecture( > + domains=[ProteinDomain({ > + 'from': feat.location.start + 1, > + 'to': feat.location.end + 1, > + 'confidence': > feat.qualifiers.get('confidence') > + }, value=feat.id) > + for feat in record.features], > + length=len(record.seq) > + ) > > I can understand a +/- one to the start location (moving between > Python zero based counting and normal one based counts), but > why would the end location also change? > > Peter > Er, it wouldn't. Oops. On that note, how do you feel about specifying biology-style indexes in PhyloXML.ProteinDomain, and switching to zero-based indexes when converting to SeqFeature? Would it be better to use zero-based indexes in ProteinDomain and substract 1 from the start position during parsing? Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Jun 22 13:38:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 14:38:27 +0100 Subject: [Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906220608h3c0fff98sc06bd01f0afaafef@mail.gmail.com> References: <3f6baf360906151004m4ee3595g7ac007fd218e370f@mail.gmail.com> <20090617124101.GH44321@sobchak.mgh.harvard.edu> <3f6baf360906171617q3b98f674lef307351e79c7793@mail.gmail.com> <320fb6e00906220429m5c84ab3dxfadf1624b0a3e6e4@mail.gmail.com> <3f6baf360906220608h3c0fff98sc06bd01f0afaafef@mail.gmail.com> Message-ID: <320fb6e00906220638u5c668583h7a5b750fce295304@mail.gmail.com> On Mon, Jun 22, 2009 at 2:08 PM, Eric Talevich wrote: >> >> I can understand a +/- one to the start location (moving between >> Python zero based counting and normal one based counts), but >> why would the end location also change? >> >> Peter >> > > Er, it wouldn't. Oops. > > On that note, how do you feel about specifying biology-style indexes in > PhyloXML.ProteinDomain, and switching to zero-based indexes when converting > to SeqFeature? Would it be better to use zero-based indexes in ProteinDomain > and substract 1 from the start position during parsing? Personally, within any Python object representation I would expect Python style counting to be used (i.e. so the start and end could be used as is for slicing the sequence). This would be consistent with the SeqFeature usage in Biopython. However, if your object is a simple naive representation of the raw data from the file, you might arguably keep is as in the file (one based). Whatever you do, please make it very explicit in the docstring. Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 22 15:46:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 22 Jun 2009 11:46:20 -0400 Subject: [Biopython-dev] [Bug 2841] SeqFeature constructor ignores qualifiers and sub_features arguments In-Reply-To: Message-ID: <200906221546.n5MFkKCT027601@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2841 chapmanb at 50mail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from chapmanb at 50mail.com 2009-06-22 11:46 EST ------- Nick, thanks for the report. Fixed in revision 1.20 of SeqFeature. Also fixed in revision 1.37 of SeqRecord, and tests added to test_SeqIO_features.py. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Mon Jun 22 16:14:19 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Jun 2009 12:14:19 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython Message-ID: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> Hi folks, Previously (June 15-19) I: * Wrote a pretty-printer for displaying a summary of the parsed tree structure * Made all existing unit tests pass * Started unit tests for instantiation of each phyloXML object * Profiled the parser and utilities using the cProfile module on the unit test suite. Summarized findings on the Biopython mailing list (nothing exciting was discovered) * Used a custom warning type to indicate noncompliance with the PhyloXML spec * Separated parsing code (Parser.py) from the phyloXML class definitions (Tree.py) -- this should make Nexus/Newick compatibility feasible * Improved the conversion from PhyloXML.Sequence to Bio.SeqRecord, making better use of annotations and using SeqFeature objects to represent protein domains This week (June 22-26) I will: Work on the backlog: * Finish unittests for parsing and instantiating core elements * Compare parser performance with Bioperl and Archaeopterix * Document results of parser testing and performance (on wiki or here) * Document basic usage and performance characteristics of the parser on the Biopython wiki Then, serialize phyloXML trees and write back to file: * Write unit tests for serialization * Write serialization methods for each class * Write a top-level function for triggering serialization of the whole hierarchy Question: Biopython has a couple of core objects that I'm reusing in my project. There was a quirk in these libraries (related to this: http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm) that made the objects slightly more awkward to instantiate, but the issues were recently fixed. I'd like to merge these fixes soon. So, GSoC requires a tarball of the code we write at the end of the summer. Merging from upstream would bring code that I didn't write into my development tree -- which I could probably filter out with the right arguments to git-diff, but nonetheless, my project history would no longer be entirely clean. Does Google care about this? Or is it safe to go ahead and pull from the next stable release of Biopython (coming soon)? Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Mon Jun 22 16:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 17:44:29 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> References: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> Message-ID: <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> On Mon, Jun 22, 2009 at 5:14 PM, Eric Talevich wrote: > Biopython has a couple of core objects that I'm reusing in my project. There > was a quirk in these libraries (related to this: > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm) > that made the objects slightly more awkward to instantiate, but the issues > were recently fixed. I'd like to merge these fixes soon. Brad just committed a fix for Bug 2841 - is there anything else you still think needs fixing in the latest trunk code in this area? I'm not sure what you meant my "merge" in this context. > So, GSoC requires a tarball of the code we write at the end of the summer. > Merging from upstream would bring code that I didn't write into my > development tree -- which I could probably filter out with the right > arguments to git-diff, but nonetheless, my project history would no longer > be entirely clean. Does Google care about this? Or is it safe to go ahead > and pull from the next stable release of Biopython (coming soon)? This probably depends on the GSoC rules - from the Biopython license point of view I don't see a problem with you providing a complete tarball (i.e. all of Biopython plus your code). Ideally use the latested current stable release for this (and for your timetable that will probably be Biopython 1.51 or 1.52 depending on how things go). Peter From eric.talevich at gmail.com Mon Jun 22 17:14:49 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Jun 2009 13:14:49 -0400 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> References: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> Message-ID: <3f6baf360906221014l2e730d8n960ecc28cfc021bd@mail.gmail.com> On Mon, Jun 22, 2009 at 12:44 PM, Peter wrote: > On Mon, Jun 22, 2009 at 5:14 PM, Eric Talevich > wrote: > > Biopython has a couple of core objects that I'm reusing in my project. > There > > was a quirk in these libraries (related to this: > > > http://effbot.org/pyfaq/why-are-default-values-shared-between-objects.htm) > > that made the objects slightly more awkward to instantiate, but the > issues > > were recently fixed. I'd like to merge these fixes soon. > > Brad just committed a fix for Bug 2841 - is there anything else you still > think needs fixing in the latest trunk code in this area? I'm not sure what > you meant my "merge" in this context. > I meant pull -- merging from biopython/master into my etal/phyloxml branch. > > So, GSoC requires a tarball of the code we write at the end of the > summer. > > Merging from upstream would bring code that I didn't write into my > > development tree -- which I could probably filter out with the right > > arguments to git-diff, but nonetheless, my project history would no > longer > > be entirely clean. Does Google care about this? Or is it safe to go ahead > > and pull from the next stable release of Biopython (coming soon)? > > This probably depends on the GSoC rules - from the Biopython license > point of view I don't see a problem with you providing a complete tarball > (i.e. all of Biopython plus your code). Ideally use the latested current > stable > release for this (and for your timetable that will probably be Biopython > 1.51 or 1.52 depending on how things go). > OK, good. The GSoC guidelines from previous years look like they're reasonably flexible, I just want to be sure. Also, here's a rant from Linux Torvalds about how to merge from upstream in Git: http://www.mail-archive.com/dri-devel at lists.sourceforge.net/msg39091.html According to that, I should pull from the Biopython master branch at the 1.51 tag when that happens (no rebasing, since I've pushed my stuff to github already), and if I want to land in time for 1.52, then pull from biopython/master then, fix any merging issues, and immeditely submit a pull request on GitHub. Otherwise, pull from the 1.52 tag, rinse, repeat. (By the way, do you think GitMigrationwill be complete before or after GSoC ends? If we're still rocking CVS then it's less important to keep a clean branch history.) Best, Eric From biopython at maubp.freeserve.co.uk Mon Jun 22 17:30:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 18:30:58 +0100 Subject: [Biopython-dev] GSoC Weekly Update: PhyloXML for Biopython In-Reply-To: <3f6baf360906221014l2e730d8n960ecc28cfc021bd@mail.gmail.com> References: <3f6baf360906220914p79553725hacac90459f1566c9@mail.gmail.com> <320fb6e00906220944l161201efn4f64908fde61a8bc@mail.gmail.com> <3f6baf360906221014l2e730d8n960ecc28cfc021bd@mail.gmail.com> Message-ID: <320fb6e00906221030o31532870qfbddb02172dfec1@mail.gmail.com> On Mon, Jun 22, 2009 at 6:14 PM, Eric Talevich wrote: > > (By the way, do you think GitMigration will be complete before or after GSoC > ends? If we're still rocking CVS then it's less important to keep a clean > branch history.) > I'm sure this will be one of the topics we (Biopython) people will be discussing at BOSC this coming weekend. http://www.open-bio.org/wiki/BOSC_2009 I'm also *hoping* to talk to one of OBF server administrators at the BOSC/ISMB conference, for a chat about our options (e.g. run git instead of CVS on the OBF servers). Nothing concrete about such a meeting yet though :( Peter From biopython at maubp.freeserve.co.uk Mon Jun 22 17:57:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 18:57:49 +0100 Subject: [Biopython-dev] Next release plans? In-Reply-To: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> Message-ID: <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> On Fri, Jun 19, 2009 at 3:40 PM, Peter wrote: > Hi all, > > We're made some good progress since Biopython 1.50, and I think > it may already be time to plan another release. ... Hi all, I've managed to tick off several items on my todo list for the release (e.g. documentation additions), and plan to tackle the release of a Biopython 1.51 beta tomorrow (Tuesday). At the earliest I'll starting the release process in 14 hours time (9am UK time), so any last minute low risk changes can still go in. As usual I'll send out a CVS freeze email before hand. Once the beta release is out, we'll resume taking small changes (especially for documentation additions or clarifications) with a view to releasing Biopython 1.51 final in July (probably the second week, after people get back from BOSC/ISMB). Peter From winda002 at student.otago.ac.nz Tue Jun 23 02:30:30 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Tue, 23 Jun 2009 14:30:30 +1200 Subject: [Biopython-dev] Release announcement for 1.51b In-Reply-To: <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> Message-ID: <1245724230.4a403e467e8c1@www.studentmail.otago.ac.nz> Hi Guys, This is a draft for an announcement once the 1.51 beta is ready to roll. Any comments are welcome, it would probably be useful to have another paragraph with some more details on GenBank feature writing and the application wrappers from people that are more familiar with them that I am. Cheers, David Biopython 1.51 beta available for testing. A beta release for Biopython 1.51 is now available for download and testing. In the two months since Biopython 1.50 was released we have introduced support for writing features in GenBank files, added new application wrappers for alignment programs, extended SeqIO's support for the FASTQ format to include files created by Illumnia 1.3+ and made numerous tweaks and bug fixes. All the new features have been tested by the dev team but it's possible there are cases for these tools that we haven't been able to foresee and test, especially for the GenBank feature writer. We are interested in getting feedback on the release as a whole and especially on the new features (and their documentation in the Biopython Tutorial and Cookbook). So, gather your courage, download the release, try it out and let us know what works and what doesn't through the mailing lists. [then some links to the files to download] From amrita at iisermohali.ac.in Tue Jun 23 04:12:32 2009 From: amrita at iisermohali.ac.in (amrita at iisermohali.ac.in) Date: Tue, 23 Jun 2009 09:42:32 +0530 (IST) Subject: [Biopython-dev] biopython Message-ID: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> Dear all, I want to know whether its possible or not to extract chemical shift information about protein from BMRB (BioMagResBank) or Ref-DB (referenced databank) using biopython programming. Amrita Kumari Research Fellow IISER Mohali Chandigarh INDIA From biopython at maubp.freeserve.co.uk Tue Jun 23 10:14:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:14:28 +0100 Subject: [Biopython-dev] CVS mirror server issue (and github) Message-ID: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> Hi all, The main CVS server is up and running (on dev.open-bio.org and possibly other aliases), so there shouldn't be any issues with preparing Biopython 1.51 beta. However, the OBF use a different (virtual) server for: code.open-bio.org cvs.open-bio.org cvs.biopython.org This runs ViewCVS and also hosts what I assume is a read only mirror of CVS (e.g. for anonymous access). Right now this is down. As github hasn't updated recently, I would guess Bartek's machine (which has been doing the CVS to github updates) was pointing at one of the above addresses. The OBF have been altered to this issue, but as an alternative way to get github back up to date, Bartek might be able to switch to fetching the latest code from dev.open-bio.org instead... I hope we can fix this before BOSC, as I would like to use github branches for the hackathon session. Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 10:24:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:24:07 +0100 Subject: [Biopython-dev] Release announcement for 1.51b In-Reply-To: <1245724230.4a403e467e8c1@www.studentmail.otago.ac.nz> References: <320fb6e00906190740k6a65e029sa4d4d0171b3985bd@mail.gmail.com> <320fb6e00906221057n10b556e2l73f645b48601dd36@mail.gmail.com> <1245724230.4a403e467e8c1@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00906230324x6bf05a5al6d709d17de9f4065@mail.gmail.com> On Tue, Jun 23, 2009 at 3:30 AM, David Winter wrote: > Hi Guys, > > This is a draft for an announcement once the 1.51 beta is ready to roll. > Any comments are welcome, it would probably be useful to have another > paragraph with some more details on GenBank feature writing and the > application wrappers from people that are more familiar with them that I > am. > > Cheers, > David Thanks for that David, I'll use is as the basis of the release notes this afternoon (in my time zone ;) ). Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 10:25:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:25:38 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.51 (beta) Message-ID: <320fb6e00906230325x2c05f32cp1c3910e159c8b104@mail.gmail.com> Hi all, OK, as per my email last night, please consider CVS "frozen" until further notice, while I prepare the Biopython 1.51 beta release. If there are any last minute additions email me ASAP, and I'll see what I can do. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 10:52:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:52:08 +0100 Subject: [Biopython-dev] CVS mirror server issue (and github) In-Reply-To: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> References: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> Message-ID: <320fb6e00906230352j1735046q1dbd07cebd526313@mail.gmail.com> On Tue, Jun 23, 2009 at 11:14 AM, Peter wrote: > Hi all, > > The main CVS server is up and running (on dev.open-bio.org and > possibly other aliases), so there shouldn't be any issues with > preparing Biopython 1.51 beta. > > However, the OBF use a different (virtual) server for: > code.open-bio.org > cvs.open-bio.org > cvs.biopython.org > > This runs ViewCVS and also hosts what I assume is a read only mirror > of CVS (e.g. for anonymous access). Right now this is down. As github > hasn't updated recently, I would guess Bartek's machine (which has > been doing the CVS to github updates) was pointing at one of the above > addresses. Progress report - the virtual machine is still playing up but a software upgrade is planned. Right now: code.open-bio.org - up cvs.open-bio.org - up cvs.biopython.org - stale DNS entry (at least with my ISP) > The OBF have been altered to this issue, but as an alternative way to > get github back up to date, Bartek might be able to switch to fetching > the latest code from dev.open-bio.org instead... If you are using cvs.biopython.org right now the DNS entry is stale (but I'm sure it will be fixed shortly), so a less disruptive change would be to try pointing at cvs.open-bio.org or code.open-bio.org (i.e. use a different alias for the mirror server, rather than switching to the primary server). Peter From bartek at rezolwenta.eu.org Tue Jun 23 10:54:49 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 23 Jun 2009 12:54:49 +0200 Subject: [Biopython-dev] CVS mirror server issue (and github) In-Reply-To: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> References: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> Message-ID: <8b34ec180906230354x2baf6ed8q1d490fbe45d3d89d@mail.gmail.com> On Tue, Jun 23, 2009 at 12:14 PM, Peter wrote: > > > This runs ViewCVS and also hosts what I assume is a read only mirror > of CVS (e.g. for anonymous access). Right now this is down. As github > hasn't updated recently, I would guess Bartek's machine (which has > been doing the CVS to github updates) was pointing at one of the above > addresses. > The OBF have been altered to this issue, but as an alternative way to > get github back up to date, Bartek might be able to switch to fetching > the latest code from dev.open-bio.org instead... > Hi, The github update process was in fact always based on dev.open-bio.org. The problem with updates was caused by my "solution" to the AUTHORS file re-occurrence in github... So it seems, that when I removed the AUTHORS file, new updates could not make it to github (there was an extra commit, that was not in CVS). Now, as I forced github to accept new updates, we have AUTHORS file in github again... (even though it's not in CVS and not in my git branch which is the source of updates...). That's weird and meybe we will be able to research this better during BOSC... To my knowledge there are no other glitches like this one, but if you see something strange, let me know... > I hope we can fix this before BOSC, as I would like to use github > branches for the hackathon session. > the updates should be working now, the only remaining problem is the AUTHORS file, but we can take care of it during BOSC cheers Bartek From biopython at maubp.freeserve.co.uk Tue Jun 23 10:57:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 11:57:46 +0100 Subject: [Biopython-dev] CVS mirror server issue (and github) In-Reply-To: <8b34ec180906230354x2baf6ed8q1d490fbe45d3d89d@mail.gmail.com> References: <320fb6e00906230314k27a782c3hcfb5b085efccc5c2@mail.gmail.com> <8b34ec180906230354x2baf6ed8q1d490fbe45d3d89d@mail.gmail.com> Message-ID: <320fb6e00906230357s61d1aeb8r5efecedc8501eaf7@mail.gmail.com> On Tue, Jun 23, 2009 at 11:54 AM, Bartek Wilczynski wrote: > > Hi, > > The github update process was in fact always based on dev.open-bio.org. > The problem with updates was caused by my "solution" to the AUTHORS > file re-occurrence in github... maybe we will be able to research this better > during BOSC... > > To my knowledge there are no other glitches like this one, but if you see > something strange, let me know... Rather odd. But as you suggest, we can have a more in depth discussion at BOSC. >> I hope we can fix this before BOSC, as I would like to use github >> branches for the hackathon session. > > the updates should be working now, the only remaining problem is the > AUTHORS file, but we can take care of it during BOSC Excellent :) Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:06:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:06:37 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906231106.n5NB6bg2016683@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:06 EST ------- Can we close this bug now? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:09:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:09:40 -0400 Subject: [Biopython-dev] [Bug 2856] Duplicate positions for some restriction enzymes in some sequences In-Reply-To: Message-ID: <200906231109.n5NB9eIe016872@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2856 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:09 EST ------- Fr??d??ric Sohm (author of Bio.Resistriction) posted a fix on the mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006215.html I have just checked this in, marking bug as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:15:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:15:02 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method In-Reply-To: Message-ID: <200906231115.n5NBF2gf017552@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Support the "in" keyword |Support the "in" keyword |with Seq objects / define |with Seq + SeqRecord objects |__contains__ method |/ define __contains__ method ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:15 EST ------- Patch for Seq object checked in. Leaving bug open for possible similar addition to the SeqRecord object. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:15:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:15:39 -0400 Subject: [Biopython-dev] [Bug 2853] Support the "in" keyword with Seq + SeqRecord objects / define __contains__ method In-Reply-To: Message-ID: <200906231115.n5NBFd7r017630@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2853 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1323 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:15 EST ------- (From update of attachment 1323) As noted above, this has been checked in. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:17:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:17:11 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906231117.n5NBHBGD017709@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 ------- Comment #13 from cymon.cox at gmail.com 2009-06-23 07:17 EST ------- (In reply to comment #12) > Can we close this bug now? Sure. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:17:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:17:34 -0400 Subject: [Biopython-dev] [Bug 2375] Coalescent support through Simcoal2 In-Reply-To: Message-ID: <200906231117.n5NBHYgq017747@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2375 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|enhancement |normal ------- Comment #26 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:17 EST ------- Switching this from an enhancement (which is done) to a normal bug for the remaining issue, removing the workaround in Bio/PopGen/SimCoal/__init__.py. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jun 23 11:22:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 23 Jun 2009 07:22:06 -0400 Subject: [Biopython-dev] [Bug 2849] PyGresql PostgreSQL driver support for BioSQL is broken In-Reply-To: Message-ID: <200906231122.n5NBM6rL018081@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2849 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-23 07:22 EST ------- Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Jun 23 13:58:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 14:58:28 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.51 (beta) In-Reply-To: <320fb6e00906230325x2c05f32cp1c3910e159c8b104@mail.gmail.com> References: <320fb6e00906230325x2c05f32cp1c3910e159c8b104@mail.gmail.com> Message-ID: <320fb6e00906230658g27301e6asca055c54dee45ab1@mail.gmail.com> On Tue, Jun 23, 2009 at 11:25 AM, Peter wrote: > Hi all, > > OK, as per my email last night, please consider CVS "frozen" > until further notice, while I prepare the Biopython 1.51 beta > release. OK, the release is done (except for the announcements) and tagged in CVS. Little low impact things can be added to CVS again now. You should be able to download it already - let me know if there any problems with the links or the files themselves: http://biopython.org/wiki/Download http://biopython.org/DIST/ If we make any notable improvements to the documentation between now and the release of Biopython 1.51 final, we can also update the online version of the tutorial. Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 14:35:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 15:35:09 +0100 Subject: [Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML) In-Reply-To: <320fb6e00906230734j3c921fcew41737bf0efc4df61@mail.gmail.com> References: <320fb6e00906181505r32d7df31w7287bdb69753c49f@mail.gmail.com> <3f6baf360906181857h43686cfawb850266eb5153af8@mail.gmail.com> <320fb6e00906190218m4619f624ncfc9d1f8bccbbad2@mail.gmail.com> <320fb6e00906230734j3c921fcew41737bf0efc4df61@mail.gmail.com> Message-ID: <320fb6e00906230735h153fb792w68bff43cf77eec23@mail.gmail.com> On Fri, Jun 19, 2009 at 10:18 AM, Peter wrote: > On Fri, Jun 19, 2009 at 2:57 AM, Eric Talevich wrote: > > On Thu, Jun 18, 2009 at 6:05 PM, Peter wrote: > >> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich wrote: > >> > > >> > I didn't notice any typos other than Python being consistently > >> > lowercase, which I assume is how the author likes it. > >> > >> I was aiming for consistency, with no strong preference - at the time > >> there were more uses of "python" than "Python" so I picked that. We > >> can change it easily enough - does anyone care either way? > > > > Python.org capitalizes it. Shrug. > > Maybe we should use "Python" then. During the polishing for Biopython 1.51b I went though and made all the appropriate "python" uses into "Python". There are a couple of special cases like command line snippets where it should be lower case. Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 14:40:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 15:40:32 +0100 Subject: [Biopython-dev] Biopython beta releases on PyPi? Message-ID: <320fb6e00906230740x6e4b76c8v7e6c0a67c662b751@mail.gmail.com> Hi Brad (and others), Do you think we should push beta releases of Biopython via PyPi? http://pypi.python.org/pypi/biopython Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 16:05:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 17:05:53 +0100 Subject: [Biopython-dev] Biopython 1.51 beta released Message-ID: <320fb6e00906230905i5b5a2364i7ae9c4c96e4ae50d@mail.gmail.com> Dear all, A beta release for Biopython 1.51 is now available for download and testing. In the two months since Biopython 1.50 was released, we have introduced support for writing features in GenBank files using Bio.SeqIO, extended SeqIO~s support for the FASTQ format to include files created by Illumina 1.3+, and added a new set of application wrappers for alignment programs, and made numerous tweaks and bug fixes. All the new features have been tested by the dev team but it's possible there are cases that we haven~t been able to foresee and test, especially for the GenBank feature writer (as there as just so many possible odd fuzzy feature locations). Note that as previously announced, Biopython no longer supports Python 2.3, and our deprecated parsing infrastructure (Martel and Bio.Mindy) has been removed. Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on the new features and the Biopython Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf So, gather your courage, download the release, try it out and let us know what works and what doesn~t through the mailing lists (or bugzilla). -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/06/biopython-151-beta-released/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News Biopython news is also on twitter: http://twitter.com/biopython Thanks also to David Winter for coming up with the draft release message. From biopython at maubp.freeserve.co.uk Wed Jun 24 09:43:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Jun 2009 10:43:51 +0100 Subject: [Biopython-dev] biopython In-Reply-To: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> References: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> Message-ID: <320fb6e00906240243x34cf22c4y3742c1cee84de6e9@mail.gmail.com> On Tue, Jun 23, 2009 at 5:12 AM, wrote: > > Dear all, > > I want to know whether its possible or not to extract chemical shift > information about protein from BMRB (BioMagResBank) or Ref-DB > (referenced databank) using biopython programming. > > Amrita Kumari I'd replied to Amrita directly, and suggested he email the discussion list in case anyone had any suggestions. I don't think there is anything already included with Biopython for chemical shifts from BMRB (BioMagResBank) or Ref-DB (referenced databank), but I don't work with NMR or 3D structures. http://www.bmrb.wisc.edu/ - BioMagResBank http://redpoll.pharmacy.ualberta.ca/RefDB/ - Ref-DB Any ideas? Peter From chapmanb at 50mail.com Wed Jun 24 12:25:17 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 24 Jun 2009 08:25:17 -0400 Subject: [Biopython-dev] Biopython beta releases on PyPi? In-Reply-To: <320fb6e00906230740x6e4b76c8v7e6c0a67c662b751@mail.gmail.com> References: <320fb6e00906230740x6e4b76c8v7e6c0a67c662b751@mail.gmail.com> Message-ID: <20090624122517.GH41327@sobchak.mgh.harvard.edu> Hi Peter; > Do you think we should push beta releases of Biopython via PyPi? > http://pypi.python.org/pypi/biopython I haven't been doing this. Conceptually, PyPi is more for users who want to install the latest thing that just works without too much thought. I suppose beta releases don't quite fall into that category since they are meant for testing, so let's just push final releases there. Brad From bugzilla-daemon at portal.open-bio.org Thu Jun 25 15:39:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 25 Jun 2009 11:39:24 -0400 Subject: [Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing In-Reply-To: Message-ID: <200906251539.n5PFdOZZ023918@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2738 ------- Comment #12 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-25 11:39 EST ------- Created an attachment (id=1329) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1329&action=view) Patch for Bio/GenBank/__init__.py to handle most locations with re A more complicated version of the previous patch, covering a wider range of features. This is a work in progress but I wanted to stash it somewhere online as a snapshot of my progress - I should try doing this in github in future ;) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From cymon.cox at googlemail.com Sun Jun 28 14:10:14 2009 From: cymon.cox at googlemail.com (Cymon Cox) Date: Sun, 28 Jun 2009 15:10:14 +0100 Subject: [Biopython-dev] Bio.Sequencing Message-ID: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> Hi Peter, What is the long-term future of Bio.Sequencing? With the (very cool) QualityIO stuff now in SeqIO, the Phd module looks a bit out of place - is there any reason not to move both Ace and Phd code to SeqIO ie in the AceIO and PhdIO interfaces? I ask because Ive written a Phd writer class for the SeqIO interface and initially added it to PhdIO. Cheers, C. -- From p.j.a.cock at googlemail.com Mon Jun 29 07:23:06 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 08:23:06 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> Message-ID: <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> On Sun, Jun 28, 2009 at 3:10 PM, Cymon Cox wrote: > Hi Peter, > > What is the long-term future of Bio.Sequencing? With the (very cool) > QualityIO stuff now in SeqIO, the Phd module looks a bit out of place - is > there any reason not to move both Ace and Phd code to SeqIO ie > in the AceIO and PhdIO interfaces? In the case of FASTQ and QUAL files, everything gets stored in the SeqRecord, so I didn't see any reason to have something in Bio.Sequencing (although perhaps things like mapping between the PHRED and Solexa scores could live there, along with the basic parser used internally giving string tuples - does this sound worth doing?). As you know, currently the SeqIO "ace" and "phd" are simply built on top of Bio.Sequencing.Ace and Bio.Sequencing.PhD, and only transforms a subset of the data into a SeqRecord object. This also describes the SwissProt parsing now - the general model is we have a SeqRecord interface (which may not cover all the details), and an underlying more file format specific objects used to hold the data. > I ask because Ive written a Phd writer class for the SeqIO interface > and initially added it to PhdIO. Do you want to file an enhancement bug, and then either upload the code to bugzilla, or give a link to a github branch to we can have a look? If your writer takes SeqRecord objects, then I think it would make sense to go in Bio.SeqIO.PhdIO (as I have done for GenBank, although this is in part because I have some intentions to simplify the Bio.GenBank code, and having another writer with a another API in there would make this more complicated). It would also make sense to have a writer in Bio.Sequencing.Phd taking its Record objects (and have Bio.SeqIO turn SeqRecord objects into PhD Record objects, and call that). Perhaps this would be a better idea as it is more flexible, but it would be more work, and could be slower ;) Peter From cy at cymon.org Mon Jun 29 07:49:26 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 29 Jun 2009 08:49:26 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> Message-ID: <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> 2009/6/29 Peter Cock > On Sun, Jun 28, 2009 at 3:10 PM, Cymon Cox > wrote: > > Hi Peter, > > > > What is the long-term future of Bio.Sequencing? With the (very cool) > > QualityIO stuff now in SeqIO, the Phd module looks a bit out of place - > is > > there any reason not to move both Ace and Phd code to SeqIO ie > > in the AceIO and PhdIO interfaces? > > In the case of FASTQ and QUAL files, everything gets stored in > the SeqRecord, so I didn't see any reason to have something in > Bio.Sequencing (although perhaps things like mapping between > the PHRED and Solexa scores could live there, along with the > basic parser used internally giving string tuples - does this sound > worth doing?). > > As you know, currently the SeqIO "ace" and "phd" are simply built > on top of Bio.Sequencing.Ace and Bio.Sequencing.PhD, and only > transforms a subset of the data into a SeqRecord object. Yes, but now that per_letter_annotation's are in SeqRecord there is no reason not to store the Phd 'phred_qualities' and 'peak_locations', so all the Phd file attributes can be stored in a SeqRecord - I altered the parser to do this. > This also > describes the SwissProt parsing now - the general model is we have > a SeqRecord interface (which may not cover all the details), and an > underlying more file format specific objects used to hold the data. > > > I ask because Ive written a Phd writer class for the SeqIO interface > > and initially added it to PhdIO. > > Do you want to file an enhancement bug, and then either upload > the code to bugzilla, or give a link to a github branch to we can > have a look? > > If your writer takes SeqRecord objects, then I think it would make > sense to go in Bio.SeqIO.PhdIO (as I have done for GenBank, > although this is in part because I have some intentions to simplify > the Bio.GenBank code, and having another writer with a another > API in there would make this more complicated). > > It would also make sense to have a writer in Bio.Sequencing.Phd > taking its Record objects (and have Bio.SeqIO turn SeqRecord > objects into PhD Record objects, and call that). Perhaps this would > be a better idea as it is more flexible, but it would be more work, > and could be slower ;) Yes, this was my concern. As I have it now, the parser code is in Bio.Sequencing.Phd and is called by the Bio.SeqIO.PhdIO, but the writer code is in PhdIO. I could move the write_record to the Phd module for symmetry, but as all the Phd attributes can be stored in SeqRecord, the Phd parser code could just as rationally be moved to PhdIO. Cheers, C. -- From p.j.a.cock at googlemail.com Mon Jun 29 07:58:00 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 08:58:00 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> Message-ID: <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> >> As you know, currently the SeqIO "ace" and "phd" are simply built >> on top of Bio.Sequencing.Ace and Bio.Sequencing.PhD, and only >> transforms a subset of the data into a SeqRecord object. > > Yes, but now that per_letter_annotation's are in SeqRecord there is no > reason not to store the Phd 'phred_qualities' and 'peak_locations', so all > the Phd file attributes can be stored in a SeqRecord - I altered the parser > to do this. Cool - that sounds like it might be worth including in Biopython 1.51 final (if you think it is ready for prime time). If as you say that your extended Bio.SeqIO.PhdIO parse covers all the data in the PHRED file, then perhaps we could consider deprecating Bio.Sequencing.Phd in the future. >> > I ask because Ive written a Phd writer class for the SeqIO interface >> > and initially added it to PhdIO. >> >> Do you want to file an enhancement bug, and then either upload >> the code to bugzilla, or give a link to a github branch to we can >> have a look? >> >> If your writer takes SeqRecord objects, then I think it would make >> sense to go in Bio.SeqIO.PhdIO (as I have done for GenBank, >> although this is in part because I have some intentions to simplify >> the Bio.GenBank code, and having another writer with a another >> API in there would make this more complicated). >> >> It would also make sense to have a writer in Bio.Sequencing.Phd >> taking its Record objects (and have Bio.SeqIO turn SeqRecord >> objects into PhD Record objects, and call that). Perhaps this would >> be a better idea as it is more flexible, but it would be more work, >> and could be slower ;) > > Yes, this was my concern. As I have it now, the parser code is in > Bio.Sequencing.Phd and is called by the Bio.SeqIO.PhdIO, but > the writer code is in PhdIO. I could move the write_record to the > Phd module for symmetry, but as all the Phd attributes can be > stored in SeqRecord, the Phd parser code could just as rationally > be moved to PhdIO. For now, having the writer in Bio.SeqIO.PhdIO seems fine. We could as a second step make the Bio.SeqIO.PhdIO parse self contained, and as a third step, declare Bio.Sequencing.Phd obsolete. Peter From cy at cymon.org Mon Jun 29 08:09:22 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 29 Jun 2009 09:09:22 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> Message-ID: <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> 2009/6/29 Peter Cock > >> It would also make sense to have a writer in Bio.Sequencing.Phd > >> taking its Record objects (and have Bio.SeqIO turn SeqRecord > >> objects into PhD Record objects, and call that). Perhaps this would > >> be a better idea as it is more flexible, but it would be more work, > >> and could be slower ;) > > > > Yes, this was my concern. As I have it now, the parser code is in > > Bio.Sequencing.Phd and is called by the Bio.SeqIO.PhdIO, but > > the writer code is in PhdIO. I could move the write_record to the > > Phd module for symmetry, but as all the Phd attributes can be > > stored in SeqRecord, the Phd parser code could just as rationally > > be moved to PhdIO. > > For now, having the writer in Bio.SeqIO.PhdIO seems fine. We > could as a second step make the Bio.SeqIO.PhdIO parse self > contained, and as a third step, declare Bio.Sequencing.Phd > obsolete. This sounds like a plan. I'll try and get it all together and push it to github sometime today. Cheers, C. -- From bugzilla-daemon at portal.open-bio.org Mon Jun 29 12:08:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:08:14 -0400 Subject: [Biopython-dev] [Bug 2865] New: Phd writer class for SeqIO Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2865 Summary: Phd writer class for SeqIO Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com Attached is a patch to add a SeqIO write interface for phd files, plus unittests. Also to be found on http://github.com/cymon/biopython-github-master/tree/assembly C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 12:09:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:09:15 -0400 Subject: [Biopython-dev] [Bug 2865] Phd writer class for SeqIO In-Reply-To: Message-ID: <200906291209.n5TC9Fj8008875@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 ------- Comment #1 from cymon.cox at gmail.com 2009-06-29 08:09 EST ------- Created an attachment (id=1333) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1333&action=view) Phd writer and unittest patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 12:37:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:37:40 -0400 Subject: [Biopython-dev] [Bug 2866] New: SQLite support for BioSQL Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2866 Summary: SQLite support for BioSQL Product: Biopython Version: Not Applicable Platform: PC OS/Version: FreeBSD Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: chapmanb at 50mail.com Attached is a git patch to add SQLite support to the latest BioSQL. I've tested this with SQLite and MySQL on my FreeBSD machine and both pass the test suite. Cymon, Peter, and anyone else who is interested -- it would be great if you could check on PostgreSQL and the various setups y'all have been using. A few notes: - SQLite does not support FOREIGN KEY constraints so I have dropped those from the creation SQL. - get_subseq_as_string used SUBSTRING, which does not seem to be supported on SQLite. I switched to SUBSTR which I believe should be general. - SQLite gives back unicode, which I explicitly convert to strings to be more compatible with what was done previously. If it's easier to check this on GitHub, I can do that when I'm back home. I branched from the main trunk and internet is too slow to learn the right way to do it now. Thanks. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 12:42:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 08:42:23 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200906291242.n5TCgNnd011737@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #1 from chapmanb at 50mail.com 2009-06-29 08:42 EST ------- Created an attachment (id=1334) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1334&action=view) BioSQL SQLite support -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jun 29 13:50:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 09:50:20 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200906291350.n5TDoKkl018215@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-06-29 09:50 EST ------- Wow - that was quick Brad, we were only chatting about this yesterday! Have you filed an enhancement bug for BioSQL itself for adding this new schema? Hilmar will probably have some feedback on the foreign key stuff. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Mon Jun 29 14:11:15 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 15:11:15 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> Message-ID: <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> Hi Cymon, I've checked in some of your patch on Bug 2865 already, recording the per-letter-annotation which I was planning to do but hadn't got round to yet - thank you: http://bugzilla.open-bio.org/show_bug.cgi?id=2865 This means with the latest code you can now use Biopython to convert a PHD output file into a FASTQ file (or a QUAL file) which could be handy for doing meta assemblies. I did relatively recently update SeqIO for the Ace format to record the qualities - but there is an issue here. Only the nucleotides get given quality scores, but not the insertions (gaps, shown as "*" in the Ace file consensus sequence). Currently the Bio.SeqIO parser gives the gapped sequence. This means to record the quality scores, we need to give some null value to the gap characters (and I used None). What I am wondering about is making the Bio.SeqIO Ace parser just return the ungapped sequence (and the associated PHRED quality scores). This means we could then convert Ace files into FASTQ or QUAL files, and also a simple Ace to FASTA conversion would give something useful for downstream analysis (the ungapped consensus). The gaps *are* important if you want to see how the consensus was built up - in which case it makes sense to think about each Ace contig as a kind of multiple sequence alignment. See this earlier discussion with David Winter: http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html Any thoughts? Peter From bugzilla-daemon at portal.open-bio.org Mon Jun 29 14:30:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 29 Jun 2009 10:30:29 -0400 Subject: [Biopython-dev] [Bug 2866] SQLite support for BioSQL In-Reply-To: Message-ID: <200906291430.n5TEUTa2021995@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2866 ------- Comment #3 from cymon.cox at gmail.com 2009-06-29 10:30 EST ------- (In reply to comment #1) > Created an attachment (id=1334) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1334&action=view) [details] > BioSQL SQLite support > This patch works for me with SQLite (Python 2.5.2)(both TESTDBs), Psycopg, Psycopg2, and Pgdb on Ubuntu 9.04 - PostgreSQL 8.3. Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Mon Jun 29 14:25:30 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 29 Jun 2009 16:25:30 +0200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <761477.83949.qm@web65501.mail.ac4.yahoo.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> Message-ID: <200906291625.30081.jblanca@btc.upv.es> Hi: I'm doing similar things and I took a slightly different approach. Instead of using the ace parser api I've created a contig class and my parsers return contig objects. You can take a look at the code at: http://bioinf.comav.upv.es/svn/biolib/biolib/src/ (By the way if you find any code in that library interesting for biopython I would be delighted to add it to biopython). In my library parsing an ace or a caf file works like: >>> fhand = open('example3.ace', 'r') >>> ace_parser = get_parser(fhand, format='ace') >>> for contig in ace_parser: >>> print contig You are also able to get a particular contig giving its name. >>> ace_parser.contigs('contig_name') The contigs are like a list of sequences with a consensus property. >>> contig[0] #the first sequence >>> contig[1] #the second sequence >>> contig.consensus #the consensus The sequeence and quality for every read is also accessible >>> read0 = contig[0] >>> read0.seq >>> read0.qual There are in fact two different coordinate systems, the contig one and the read one (because every read starts in a different place and it can be reversed). To acces to the read in its own coordinate sequence you have to ask for the sequence property of the read. In fact the Contig and the LocatableSequence classes are capable of doing more things. For instance the contig accepts 2-D indexes and returns new contigs, columns, rows, subcontigs, etc. If you find those classes interesting take a look at the code and take also a look at the tests. There is not much documentation, but many tests. Best regards, Jose Blanca On Monday 29 June 2009 12:49:39 Fungazid wrote: > David hi, > > Many many thanks for the diagram. > I'm not sure I understand the differences between > contig.af[readn].padded_start, and contig.bs[readn].padded_start, and > other unknown parameters. I'll try to compare to the Ace format > > Avi > > --- On Mon, 6/29/09, Peter wrote: > > From: Peter > > Subject: Re: [Biopython] Bio.Sequencing.Ace > > To: "David Winter" > > Cc: biopython at lists.open-bio.org > > Date: Monday, June 29, 2009, 10:26 AM > > On Mon, Jun 29, 2009 at 6:19 AM, > > David > > Winter > > > > wrote: > > > Quoting Peter : > > >> There top level properties are simple enough - but > > > > I find drilling > > > > >> down into the reads a bit more tricky. In general > > > > the Ace parser is > > > > >> a bit non-obvious without knowing the Ace format. > > > > Having some > > > > >> __str__ and __repr__ methods defined on the > > > > objects returned > > > > >> would be very nice - I may get time to work on > > > > this later this year. > > > > >> Anyone else interested in this drop us an email. > > >> > > >> Peter > > > > > > I had a scrawled diagram of the contig class next to > > > > me when I was using > > > > > it more frequently - it was easy enough to reproduce > > > > digitally > > > > > http://biopython.org/wiki/Ace_contig_class > > > > > > Hopefully it helps make sese of where all the data is. > > > > I've added a couple > > > > > of very brief examples there for now - will expand it > > > > when I get a chance. > > > > > David > > > > This could get turned in docstring/doctest for the Ace > > parser :) > > > > Peter > > _______________________________________________ > > Biopython mailing list? -? Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From p.j.a.cock at googlemail.com Mon Jun 29 14:53:12 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 15:53:12 +0100 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <200906291625.30081.jblanca@btc.upv.es> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291625.30081.jblanca@btc.upv.es> Message-ID: <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> On 6/29/09, Jose Blanca wrote: > Hi: > I'm doing similar things and I took a slightly different approach. Instead > of using the ace parser api I've created a contig class and my parsers > return contig objects. You can take a look at the code at: > http://bioinf.comav.upv.es/svn/biolib/biolib/src/ Hi Jose, Are you using Bio.Sequencing.Ace in your code, or did you write a whole new parser instead? Now that I have been using Ace files in my own work, I've been meaning to look over your stuff. In some ways, a contig class can be seen as a generalisation of a multiple sequence alignment class. Certainly this is something we should improve in Biopython (as you might gather from some of the enhancement bugs on bugzilla, I have lots of ideas for the current alignment class), and I'm sure you have some great ideas too. Peter From jblanca at btc.upv.es Mon Jun 29 15:16:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 29 Jun 2009 17:16:06 +0200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291625.30081.jblanca@btc.upv.es> <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> Message-ID: <200906291716.06607.jblanca@btc.upv.es> > Are you using Bio.Sequencing.Ace in your code, or did you write a whole > new parser instead? I wrote one, because I wanted to be able to get one particular contig or just the contig or the read names. But I don't think that is a problem. I gues that the biopyhon parser could be easily adapted to that. > Now that I have been using Ace files in my own work, I've been meaning > to look over your stuff. In some ways, a contig class can be seen as a > generalisation of a multiple sequence alignment class. Certainly this is > something we should improve in Biopython (as you might gather from > some of the enhancement bugs on bugzilla, I have lots of ideas for the > current alignment class), and I'm sure you have some great ideas too. I think that here is the main deviation from Biopython. The contig class is similar to an alignment class, in fact my contig classes shoud be compatible with your new alignment proporsal api. alignment. seq1 +++++++++> seq2 +++++++++> seq3 +++++++++> contig seq1 ++++> seq2 +++++> seq3 ++++++> Basically every read has a different coordinate system in the contig case. What I've done is to create a class named LocatableSequence that is a container for sequence objects. It works like: >>> seq1 = 'ATCG' >>> locseq1 = locate_sequence(seq1, location=10) >>> locseq1[10] == A In that way the contig is a list of LocatableSequences and the coordinate system transformations are done by the LocatableSequences, not by the contig. The LocatableSequences also allow for masks. The LocatableSequence works with any sequence like objects, strs, Seq, SeqRecord, lists, etc. There's also a Location class that represents a fragment of a sequence. My Location class is more limited than the one in the Biopython SeqFeature. In my case the start and end should be integers. I use this class to represent the region not masked in the sequence and the Location of the sequence inside the LocatableSequence. Take a look at Contig.py and at LocatableSequence.py, these are the most relevant classes for this. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From cy at cymon.org Mon Jun 29 15:47:36 2009 From: cy at cymon.org (Cymon Cox) Date: Mon, 29 Jun 2009 16:47:36 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> Message-ID: <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> Hi Peter, 2009/6/29 Peter Cock > Hi Cymon, > > I've checked in some of your patch on Bug 2865 already, > recording the per-letter-annotation which I was planning to > do but hadn't got round to yet - thank you: > http://bugzilla.open-bio.org/show_bug.cgi?id=2865 > > This means with the latest code you can now use Biopython > to convert a PHD output file into a FASTQ file (or a QUAL > file) which could be handy for doing meta assemblies. Yeah, that's nice. Conversely, the reason I wrote the Phd writer is that I want to 'fake' some Phd files from FASTA and QUAL files - should/might be possible by using the default headers and equally spaced peak locations. The use-case is to fool Consed into displaying the trace (which it 'fakes') from a 454 Mira assembly ACE file output, but which it will only do if the Phd files are available. So I'm hoping to write the Phd files from the original FASTA/QUAL input files. Not sure if this is going to work, or if its a sensible thing to be trying... > I did relatively recently update SeqIO for the Ace format to > record the qualities - but there is an issue here. Only the > nucleotides get given quality scores, but not the insertions > (gaps, shown as "*" in the Ace file consensus sequence). > Currently the Bio.SeqIO parser gives the gapped sequence. > This means to record the quality scores, we need to give > some null value to the gap characters (and I used None). > > What I am wondering about is making the Bio.SeqIO Ace > parser just return the ungapped sequence (and the > associated PHRED quality scores). This means we could > then convert Ace files into FASTQ or QUAL files, and also > a simple Ace to FASTA conversion would give something > useful for downstream analysis (the ungapped consensus). > > The gaps *are* important if you want to see how the > consensus was built up - in which case it makes sense to > think about each Ace contig as a kind of multiple sequence > alignment. See this earlier discussion with David Winter: > http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html > http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html > > Any thoughts? I think it's probably unwise to return an ungapped sequence/qual by default if the contig in the ACE assembly is gapped. It would be nice if the parser had a switch ungapped=True, but thats not going to work with the SeqIO interface. Second best option would be to have an easy way of getting the ungapped SeqRecord from the gapped SeqRecord - a function somewhere in Bio.Sequencing? Anyway, I assume (havent checked) that currently if all the contigs are free of gaps then the SeqIO.AceIO will parse them into an Ungapped alphabet which can then be written to FASTA/QUAL etc. I think this is the right way to go, if the contigs have gaps the user needs to decide how to deal with them explicitly. Cheers, C. -- From eric.talevich at gmail.com Mon Jun 29 16:50:05 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 29 Jun 2009 12:50:05 -0400 Subject: [Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython Message-ID: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> Hi folks, Previously (June 22--26) I: - Wrote unit tests for: - Instantiation of all implemented elements (properly) - Serialization to an output stream -- reusing the parser tests - Made all the unit tests pass - Tweaked for performance: parsing takes about 1/3 less CPU time now - Started Writer.py, with some imports, a class called Writer, and a top-level function for triggering serialization of the whole hierarchy (just a wrapper for ElementTree.write()) - Added __str__ and __repr__ methods to the base class (used in pretty-printing) - Added the method to_rgb() to class BranchColor. It builds a 24-bit hex string representing the color that can be used from HTML/CSS directly. Just something completely different... - Pulled from the biopython trunk This week (June 29--July 3) I will: - Write serialization methods for each class, matching Parser - Catch up on documentation (on the Biopython wiki): - Explain use cases - Basic usage of the parser - Provide guidance on parser performance (parse() is ~4x faster; compare to Bioperl and Archaopterix) Performance: The normal test suite running on apaf.xml, bcl_2.xml, phyloxml_examples.xml and ncbi_taxonomy_mollusca.xml.zip takes about 5 seconds; adding in ncbi_taxonomy_metazoa.xml.zip and the full ncbi_taxonomy.xml.zip to the utilities tests requires 256 seconds (parsing and pretty-printing), and just parsing all six files without pretty-printing or counting tags takes a total of 186 seconds. The python process creeps up to 1.6GB while parsing all six files, but stays under 40MB during the unit tests on the four more reasonably-sized files. Scheduling: The code for serializing to XML was supposed to be written last week. It was not, but I do have comprehensive tests written for it (abusing the unittest framework to re-run the original parser tests) and see no obstacles to its completion this week. I didn't completely trust the unit tests earlier last week, so I spent some time making the pretty-printer work properly, and in the process added some syntactic sugar that was scheduled for later in the project plan. I think this follows the current Biopython convention: bc = ProteinDomain(start=181, end=503, value='WD40') str(bc) # ProteinDomain WD40 repr(bc) # ProteinDomain(start=181, end=503, value=WD40) My plan is that when a phyloXML tree is exported to networkx for display and other purposes, the str() result will be the label for each node. Pulling from upstream: I intended to pull the tagged 1.51 beta of biopython from github and merge it into my own code to take advantage of some recent improvements. But I don't see the 1.51b tag anywhere. Does anyone else know what happened to that tag? I waited a few hours to see if it would be pushed from CVS automatically, but no luck, so I pulled from a plausible point during the lull after Peter's CVS-freeze announcement. Cheers, Eric http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML From biopython at maubp.freeserve.co.uk Tue Jun 30 07:46:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 08:46:43 +0100 Subject: [Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython In-Reply-To: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> References: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> Message-ID: <320fb6e00906300046t75a1f669ne4f1c6af1b4f96e3@mail.gmail.com> On Mon, Jun 29, 2009 at 5:50 PM, Eric Talevich wrote: > > Pulling from upstream: > I intended to pull the tagged 1.51 beta of biopython from github > and merge it into my own code to take advantage of some recent > improvements. But I don't see the 1.51b tag anywhere. Does > anyone else know what happened to that tag? I waited a few > hours to see if it would be pushed from CVS automatically, but > no luck, so I pulled from a plausible point during the lull after > Peter's CVS-freeze announcement. If I recall correctly, when pulling from git by default it does not featch the tags - you have to explicitly ask for them. Peter From p.j.a.cock at googlemail.com Tue Jun 30 08:01:28 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jun 2009 09:01:28 +0100 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <200906291716.06607.jblanca@btc.upv.es> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291625.30081.jblanca@btc.upv.es> <320fb6e00906290753w6e2a56d8v5748aa644da051d8@mail.gmail.com> <200906291716.06607.jblanca@btc.upv.es> Message-ID: <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> On Mon, Jun 29, 2009 at 4:16 PM, Jose Blanca wrote: >> Are you using Bio.Sequencing.Ace in your code, or did you write a whole >> new parser instead? > I wrote one, because I wanted to be able to get one particular contig or just > the contig or the read names. But I don't think that is a problem. I gues > that the biopyhon parser could be easily adapted to that. I see. This touches on the indexing discussion - the same idea on this thread would probably work on Ace files too: http://lists.open-bio.org/pipermail/biopython/2009-June/005275.html >> Now that I have been using Ace files in my own work, I've been meaning >> to look over your stuff. In some ways, a contig class can be seen as a >> generalisation of a multiple sequence alignment class. Certainly this is >> something we should improve in Biopython (as you might gather from >> some of the enhancement bugs on bugzilla, I have lots of ideas for the >> current alignment class), and I'm sure you have some great ideas too. > > I think that here is the main deviation from Biopython. The contig class is > similar to an alignment class, in fact my contig classes shoud be compatible > with your new alignment proporsal api. That's good. I agree that a specialised contig class that works like the traditional multiple sequence alignment class would be nice. It would then make sense to have Bio.AlignIO handle contigs as well as traditional multiple sequence alignments. > alignment. > seq1 +++++++++> > seq2 +++++++++> > seq3 +++++++++> > > contig > seq1 ++++> > seq2 ? ?+++++> > seq3 ? ? ? ?++++++> > > Basically every read has a different coordinate system in the contig case. > What I've done is to create a class named LocatableSequence that is a > container for sequence objects. It works like: >>>> seq1 = 'ATCG' >>>> locseq1 = locate_sequence(seq1, location=10) >>>> locseq1[10] == A > In that way the contig is a list of LocatableSequences and the coordinate > system transformations are done by the LocatableSequences, not by the contig. > The LocatableSequences also allow for masks. > The LocatableSequence works with any sequence like objects, strs, Seq, > SeqRecord, lists, etc. > There's also a Location class that represents a fragment of a sequence. My > Location class is more limited than the one in the Biopython SeqFeature. In > my case the start and end should be integers. I use this class to represent > the region not masked in the sequence and the Location of the sequence inside > the LocatableSequence. > Take a look at Contig.py and at LocatableSequence.py, these are the most > relevant classes for this. > Best regards, I'll have to make some time for looking at your code. What I was thinking of was a contig class as an alignment subclass, holding a list of SeqRecord objects and offsets. The consensus might just be one element of this list - but could be handled specially. This sounds simpler than having to introduce a whole new object system, related to but different to SeqFeature objects. However, I don't yet have a sample implementation to demonstrate this. One important thing I think we should do BEFORE adding any contig class to Biopython, is get it working with at least one other contig file format in addition to Ace. I don't want to end up with a class which is too specialised for how ace contigs work. Peter From biopython at maubp.freeserve.co.uk Tue Jun 30 08:18:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 09:18:44 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> Message-ID: <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox wrote: > Hi Peter, > > 2009/6/29 Peter >> >> Hi Cymon, >> >> I've checked in some of your patch on Bug 2865 already, >> recording the per-letter-annotation which I was planning to >> do but hadn't got round to yet - thank you: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2865 >> >> This means with the latest code you can now use Biopython >> to convert a PHD output file into a FASTQ file (or a QUAL >> file) which could be handy for doing meta assemblies. > > Yeah, that's nice. Conversely, the reason I wrote the Phd writer is that I > want to 'fake' some Phd files from FASTA and QUAL files - should/might be > possible by using the default headers and equally spaced peak locations. The > use-case is to fool Consed into displaying the trace (which it 'fakes') from > a 454 Mira assembly ACE file output, but which it will only do if the Phd > files are available. So I'm hoping to write the Phd files from the original > FASTA/QUAL input files. Not sure if this is going to work, or if its a > sensible thing to be trying... That sounds reasonable - as long as you know you are faking it ;) >> I did relatively recently update SeqIO for the Ace format to >> record the qualities - but there is an issue here. Only the >> nucleotides get given quality scores, but not the insertions >> (gaps, shown as "*" in the Ace file consensus sequence). >> Currently the Bio.SeqIO parser gives the gapped sequence. >> This means to record the quality scores, we need to give >> some null value to the gap characters (and I used None). >> >> What I am wondering about is making the Bio.SeqIO Ace >> parser just return the ungapped sequence (and the >> associated PHRED quality scores). This means we could >> then convert Ace files into FASTQ or QUAL files, and also >> a simple Ace to FASTA conversion would give something >> useful for downstream analysis (the ungapped consensus). >> >> The gaps *are* important if you want to see how the >> consensus was built up - in which case it makes sense to >> think about each Ace contig as a kind of multiple sequence >> alignment. See this earlier discussion with David Winter: >> http://lists.open-bio.org/pipermail/biopython/2009-April/005125.html >> http://lists.open-bio.org/pipermail/biopython/2009-April/005128.html >> >> Any thoughts? > > I think it's probably unwise to return an ungapped sequence/qual by default > if the contig in the ACE assembly is gapped. It would be nice if the parser > had a switch ungapped=True, but thats not going to work with the SeqIO > interface. We can certainly add a ungapped optional argument to the parser in Bio.SeqIO.AceIO - that would be a small improvement, meaning the functionality would be there if you needed it (all be it a bit hidden). Several of the Bio.SeqIO parsers already have optional arguments. I have sometimes wondered about letting the SeqIO functions take a **kwargs argument, and passing these arbitrary options to the underlying parser. This would allow for example passing wrap options to the FASTA writer, or skiping the features when parsing GenBank and EBML. On the other hand, it gets very complicated, and detracts from the current simplicity of Bio.SeqIO (which I like). > Second best option would be to have an easy way of getting the > ungapped SeqRecord from the gapped SeqRecord - a function > somewhere in Bio.Sequencing? I've already suggested some kind of "ungapped" method for Seq objects, and yes, having this at the SeqRecord level too would solve this particular use case. Removing the per-letter-annotations associated with the gaps would be straight forward. I'm not sure what we would want to do with any features in the SeqRecord (perhaps a corner case), but most likely any SeqFeature covering a region containing a gap would be lost. > Anyway, I assume (havent checked) that currently if all the > contigs are free of gaps then the SeqIO.AceIO will parse > them into an Ungapped alphabet which can then be written > to FASTA/QUAL etc. I think this is the right way to go, if > the contigs have gaps the user needs to decide how to deal > with them explicitly. Yes, if the Ace contig has no gaps, it will have a nice integer PHRED quality for each base, and could be saved as FASTQ or QUAL (or FASTA). The thing about "gaps" in contigs is that the consensus is really the ungapped sequence. I'd have to check but I think Newbler and CAP3 will output both FASTA and ACE files, and in the FASTA files there are no insertions/gaps in the contig sequences. What I am thinking is Bio.SeqIO could return the ungapped consensus sequences as SeqRecord objects (which can then be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO could return contig-alignment objects (with the gaps, like David's cookbook but in the long run with a contig class). This has some merit, but breaks my current convention that parsing an alignment file with SeqIO works by giving each gapped sequence in each alignment in turn. Peter From jblanca at btc.upv.es Tue Jun 30 08:31:06 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 30 Jun 2009 10:31:06 +0200 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291716.06607.jblanca@btc.upv.es> <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> Message-ID: <200906301031.06273.jblanca@btc.upv.es> > What I was thinking of was a contig class as an alignment subclass, > holding a list of SeqRecord objects and offsets. The consensus might > just be one element of this list - but could be handled specially. This > sounds simpler than having to introduce a whole new object system, > related to but different to SeqFeature objects. However, I don't yet > have a sample implementation to demonstrate this. I thought about that implementation and I created some code. The problem I found with that approach is that the contig class code got too messy. Take into account that besides the offset you also need the masks and that some sequences could be reversed. That's why I decided to split the part that calculates the offset and the mask into a separate class. > One important thing I think we should do BEFORE adding any contig > class to Biopython, is get it working with at least one other contig file > format in addition to Ace. I don't want to end up with a class which > is too specialised for how ace contigs work. > > Peter Well, In fact my contig class is modeled after the caf file format. The ace parsing was just an afterthought, my primary interest was the caf format. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From bartek at rezolwenta.eu.org Tue Jun 30 08:39:01 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 30 Jun 2009 10:39:01 +0200 Subject: [Biopython-dev] GSoC Weekly Update 6: PhyloXML for Biopython In-Reply-To: <320fb6e00906300046t75a1f669ne4f1c6af1b4f96e3@mail.gmail.com> References: <3f6baf360906290950p4a22ddbeo6abc87769f8d09bf@mail.gmail.com> <320fb6e00906300046t75a1f669ne4f1c6af1b4f96e3@mail.gmail.com> Message-ID: <8b34ec180906300139q722fbd7ei241ef51d004ea2@mail.gmail.com> On Tue, Jun 30, 2009 at 9:46 AM, Peter wrote: > On Mon, Jun 29, 2009 at 5:50 PM, Eric Talevich wrote: >> improvements. But I don't see the 1.51b tag anywhere. Does >> anyone else know what happened to that tag? I waited a few >> hours to see if it would be pushed from CVS automatically, but >> no luck, so I pulled from a plausible point during the lull after >> Peter's CVS-freeze announcement. > > If I recall correctly, when pulling from git by default it does not > featch the tags - you have to explicitly ask for them. Hi, This was actually connected to the issues with moving tags from cvs to github. It's fixed now. cheers Bartek From cy at cymon.org Tue Jun 30 09:02:04 2009 From: cy at cymon.org (Cymon Cox) Date: Tue, 30 Jun 2009 10:02:04 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> Message-ID: <7265d4f0906300202n119bac77j52f60e5680db528d@mail.gmail.com> 2009/6/30 Peter > On Mon, Jun 29, 2009 at 4:47 PM, Cymon Cox wrote: > > Hi Peter, > > > > 2009/6/29 Peter > >> > >> Hi Cymon, > >> > [...] > > Several of the Bio.SeqIO parsers already have optional arguments. > I have sometimes wondered about letting the SeqIO functions take > a **kwargs argument, and passing these arbitrary options to the > underlying parser. This would allow for example passing wrap options > to the FASTA writer, or skiping the features when parsing GenBank > and EBML. On the other hand, it gets very complicated, and detracts > from the current simplicity of Bio.SeqIO (which I like). Its a bit of a slippery-slope - but it would have been nice to have a "useDefaults" switch in the PhdWriter. > > Anyway, I assume (havent checked) that currently if all the > > contigs are free of gaps then the SeqIO.AceIO will parse > > them into an Ungapped alphabet which can then be written > > to FASTA/QUAL etc. I think this is the right way to go, if > > the contigs have gaps the user needs to decide how to deal > > with them explicitly. > > Yes, if the Ace contig has no gaps, it will have a nice integer > PHRED quality for each base, and could be saved as FASTQ > or QUAL (or FASTA). > > The thing about "gaps" in contigs is that the consensus is > really the ungapped sequence. Yes, but... there is still some ambiguity over the consensus sequence which is lost in the ungapped sequence. OK, so this isnt such a bid deal with the massive coverages achieved by 454 tech but I can imagine cases of hybrid Sanger/454 where this might be an issue (might be scraping the bottom of the barrel a bit here...). I'd have to check but I think > Newbler and CAP3 will output both FASTA and ACE files, > and in the FASTA files there are no insertions/gaps in the > contig sequences. For comparison, Mira outputs ACE, plus X.gapped.fasta, and X.ungapped.fasta > What I am thinking is Bio.SeqIO could return the ungapped > consensus sequences as SeqRecord objects (which can then > be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO > could return contig-alignment objects (with the gaps, like > David's cookbook but in the long run with a contig class). Yeah, I like this. Although, I'm not sure how intuitive it is that SeqIO would necessarily return the ungapped rather than gapped sequences - but it kinda makes sense... Cheers, C. -- From biopython at maubp.freeserve.co.uk Tue Jun 30 09:33:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 10:33:00 +0100 Subject: [Biopython-dev] Bio.Sequencing In-Reply-To: <7265d4f0906300202n119bac77j52f60e5680db528d@mail.gmail.com> References: <7265d4f0906280710r32561593l2aca26abaa9c135e@mail.gmail.com> <320fb6e00906290023q57f8ccc3sde16576e23766bc5@mail.gmail.com> <7265d4f0906290049n65e63227s7537b2e83525ef3d@mail.gmail.com> <320fb6e00906290058s2d8fd93aw435eb803adbc5ac8@mail.gmail.com> <7265d4f0906290109l7a2c77a0radc7aad20c4263cc@mail.gmail.com> <320fb6e00906290711o7d503679m593cdb1df0d6c701@mail.gmail.com> <7265d4f0906290847i4cfc26c4q676d213a35d73ce3@mail.gmail.com> <320fb6e00906300118l78ca2a98kc25278e24ad433a1@mail.gmail.com> <7265d4f0906300202n119bac77j52f60e5680db528d@mail.gmail.com> Message-ID: <320fb6e00906300233r60998635lcbfe8788c73ab119@mail.gmail.com> Cymon wrote: > >Peter wrote: >> The thing about "gaps" in contigs is that the consensus is >> really the ungapped sequence. > > Yes, but... there is still some ambiguity over the consensus sequence which > is lost in the ungapped sequence. OK, so this isnt such a bid deal with the > massive coverages achieved by 454 tech but I can imagine cases of hybrid > Sanger/454 where this might be an issue (might be scraping the bottom of the > barrel a bit here...). > >Peter wrote: >> I'd have to check but I think >> Newbler and CAP3 will output both FASTA and ACE files, >> and in the FASTA files there are no insertions/gaps in the >> contig sequences. > > For comparison, Mira outputs ACE, plus X.gapped.fasta, and X.ungapped.fasta That is nice an explicit. :) >> What I am thinking is Bio.SeqIO could return the ungapped >> consensus sequences as SeqRecord objects (which can then >> be saved as FASTA, FASTQ, QUAL) while Bio.AlignIO >> could return contig-alignment objects (with the gaps, like >> David's cookbook but in the long run with a contig class). > > Yeah, I like this. Cool. I will try and look into this later in July. > Although, I'm not sure how intuitive it is that SeqIO would > necessarily return the ungapped rather than gapped > sequences - but it kinda makes sense... Yeah - I'm a bit on the fence myself about Ace to SeqRecord, and whether gapped or ungapped makes most sense. Given that the current Bio.SeqIO behaviour gives the gapped sequence, I guess we should just leave it like that. Peter From p.j.a.cock at googlemail.com Tue Jun 30 09:47:51 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jun 2009 10:47:51 +0100 Subject: [Biopython-dev] [Biopython] Bio.Sequencing.Ace In-Reply-To: <200906301031.06273.jblanca@btc.upv.es> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> <200906291716.06607.jblanca@btc.upv.es> <320fb6e00906300101r3e3faa37l6a47295bd5e12538@mail.gmail.com> <200906301031.06273.jblanca@btc.upv.es> Message-ID: <320fb6e00906300247ve2f45eau4ef97d3f65bb50c5@mail.gmail.com> On Tue, Jun 30, 2009 at 9:31 AM, Jose Blanca wrote: >> What I was thinking of was a contig class as an alignment subclass, >> holding a list of SeqRecord objects and offsets. The consensus might >> just be one element of this list - but could be handled specially. This >> sounds simpler than having to introduce a whole new object system, >> related to but different to SeqFeature objects. However, I don't yet >> have a sample implementation to demonstrate this. > > I thought about that implementation and I created some code. The > problem I found with that approach is that the contig class code got > too messy. ?Take into account that besides the offset you also need > the masks and that some sequences could be reversed. That's why > I decided to split the part that calculates the offset and the mask > into a separate class. A simple masked sequence class would also be useful for Roche SFF files which hold sequencing reads (of about 500bp) with start and end trim points. This is a use case separate from the location offset in an alignment - so I'm not convinced it makes sense to do both in one class. Perhaps having the contig class hold a list of (masked) SeqRecord objects, their offset, and their direction would work? >> One important thing I think we should do BEFORE adding any contig >> class to Biopython, is get it working with at least one other contig file >> format in addition to Ace. I don't want to end up with a class which >> is too specialised for how ace contigs work. > > Well, In fact my contig class is modeled after the caf file format. > The ace parsing was just an afterthought, my primary interest > was the caf format. Well, as the CAF file format was an extension of the ACE format, perhaps a third contig format would be worth looking at before considering if a contig class would be sufficiently general. Have you got any links to the CAF file format you found useful when writing your parser? In addition to: http://www.sanger.ac.uk/Software/formats/CAF/ http://www.genome.org/cgi/content/full/8/3/260 Thanks, Peter From bugzilla-daemon at portal.open-bio.org Tue Jun 30 17:13:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 30 Jun 2009 13:13:29 -0400 Subject: [Biopython-dev] [Bug 2867] New: Bio.PDB.PDBList.update_pdb calls invalid os.cmd Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2867 Summary: Bio.PDB.PDBList.update_pdb calls invalid os.cmd Product: Biopython Version: 1.51b Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: mlrodrigues at igc.gulbenkian.pt As listed in the traceback, this module tries to use an invalid function from the os module. Traceback (most recent call last): File "update_pdb.py", line 33, in update(pdblist, my_try) File "update_pdb.py", line 27, in update x.update_pdb() File "/usr/lib/pymodules/python2.5/Bio/PDB/PDBList.py", line 280, in update_pdb os.cmd('mv %s %s'%(old_file,new_file)) AttributeError: 'module' object has no attribute 'cmd' -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.