From chapmanb at uga.edu Mon Feb 2 22:54:52 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] hmmpfam parser Message-ID: <20040203035452.GA67076@evostick.agtec.uga.edu> Hi Wagied; > I have some code which is able to parse hmmer output, > as well as code donated by Joanne Adamkewicz from Exilexis. > > If you guys/gals find it useful, updates and modification will be done! Thanks for sending this -- hmmpfam parsing code in Biopython is definitely something we need. A few notes on what you sent: 1. I'm guessing that PfamParser.py and ExPFam.py are completely separate pieces of code (except for both dealing with parsing Pfam). For Biopython, the PfamParser.py is the more generally useful piece of code since it provides an interface to parse a hmmpfam result into a record-like object. So I'll probably restrict my comments to that code. 2. Is there an methodology that you use to iterate over a file full of hmmpfam results? Normally most parsers in Biopython include a parser for individual records and then an iterator so that you can apply the parser to a file full of results. 3. Some of the code does not follow the naming conventions that we normally use in Biopython. Specifically: a. Functions should be lowercase_separated_by_underscores style. b. Variables should be lowercase_underscores style or alltogether style. One of the things which was confusing to me in your code is that you alternate between the lowercase_underscores style and ALL_UPPERCASE style. At least in my experience ALL_UPPERCASE is normally reserved for "constants." c. You provide a lot of accessor methods for class variables (ie. getAccession for self.accession). Normally in python you just have access to the variable directly (or preface it with an underscore like self._internal if the variable is for internal class use) -- the getWhatever functions is more java-like. d. There are lots of unnecessary semi-colons in the code. They don't hurt anything, but again make the code look more Java-like than python-like. e. On the class __init__()'s you have code that looks like: def __init__(self, variable = None): if variable is not(None): # do something with variable else: # raise an error You can eliminate all of this by just requiring the variable in the initializer: def __init__(self, variable): # do something with variable And let python take care of the error checking that something was passed. Generally, the documentation on contributing to Biopython talks more about style issues we try to stick to; so that a heterogeneous project such as this can be as uniform as possible: http://biopython.org/docs/developer/contrib.html Hopefully all that is helpful -- we'd be very happy to accept the code with some modifications along the lines of what I've mentioned above, so I'm definitely not trying to be discouraging by enumerating those points above. We just want to make sure the code that gets in is as easy to understand and maintain as possible. Thanks again for the mail and please don't hesitate to ask any other questions! Brad From chapmanb at uga.edu Mon Feb 2 23:01:34 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] Explanation for late responses to development list mails Message-ID: <20040203040134.GC45748@evostick.agtec.uga.edu> Hey all; I just realized (with the help of good Jeff as always) that my mails to the development list have been getting discarded for the past month or so. Apparently I write a lot like automated spammers -- man, am I feelin' some self-confidence now :-). But the point is that I'm going to try and dig through my sent box and forward on some mails which never saw the light of day. So, if some responses I send now seem especially non-timely, I blame it entirely on the e-mail system and not my slacking. Sorry about this -- if anything you may have wrote to the development list got no attention make sure to send me a mail so that I know about it. Thanks! Brad From chapmanb at uga.edu Mon Feb 2 23:02:08 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] Bio.Wise checked in Message-ID: <20040203040208.GB67076@evostick.agtec.uga.edu> Hi Michael; > I have checked in Bio.Wise, which contains modules for running and > processing the output of some of the models in the Wise package > available from > : > > Bio.Wise.psw for protein Smith-Waterman alignments > Bio.Wise.dnal for Smith-Waterman DNA alignments Great! Thanks for doing this! > There are also appropriate unit tests which will not be checked if > dnal is not in your path. Right now I don't have wise installed and I am getting the test failing instead of skipping it: 7:38pm Tests> python run_tests.py test_Wise.py test_Wise ... dnal: not found FAIL ====================================================================== FAIL: test_Wise ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 148, in runTest self.runSafeTest() File "run_tests.py", line 185, in runSafeTest expected_handle) File "run_tests.py", line 285, in compare_output assert expected_line == output_line, \ AssertionError: Output : 'test_dnal (test_Wise.TestWiseDryRun) ... FAIL\n' Expected: 'test_dnal (test_Wise.TestWiseDryRun) ... ok\n' ---------------------------------------------------------------------- Ran 1 tests in 0.075s It looks like my commands execution returns something different than on your machine: >>> import commands >>> commands.getoutput("dnal") 'dnal: not found' I changed requires_wise.py a bit so it takes care of this case. So just a minor thing. Thanks again for this! Brad From chapmanb at uga.edu Mon Feb 2 23:04:01 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] Contribution -- NMR xpk files Message-ID: <20040203040401.GC67076@evostick.agtec.uga.edu> Hi Bob and all; > I have contributed some code to biopython for working with NMR data to be > included in the CVS, probably in the NMR package. Along with the two > modules (xpktools.py and NOEtools.py) is an example script > (simplepredict.py) and an input file (noed.xpk). I think you will find > the example script to be well documented and readable. Great! Thanks for sending this my way. I've checked the modules into Bio.NMR and the example code and input file now live in Doc/examples/nmr. Everything seems to work on my machine (well, at least it runs without any errors -- without any NMR knowledge I'm not so good at interpreting the output :-), but if you could check and be sure I didn't mess anything that would be great. It looks like everything has already migrated over to anonymous CVS. > This new functionality will enable biopython users to perform analysis > and data extraction of NMR data whether in the form of data tables or > directly from .xpk peaklist files. Again, we really appreciate this. As with everything in Biopython it takes the specific knowledge about an area to have code that handles the bioinformatics challenges well. Go NMR, go. Well-at-least-I-know-what-NMR-stands-for-ly yr's, Brad From idoerg at burnham.org Mon Feb 2 23:13:55 2004 From: idoerg at burnham.org (Iddo Friedberg) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] Explanation for late responses to development list mails In-Reply-To: <20040203040134.GC45748@evostick.agtec.uga.edu> Message-ID: On Mon, 2 Feb 2004, Brad Chapman wrote: > Hey all; > I just realized (with the help of good Jeff as always) that my mails > to the development list have been getting discarded for the past month or > so. Apparently I write a lot like automated spammers -- man, am I > feelin' some self-confidence now :-). I told you to change the name for the new V1agraPr0nStar class... (Let's see if this makes it to the list) false-negatively-y'rs, Iddo From hoffman at ebi.ac.uk Tue Feb 3 04:14:24 2004 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] Re: Bio.Wise checked in In-Reply-To: <20040203040208.GB67076@evostick.agtec.uga.edu> References: <20040203040208.GB67076@evostick.agtec.uga.edu> Message-ID: On Mon, 2 Feb 2004, Brad Chapman wrote: > >>> commands.getoutput("dnal") > 'dnal: not found' > > I changed requires_wise.py a bit so it takes care of this case. So > just a minor thing. Thanks! -- Michael Hoffman European Bioinformatics Institute From chapmanb at uga.edu Wed Feb 4 19:04:53 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] hmmpfam parser In-Reply-To: <401F8BDD.73417BCD@ebc.uu.se> References: <401AC30A.E639BD2E@ebc.uu.se> <20040203014909.GD17947@evostick.agtec.uga.edu> <401F8BDD.73417BCD@ebc.uu.se> Message-ID: <20040205000453.GJ907@evostick.agtec.uga.edu> Hi Wagied; > ExPfam is code donated by Joanne Adamkewicz at Exelixis. I guess > they use it at the "module" level rather than at the class level. Okay. Thanks makes more sense now. This would need a bit of work to be "Biopython-like." It does do a good job of solving a particular problem, but normally we try to focus the code that gets in on being as broadly applicable as possible (hence the emphasis on parsers and iterators and the like). > The record/entry structurs are ideally collected in a hash...let me > check the code..will write than in! I'm not sure what exactly you mean here. What does this refer to? > I can write in an Iterator object to traverse the records, if that is > necessary. It would be really nice to have -- I think the most common usecase (at least in my experience with hmmpfam) is to have a file full of searches that need to be parsed out. It would really expand the useability of the code. > Mmhhh you noticed.-..the Java-like coding....I generally try to prevent > direct access of class instance variables, rubbed off from Java. Yes, this really isn't necessary in Biopython. It isn't terrible to have (in addition to direct attribute access), but does require extra work coding all those functions up. Generally, the Python viewpoint on this is that you "trust" users of your code to access the variables correctly. Compared to Java coders we tend to be a bit more easygoing about that sort of thing. > Made the releveant changes you suggested. > Here it is again: [...] > Please don't hesitate to ask any other questions! Thanks -- this does clear up a few things. Not to be a pain, but there are a few things I mentioned before that are still present in the code you sent: 1. Function names are still in thisStyle instead of this_style. This is the most serious problem, as it really helps to have consistent naming conventions throughtout the Biopython codebase -- as much as possible. 2. An iterator -- as I mentioned before, this would really improve the usability of the code. 3. All the ';'s in the code. This is a more minor gripe, but poor python programmers aren't used to looking at those. A search/replace on them will likely get rid of them all and make it look much nicer. I'd be very happy to check it in with those changes. > I was thinking of becoming a developer, will need to go thru biopython's > coding > guidelines. If I could be added to the developers list, would be great! Definitely. The participants page is at: http://www.biopython.org/participants/ This is editable on the web (with username 'biopython' and password 'user'). I'd encourage you to enter your information there (just sign in, click 'edit this page' then 'add new') and be included. We are all about giving credit to the people that contribute (as these are the people that make it happen :-). Thanks again for the continued work! Brad From hoffman at ebi.ac.uk Thu Feb 5 04:20:36 2004 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Sat Mar 5 14:43:30 2005 Subject: [Biopython-dev] hmmpfam parser In-Reply-To: <20040205000453.GJ907@evostick.agtec.uga.edu> References: <401AC30A.E639BD2E@ebc.uu.se> <20040203014909.GD17947@evostick.agtec.uga.edu> <401F8BDD.73417BCD@ebc.uu.se> <20040205000453.GJ907@evostick.agtec.uga.edu> Message-ID: On Wed, 4 Feb 2004, Brad Chapman wrote: > > Mmhhh you noticed.-..the Java-like coding....I generally try to prevent > > direct access of class instance variables, rubbed off from Java. > > Yes, this really isn't necessary in Biopython. It isn't terrible to > have (in addition to direct attribute access), but does require > extra work coding all those functions up. Generally, the Python > viewpoint on this is that you "trust" users of your code to access > the variables correctly. Compared to Java coders we tend to be a bit > more easygoing about that sort of thing. It is also worth noting that, unlike in Java, if you need to add accessor methods in later to do some processing you can always do so by changing the attribute into a property. http://www.python.org/2.2.1/descrintro.html#property This should get you all of the benefits of encapsulation later if you want it without having to deal with all of the cruft. Until then, YAGNI (you aren't gonna need it). -- Michael Hoffman European Bioinformatics Institute From chapmanb at uga.edu Fri Feb 6 11:29:35 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] Re: biopython database connectivity error In-Reply-To: <1076029748.4022e934509fb@webmail.njit.edu> References: <033e01c3c654$549354c0$2b113b86@christen2002> <20040105223401.GC9588@evostick.agtec.uga.edu> <1076029748.4022e934509fb@webmail.njit.edu> Message-ID: <20040206162935.GC31847@evostick.agtec.uga.edu> Hello Chidambaram; > i have insattled pyhton biopython and all the necessary modules but i > cannot connect to the database iam getting errors i dont know why [...] > ython 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> from BioSQL import BioSeqDatabase > >>> server = BioSeqDatabase.open_database(driver ="MySQLdb", user = "chapmanb", > ... passwd = "biopython", host = "localhost", db= "bioseqdb") > Traceback (most recent call last): > File "", line 2, in ? > File "C:\PROGRA~1\Lib\site-packages\BioSQL\BioSeqDatabase.py", line 51, in > ope > n_database > conn = connect(**kw) > File "C:\PROGRA~1\Lib\site-packages\MySQLdb\__init__.py", line 63, in Connect > return apply(Connection, args, kwargs) > File "C:\PROGRA~1\Lib\site-packages\MySQLdb\connections.py", line 58, in > __ini > t__ > self._db = apply(connect, args, kwargs2) > _mysql_exceptions.OperationalError: (2003, "Can't connect to MySQL server > on 'localhost' (10061)") The problem is not a biopython one, but rather a MySQL one. The relevant error is the last line: "Can't connect to MySQL server on 'localhost' (10061)" Based on the lines you pasted, you are using my code directly from an example file or documentation. If you look at this line: server = BioSeqDatabase.open_database(driver ="MySQLdb", user = "chapmanb", passwd = "biopython", host = "localhost", db= "bioseqdb") this indicates you are trying to connect to the database with the username "chapmanb" (my normal username :-) and the password "biopython." For your use you'll have to use your actual username and password to connect to your local MySQL installation. You will also have to have created the "bioseqdb" database, and populated it with the BioSQL schema. Hope this helps some! Brad From a.cavallo at reading.ac.uk Wed Feb 11 08:02:24 2004 From: a.cavallo at reading.ac.uk (Antonio Cavallo) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] (no subject) Message-ID: Hy, there is my problem. I would like to retrieve some accessions from embl data source, and I've read the tutorial so: ================================================================================== >>> from Bio import db >>> sp = db["embl"] >>> record_handle = sp['AA054823'] Traceback (most recent call last): File "", line 1, in ? File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 152, in __getitem__ data = self._run_serial(key) File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 219, in _run_serial raise KeyError, "I could not get any results." KeyError: 'I could not get any results.' ================================================================================== This error seems strange because that entry does exist! Using other sources: ================================================================================== >>> sp = db['embl-dbfetch-cgi'] >>> record_handle = sp['AA054823'] Traceback (most recent call last): File "", line 1, in ? File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 89, in __getitem__ return self._get(key) File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/_support.py", line 109, in __call__ return self.fn(*args, **keywds) File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 267, in _get handle = self._cgiopen(key) File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 274, in _cgiopen options = _my_urlencode(params) File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 561, in _my_urlencode params = params.items() AttributeError: 'list' object has no attribute 'items' ================================================================================== And more: ================================================================================== >>> sp = db['embl-fast'] >>> record_handle = sp['AA054823'] Traceback (most recent call last): File "", line 1, in ? File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 150, in __getitem__ data = self._run_concurrent(key) File "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", line 202, in _run_concurrent raise KeyError, "I could not get any results." KeyError: 'I could not get any results.' >>> ================================================================================== What's wrong? After installing the biopython-1.23 there is something else I have to do in order to get access to the embl database? Sorry but I'm totally new to biopython. Thank you in advance, antonio From a.cavallo at reading.ac.uk Wed Feb 11 12:20:04 2004 From: a.cavallo at reading.ac.uk (Antonio Cavallo) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] (no subject) In-Reply-To: <045A60AE-5CB4-11D8-AD2C-000A956845CE@stanfordalumni.org> References: <045A60AE-5CB4-11D8-AD2C-000A956845CE@stanfordalumni.org> Message-ID: On Wed, 11 Feb 2004, Jeffrey Chang wrote: Now it seems ok: in effect I'm running on an un-usual layout (but very updated so when do you need a beta tester I'm here). thank you very much, antonio > Hi Antonio, > > I think I see what's going on. Are you using Python 2.3? It looks > like Python 2.3 has changed the behavior of operator.isMappingType with > respect to lists. In Python2.2, it returns 0, and in Python 2.3, it > returns true. > > The code in Bio/config/DBRegistry.py expects lists to not be a mapping > type, which causes problems. The fix is to change the following code > in that file: > if operator.isMappingType(params): > params = params.items() > to: > if operator.isMappingType(params) and hasattr(params, "items"): > params = params.items() > > I've made this change, and your code is working again. > > I've updated this in the CVS, and it will propogate to the anonymous > CVS in a few hours. Please let me know if there are further problems. > > Jeff > > > > > On Feb 11, 2004, at 8:02 AM, Antonio Cavallo wrote: > > > > > Hy, > > > > there is my problem. > > I would like to retrieve some accessions from embl data source, and > > I've > > read the tutorial so: > > > > ======================================================================= > > =========== > >>>> from Bio import db > >>>> sp = db["embl"] > >>>> record_handle = sp['AA054823'] > > Traceback (most recent call last): > > File "", line 1, in ? > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 152, in __getitem__ > > data = self._run_serial(key) > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 219, in _run_serial > > raise KeyError, "I could not get any results." > > KeyError: 'I could not get any results.' > > ======================================================================= > > =========== > > > > > > This error seems strange because that entry does exist! > > Using other sources: > > > > > > ======================================================================= > > =========== > >>>> sp = db['embl-dbfetch-cgi'] > >>>> record_handle = sp['AA054823'] > > Traceback (most recent call last): > > File "", line 1, in ? > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 89, in __getitem__ > > return self._get(key) > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > _support.py", > > line 109, in __call__ > > return self.fn(*args, **keywds) > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 267, in _get > > handle = self._cgiopen(key) > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 274, in _cgiopen > > options = _my_urlencode(params) > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 561, in _my_urlencode > > params = params.items() > > AttributeError: 'list' object has no attribute 'items' > > ======================================================================= > > =========== > > > > And more: > > > > ======================================================================= > > =========== > >>>> sp = db['embl-fast'] > >>>> record_handle = sp['AA054823'] > > Traceback (most recent call last): > > File "", line 1, in ? > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 150, in __getitem__ > > data = self._run_concurrent(key) > > File > > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ > > DBRegistry.py", > > line 202, in _run_concurrent > > raise KeyError, "I could not get any results." > > KeyError: 'I could not get any results.' > >>>> > > ======================================================================= > > =========== > > > > > > > > > > What's wrong? After installing the biopython-1.23 there is something > > else > > I have to do in order to get access to the embl database? > > Sorry but I'm totally new to biopython. > > Thank you in advance, > > antonio > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev@biopython.org > > http://biopython.org/mailman/listinfo/biopython-dev > > From hoffman at ebi.ac.uk Wed Feb 11 12:20:38 2004 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] operator.isMappingType In-Reply-To: References: <045A60AE-5CB4-11D8-AD2C-000A956845CE@stanfordalumni.org> Message-ID: On Wed, 11 Feb 2004, Antonio Cavallo wrote: > On Wed, 11 Feb 2004, Jeffrey Chang wrote: > > > I think I see what's going on. Are you using Python 2.3? It looks > > like Python 2.3 has changed the behavior of operator.isMappingType with > > respect to lists. In Python2.2, it returns 0, and in Python 2.3, it > > returns true. I think operator.isMappingType is destined to be removed: http://mail.python.org/pipermail/python-list/2003-November/192444.html http://mail.python.org/pipermail/python-dev/2003-November/040307.html -- Michael Hoffman European Bioinformatics Institute From mcolosimo at mitre.org Thu Feb 12 10:09:09 2004 From: mcolosimo at mitre.org (Marc Colosimo) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] mxTextTools link Message-ID: <685E5F67-5D6D-11D8-9030-000A95A5D8B2@mitre.org> In the setup.py file, the link for mxTextTools is out dated (this is from the current cvs files). It is listed as: You can find mxTextTools at http://www.lemburg.com/files/python/mxExtensions.html. And should be: http://www.egenix.com/files/python/eGenix-mx-Extensions.html Marc From chapmanb at uga.edu Thu Feb 12 18:54:19 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] mxTextTools link In-Reply-To: <685E5F67-5D6D-11D8-9030-000A95A5D8B2@mitre.org> References: <685E5F67-5D6D-11D8-9030-000A95A5D8B2@mitre.org> Message-ID: <20040212235419.GC2841@evostick.agtec.uga.edu> Hi Marc; > In the setup.py file, the link for mxTextTools is out dated (this is > from the current cvs files). It is listed as: > > You can find mxTextTools at > http://www.lemburg.com/files/python/mxExtensions.html. > > And should be: > > http://www.egenix.com/files/python/eGenix-mx-Extensions.html Thanks much for the heads up -- fixed in CVS. Brad From chapmanb at uga.edu Thu Feb 12 18:52:32 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] hmmpfam parser In-Reply-To: <40226C51.F70315B7@ebc.uu.se> References: <401AC30A.E639BD2E@ebc.uu.se> <20040203014909.GD17947@evostick.agtec.uga.edu> <401F8BDD.73417BCD@ebc.uu.se> <20040205000453.GJ907@evostick.agtec.uga.edu> <40226C51.F70315B7@ebc.uu.se> Message-ID: <20040212235232.GB2841@evostick.agtec.uga.edu> Hi Wagied; > Hopefully this meets the guidelines. [...New version of the parser deleted...] I've had a good hard look at this and done a number of fix-ups and things to try and make things more understandable to myself, easier to maintain, and more conformant to the normal "Biopython way" of doing things. I've attached a new version to this mail which cleans up a number of things: 1. Iterator and parser are separate entities, and now behave like standard Biopython iterators and parsers. 2. Cleaned up functions -- some functions (specifically parse()) ran over multiple pages with heavy indenting. This is really hard to follow and maintain -- I split things into separate internal functions. 3. Got rid of lots of use of constants, for standard things like newlines and other things. Again, these make the code harder to follow. 4. Removed some of the unnecessary variables which look like they are left over from coding the parser. 5. Wrapped lines to 80 characters or less. This leaves things much better and gives a better idea of where the parser is at. If you like these changes and want to keep working on it, I'd suggest that a couple of things are still missing which could use coding. 1. The domains and families can be extracted as XML, but not accessed through a class. The HmmpfamRecord class really needs to have family_scores and parsed_domains be lists of objects which have all of the elements (model, description, e-value...) as attributes of these classes. An excellent example of how this is done is the BLAST Record class in Bio/Blast/Record.py, which is also documented at: http://biopython.org/docs/tutorial/Tutorial004.html#toc10 2. There needs to be some similar way to access the alignments, so that they are also parsed into classes. I think things are really coming along well -- let me know what you think about expanding the Record class to include the families, domains and alignments and we can get this finished and all checked in. Please also let me know if I messed any of your code during my work on it. Hope this helps -- glad it's coming along! Brad -------------- next part -------------- ####################################################################### */ # COPYRIGHT INFORMATION # Pfam DOMAIN RESULTS PARSER # @AUTHOR: Wagied Davids # @DATE: 22.01.2004 # @COPYRIGHT: Wagied Davids, 2004 ####################################################################### */ import sys import string import re import time from types import FileType class HmmpfamRecord: ''' Prototype class Entry structure @author: Wagied Davids @date: 22.01.2004 @copyright: Wagied Davids, 2004 ''' # STATIC DATA FAMILY_CLASSIFICATION_HEADER = 'Scores for sequence family ' \ 'classification (score includes all domains):\nModel ' \ 'Description Score E-value N\n' \ '-------- ----------- ' \ '----- ------- ---' PARSED_DOMAIN_HEADER = 'Parsed for domains:\nModel ' \ 'Domain seq-f seq-t hmm-f hmm-t score E-value\n' \ '-------- ------- ----- ----- ----- ----- ----- -------' NO_HITS= '[no hits above thresholds]' # STATIC REGEX OBJECTS REGEX_FAMILY_SCORES= re.compile( r'((\S.*?)\s+(\S.*?)\s+((-| )\S.*?)\s+(\S.*?)\s+(\d+))', re.MULTILINE | re.DOTALL ) def __init__( self, query, accession= None, description= None, family_scores= [], parsed_domains= [], alignments= [] ): ''' Constructor for Pfam Entry structure @param ( query, accession= None, description= None, family_scores= [], parsed_domains= [], alignments= [] ) @return (None) ''' self.query= query self.accession= accession self.description= description self.family_scores= family_scores # FAMILY SCORES HITLIST FOR SCORE ENTRIES self.family_scores_hitlist= [] self.parsed_domains= parsed_domains self.alignments= alignments def __str__( self ): ''' Retrieves a string representation of parser entry class @param (None) @return (String: representation of HmmpfamRecord class) ''' strBuffer= '' strBuffer= strBuffer + "\n" strBuffer= strBuffer + "\t%s\n" % (self.get_query()) strBuffer= strBuffer + "\t%s\n" \ % (self.get_accession()) strBuffer= strBuffer + "\t%s\n" \ % (self.get_description()) strBuffer= strBuffer + "\t%s" % (self.get_family_scores_ml()) strBuffer= strBuffer + "\t%s" % (self.get_parsed_domains_ml()) strBuffer= strBuffer + "\t%s\n" \ % (self.get_alignments()) strBuffer= strBuffer + "" return strBuffer def get_query( self ): ''' Retrieves the QUERY @param (None) @return (String: QUERY ) ''' return self.query def get_accession( self ): ''' Retrieves the ACCESSION @param (None) @return (String: ACCESSION) ''' return self.accession def get_description( self ): ''' Retrieves the DESCRIPTION @param (None) @return (String: DESCRIPTION) ''' return self.description def get_family_scores_raw(self): ''' Retrieves a list of FAMILY SCORES @param (None) @return (List: FAMILY SCORES) ''' return self.family_scores def get_no_of_family_entries(self): ''' Retrieves the number of hits per query @param (None) @return (Integer: number of hits per query) ''' return len(self.family_scores) def get_family_scores_ml( self ): ''' FINE-GRAINED CONTROL OVER FAMILY CLASSIFICATION AND SCORE RESULTS @param (None) @return (String: Marked-Up text format of Family Classificatio Scores) ''' # BEGIN FAMILY_SCORE_LIST TAG family_scores= "\n" family_model= '' family_description= '' family_score_value= '' family_e_value= '' family_n_value= '' family_scores_counter= 1 for score_entry in self.get_family_scores_raw(): MatchScoreEntry= HmmpfamRecord.REGEX_FAMILY_SCORES.search(score_entry) if MatchScoreEntry != None: # BEGIN FAMILY_SCORE_HIT TAG family_scores= family_scores + \ "\t\t\n" \ % ( family_scores_counter ) # EXTRACT INFORMATION FROM MATCH_SCORE_ENTRY # MatchScoreEntry.group( 1 ) equals WHOLE ENTRY family_model= MatchScoreEntry.group( 2 ) family_description= MatchScoreEntry.group( 3 ) family_score_value= MatchScoreEntry.group( 4 ) # MatchScoreEntry.group( 5 ) equals '-' IF PRESENT family_e_value= MatchScoreEntry.group( 6 ) family_n_value= MatchScoreEntry.group( 7 ) for data, tag in [(family_model, "FAMILY_SCORE_MODEL"), (family_description, "FAMILY_DESCRIPTION"), (family_score_value, "FAMILY_SCORE_VALUE"), (family_e_value, "FAMILY_E_VALUE"), (family_n_value, "FAMILY_N_VALUE")]: family_scores += "\t\t\t<%s>%s\n" % \ (tag, data, tag) # COMPLETE FAMILY_SCORE_HIT TAG family_scores= family_scores + "\t\t\n" # INCREMENT family_scores_counter family_scores_counter= family_scores_counter + 1 # COMPLETE FAMILY_SCORE_LIST TAG family_scores= family_scores + "\t\n" return family_scores def get_parsed_domains_raw(self): ''' Retrieves a list of PARSED DOMAINS @param (None) @return (List: PARSED DOMAINS) ''' return self.parsed_domains def get_no_of_parsed_domains( self ): ''' Retrieves the number of parsed hits per query @param (None) @return (Integer: number of parsed hits per query) ''' return len( self.parsed_domains ) def get_parsed_domains_ml( self ): ''' FINE-GRAINED CONTROL OVER PARSED DOMAINS AND SCORE RESULTS @param (None) @return (String: Marked-Up text format of Parsed Domain section) ''' parsed_domain_list= [] parsed_model= '' parsed_domain_number= '' parsed_domain_seq_f= '' parsed_domain_seq_t= '' parsed_domain_hmm_f= '' parsed_domain_hmm_t= '' parsed_domain_2_dots= '' parsed_domain_brackets= '' parsed_domain_score= '' parsed_domain_e_value= '' parsed_domains_counter= 1 # BEGIN PARSED_DOMAINS_LIST TAG parsed_domains= '\n' for domain in self.get_parsed_domains_raw(): # IF NO_HITS NOT FOUND, THEN EXTRACT DATA if string.find( domain, HmmpfamRecord.NO_HITS ) < 0: parsed_domain_list= string.split( domain ) parsed_model= parsed_domain_list[0] parsed_domain_number= parsed_domain_list[1] parsed_domain_seq_f= parsed_domain_list[2] parsed_domain_seq_t= parsed_domain_list[3] #parsed_domain_2_dots= parsed_domain_list[4] parsed_domain_hmm_f= parsed_domain_list[5] parsed_domain_hmm_t= parsed_domain_list[6] #parsed_domain_brackets= parsed_domain_list[7] parsed_domain_score= parsed_domain_list[8] parsed_domain_e_value= parsed_domain_list[9] # BEGIN PARSED_DOMAIN_HIT TAG parsed_domains= parsed_domains + \ "\t\t\n" % (parsed_domains_counter) # FORMAT ENTRY TAGS for data, tag in [(parsed_model, "PARSED_MODEL"), (parsed_domain_number, "PARSED_DOMAIN_NUMBER"), (parsed_domain_seq_f, "PARSED_DOMAIN_SEQ_F"), (parsed_domain_seq_t, "PARSED_DOMAIN_SEQ_T"), (parsed_domain_hmm_f, "PARSED_DOMAIN_HMM_F"), (parsed_domain_hmm_t, "PARSED_DOMAIN_HMM_T"), (parsed_domain_score, "PARSED_DOMAIN_SCORE"), (parsed_domain_e_value, "PARSED_DOMAIN_E_VALUE")]: parsed_domains += "\t\t\t<%s>%s\n" % \ (tag, data, tag) # COMPLETE PARSED_DOMAIN_HIT TAG parsed_domains= parsed_domains + "\t\t\n" # INCREMENT parsed_domains_counter parsed_domains_counter= parsed_domains_counter + 1 else: # NO_HITS FOUND return domain # COMPLETE PARSED_DOMAINS_LIST TAG parsed_domains= parsed_domains + '\n' return parsed_domains def get_alignments( self ): ''' Retrieves a list of TOP SCORING ALIGNMENTS @param (None) @return (List: TOP SCORING ALIGNMENTS) ''' return self.alignments def get_regex_family_scores( self ): ''' Retrieves the Regex object for Pfam family scores @param (None) @return (Regex: Regex object for Pfam family scores) ''' return HmmpfamRecord.REGEX_FAMILY_SCORES class Iterator: """Iterate over a hmmpfam result file one record at a time. """ def __init__(self, handle, parser = None): """Initalize with a handle to the hmmpfam output and optional parser. """ if type(handle) is not FileType and type(handle) is not InstanceType: raise ValueError, "I expected a file handle or file-like object" self._handle = handle self._parser = parser def __iter__(self): return iter(self.next, None) def next(self): """Return the next hmmpfam output record, parsed if appropriate. """ lines = [] while 1: line= self._handle.readline() if not line: break # Pfam ENTRY DETECTED if line.find('Query sequence:') == 0: lines.append(line.rstrip()) while 1: line= self._handle.readline() lines.append(line.rstrip()) if not line: break if line.find("//") == 0: break if len(lines) == 0: # nothing left return None else: if self._parser: data = "\n".join(lines) return self._parser.parse(data) else: return "\n".join(lines) class RecordParser: ''' Prototype class for parsing hmmpfam output @author: Wagied Davids @date: 22.01.2004 @copyright: Wagied Davids, 2004 ''' # STATIC REGEX OBJECTS REGEX_HMM_ENTRY= re.compile( r'(Query sequence:\s+\S.*\s+//)', re.MULTILINE | re.DOTALL ) REGEX_HMM_QUERY= re.compile( r'Query sequence:\s+(\S.*?)\s+Accession', re.MULTILINE | re.DOTALL ) REGEX_HMM_ACC= re.compile( r'Accession:\s+(\S.*?)\s+Description', re.MULTILINE | re.DOTALL ) REGEX_HMM_DESCRIPTION= re.compile( r'Description:\s+(\S.*?)\s+Scores', re.MULTILINE | re.DOTALL ) REGEX_HMM_SEQ_FAMILY_SCORES= re.compile( r'(Scores\s+\S.*)\s+Parsed', re.MULTILINE | re.DOTALL ) REGEX_HMM_PARSED_DOMAINS= re.compile( r'(Parsed for domains:\s+\S.*)\s+Alignments', re.MULTILINE | re.DOTALL ) REGEX_HMM_ALIGNMENTS= re.compile( r'(Alignments of top-scoring domains:\s+\S.*)\s+//', re.MULTILINE | re.DOTALL ) def __init__(self): ''' Constructor for RecordParser @param (Filename) @return (None) ''' self.debug= 0 def set_debug( self, debug= 0 ): ''' Sets the debug level when parsing debug= 0 No debug information debug= 1 Pfam Entry level debug information debug= 2 Regex level debug information debug= 3 Incoming data @param (Integer representing the verbosity/ debug level) @return (None) ''' self.debug= debug def _print_debug(self, level, info): """Simple class to print out debug info if it matches a given level. """ if level == self.debug: sys.stdout.write(info + "\n") def parse(self, data_entry): """Initialize with a single hmmpfam record to parse. Returns the record parsed into an HmmpfamRecord class. """ if self.debug == 3: print data_entry # MATCH ENTRY STRUCTURE match_hmm_entry= RecordParser.REGEX_HMM_ENTRY.search(data_entry) if self.debug == 2: print "%s: %s" % (match_hmm_entry, match_hmm_entry.re.pattern) if match_hmm_entry is not None: entry = match_hmm_entry.group(1) query, accession, description = self._parse_query_info(entry) family_scores_list = self._parse_family_scores(entry) parsed_domains_list = self._parse_domains(entry) domain_alignments_list = self._parse_alignments(entry) # Construct Pfam Entry structure record = HmmpfamRecord(query, accession, description, family_scores_list, parsed_domains_list, domain_alignments_list ) if self.debug == 1: print "%s => %s" % ( record.get_query(), record.get_description() ) print record.get_family_scores_ml() print record.get_parsed_domains_ml() return record def _parse_query_info(self, entry): """Retrieve the query name, accession and description. """ hmm_query, hmm_accession, hmm_description = ('', '', '') # MATCH QUERY SEQUENCE match_hmm_query= RecordParser.REGEX_HMM_QUERY.search(entry) if self.debug == 2: print "%s: %s" % ( match_hmm_query, match_hmm_query.re.pattern ) if match_hmm_query is not None: hmm_query= match_hmm_query.group(1) # MATCH ACCESSION match_hmm_accession= RecordParser.REGEX_HMM_ACC.search( entry ) if self.debug == 2: print "%s: %s" % ( match_hmm_accession, match_hmm_accession.re.pattern ) if match_hmm_accession is not None: hmm_accession= match_hmm_accession.group( 1 ) # MATCH DESCRIPTION match_hmm_description= RecordParser.REGEX_HMM_DESCRIPTION.search( entry ) if self.debug == 2: print "%s: %s" % ( match_hmm_description, match_hmm_description.re.pattern ) if match_hmm_description is not None: hmm_description= match_hmm_description.group(1) return hmm_query, hmm_accession, hmm_description def _parse_family_scores(self, entry): """Retrieve the family scores from the hmmpfam search. """ match_hmm_scores= RecordParser.REGEX_HMM_SEQ_FAMILY_SCORES.search(entry) if self.debug == 2: print "%s: %s" % (match_hmm_scores, match_hmm_scores.re.pattern) family_scores_list = [] if match_hmm_scores != None: hmm_scores= match_hmm_scores.group( 1 ) family_scores_info_list= string.split( hmm_scores, "\n") # NOTE: LAST ELEMENT = EMPTY SPACE family_scores_list = family_scores_info_list[ 3: -1 ] return family_scores_list def _parse_domains(self, entry): """Parse domain information from the hmmpfam output. """ match_hmm_parsed_domains= RecordParser.REGEX_HMM_PARSED_DOMAINS.search( entry ) if self.debug == 2: print "%s: %s" % ( match_hmm_parsed_domains, match_hmm_parsed_domains.re.pattern ) parsed_domains_list = [] if match_hmm_parsed_domains != None: hmm_domains= match_hmm_parsed_domains.group(1) parsed_domains_info_list= string.split(hmm_domains, "\n") # NOTE: LAST ELEMENT = EMPTY SPACE parsed_domains_list= parsed_domains_info_list[3: -1] return parsed_domains_list def _parse_alignments(self, entry): """Parse out alignment information from the hmmpfam output. """ match_hmm_alignments= RecordParser.REGEX_HMM_ALIGNMENTS.search( entry ) if self.debug == 2: print "%s: %s" % (match_hmm_alignments, match_hmm_alignments.re.pattern) if match_hmm_alignments is not None: hmm_aligments= match_hmm_alignments.group(1) damain_aligments_info_list= string.split(hmm_aligments, "\n") domain_alignments_list= damain_aligments_info_list[3:-2] return domain_alignments_list def get_regex_hmm_entry( self ): ''' Retrieves the Regex object for REGEX_HMM_ENTRY @param (None) @return (Regex: HMM_ENTRY) ''' return RecordParser.REGEX_HMM_ENTRY def get_regex_query( self ): ''' Retrieves the Regex object for REGEX_HMM_QUERY @param (None) @return (Regex: REGEX_HMM_QUERY) ''' return RecordParser.REGEX_HMM_QUERY def get_regex_accession(self): ''' Retrieves the Regex object for REGEX_HMM_ACC @param (None) @return (Regex: REGEX_HMM_ACC) ''' return RecordParser.REGEX_HMM_ACC def get_regex_description( self ): ''' Retrieves the Regex object for REGEX_HMM_DESCRIPTION @param (None) @return (Regex: REGEX_HMM_DESCRIPTION) ''' return RecordParser.REGEX_HMM_DESCRIPTION def get_regex_family_scores( self ): ''' Retrieves the Regex object for REGEX_HMM_SEQ_FAMILY_SCORES @param (None) @return (Regex: REGEX_HMM_SEQ_FAMILY_SCORES) ''' return RecordParser.REGEX_HMM_SEQ_FAMILY_SCORES def get_regex_parsed_domains( self ): ''' Retrieves the Regex object for REGEX_HMM_DOMAINS @param (None) @return (Regex: REGEX_HMM_DOMAINS) ''' return RecordParser.REGEX_HMM_PARSED_DOMAINS def get_regex_alignments( self ): ''' Retrieves the Regex object for REGEX_HMM_ALIGNMENTS @param (None) @return (Regex: REGEX_HMM_ALIGNMENTS) ''' return RecordParser.REGEX_HMM_ALIGNMENTS def __str__( self ): ''' Retrieves a string representation of parser class @param (None) @return (String: Retrieves a string representation of parser class) ''' strBuffer= 'ParserType: RecordParser' return strBuffer # __END__ -------------- next part -------------- #!/usr/bin/env python ###################################################################### */ # COPYRIGHT INFORMATION # Test program for Pfam domain results parser # @AUTHOR: Wagied Davids # @DATE: 22.01.2004 # @COPYRIGHT: Wagied Davids, 2004 ###################################################################### */ import Hmmpfam # Module level re-name # DATA LOCATION filename= 'hmmpfam_output.example' handle = open(filename, "r") # INSTANTIATE Parser with debugging info parser= Hmmpfam.RecordParser() # parser.set_debug(1) iterator = Hmmpfam.Iterator(handle, parser) for rec in iter(iterator): print "--> %s : %s : %s" % (rec.query, rec.accession, rec.description) print rec.get_family_scores_ml() print rec.get_parsed_domains_ml() From mcolosimo at mitre.org Fri Feb 27 16:25:20 2004 From: mcolosimo at mitre.org (Marc Colosimo) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] GenBank bug, oriT feature missing Message-ID: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org> Hi, I've just spent a good part of a day trying to understand what was going wrong and I think I finally know. Here is the problem: I was getting this exception for reading in a GenBank file (from genbank): "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception Martel.Parser.ParserPositionException: error parsing at or beyond character 1981 After digging into the GenBank code (__init.py__) and then into Martel's code. I found I could turn on debugging: GenBank.FeatureParser(debug_level=2) I finally see where things die (and what character 1981 means). for AE000070 there is a feature tag "oriT", which seems to be missing from genbank_record.py and __init__.py oriT 81..92 /note="region including origin of transfer (oriT) almost identical to oriT regions of plasmids from the 'Q-group'" /evidence=not_experimental This really isn't a pretty way of dealing with unknown features. Is there a way to get this to just pass unknown features? Thanks, Marc From idoerg at burnham.org Fri Feb 27 18:55:37 2004 From: idoerg at burnham.org (Iddo Friedberg) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] GenBank bug, oriT feature missing In-Reply-To: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org> References: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org> Message-ID: <403FD8F9.8000908@burnham.org> I agree that these things should be handeled better. How about raising an UnknownFeature exception, which is not silenced by default. The user can then decide whether the parser should trap & silence such an exception when it occurs. ./I Marc Colosimo wrote: > Hi, > > I've just spent a good part of a day trying to understand what was going > wrong and I think I finally know. Here is the problem: > > I was getting this exception for reading in a GenBank file (from genbank): > > "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in > fatalError > raise exception > Martel.Parser.ParserPositionException: error parsing at or beyond > character 1981 > > After digging into the GenBank code (__init.py__) and then into Martel's > code. I found I could turn on debugging: > > GenBank.FeatureParser(debug_level=2) > > I finally see where things die (and what character 1981 means). > > for AE000070 there is a feature tag "oriT", which seems to be missing > from genbank_record.py and __init__.py > > oriT 81..92 > /note="region including origin of transfer (oriT) > almost > identical to oriT regions of plasmids from the > 'Q-group'" > /evidence=not_experimental > > This really isn't a pretty way of dealing with unknown features. Is > there a way to get this to just pass unknown features? > > Thanks, > > Marc > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev@biopython.org > http://biopython.org/mailman/listinfo/biopython-dev > > -- Iddo Friedberg, Ph.D. The Burnham Institute 10901 N. Torrey Pines Rd. La Jolla, CA 92037 USA Tel: +1 (858) 646 3100 x3516 Fax: +1 (858) 713 9930 http://ffas.ljcrf.edu/~iddo From mcolosimo at mitre.org Sat Feb 28 22:03:08 2004 From: mcolosimo at mitre.org (Marc Colosimo) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] GenBank bug, oriT feature missing In-Reply-To: <200402281154.38515.Peter.Bienstman@UGent.be> References: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org> <403FD8F9.8000908@burnham.org> <200402281154.38515.Peter.Bienstman@UGent.be> Message-ID: <4041566C.1060305@mitre.org> Thanks, it works on that case now. I'll look to see where you added that so that if I run into another unknown tag I can add it. raising an UnknownFeature exception would be nice. But from what little I know about how it parses, how could you re-enter parsing? Maybe creating a different FeatureParser to handle unknown features (WeakFeatureParser, maybe?) Marc Peter Bienstman wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >That would be a good solution. As a short term fix however, I've added the >oriT tag to genbank_format.py in CVS. > >Peter > >On Saturday 28 February 2004 00:55, Iddo Friedberg wrote: > > >>I agree that these things should be handeled better. How about raising >>an UnknownFeature exception, which is not silenced by default. The user >>can then decide whether the parser should trap & silence such an >>exception when it occurs. >> >>./I >> >>Marc Colosimo wrote: >> >> >>>Hi, >>> >>>I've just spent a good part of a day trying to understand what was going >>>wrong and I think I finally know. Here is the problem: >>> >>>I was getting this exception for reading in a GenBank file (from >>>genbank): >>> >>>"/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in >>>fatalError >>> raise exception >>>Martel.Parser.ParserPositionException: error parsing at or beyond >>>character 1981 >>> >>>After digging into the GenBank code (__init.py__) and then into Martel's >>>code. I found I could turn on debugging: >>> >>>GenBank.FeatureParser(debug_level=2) >>> >>>I finally see where things die (and what character 1981 means). >>> >>>for AE000070 there is a feature tag "oriT", which seems to be missing >>>from genbank_record.py and __init__.py >>> >>> oriT 81..92 >>> /note="region including origin of transfer (oriT) >>>almost >>> identical to oriT regions of plasmids from the >>>'Q-group'" >>> /evidence=not_experimental >>> >>>This really isn't a pretty way of dealing with unknown features. Is >>>there a way to get this to just pass unknown features? >>> >>>Thanks, >>> >>>Marc >>> >>> >>> From chapmanb at uga.edu Sun Feb 29 17:17:58 2004 From: chapmanb at uga.edu (Brad Chapman) Date: Sat Mar 5 14:43:31 2005 Subject: [Biopython-dev] GenBank bug, oriT feature missing In-Reply-To: <4041566C.1060305@mitre.org> References: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org> <403FD8F9.8000908@burnham.org> <200402281154.38515.Peter.Bienstman@UGent.be> <4041566C.1060305@mitre.org> Message-ID: <20040229221758.GH24150@evostick.agtec.uga.edu> Hey guys; [Mark reports yet another new feature tag added to GenBank files] > Martel.Parser.ParserPositionException: error parsing at or beyond > character 1981 > > After digging into the GenBank code (__init.py__) and then into Martel's > code. I found I could turn on debugging: > > GenBank.FeatureParser(debug_level=2) > > I finally see where things die (and what character 1981 means). > > for AE000070 there is a feature tag "oriT", which seems to be missing > from genbank_record.py and __init__.py [And makes a useful suggestion that others second (and third...)] > This really isn't a pretty way of dealing with unknown features. Is > there a way to get this to just pass unknown features? Yes, I completely agree that this is a pain. The problem is an unfortunate design decision where the format used to parse the files uses a hard-coded list of tags. This made sense when it was originally designed since there are supposed to be a restricted set of feature and qualifier key names that can be used. Unfortunately, it's turned into a headache for everyone since NCBI keeps adding tags. I've decided to get rid of this and just checked in a series of changes to CVS that update the genbank format so it shouldn't run into this problem any longer -- the new format uses a general regular expression (basically \w, plus some additional characters that get used like ' and - ), so it shouldn't run into this problem. In the process of making these changes I've also done a general cleanup of the format file and merged it with the old (but still with plenty of useful bits of code) format in Bio.expressions.genbank. I've moved Bio/GenBank/genbank_format.py to Bio/expressions/genbank.py -- so for those of you who look at it or change it (thanks Peter!), you now need to look there. So, long story short -- I hope I fixed this problem for the future. Please do give the new version in CVS a go and let me know if it has any problems on your files. Sorry about the pain and thanks for the report! Brad