From chapmanb at uga.edu  Mon Feb  2 22:54:52 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] hmmpfam parser
Message-ID: <20040203035452.GA67076@evostick.agtec.uga.edu>

Hi Wagied;

> I have some code which is able to parse hmmer output,
> as well as code donated by Joanne Adamkewicz from Exilexis.
> 
> If you guys/gals find it useful, updates and modification will be done!

Thanks for sending this -- hmmpfam parsing code in Biopython is
definitely something we need. A few notes on what you sent:

1. I'm guessing that PfamParser.py and ExPFam.py are completely
separate pieces of code (except for both dealing with parsing Pfam).
For Biopython, the PfamParser.py is the more generally useful piece
of code since it provides an interface to parse a hmmpfam result
into a record-like object. So I'll probably restrict my comments to
that code.

2. Is there an methodology that you use to iterate over a file full
of hmmpfam results? Normally most parsers in Biopython include a
parser for individual records and then an iterator so that you can
apply the parser to a file full of results.

3. Some of the code does not follow the naming conventions that we
normally use in Biopython. Specifically:

a. Functions should be lowercase_separated_by_underscores style.

b. Variables should be lowercase_underscores style or alltogether
style. One of the things which was confusing to me in your code is
that you alternate between the lowercase_underscores style and
ALL_UPPERCASE style. At least in my experience ALL_UPPERCASE is
normally reserved for "constants."

c. You provide a lot of accessor methods for class variables (ie.
getAccession for self.accession). Normally in python you just have
access to the variable directly (or preface it with an underscore
like self._internal if the variable is for internal class use) --
the getWhatever functions is more java-like.

d. There are lots of unnecessary semi-colons in the code. They don't
hurt anything, but again make the code look more Java-like than
python-like.

e. On the class __init__()'s you have code that looks like:

def __init__(self, variable = None):
    if variable is not(None):
        # do something with variable
    else:
        # raise an error

You can eliminate all of this by just requiring the variable in the
initializer:

def __init__(self, variable):
    # do something with variable

And let python take care of the error checking that something was
passed.

Generally, the documentation on contributing to Biopython talks more
about style issues we try to stick to; so that a heterogeneous
project such as this can be as uniform as possible:

http://biopython.org/docs/developer/contrib.html

Hopefully all that is helpful -- we'd be very happy to accept the
code with some modifications along the lines of what I've mentioned
above, so I'm definitely not trying to be discouraging by
enumerating those points above. We just want to make sure the code
that gets in is as easy to understand and maintain as possible.

Thanks again for the mail and please don't hesitate to ask any other
questions!
Brad

From chapmanb at uga.edu  Mon Feb  2 23:01:34 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] Explanation for late responses to development list
	mails
Message-ID: <20040203040134.GC45748@evostick.agtec.uga.edu>

Hey all;
I just realized (with the help of good Jeff as always) that my mails
to the development list have been getting discarded for the past month or 
so. Apparently I write a lot like automated spammers -- man, am I
feelin' some self-confidence now :-).

But the point is that I'm going to try and dig through my sent box
and forward on some mails which never saw the light of day. So, if
some responses I send now seem especially non-timely, I blame it
entirely on the e-mail system and not my slacking.

Sorry about this -- if anything you may have wrote to the
development list got no attention make sure to send me a mail so
that I know about it. Thanks!

Brad

From chapmanb at uga.edu  Mon Feb  2 23:02:08 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] Bio.Wise checked in
Message-ID: <20040203040208.GB67076@evostick.agtec.uga.edu>

Hi Michael;

> I have checked in Bio.Wise, which contains modules for running and
> processing the output of some of the models in the Wise package
> available from
> <ftp://ftp.ebi.ac.uk/pub/software/unix/wise2/wise2.2.0.tar.gz>:
> 
> Bio.Wise.psw for protein Smith-Waterman alignments
> Bio.Wise.dnal for Smith-Waterman DNA alignments

Great! Thanks for doing this!

> There are also appropriate unit tests which will not be checked if
> dnal is not in your path.

Right now I don't have wise installed and I am getting the test
failing instead of skipping it:

7:38pm Tests> python run_tests.py test_Wise.py
test_Wise ... dnal: not found
FAIL

======================================================================
FAIL: test_Wise
----------------------------------------------------------------------
Traceback (most recent call last):
  File "run_tests.py", line 148, in runTest
    self.runSafeTest()
  File "run_tests.py", line 185, in runSafeTest
    expected_handle)
  File "run_tests.py", line 285, in compare_output
    assert expected_line == output_line, \
AssertionError: 
Output  : 'test_dnal (test_Wise.TestWiseDryRun) ... FAIL\n'
Expected: 'test_dnal (test_Wise.TestWiseDryRun) ... ok\n'
----------------------------------------------------------------------
Ran 1 tests in 0.075s

It looks like my commands execution returns something different than
on your machine:

>>> import commands
>>> commands.getoutput("dnal")
'dnal: not found'

I changed requires_wise.py a bit so it takes care of this case. So
just a minor thing.

Thanks again for this!
Brad

From chapmanb at uga.edu  Mon Feb  2 23:04:01 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] Contribution -- NMR xpk files
Message-ID: <20040203040401.GC67076@evostick.agtec.uga.edu>

Hi Bob and all;

> I have contributed some code to biopython for working with NMR data to be 
> included in the CVS, probably in the NMR package.  Along with the two 
> modules (xpktools.py and NOEtools.py) is an example script 
> (simplepredict.py) and an input file (noed.xpk).  I think you will find 
> the example script to be well documented and readable.

Great! Thanks for sending this my way. I've checked the modules into
Bio.NMR and the example code and input file now live in
Doc/examples/nmr. Everything seems to work on my machine (well, at
least it runs without any errors -- without any NMR knowledge I'm
not so good at interpreting the output :-), but if you could check
and be sure I didn't mess anything that would be great. It looks
like everything has already migrated over to anonymous CVS.

> This new functionality will enable biopython users to perform analysis 
> and data extraction of NMR data whether in the form of data tables or 
> directly from .xpk peaklist files.

Again, we really appreciate this. As with everything in Biopython it
takes the specific knowledge about an area to have code that handles
the bioinformatics challenges well. Go NMR, go.

Well-at-least-I-know-what-NMR-stands-for-ly yr's,
Brad

From idoerg at burnham.org  Mon Feb  2 23:13:55 2004
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] Explanation for late responses to development
	list mails
In-Reply-To: <20040203040134.GC45748@evostick.agtec.uga.edu>
Message-ID: <Pine.SGI.4.10.10402022010590.15817945-100000@pines2.ljcrf.edu>


On Mon, 2 Feb 2004, Brad Chapman wrote:

> Hey all;
> I just realized (with the help of good Jeff as always) that my mails
> to the development list have been getting discarded for the past month or 
> so. Apparently I write a lot like automated spammers -- man, am I
> feelin' some self-confidence now :-).

I told you to change the name for the new V1agraPr0nStar class...

(Let's see if this makes it to the list)

false-negatively-y'rs,

Iddo


From hoffman at ebi.ac.uk  Tue Feb  3 04:14:24 2004
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] Re: Bio.Wise checked in
In-Reply-To: <20040203040208.GB67076@evostick.agtec.uga.edu>
References: <20040203040208.GB67076@evostick.agtec.uga.edu>
Message-ID: <Pine.LNX.4.58.0402030913470.629@qnzvnan.rov.np.hx>

On Mon, 2 Feb 2004, Brad Chapman wrote:

> >>> commands.getoutput("dnal")
> 'dnal: not found'
>
> I changed requires_wise.py a bit so it takes care of this case. So
> just a minor thing.

Thanks!
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute

From chapmanb at uga.edu  Wed Feb  4 19:04:53 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] hmmpfam parser
In-Reply-To: <401F8BDD.73417BCD@ebc.uu.se>
References: <401AC30A.E639BD2E@ebc.uu.se>
	<20040203014909.GD17947@evostick.agtec.uga.edu>
	<401F8BDD.73417BCD@ebc.uu.se>
Message-ID: <20040205000453.GJ907@evostick.agtec.uga.edu>

Hi Wagied;

> ExPfam is code donated by Joanne Adamkewicz at Exelixis. I guess
> they use it at the "module" level rather than at the class level.

Okay. Thanks makes more sense now. This would need a bit of work to
be "Biopython-like." It does do a good job of solving a particular
problem, but normally we try to focus the code that gets in on being
as broadly applicable as possible (hence the emphasis on parsers and
iterators and the like).

> The record/entry structurs are ideally collected in a hash...let me
> check the code..will write than in!

I'm not sure what exactly you mean here. What does this refer to?

> I can write in an Iterator object to traverse the records, if that is
> necessary.

It would be really nice to have -- I think the most common usecase
(at least in my experience with hmmpfam) is to have a file full of
searches that need to be parsed out. It would really expand the
useability of the code.

> Mmhhh you noticed.-..the Java-like coding....I generally try to prevent 
> direct access of class instance variables, rubbed off from Java.

Yes, this really isn't necessary in Biopython. It isn't terrible to
have (in addition to direct attribute access), but does require extra 
work coding all those functions up. Generally, the Python viewpoint
on this is that you "trust" users of your code to access the
variables correctly. Compared to Java coders we tend to be a bit
more easygoing about that sort of thing.

> Made the releveant changes you suggested. 
> Here it is again:
[...]
> Please don't hesitate to ask any other questions!

Thanks -- this does clear up a few things. Not to be a pain, but
there are a few things I mentioned before that are still present in
the code you sent:

1. Function names are still in thisStyle instead of this_style. This
is the most serious problem, as it really helps to have consistent
naming conventions throughtout the Biopython codebase -- as much as
possible.

2. An iterator -- as I mentioned before, this would really improve
the usability of the code.

3. All the ';'s in the code. This is a more minor gripe, but poor
python programmers aren't used to looking at those. A search/replace
on them will likely get rid of them all and make it look much nicer.

I'd be very happy to check it in with those changes.

> I was thinking of becoming a developer, will need to go thru biopython's
> coding
> guidelines. If I could be added to the developers list, would be great!

Definitely. The participants page is at:

http://www.biopython.org/participants/

This is editable on the web (with username 'biopython' and password
'user'). I'd encourage you to enter your information there (just
sign in, click 'edit this page' then 'add new') and be included.
We are all about giving credit to the people that contribute (as
these are the people that make it happen :-).

Thanks again for the continued work!
Brad

From hoffman at ebi.ac.uk  Thu Feb  5 04:20:36 2004
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Sat Mar  5 14:43:30 2005
Subject: [Biopython-dev] hmmpfam parser
In-Reply-To: <20040205000453.GJ907@evostick.agtec.uga.edu>
References: <401AC30A.E639BD2E@ebc.uu.se>
	<20040203014909.GD17947@evostick.agtec.uga.edu>
	<401F8BDD.73417BCD@ebc.uu.se>
	<20040205000453.GJ907@evostick.agtec.uga.edu>
Message-ID: <Pine.LNX.4.58.0402050914260.6639@qnzvnan.rov.np.hx>

On Wed, 4 Feb 2004, Brad Chapman wrote:

> > Mmhhh you noticed.-..the Java-like coding....I generally try to prevent
> > direct access of class instance variables, rubbed off from Java.
>
> Yes, this really isn't necessary in Biopython. It isn't terrible to
> have (in addition to direct attribute access), but does require
> extra work coding all those functions up. Generally, the Python
> viewpoint on this is that you "trust" users of your code to access
> the variables correctly. Compared to Java coders we tend to be a bit
> more easygoing about that sort of thing.

It is also worth noting that, unlike in Java, if you need to add
accessor methods in later to do some processing you can always do so
by changing the attribute into a property.

http://www.python.org/2.2.1/descrintro.html#property

This should get you all of the benefits of encapsulation later if you
want it without having to deal with all of the cruft. Until then,
YAGNI (you aren't gonna need it).
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute

From chapmanb at uga.edu  Fri Feb  6 11:29:35 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Re: biopython database connectivity error
In-Reply-To: <1076029748.4022e934509fb@webmail.njit.edu>
References: <033e01c3c654$549354c0$2b113b86@christen2002>
	<20040105223401.GC9588@evostick.agtec.uga.edu>
	<1076029748.4022e934509fb@webmail.njit.edu>
Message-ID: <20040206162935.GC31847@evostick.agtec.uga.edu>

Hello Chidambaram;

>       i have insattled pyhton biopython and all the necessary modules but i 
> cannot connect to the database iam getting errors i dont know why
[...]
> ython 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from BioSQL import BioSeqDatabase
> >>> server = BioSeqDatabase.open_database(driver ="MySQLdb", user = "chapmanb",
> ...             passwd = "biopython", host = "localhost", db= "bioseqdb")
> Traceback (most recent call last):
>   File "<stdin>", line 2, in ?
>   File "C:\PROGRA~1\Lib\site-packages\BioSQL\BioSeqDatabase.py", line 51, in 
> ope
> n_database
>     conn = connect(**kw)
>   File "C:\PROGRA~1\Lib\site-packages\MySQLdb\__init__.py", line 63, in Connect
>     return apply(Connection, args, kwargs)
>   File "C:\PROGRA~1\Lib\site-packages\MySQLdb\connections.py", line 58, in 
> __ini
> t__
>     self._db = apply(connect, args, kwargs2)
> _mysql_exceptions.OperationalError: (2003, "Can't connect to MySQL server 
> on 'localhost' (10061)")

The problem is not a biopython one, but rather a MySQL one. The
relevant error is the last line:

"Can't connect to MySQL server on 'localhost' (10061)"

Based on the lines you pasted, you are using my code directly from
an example file or documentation. If you look at this line:

server = BioSeqDatabase.open_database(driver ="MySQLdb", user = "chapmanb",
            passwd = "biopython", host = "localhost", db= "bioseqdb")

this indicates you are trying to connect to the database with the
username "chapmanb" (my normal username :-) and the password
"biopython." For your use you'll have to use your actual username and
password to connect to your local MySQL installation. You will also
have to have created the "bioseqdb" database, and populated it with
the BioSQL schema.

Hope this helps some!
Brad

From a.cavallo at reading.ac.uk  Wed Feb 11 08:02:24 2004
From: a.cavallo at reading.ac.uk (Antonio Cavallo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] (no subject)
Message-ID: <Pine.LNX.4.58.0402111257240.3796@laptop0.home>


Hy,

there is my problem.
I would like to retrieve some accessions from embl data source, and I've 
read the tutorial so:

==================================================================================
>>> from Bio import db  
>>> sp = db["embl"]
>>> record_handle = sp['AA054823']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 152, in __getitem__
    data = self._run_serial(key)
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 219, in _run_serial
    raise KeyError, "I could not get any results."
KeyError: 'I could not get any results.'
==================================================================================


This error seems strange because that entry does exist!
Using other sources:


==================================================================================
>>> sp = db['embl-dbfetch-cgi']
>>> record_handle = sp['AA054823']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 89, in __getitem__
    return self._get(key)
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/_support.py", 
line 109, in __call__
    return self.fn(*args, **keywds)
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 267, in _get
    handle = self._cgiopen(key)
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 274, in _cgiopen
    options = _my_urlencode(params)
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 561, in _my_urlencode
    params = params.items()
AttributeError: 'list' object has no attribute 'items'
==================================================================================

And more:

==================================================================================
>>> sp = db['embl-fast']
>>> record_handle = sp['AA054823']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 150, in __getitem__
    data = self._run_concurrent(key)
  File 
"/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/DBRegistry.py", 
line 202, in _run_concurrent
    raise KeyError, "I could not get any results."
KeyError: 'I could not get any results.'
>>> 
==================================================================================


What's wrong? After installing the biopython-1.23 there is something else 
I have to do in order to get access to the embl database?
Sorry but I'm totally new to biopython.
Thank you in advance,
antonio

From a.cavallo at reading.ac.uk  Wed Feb 11 12:20:04 2004
From: a.cavallo at reading.ac.uk (Antonio Cavallo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] (no subject)
In-Reply-To: <045A60AE-5CB4-11D8-AD2C-000A956845CE@stanfordalumni.org>
References: <Pine.LNX.4.58.0402111257240.3796@laptop0.home>
	<045A60AE-5CB4-11D8-AD2C-000A956845CE@stanfordalumni.org>
Message-ID: <Pine.LNX.4.58.0402111715260.4843@laptop0.home>


On Wed, 11 Feb 2004, Jeffrey Chang wrote:

Now it seems ok: in effect I'm running on an un-usual
layout (but very updated so when do you need a beta tester
I'm here).
thank you very much,
antonio


> Hi Antonio,
> 
> I think I see what's going on.  Are you using Python 2.3?  It looks  
> like Python 2.3 has changed the behavior of operator.isMappingType with  
> respect to lists.  In Python2.2, it returns 0, and in Python 2.3, it  
> returns true.
> 
> The code in Bio/config/DBRegistry.py expects lists to not be a mapping  
> type, which causes problems.  The fix is to change the following code  
> in that file:
>      if operator.isMappingType(params):
>          params = params.items()
> to:
>      if operator.isMappingType(params) and hasattr(params, "items"):
>          params = params.items()
> 
> I've made this change, and your code is working again.
> 
> I've updated this in the CVS, and it will propogate to the anonymous  
> CVS in a few hours.  Please let me know if there are further problems.
> 
> Jeff
> 
> 
> 
> 
> On Feb 11, 2004, at 8:02 AM, Antonio Cavallo wrote:
> 
> >
> > Hy,
> >
> > there is my problem.
> > I would like to retrieve some accessions from embl data source, and  
> > I've
> > read the tutorial so:
> >
> > ======================================================================= 
> > ===========
> >>>> from Bio import db
> >>>> sp = db["embl"]
> >>>> record_handle = sp['AA054823']
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 152, in __getitem__
> >     data = self._run_serial(key)
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 219, in _run_serial
> >     raise KeyError, "I could not get any results."
> > KeyError: 'I could not get any results.'
> > ======================================================================= 
> > ===========
> >
> >
> > This error seems strange because that entry does exist!
> > Using other sources:
> >
> >
> > ======================================================================= 
> > ===========
> >>>> sp = db['embl-dbfetch-cgi']
> >>>> record_handle = sp['AA054823']
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 89, in __getitem__
> >     return self._get(key)
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > _support.py",
> > line 109, in __call__
> >     return self.fn(*args, **keywds)
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 267, in _get
> >     handle = self._cgiopen(key)
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 274, in _cgiopen
> >     options = _my_urlencode(params)
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 561, in _my_urlencode
> >     params = params.items()
> > AttributeError: 'list' object has no attribute 'items'
> > ======================================================================= 
> > ===========
> >
> > And more:
> >
> > ======================================================================= 
> > ===========
> >>>> sp = db['embl-fast']
> >>>> record_handle = sp['AA054823']
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 150, in __getitem__
> >     data = self._run_concurrent(key)
> >   File
> > "/home/users/antonio/usr/encap/biopython-1.23/lib/python/Bio/config/ 
> > DBRegistry.py",
> > line 202, in _run_concurrent
> >     raise KeyError, "I could not get any results."
> > KeyError: 'I could not get any results.'
> >>>>
> > ======================================================================= 
> > ===========
> >
> >
> >
> >
> > What's wrong? After installing the biopython-1.23 there is something  
> > else
> > I have to do in order to get access to the embl database?
> > Sorry but I'm totally new to biopython.
> > Thank you in advance,
> > antonio
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev@biopython.org
> > http://biopython.org/mailman/listinfo/biopython-dev
> 
> 

From hoffman at ebi.ac.uk  Wed Feb 11 12:20:38 2004
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] operator.isMappingType
In-Reply-To: <Pine.LNX.4.58.0402111715260.4843@laptop0.home>
References: <Pine.LNX.4.58.0402111257240.3796@laptop0.home>
	<045A60AE-5CB4-11D8-AD2C-000A956845CE@stanfordalumni.org>
	<Pine.LNX.4.58.0402111715260.4843@laptop0.home>
Message-ID: <Pine.LNX.4.58.0402111717560.6369@qnzvnan.rov.np.hx>

On Wed, 11 Feb 2004, Antonio Cavallo wrote:

> On Wed, 11 Feb 2004, Jeffrey Chang wrote:
>
> > I think I see what's going on.  Are you using Python 2.3?  It looks
> > like Python 2.3 has changed the behavior of operator.isMappingType with
> > respect to lists.  In Python2.2, it returns 0, and in Python 2.3, it
> > returns true.

I think operator.isMappingType is destined to be removed:

http://mail.python.org/pipermail/python-list/2003-November/192444.html
http://mail.python.org/pipermail/python-dev/2003-November/040307.html
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute

From mcolosimo at mitre.org  Thu Feb 12 10:09:09 2004
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] mxTextTools link
Message-ID: <685E5F67-5D6D-11D8-9030-000A95A5D8B2@mitre.org>

In the setup.py file, the link for mxTextTools is out dated (this is 
from the current cvs files). It is listed as:

You can find mxTextTools at 
http://www.lemburg.com/files/python/mxExtensions.html.

And should be:

http://www.egenix.com/files/python/eGenix-mx-Extensions.html

Marc


From chapmanb at uga.edu  Thu Feb 12 18:54:19 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] mxTextTools link
In-Reply-To: <685E5F67-5D6D-11D8-9030-000A95A5D8B2@mitre.org>
References: <685E5F67-5D6D-11D8-9030-000A95A5D8B2@mitre.org>
Message-ID: <20040212235419.GC2841@evostick.agtec.uga.edu>

Hi Marc;

> In the setup.py file, the link for mxTextTools is out dated (this is 
> from the current cvs files). It is listed as:
> 
> You can find mxTextTools at 
> http://www.lemburg.com/files/python/mxExtensions.html.
> 
> And should be:
> 
> http://www.egenix.com/files/python/eGenix-mx-Extensions.html

Thanks much for the heads up -- fixed in CVS.
Brad

From chapmanb at uga.edu  Thu Feb 12 18:52:32 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] hmmpfam parser
In-Reply-To: <40226C51.F70315B7@ebc.uu.se>
References: <401AC30A.E639BD2E@ebc.uu.se>
	<20040203014909.GD17947@evostick.agtec.uga.edu>
	<401F8BDD.73417BCD@ebc.uu.se>
	<20040205000453.GJ907@evostick.agtec.uga.edu>
	<40226C51.F70315B7@ebc.uu.se>
Message-ID: <20040212235232.GB2841@evostick.agtec.uga.edu>

Hi Wagied;

> Hopefully this meets the guidelines.
[...New version of the parser deleted...]

I've had a good hard look at this and done a number of fix-ups and
things to try and make things more understandable to myself, easier
to maintain, and more conformant to the normal "Biopython way" of
doing things.

I've attached a new version to this mail which cleans up a number of
things:

1. Iterator and parser are separate entities, and now behave like
standard Biopython iterators and parsers.

2. Cleaned up functions -- some functions (specifically parse()) ran
over multiple pages with heavy indenting. This is really hard to
follow and maintain -- I split things into separate internal
functions.

3. Got rid of lots of use of constants, for standard things like
newlines and other things. Again, these make the code harder to
follow.

4. Removed some of the unnecessary variables which look like they
are left over from coding the parser.

5. Wrapped lines to 80 characters or less.

This leaves things much better and gives a better idea of where the
parser is at. If you like these changes and want to keep working on
it, I'd suggest that a couple of things are still missing which
could use coding.

1. The domains and families can be extracted as XML, but not
accessed through a class. The HmmpfamRecord class really needs to
have family_scores and parsed_domains be lists of objects which have
all of the elements (model, description, e-value...) as attributes
of these classes. An excellent example of how this is done is the
BLAST Record class in Bio/Blast/Record.py, which is also documented
at:

http://biopython.org/docs/tutorial/Tutorial004.html#toc10

2. There needs to be some similar way to access the alignments, so
that they are also parsed into classes.

I think things are really coming along well -- let me know what you
think about expanding the Record class to include the families,
domains and alignments and we can get this finished and all checked
in. Please also let me know if I messed any of your code during my
work on it.

Hope this helps -- glad it's coming along!
Brad
-------------- next part --------------
####################################################################### */
# COPYRIGHT INFORMATION
# Pfam DOMAIN RESULTS PARSER
# @AUTHOR: Wagied Davids
# @DATE: 22.01.2004
# @COPYRIGHT: Wagied Davids, 2004
####################################################################### */

import sys
import string
import re
import time
from types import FileType

class HmmpfamRecord:
    '''
    Prototype class Entry structure
    @author: Wagied Davids
    @date: 22.01.2004
    @copyright: Wagied Davids, 2004
    '''
    # STATIC DATA
    FAMILY_CLASSIFICATION_HEADER = 'Scores for sequence family ' \
     'classification (score includes all domains):\nModel           ' \
     'Description                             Score    E-value  N\n' \
     '--------        -----------                             ' \
     '-----    ------- ---'
    PARSED_DOMAIN_HEADER = 'Parsed for domains:\nModel           ' \
     'Domain  seq-f seq-t    hmm-f hmm-t      score  E-value\n' \
     '--------        ------- ----- -----    ----- -----      -----  -------'
    NO_HITS= '[no hits above thresholds]'

    # STATIC REGEX OBJECTS
    REGEX_FAMILY_SCORES= re.compile(
            r'((\S.*?)\s+(\S.*?)\s+((-| )\S.*?)\s+(\S.*?)\s+(\d+))',
            re.MULTILINE | re.DOTALL )

    def __init__( self, query, accession= None, description= None,
            family_scores= [], parsed_domains= [], alignments= [] ):
        '''
        Constructor for Pfam Entry structure
        @param ( query, accession= None, description= None, 
        family_scores= [], parsed_domains= [], alignments= [] )
        @return (None)
        '''
        self.query= query
        self.accession= accession
        self.description= description
        self.family_scores= family_scores
        # FAMILY SCORES HITLIST FOR SCORE ENTRIES
        self.family_scores_hitlist= []
        self.parsed_domains= parsed_domains
        self.alignments= alignments
    
    def __str__( self ):
        '''
        Retrieves a string representation of parser entry class
        @param (None)
        @return (String:  representation of HmmpfamRecord class)
        '''
        strBuffer= '' 
        strBuffer= strBuffer + "<HMMER>\n" 
        strBuffer= strBuffer + "\t<QUERY>%s</QUERY>\n" % (self.get_query())
        strBuffer= strBuffer + "\t<ACCESSION>%s</ACCESSION>\n" \
                % (self.get_accession())
        strBuffer= strBuffer + "\t<DESCRIPTION>%s</DESCRIPTION>\n" \
                % (self.get_description())
        strBuffer= strBuffer + "\t%s" % (self.get_family_scores_ml())
        strBuffer= strBuffer + "\t%s" % (self.get_parsed_domains_ml())
        strBuffer= strBuffer + "\t<ALIGNMENTS>%s</ALIGNMENTS>\n" \
                % (self.get_alignments())
        strBuffer= strBuffer + "</HMMER>" 
        
        return strBuffer
      
    def get_query( self ):
        '''
        Retrieves the QUERY
        @param (None)
        @return (String: QUERY )
        '''
        return self.query

    def get_accession( self ):
        '''
        Retrieves the ACCESSION
        @param (None)
        @return (String: ACCESSION)
        '''
        return self.accession

    def get_description( self ):
        '''
        Retrieves the DESCRIPTION
        @param (None)
        @return (String: DESCRIPTION)
        '''
        return self.description

    def get_family_scores_raw(self):
        '''
        Retrieves a list of FAMILY SCORES
        @param (None)
        @return (List: FAMILY SCORES)
        '''
        return self.family_scores

    def get_no_of_family_entries(self):
        '''
        Retrieves the number of hits per query
        @param (None)
        @return (Integer: number of hits per query)
        '''
        return len(self.family_scores)

    def get_family_scores_ml( self ):
        '''
        FINE-GRAINED CONTROL OVER FAMILY CLASSIFICATION AND SCORE RESULTS
        @param (None)
        @return (String: Marked-Up text format of Family Classificatio Scores)
        '''
        # BEGIN FAMILY_SCORE_LIST TAG
        family_scores= "<FAMILY_SCORES_HITLIST>\n" 
        family_model= '' 
        family_description= '' 
        family_score_value= '' 
        family_e_value= '' 
        family_n_value= '' 
        family_scores_counter= 1 
        
        for score_entry in self.get_family_scores_raw():
            MatchScoreEntry= HmmpfamRecord.REGEX_FAMILY_SCORES.search(score_entry)
            if MatchScoreEntry != None:
                # BEGIN FAMILY_SCORE_HIT TAG
                family_scores= family_scores + \
                        "\t\t<FAMILY_SCORE_HIT= %d>\n" \
                        % ( family_scores_counter )
                # EXTRACT INFORMATION FROM MATCH_SCORE_ENTRY
                # MatchScoreEntry.group( 1 ) equals WHOLE ENTRY
                family_model= MatchScoreEntry.group( 2 )
                family_description= MatchScoreEntry.group( 3 )
                family_score_value= MatchScoreEntry.group( 4 )
                # MatchScoreEntry.group( 5 ) equals '-' IF PRESENT
                family_e_value= MatchScoreEntry.group( 6 )
                family_n_value= MatchScoreEntry.group( 7 )

                for data, tag in [(family_model, "FAMILY_SCORE_MODEL"),
                     (family_description, "FAMILY_DESCRIPTION"),
                     (family_score_value, "FAMILY_SCORE_VALUE"),
                     (family_e_value, "FAMILY_E_VALUE"),
                     (family_n_value, "FAMILY_N_VALUE")]:
                    family_scores += "\t\t\t<%s>%s</%s>\n" % \
                            (tag, data, tag)

                # COMPLETE FAMILY_SCORE_HIT TAG
                family_scores= family_scores + "\t\t</FAMILY_SCORE_HIT>\n" 

                # INCREMENT family_scores_counter
                family_scores_counter= family_scores_counter + 1 

        # COMPLETE FAMILY_SCORE_LIST TAG
        family_scores= family_scores + "\t</FAMILY_SCORES_HITLIST>\n" 
        return family_scores
    
    def get_parsed_domains_raw(self):
        '''
        Retrieves a list of PARSED DOMAINS
        @param (None)
        @return (List: PARSED DOMAINS)
        '''
        return self.parsed_domains

    def get_no_of_parsed_domains( self ):
        '''
        Retrieves the number of parsed hits per query
        @param (None)
        @return (Integer: number of parsed hits per query)
        '''
        return len( self.parsed_domains )

    def get_parsed_domains_ml( self ):
        '''
        FINE-GRAINED CONTROL OVER PARSED DOMAINS AND SCORE RESULTS
        @param (None)
        @return (String: Marked-Up text format of Parsed Domain section)
        '''
        parsed_domain_list= []
        parsed_model= '' 
        parsed_domain_number= '' 
        parsed_domain_seq_f= '' 
        parsed_domain_seq_t= '' 
        parsed_domain_hmm_f= '' 
        parsed_domain_hmm_t= '' 
        parsed_domain_2_dots= '' 
        parsed_domain_brackets= '' 
        parsed_domain_score= '' 
        parsed_domain_e_value= '' 
        parsed_domains_counter= 1 


        # BEGIN PARSED_DOMAINS_LIST TAG
        parsed_domains= '<PARSED_DOMAINS_HITLIST>\n' 
    
        for domain in self.get_parsed_domains_raw():
            # IF NO_HITS NOT FOUND, THEN EXTRACT DATA
            if string.find( domain, HmmpfamRecord.NO_HITS ) < 0:
                parsed_domain_list= string.split( domain )
                parsed_model= parsed_domain_list[0]
                parsed_domain_number= parsed_domain_list[1]
                parsed_domain_seq_f= parsed_domain_list[2]
                parsed_domain_seq_t= parsed_domain_list[3]
                #parsed_domain_2_dots= parsed_domain_list[4]            
                parsed_domain_hmm_f= parsed_domain_list[5]
                parsed_domain_hmm_t= parsed_domain_list[6]
                #parsed_domain_brackets= parsed_domain_list[7] 
                parsed_domain_score= parsed_domain_list[8]
                parsed_domain_e_value= parsed_domain_list[9]

                # BEGIN PARSED_DOMAIN_HIT TAG
                parsed_domains= parsed_domains + \
                  "\t\t<PARSED_DOMAIN_HIT= %d>\n" % (parsed_domains_counter)

                # FORMAT ENTRY TAGS
                for data, tag in [(parsed_model, "PARSED_MODEL"),
                            (parsed_domain_number, "PARSED_DOMAIN_NUMBER"),
                            (parsed_domain_seq_f, "PARSED_DOMAIN_SEQ_F"),
                            (parsed_domain_seq_t, "PARSED_DOMAIN_SEQ_T"),
                            (parsed_domain_hmm_f, "PARSED_DOMAIN_HMM_F"),
                            (parsed_domain_hmm_t, "PARSED_DOMAIN_HMM_T"),
                            (parsed_domain_score, "PARSED_DOMAIN_SCORE"),
                            (parsed_domain_e_value, "PARSED_DOMAIN_E_VALUE")]:
                    parsed_domains += "\t\t\t<%s>%s</%s>\n" % \
                            (tag, data, tag)

                # COMPLETE PARSED_DOMAIN_HIT TAG
                parsed_domains= parsed_domains + "\t\t</PARSED_DOMAIN_HIT>\n" 

                # INCREMENT parsed_domains_counter
                parsed_domains_counter= parsed_domains_counter + 1 

            else:
                # NO_HITS FOUND
                return domain

        # COMPLETE PARSED_DOMAINS_LIST TAG
        parsed_domains= parsed_domains + '</PARSED_DOMAINS_HITLIST>\n' 
        return parsed_domains
        
    def get_alignments( self ):
        '''
        Retrieves a list of TOP SCORING ALIGNMENTS
        @param (None)
        @return (List: TOP SCORING ALIGNMENTS)
        '''
        return self.alignments

    def get_regex_family_scores( self ):
        '''
        Retrieves the Regex object for Pfam family scores
        @param (None)
        @return (Regex: Regex object for Pfam family scores)
        '''
        return HmmpfamRecord.REGEX_FAMILY_SCORES

class Iterator:
    """Iterate over a hmmpfam result file one record at a time.
    """
    def __init__(self, handle, parser = None):
        """Initalize with a handle to the hmmpfam output and optional parser.
        """
        if type(handle) is not FileType and type(handle) is not InstanceType:
            raise ValueError, "I expected a file handle or file-like object"
        self._handle = handle
        self._parser = parser

    def __iter__(self):
        return iter(self.next, None)

    def next(self):
        """Return the next hmmpfam output record, parsed if appropriate.
        """
        lines = []
        while 1:
            line= self._handle.readline()
            if not line:
                break
            # Pfam ENTRY DETECTED
            if line.find('Query sequence:') == 0:
                lines.append(line.rstrip())
                while 1:
                    line= self._handle.readline()
                    lines.append(line.rstrip())
                    if not line:
                        break 
                    if line.find("//") == 0:
                        break
        
        if len(lines) == 0: # nothing left
            return None
        else:
            if self._parser:
                data = "\n".join(lines)
                return self._parser.parse(data)
            else:
                return "\n".join(lines)

class RecordParser:
    '''
    Prototype class for parsing hmmpfam output
    @author: Wagied Davids
    @date: 22.01.2004
    @copyright: Wagied Davids, 2004
    '''
    # STATIC REGEX OBJECTS
    REGEX_HMM_ENTRY= re.compile( r'(Query sequence:\s+\S.*\s+//)',
            re.MULTILINE | re.DOTALL )
    REGEX_HMM_QUERY= re.compile( r'Query sequence:\s+(\S.*?)\s+Accession',
            re.MULTILINE | re.DOTALL )
    REGEX_HMM_ACC= re.compile( r'Accession:\s+(\S.*?)\s+Description',
            re.MULTILINE | re.DOTALL )
    REGEX_HMM_DESCRIPTION= re.compile( r'Description:\s+(\S.*?)\s+Scores',
            re.MULTILINE | re.DOTALL )
    REGEX_HMM_SEQ_FAMILY_SCORES= re.compile( r'(Scores\s+\S.*)\s+Parsed',
            re.MULTILINE | re.DOTALL )
    REGEX_HMM_PARSED_DOMAINS= re.compile(
            r'(Parsed for domains:\s+\S.*)\s+Alignments',
            re.MULTILINE | re.DOTALL )
    REGEX_HMM_ALIGNMENTS= re.compile(
            r'(Alignments of top-scoring domains:\s+\S.*)\s+//',
            re.MULTILINE | re.DOTALL )

    def __init__(self):
        '''
        Constructor for RecordParser
        @param (Filename)
        @return (None)
        '''
        self.debug= 0 
     
    def set_debug( self, debug= 0 ):
        '''
        Sets the debug level when parsing
        debug= 0 No debug information
        debug= 1 Pfam Entry level debug information  
        debug= 2 Regex level debug information
        debug= 3 Incoming data 
        @param (Integer representing the verbosity/ debug level)
        @return (None)
        '''
        self.debug= debug

    def _print_debug(self, level, info):
        """Simple class to print out debug info if it matches a given level.
        """
        if level == self.debug:
            sys.stdout.write(info + "\n")

    def parse(self, data_entry):
        """Initialize with a single hmmpfam record to parse.

        Returns the record parsed into an HmmpfamRecord class.
        """
        if self.debug == 3:
            print data_entry
        
        # MATCH ENTRY STRUCTURE
        match_hmm_entry= RecordParser.REGEX_HMM_ENTRY.search(data_entry)
        if self.debug == 2:
            print "%s: %s" % (match_hmm_entry, match_hmm_entry.re.pattern)
        if match_hmm_entry is not None:
            entry = match_hmm_entry.group(1)
            query, accession, description = self._parse_query_info(entry)
            family_scores_list = self._parse_family_scores(entry)
            parsed_domains_list = self._parse_domains(entry)
            domain_alignments_list = self._parse_alignments(entry)

            # Construct Pfam Entry structure
            record = HmmpfamRecord(query, accession, description,
                    family_scores_list, parsed_domains_list,
                    domain_alignments_list )

            if self.debug == 1:
                print "%s => %s" % ( record.get_query(), record.get_description() )
                print record.get_family_scores_ml()
                print record.get_parsed_domains_ml()

        return record

    def _parse_query_info(self, entry):
        """Retrieve the query name, accession and description.
        """
        hmm_query, hmm_accession, hmm_description = ('', '', '')
        # MATCH QUERY SEQUENCE
        match_hmm_query= RecordParser.REGEX_HMM_QUERY.search(entry)
        if self.debug == 2:
            print "%s: %s" % ( match_hmm_query, match_hmm_query.re.pattern )
        if match_hmm_query is not None:
            hmm_query= match_hmm_query.group(1)

        # MATCH ACCESSION
        match_hmm_accession= RecordParser.REGEX_HMM_ACC.search( entry ) 
        if self.debug == 2:
            print "%s: %s" % ( match_hmm_accession, match_hmm_accession.re.pattern ) 
        if match_hmm_accession is not None:
            hmm_accession= match_hmm_accession.group( 1 )

        # MATCH DESCRIPTION
        match_hmm_description= RecordParser.REGEX_HMM_DESCRIPTION.search( entry )
        if self.debug == 2:
            print "%s: %s" % ( match_hmm_description, match_hmm_description.re.pattern )
        if match_hmm_description is not None:
            hmm_description= match_hmm_description.group(1)

        return hmm_query, hmm_accession, hmm_description

    def _parse_family_scores(self, entry):
        """Retrieve the family scores from the hmmpfam search.
        """
        match_hmm_scores= RecordParser.REGEX_HMM_SEQ_FAMILY_SCORES.search(entry)
        if self.debug == 2:
            print "%s: %s" % (match_hmm_scores, match_hmm_scores.re.pattern)
        
        family_scores_list = []
        if match_hmm_scores != None:
            hmm_scores= match_hmm_scores.group( 1 )
            family_scores_info_list= string.split(
                    hmm_scores, "\n")
            
            # NOTE: LAST ELEMENT = EMPTY SPACE
            family_scores_list = family_scores_info_list[ 3: -1 ]
        return family_scores_list

    def _parse_domains(self, entry):
        """Parse domain information from the hmmpfam output.
        """
        match_hmm_parsed_domains= RecordParser.REGEX_HMM_PARSED_DOMAINS.search( entry )

        if self.debug == 2:
            print "%s: %s" % ( match_hmm_parsed_domains, match_hmm_parsed_domains.re.pattern )
           
        parsed_domains_list = []
        if match_hmm_parsed_domains != None:
            hmm_domains= match_hmm_parsed_domains.group(1)
            parsed_domains_info_list= string.split(hmm_domains, "\n")

            # NOTE: LAST ELEMENT = EMPTY SPACE
            parsed_domains_list= parsed_domains_info_list[3: -1]
        return parsed_domains_list

    def _parse_alignments(self, entry):
        """Parse out alignment information from the hmmpfam output.
        """
        match_hmm_alignments= RecordParser.REGEX_HMM_ALIGNMENTS.search(  entry )
        if self.debug == 2:
            print "%s: %s" % (match_hmm_alignments, match_hmm_alignments.re.pattern)
            
        if match_hmm_alignments is not None:
            hmm_aligments= match_hmm_alignments.group(1)
            damain_aligments_info_list= string.split(hmm_aligments, "\n")
            domain_alignments_list= damain_aligments_info_list[3:-2]

        return domain_alignments_list

    def get_regex_hmm_entry( self ):
        '''
        Retrieves the Regex object for REGEX_HMM_ENTRY
        @param (None)
        @return (Regex: HMM_ENTRY)
        '''
        return RecordParser.REGEX_HMM_ENTRY

    def get_regex_query( self ):
        '''
        Retrieves the Regex object for REGEX_HMM_QUERY
        @param (None)
        @return (Regex: REGEX_HMM_QUERY)
        '''
        return RecordParser.REGEX_HMM_QUERY
    
    def get_regex_accession(self):
        '''
        Retrieves the Regex object for REGEX_HMM_ACC
        @param (None)
        @return (Regex: REGEX_HMM_ACC)
        '''
        return RecordParser.REGEX_HMM_ACC
        
    def get_regex_description( self ):
        '''
        Retrieves the Regex object for REGEX_HMM_DESCRIPTION
        @param (None)
        @return (Regex: REGEX_HMM_DESCRIPTION)
        '''
        return RecordParser.REGEX_HMM_DESCRIPTION

    def get_regex_family_scores( self ):
        '''
        Retrieves the Regex object for REGEX_HMM_SEQ_FAMILY_SCORES
        @param (None)
        @return (Regex: REGEX_HMM_SEQ_FAMILY_SCORES)
        '''
        return RecordParser.REGEX_HMM_SEQ_FAMILY_SCORES

    def get_regex_parsed_domains( self ):
        '''
        Retrieves the Regex object for REGEX_HMM_DOMAINS
        @param (None)
        @return (Regex: REGEX_HMM_DOMAINS)
        '''
        return RecordParser.REGEX_HMM_PARSED_DOMAINS

    def get_regex_alignments( self ):
        '''
        Retrieves the Regex object for REGEX_HMM_ALIGNMENTS
        @param (None)
        @return (Regex: REGEX_HMM_ALIGNMENTS)
        '''
        return RecordParser.REGEX_HMM_ALIGNMENTS
    
    def __str__( self ):
        '''
        Retrieves a string representation of parser class
        @param (None)
        @return (String: Retrieves a string representation of parser class)
        '''
        strBuffer= 'ParserType: RecordParser' 
        return strBuffer
        
# __END__
-------------- next part --------------
#!/usr/bin/env python

###################################################################### */
# COPYRIGHT INFORMATION
# Test program for Pfam domain results parser
# @AUTHOR: Wagied Davids
# @DATE: 22.01.2004
# @COPYRIGHT: Wagied Davids, 2004
###################################################################### */

import Hmmpfam

# Module level re-name

# DATA LOCATION
filename= 'hmmpfam_output.example'
handle = open(filename, "r")

# INSTANTIATE Parser with debugging info
parser= Hmmpfam.RecordParser()
# parser.set_debug(1)

iterator = Hmmpfam.Iterator(handle, parser)
for rec in iter(iterator):
    print "--> %s : %s : %s" % (rec.query, rec.accession, rec.description)
    print rec.get_family_scores_ml()
    print rec.get_parsed_domains_ml()
From mcolosimo at mitre.org  Fri Feb 27 16:25:20 2004
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] GenBank bug, oriT feature missing
Message-ID: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org>

Hi,

I've just spent a good part of a day trying to understand what was 
going wrong and I think I finally know. Here is the problem:

I was getting this exception for reading in a GenBank file (from 
genbank):

"/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in 
fatalError
     raise exception
Martel.Parser.ParserPositionException: error parsing at or beyond 
character 1981

After digging into the GenBank code (__init.py__) and then into 
Martel's code. I found I could turn on debugging:

GenBank.FeatureParser(debug_level=2)

I finally see where things die (and what character 1981 means).

for AE000070 there is a  feature tag "oriT", which seems to be missing 
from genbank_record.py and __init__.py

      oriT            81..92
                      /note="region including origin of transfer (oriT) 
almost
                      identical to oriT regions of plasmids from the 
'Q-group'"
                      /evidence=not_experimental

This really isn't a pretty way of dealing with unknown features. Is 
there a way to get this to just pass unknown features?

Thanks,

Marc


From idoerg at burnham.org  Fri Feb 27 18:55:37 2004
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] GenBank bug, oriT feature missing
In-Reply-To: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org>
References: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org>
Message-ID: <403FD8F9.8000908@burnham.org>

I agree that these things should be handeled better. How about raising 
an UnknownFeature exception, which is not silenced by default. The user 
can then decide whether the parser should trap & silence such an 
exception when it occurs.

./I

Marc Colosimo wrote:
> Hi,
> 
> I've just spent a good part of a day trying to understand what was going 
> wrong and I think I finally know. Here is the problem:
> 
> I was getting this exception for reading in a GenBank file (from genbank):
> 
> "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in 
> fatalError
>     raise exception
> Martel.Parser.ParserPositionException: error parsing at or beyond 
> character 1981
> 
> After digging into the GenBank code (__init.py__) and then into Martel's 
> code. I found I could turn on debugging:
> 
> GenBank.FeatureParser(debug_level=2)
> 
> I finally see where things die (and what character 1981 means).
> 
> for AE000070 there is a  feature tag "oriT", which seems to be missing 
> from genbank_record.py and __init__.py
> 
>      oriT            81..92
>                      /note="region including origin of transfer (oriT) 
> almost
>                      identical to oriT regions of plasmids from the 
> 'Q-group'"
>                      /evidence=not_experimental
> 
> This really isn't a pretty way of dealing with unknown features. Is 
> there a way to get this to just pass unknown features?
> 
> Thanks,
> 
> Marc
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 
> 

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo


From mcolosimo at mitre.org  Sat Feb 28 22:03:08 2004
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] GenBank bug, oriT feature missing
In-Reply-To: <200402281154.38515.Peter.Bienstman@UGent.be>
References: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org>
	<403FD8F9.8000908@burnham.org>
	<200402281154.38515.Peter.Bienstman@UGent.be>
Message-ID: <4041566C.1060305@mitre.org>

Thanks, it works on that case now. I'll look to see where you added that 
so that if I run into another unknown tag I can add it.

raising an UnknownFeature exception would be nice. But from what little 
I know about how it parses, how could you re-enter parsing? Maybe 
creating a different FeatureParser to handle unknown features 
(WeakFeatureParser, maybe?)

Marc

Peter Bienstman wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>That would be a good solution. As a short term fix however, I've added the 
>oriT tag to genbank_format.py in CVS.
>
>Peter
>
>On Saturday 28 February 2004 00:55, Iddo Friedberg wrote:
>  
>
>>I agree that these things should be handeled better. How about raising
>>an UnknownFeature exception, which is not silenced by default. The user
>>can then decide whether the parser should trap & silence such an
>>exception when it occurs.
>>
>>./I
>>
>>Marc Colosimo wrote:
>>    
>>
>>>Hi,
>>>
>>>I've just spent a good part of a day trying to understand what was going
>>>wrong and I think I finally know. Here is the problem:
>>>
>>>I was getting this exception for reading in a GenBank file (from
>>>genbank):
>>>
>>>"/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in
>>>fatalError
>>>    raise exception
>>>Martel.Parser.ParserPositionException: error parsing at or beyond
>>>character 1981
>>>
>>>After digging into the GenBank code (__init.py__) and then into Martel's
>>>code. I found I could turn on debugging:
>>>
>>>GenBank.FeatureParser(debug_level=2)
>>>
>>>I finally see where things die (and what character 1981 means).
>>>
>>>for AE000070 there is a  feature tag "oriT", which seems to be missing
>>>from genbank_record.py and __init__.py
>>>
>>>     oriT            81..92
>>>                     /note="region including origin of transfer (oriT)
>>>almost
>>>                     identical to oriT regions of plasmids from the
>>>'Q-group'"
>>>                     /evidence=not_experimental
>>>
>>>This really isn't a pretty way of dealing with unknown features. Is
>>>there a way to get this to just pass unknown features?
>>>
>>>Thanks,
>>>
>>>Marc
>>>
>>>      
>>>


From chapmanb at uga.edu  Sun Feb 29 17:17:58 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] GenBank bug, oriT feature missing
In-Reply-To: <4041566C.1060305@mitre.org>
References: <723CEEEE-696B-11D8-ABCF-000A95A5D8B2@mitre.org>
	<403FD8F9.8000908@burnham.org>
	<200402281154.38515.Peter.Bienstman@UGent.be>
	<4041566C.1060305@mitre.org>
Message-ID: <20040229221758.GH24150@evostick.agtec.uga.edu>

Hey guys;

[Mark reports yet another new feature tag added to GenBank files]
> Martel.Parser.ParserPositionException: error parsing at or beyond
> character 1981
> 
> After digging into the GenBank code (__init.py__) and then into Martel's
> code. I found I could turn on debugging:
> 
> GenBank.FeatureParser(debug_level=2)
> 
> I finally see where things die (and what character 1981 means).
> 
> for AE000070 there is a  feature tag "oriT", which seems to be missing
> from genbank_record.py and __init__.py

[And makes a useful suggestion that others second (and third...)]
> This really isn't a pretty way of dealing with unknown features. Is
> there a way to get this to just pass unknown features?

Yes, I completely agree that this is a pain. The problem is an
unfortunate design decision where the format used to parse the files
uses a hard-coded list of tags. This made sense when it was
originally designed since there are supposed to be a restricted set
of feature and qualifier key names that can be used. Unfortunately,
it's turned into a headache for everyone since NCBI keeps adding
tags.

I've decided to get rid of this and just checked in a series of
changes to CVS that update the genbank format so it shouldn't run
into this problem any longer -- the new format uses a general
regular expression (basically \w, plus some additional characters
that get used like ' and - ), so it shouldn't run into this problem.

In the process of making these changes I've also done a general
cleanup of the format file and merged it with the old (but still
with plenty of useful bits of code) format in
Bio.expressions.genbank. I've moved Bio/GenBank/genbank_format.py to
Bio/expressions/genbank.py -- so for those of you who look at it or
change it (thanks Peter!), you now need to look there.

So, long story short -- I hope I fixed this problem for the future.
Please do give the new version in CVS a go and let me know if it has
any problems on your files. Sorry about the pain and thanks for 
the report!

Brad