[Bioperl-l] Parsing fuzzy locations

Mark Dalphin mdalphin@amgen.com
Thu, 14 Dec 2000 16:05:32 -0800


"Strassel, Chris" wrote:

> I am about to embark on trying to write a parser for fuzzy feature
> locations, and was hoping to gather some advice before starting.
>
> Does anyone have any experience that they would be willing to share?
> Pitfalls? Is this already done somewhere? Is it totally hopeless?

I have recently written a Genbank LOCATION parser (just the location part, not
the whole Feature Table; I do that elsewhere in Perl). It parses the fuzzy
locations without a problem.  I wrote it in flex and bison, creating a function
in C, which I then wrapped in a Perl XS subroutine.  It is quick and stable
enough for our purposes.

Here are some of the problems I encountered.  First, the grammar specified at
NCBI for the Feature Table locations:
http://www.ncbi.nlm.nih.gov/collab/FT/index.html lists a Backus-Naur
representation of the grammar which is not completely correct.  For example, it
does not specifiy parens anywhere except in functions like "join()".  Yet
Genbank releases locations like "(9.10)..(20.22)".  By the specified grammar,
that should be "9.10..20.22".  There were some minor programming problems I
encountered along the way as well, but as this was my first attempt at
flex/bison, I can't really complain. It was also less than pretty for me to
return an array from C into Perl via XS.  That too was due to inexperience.

I did look at writing the parser in pure Perl, using "Parse::RecDescent" (see
D.Conway, "The man(1) of descent", The Perl Journal, 12:46-58, winter 1998).  I
suspect the grammer I developed (modified from the Genbank B-N form) would
almost work for Parse::RecDescent, but some of the recursions might need to be
re-ordered.  I went the flex/bison route as we have other programers who wanted
a parser that could be accessed via C and C++.

I have not looked at giving it to BioPerl in part because linking in the C
function would be a problem for most users. I also am not sure how BioPerl plans
to carry fuzzy locations; I have my parser return a ref to an array of refs to
arrays. The final level of arrays contains:
    [0]=AccNum [1]=isComplement(0 or 1)
    [2]=Beg-Position   [3]=Beg-Fuzzy-Type  [4]=Beg-Fuzzy-Amount
    [5]=End-Position   [6]=End-Fuzzy-Type  [7]=End-Fuzzy-Amount

Examples:
    10..20  -->   [0]='', [1]=0, [2]=10, [3]=undef, [4]=undef, [5]=20,
[6]=undef, [7]=undef
    (9.10)..20 --> as above, except: [2]=9,  [3]='dot', [4]=1
    <9..20  -->  as above, except [2]=9, [3]='<', [4]=undef
    complement(AC000134:10..50) --> [0]='AC000134', [1]=1, [2]=10, [3]=undef,
[4]=undef, [5]=50, [6]=undef, [7]=undef

Let me know if you wish to see any of this: the tokenizer input for flex, the
grammar for bison, the wrapper for Perl XS. I even have a makefile for both the
SGI running Irix and our Dec Alphas (running a different version of Perl; this
makes the XS output incompatible! Yuck!).

Cheers,
Mark

PS I still think that I should learn ASN.1 and use the NCBI parser directly from
the NCBI toolkit.  Avoid all the trouble of re-inventing the wheel.  I advocate
this, despite the fact that I don't enjoy the NCBI code.  In my case, I am
actually getting the "locations" from a non-Genbank source as well as Genbank,
so I needed to create this parser.

--
Mark Dalphin                          email: mdalphin@amgen.com
Mail Stop: 29-2-A                     phone: +1-805-447-4951 (work)
One Amgen Center Drive                       +1-805-375-0680 (home)
Thousand Oaks, CA 91320                 fax: +1-805-499-9955 (work)