[Bioperl-l] location parsing refactored

Hilmar Lapp hlapp@gnf.org
Mon, 12 Aug 2002 02:03:01 -0700


I just finished adapting FTHelper and the SeqIO parsers for genbank, 
embl, and swissprot to use the new LocationFactoryI framework. You 
can now pass in your own location string parser if you want by 
calling $seqio_stream->location_factory($mylocationfactory). The 
default instance parses Genbank-compatible location strings.

All tests pass. This doesn't necessarily mean that all 
Genbank/EMBL/SwissProt locations would be properly parsed. It'd be 
great if someone can try it on a large-scale, e.g. entire Swissprot 
or Genbank, and collect the trouble-makers. (Ideally, grab the 
location string that's causing trouble and add it to the test cases 
in t/LocationFactory.t.)

There are more consequences from the refactoring: as decided, the 
root split location now doesn't have a strand anymore. If you set 
the strand of the root split location, it propagates to all 
sublocations that are not remote. $splitloc->strand() returning 
undef may pose a problem for feature renderers: you can't treat 
features with split locations the same anymore as those with other 
location types. Instead, you need to inspect the sublocations of 
$feat->location().

Is this really the behaviour we want? We could add convenience code 
that returns a strand if all non-remote sublocations have the same. 
Any votes on that?

Note also that I simplified the location of the last feature in 
testfuzzy.genbank: complement(join(...,complement(...))) is not 
really what you'd see in reality I'd think (complement of complement 
is plus strand, right? However $loc->strand(-1) doesn't invert the 
sign.).

	-hilmar

On Sunday, August 11, 2002, at 09:53  PM, Hilmar Lapp wrote:

> I introduced Bio::Factory::LocationFactoryI with currently one 
> method: from_string(), to return a Bio::LocationI object.
>
> Also, I rewrote the feature table location string parsing (the old 
> code in FTHelper.pm is so hideous that understanding it would have 
> taken the same time as rewriting) in the new module 
> Bio::Factory::FTLocationFactory, which implements the above 
> interface. I also added extensive testing in t/LocationFactory.t, 
> which basically goes through all examples of the Genbank FT 
> documentation. Tests include a test whether rendered location 
> string is equal to original location string.
>
> All those tests pass. To achieve that I had to fix a couple of 
> problems in the LocationI implementations, too.
>
> Note that a join() on the complementary strand will always be 
> rendered as join(complement(...),complement(...),...), instead of 
> the equivalent complement(join(...)).
>
> My rewrite does not contain 'short cuts' to object initialization 
> anymore that directly mess with the hash ref. If someone needs a 
> speed-up I'm sure there are ways to achieve this other than to 
> create time bombs.
>
> Next thing to do is integrate this new framework with 
> SeqIO/FTHelper. Once this is done, people need to bang on it 
> extensively to see where it breaks (I'm sure I haven't covered 
> every weird location string in Genbank and swissprot). If you have 
> a trouble-maker at hand, just add it to the tests in 
> t/LocationFactory.t (it's near the top, shouldn't be hard to figure 
> out; otherwise just send it to me).
>
> 	-hilmar
>
> On Friday, August 9, 2002, at 10:54  AM, Jason Stajich wrote:
>
>>
>> To push this onto the subfeatures as you suggest is going to take 
>> a fair
>> amount of refactoring in the current parsing code but would 
>> probably be
>> the best idea.
>>
>> No one has been brave enough to go in there and mess with things very
>> much.  A full refactor is okay with me but we sort of already did that
>> when we moved to locations/seqfeature separation.  We also have the
>> problem that we support hierarchical features AND hierarchical 
>> locations.
>> At the BoF we discussed describing coding conventions to insure that
>> people follow a convention that works.  I'm still unclear what the 
>> right
>> path ahead should be.
>>
>>
>> I'd suggest that 1st we derive a set of test cases which break the
>> expected semantics, put these in a new test file or as part of 
>> t/SeqIO.t
>> show that the parser currently does the wrong thing and then set about
>> trying to fix it.  This should also test that the expected DNA is 
>> returned
>> from all of these cases as well.  If we have a test system in 
>> place that
>> does this properly we'll have a much better time tracking down 
>> errors and
>> being consistent.
>>
>> I think cases like:
>>
>> complement(join(1..200,205..300),complement(500..600))
>> join(complement(1..200),205..300,complement(500..600))
>>
>> need to be properly tested
>>
>>
>>
>> On Thu, 8 Aug 2002, Hilmar Lapp wrote:
>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Ewan Birney [mailto:birney@ebi.ac.uk]
>>>> Sent: Thursday, August 08, 2002 8:30 AM
>>>> To: Chris Mungall
>>>> Cc: Hilmar Lapp; Elia Stupka; Jason Stajich; bioperl-l@bioperl.org
>>>> Subject: Re: [Bioperl-l] *major* error in genbank parser or am i 
>>>> just
>>>> insane?
>>>>
>>> [...]
>>>>
>>>> I do prefer chris' semantics to having to hold onto the
>>>> difference between
>>>> a parent complement and a child complement - ie, I think we should
>>>> implicitly only allow the complement to happen on simple sequence
>>>> locations and never splits, and genbank with an outer complement 
>>>> is an
>>>> implicit distributive complement and reverse of its components.
>>>>
>>>
>>> OK. So this is a vote for sublocs on strand -1 and splitloc 
>>> strandless, right?
>>>
>>> Even though this differs from the present implementation, I 
>>> actually think too this is saner. So my vote goes here too. 
>>> Jason? (I know we decided otherwise in Canada :o)
>>>
>>> 	-hilmar
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>>
>>
>>
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------