[Bioperl-l] Bio::Location::Split

Chris Fields cjfields at uiuc.edu
Mon Oct 16 20:51:08 UTC 2006


...
> I downloaded candida glabrata chromosome B from EBI:
> http://www.ebi.ac.uk/genomes/eukaryota.html, CR380948
> 
> testportal>perl location.pl new_glabrata_B.embl > bio
> testportal>perl -wlne 'print $1 if /^FT\s+CDS\s+(.*)/'
> new_glabrata_B.embl > nonbio
> testportal>wc bio nonbio
>  217  217 4537 bio
>  217  217 4549 nonbio
>  434  434 9086 total
> testportal>diff bio nonbio
> 4c4
> < complement(join(10632..11157,10347..10372))
> ---
> > join(complement(10632..11157),complement(10347..10372))
> 
> Just one example here, but see below.
> 
> > As for the lack of catching this, the particular types of
> > locations that
> > cause the issue are quite rare.
> 
> Really? I guess our definitions of rare depend on which sequences we're
> working with. I'm doing fungal genomes, and here's a grep for a few
> species' entire genomes:
> 
> testportal>foreach i ( *.embl )
> foreach? echo $i
> foreach? grep CDS $i | grep join | grep -c complement
> foreach? end
> glabrata_orf.embl
> 29
> hansenii_orf.embl
> 151
> lactis_orf.embl
> 70
> lipolytica_orf.embl
> 337
> pombe_orf.embl
> 1137
> 
> You might like to use pombe as a test case, as it has lots of these
> complement joins, including ones with multiple introns.

I'll use those.  I'll see if an analogous GenBank file exists as well.  

I can probably make a preliminary fix for FT_string() so that it arranges
the sublocations correctly, but I think the best way to go is to have
FTLocationFactory not modify the various sublocations to start with, which
it currently does when it sets strand() (strand() propagates the strand info
to sublocations). 

> Anyway, I'd question the "rare" designation. It seems to me like any
> species that has introns will have situations like this in their CDSs.
> Not to mention any other sequence that uses Bio::Location::Split. (Since
> I'm not a Real Biologist, I can't think up mor examples here, but I'm
> sure they exist.)

I think that additional tests are definitely needed for pulling out
sequences.  

What I mean by 'rare' is that the majority of sequences do not have
problems.  Also, this seems to be a 'silent' bug since the error shows up in
to_FTstring() but the object sublocations seem to beprocessed correctly when
using the location object directly (such as via SeqFeatureI).  

Round-tripping the sequence should pick it up though.  Since
complement(join(10632..11157,10347..10372)) is not the same as
join(complement(10632..11157),complement(10347..10372)).  

That is essentially what you are doing, correct? i.e. getting the sequences
using Bioperl, saving them (which passes them through SeqIO), reading them
again (back through SeqIO with the malformed location string).

> Or are you saying it's rare to use join (complement(C..D),
> complement(A..B)) instead of complement(join(A..B, C..D)). In that case,
> I guess I just got really unlucky in that five fungal genomes I was
> using decided to use the "rare" syntax.

Location::Split is supposed to handle all variations, but apparently it
isn't.  

> > Note that there are two bugs
> > for that bug
> > report.  The first (and more serious) is still unresolved.  The second
> > (where remote locations are treated differently in
> > Location::Split, which
> > caused more problems than it was worth) had a fix committed
> > about a month
> > ago.
> 
> Sadly, it's the first (and in my case, more common (I have no remote
> locations.)) bug for me.
> 
> > Any fixes I have made for the first bug invariably break several other
> > methods, which use the current Location::Split object logic
> > for retrieving
> > sequences, building feature strings, etc.  Since a new RC is
> > imminent and
> > the bug only affects a small number of locations, I have held
> > off until
> > after a final release is made (the last thing I want to do is
> > fix something
> > that breaks ~6-8 other methods), but I'll try looking at it
> > again this week.
> 
> IMO this is a pretty serious bug (if these kinds of sequences aren't
> that rare as I've shown above), because you're outputting sequence
> descriptions that are just plain wrong. Anyone who uses
> FTLocationFactory to read these output description will have incorrect
> sequence, incorrect translated proteins, etc. And it's even more serious
> if other methods are depending on it.
> 
> I know I can't dictate your time, and should be volunteering to work on
> fixing it. But if it affects other modules, then I will no doubt break
> things even more than you have in your attempts.
> 
> -Amir

I'll give it a look over the next week.  Like I mentioned above, I may be
able to fix it in Split::to_FTstring() w/o breaking other tests (in which
case I'll commit it for the 1.5.2 release), but it would be a temporary hack
until I can work out why other tests are failing.

Chris




More information about the Bioperl-l mailing list