[Bioperl-l] Fix for Bug #3376 broke somewhere else

Thu Feb 28 15:36:34 UTC 2013

Hi,
I was re-checking Bug #3302 using the Bio::SearchIO modules of the
repository and found that now it can't parse a Hmmer2 file that was
previously fine. After tracking the problem, I discovered that a change in a
regular expression to fix another bug broke the parse.

The fix for the Bug #3376 consisted in adding an extra condition to omit
lines where end of domain indicator is split across lines
(https://redmine.open-bio.org/issues/3376):
TEST: domain 1 of 1, from 8 to 97: score 184.7, E = 2.5e-56
                   *->svfqqqqssksttgstvtAiAiAigYRYRYRAvtWnsGsLssGvnDn
                      sv+qqqq+  +    +vtAiAiAigYRYRYRAv Wn GsLs G nDn
        Test     8    SVYQQQQGGSA----MVTAIAIAIGYRYRYRAVVWNKGSLSTGTNDN 50   

                   DnDqqsdgLYtiYYsvtvpssslpsqtviHHHaHkasstkiiikiePr<-
                   DnDq +d LYtiYYsvtv +ss+p q+v+HHHaH+asstkiiiki P   
        Test    51 DNDQAAD-LYTIYYSVTVSASSWPGQSVTHHHAHPASSTKIIIKIAPS   97   

                   *

        Test     -   -
This case is characterized by the 2 dashes in the line...

So the expression added in hmmer2.pm - ‘next_result’
(https://github.com/bioperl/bioperl-live/commit/142e5d79e3a6593db32bf0af9904
8f47d01bd3f2):
                        elsif (CORE::length($_) == 0
                            || ( $count != 1 && /^\s+$/o )
                            || /^\s+\-?\*\s*$/
                            || /^.+\-\s+\-\s*$/ ) ### <--- This regex was
designed for bug 3376
                        {
                            next;
                        }

But the expression used is too broad because it uses the "^.+" just before
the 2 dashes, and it broke these lines parsing, where is full of dashes:
                   KyACrqCdtiVQAPaPakpIErGiptaGLLArvlVSKyaEHlPLYRQsEI

  lcl|gi|340     - -------------------------------------------------- -    

                   yaRqGVeiaRstLadWVgrtgarLaPLvdALaeyVLkeGklHADeTPVqV
                         +i  s L   V++ + r                           
  lcl|gi|340 60938 ------AIMISGLIHGVSARCLRF-------------------------- 60955

I think a reasonable fix that still fixes the original bug and restore the
function for this case is to add an extra \s+ in the regex just before the
first dash, so the expression makes sure that the first dash is the one that
comes AFTER the description (and is replacing the usual coordinate number)
and is not the last of an alignment or a series of dashes like the one
above:
                        elsif (CORE::length($_) == 0
                            || ( $count != 1 && /^\s+$/o )
                            || /^\s+\-?\*\s*$/
                            || /^.+\s+\-\s+\-\s*$/ ) ### <--- Tweaked regex
                        {
                            next;
                        }
I tested it and it works fine, hope you find the fix acceptable.

Cheers,

--
Francisco J. Ossandon
Bioinformatician.
Ph.D. Candidate, University Andres Bello.
Center for Bioinformatics and Genome Biology,
Fundacion Ciencia para la Vida.
Santiago, Chile.
www.cienciavida.cl/CBGB.htm