[Bioperl-l] Re: [Bioperl-guts-l] Notification: incoming/1065

Heikki Lehvaslaiho heikki@ebi.ac.uk
Mon, 21 Jan 2002 15:43:36 +0000


Steve,

Let's move this this into bioperl-l where it belongs...


OK, I think I've got phylip.pm to work properly now. The last problems all
came from $name =~ /.{10}/ not matching anything when the sequence name was
<10 characters long. We did not have any in our test suite. Thanks for
spotting this.

In summary: If you need phylip format output that works with Joe
Felsenstein's PHYLIP programs (a reasonable request!), you need the latest
(v.1.7) from cvs. Also, I forgot to say it to Steve that spurious warnings
printed when importing gapped sequences are generated in Bio::LocatableSeq.
The warnings are now silenced by default.

	-heikki



Steven Cannon wrote:
> 
> On Friday, January 18, 2002, at 03:30 AM, Heikki Lehvaslaiho wrote:
> 
> > Steve,
> >
> > Thanks for the bug report.
> >
> >> First, phylip.pm is placing three line returns between sequence blocks.
> >> Felsenstein's programs in the Phylip suite can't deal with this --
> >> they require
> >> two returns between blocks (that is, one blank line rather than two;
> >> illustrated
> >> below).
> >
> > That is easy enought to change.
> >
> >> Second (just an annoyance), when converting from, say, fasta to phylip
> >> format,
> >> any dashes in  the fasta-format alignment generate STDIO warnings --
> >> one warning
> >> per sequence (annoying, since any decent alignment will have gaps,
> >> usually
> >> indicated by dashes). Typical warning:
> >>
> >> -------------------- WARNING ---------------------
> >> MSG: In sequence MtTC36450 residue count gives value 64.
> >> Overriding value [65] with value 64 for Bio::LocatableSeq::end().
> >> ---------------------------------------------------
> >
> > Hmm. That warning proved useful when debugging parsers, so let's turn it
> > into a proper debugging statement. From now on the warning will be
> > printed
> > only
> > if $locatableseq->verbose > 0.
> >
> >> Third (just an annoyance), some garbage is inserted into the
> >> phylip-formatted
> >> sequence names, in the form of truncated "start-end position" numbers.
> >> For
> >> example, if the  original sequence name has the 7 characters
> >> 'ABCDEFG', three
> >> characters indicating the  start position of the sequence will be
> >> added to the
> >> name, bringing the name to the allowed 10-character phylip name length:
> >> 'ABCDEFG/1-'. This added information is never useful in the
> >> 10-character names,
> >> and will usually have to be subsequently stripped out.
> >
> > You are right. Ten characters is too short to hold "start-end
> > position", so
> > lets
> > dump them.
> >
> > All these changes are in phylip.pm file so once I've updated the cvs
> > repository, you can go into WebCVS and copy the file over the old one.
> > If
> > you do that, could you let me know that everything works.
> >
> > Yours,
> >       -Heikki
> 
> Heikki -
> 
> I did a test, using phylip.pm from CVS, and it looks like the fixes
> introduced some new problems. Here are my test file and output:
> 
>  >H122_HMM
> tyvklatlavfmltqflivqtknveagqcpragracsqaesnacgdieecicvsegshydggick
>  >MtNP212753
> tyvklatlavfmltqflivqtknveegqcpfagrvcsqyesnacgdseecicvsewshydggick
>  >MtTC30424
> -----------------------iearecpsfgtvcsilrsnscgniieyiciphwih--ggick
>  >MtTC4140912341234
> tyvklailavlhltiflifqtknveaascpnvgavcspfetkpcgnvkdcrclpwglff--gtc-
>  >MtTC28
> tyvklitlalflvttllmfqtknveaefcssvgsfcspfntnpcgylgncrcvpy--ylyggtce
> 
>   5 65
>                 tyvklitlal flvttllmfq tknveaefcs svgsfcspfn tnpcgylgnc
> MtNP212753     tyvklatlav fmltqflivq tknveegqcp fagrvcsqye snacgdseec
>                 tyvklitlal flvttllmfq tknveaefcs svgsfcspfn tnpcgylgnc
> MtTC414091     tyvklailav lhltiflifq tknveaascp nvgavcspfe tkpcgnvkdc
>                 tyvklitlal flvttllmfq tknveaefcs svgsfcspfn tnpcgylgnc
> 
>                 rcvpy--yly ggtce
>                 icvsewshyd ggick
>                 rcvpy--yly ggtce
>                 rclpwglff- -gtc-
>                 rcvpy--yly ggtce
> 
> So, the line return problem is fixed (3 -> 2 between interleaved
> blocks), and start-end information is being omitted from the shortened
> name (good), but sequence names shorter than 10 characters are being
> dropped (bad!).
> 
> I'm also still getting, e.g.,
> -------------------- WARNING ---------------------
> MSG: In sequence MtTC36450 residue count gives value 64.
> Overriding value [65] with value 64 for Bio::LocatableSeq::end().
> ---------------------------------------------------
> 
> I don't know if you were suggesting that I set $locatableseq->verbose >
> 0 ?
> 
> Steve
> 
> _______________________________________________
> Bioperl-guts-l mailing list
> Bioperl-guts-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-guts-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________