[Bioperl-l] extending the PHYLIP format

Heikki Lehvaslaiho heikki at sanbi.ac.za
Thu May 29 10:06:12 UTC 2008


I commited my first take on long IDs in phylip format. 

It turned out to be quite a simple thing to add an option 'longid=>1' to it.
I had to change the interleaved() to return the set value (it was returning 
the previous value). It might have been something I did long time ago when I 
wrote the first version of the module! Strangely, it did not have any knock 
on effects.

The longid version of phylip format seems to work fine with phyml as long as 
there are no spaces in the Ids (but that is said clearly in phyml docs). This 
still needs careful testing with other programs.

Please try it out with your favorite phylo programs!

   -Heikki

On Wednesday 28 May 2008 20:01:55 Weigang Qiu wrote:
> Let me summarize a few things I implemented during the 07 Nescent
> Hackathon (with a lot of help from Sandu, Aaron, and Jason):
>
> 1. A "longname.aln" is included in the bioperl-live. Which turns out to
> be the same file as "pep-266.aln" (one of them showed be removed). This
> file is in clustalw format. Also, it doesn't contain the tough cases
> like 50 chars long, with spaces and single quotes.
>
> 2. The solution to this ugly restriction that had been implemented
> include the following pair of SimpleAlignI methods:
>  set_displayname_safe
>
>       Title     : set_displayname_safe
>       Usage     : ($new_aln, $ref_name)=$ali->set_displayname_safe(4)
>       Function  : Assign machine-generated serial names to sequences in
> input order.
>                   Designed to protect names during PHYLIP runs. Assign
> 10-char string
>                   in the form of "S000000001" to "S999999999". Restore
> the original
>                   names using "restore_displayname".
>       Returns   : 1. a new $aln with system names;
>                   2. a hash ref for restoring names
>       Argument  : Number for id length (default 10)
>
>  restore_displayname
>
>       Title     : restore_displayname
>       Usage     : $aln_name_restored=$ali->restore_displayname($hash_ref)
>       Function  : Restore original sequence names (after running
>                   $ali->set_displayname_safe)
>       Returns   : a new $aln with names restored.
>       Argument  : a hash reference of names from "set_displayname_safe".
>
> 3. Added following tests in "SimpleAlign.t":
> # test set_displayname_safe & restore_displayname:
> $str = Bio::AlignIO->new(-file=>
> Bio::Root::IO->catfile("t","data","pep-266.aln"));
> $aln=$str->next_aln();
> is $aln->get_seq_by_pos(3)->display_id, 'Smik_Contig1103.1', 'initial
> display id ok';
> my ($new_aln, $ref)=$aln->set_displayname_safe();
> is $new_aln->get_seq_by_pos(3)->display_id, 'S000000003', 'safe display
> id ok';
> my $restored_aln=$new_aln->restore_displayname($ref);
> is $restored_aln->get_seq_by_pos(3)->display_id, 'Smik_Contig1103.1',
> 'restored display id ok';
>
> I would be happy to contribute more if additional work or design is needed.
>
> ps. We developed a module for graphic annotation of alignments using GD
> (modeled after Bio::Graphics). This should be useful for people who are
> annotating alignments manually (such as highlight alignment positions,
> labeling domains, etc). Someone help me to deposit it in bioperl-live
> through subversion would be great (my cvs developer's account was told
> to be not useful any more).
>
> Jason Stajich wrote:
> > Should also ask Weigang what the status is, I think he implemented a
> > lot of it.
> >
> > -jason
> >
> > On May 28, 2008, at 6:51 AM, Chris Fields wrote:
> >> Could you post a few example phylip sequences with long names to svn
> >> and add a ticket to bugzilla?  I would consider this a somewhat
> >> high-priority enhancement.
> >>
> >> I think keeping this in a single phylip module would be best, but
> >> we'll to see how feasible it is.  I think it is possible to do so,
> >> however, and still retain some backwards compatibility (I may even
> >> have an idea how, just need to test it out).
> >>
> >> chris
> >>
> >> On May 28, 2008, at 3:23 AM, Heikki Lehvaslaiho wrote:
> >>> I just learned that a number of phylogenetics packages (PAUP, PHYML,
> >>> Mr Bayes
> >>> at least ) now allow longer than 10 character IDs in PHYLIP format. The
> >>> documentation is scarce but the rules seem to be:
> >>>
> >>> 1. There can be spaces before the ID.
> >>> 2. The ID can be up to 50 characters long.
> >>> 3. ID can contain any characters. If you are using spaces within the
> >>> ID, you
> >>> have to put the whole ID in single quotes ('). Single quotes can be
> >>> used for
> >>> all IDs and are removed when parsing in.
> >>> 4. It is customary to have two spaces between the ID and the sequence.
> >>>
> >>> This custom seems to have come into PHYLIP format from Nexus.
> >>> Note that this allows sequences in a file to start at different
> >>> columns.
> >>>
> >>> Can anyone shed more light into matter?
> >>>
> >>>
> >>> I need to get this into bioperl as the names in HIV sequences that I
> >>> work with
> >>> are very long and can not be sensibly truncated.
> >>>
> >>> What would be the best way to do this?
> >>> 1. Add more options to the already heavily
> >>>   hacked Bio::AlignIO::phylip.pm
> >>> 2. Create a Bio::AlignIO::phyliplong.pm
> >>>
> >>> Do those ugly hacks for supporting fixed length long IDs really
> >>> really belong
> >>> in the vanilla phylip.pm file?
> >>>
> >>> Opinions?
> >>>
> >>>     -Heikki
> >>>
> >>> --______ _/
> >>> _/_____________________________________________________
> >>>      _/      _/
> >>>     _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
> >>>    _/_/_/_/_/  Senior Scientist    skype: heikki_lehvaslaiho
> >>>   _/  _/  _/  SANBI, South African National Bioinformatics Institute
> >>>  _/  _/  _/  University of Western Cape, South Africa
> >>>     _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
> >>> ___ _/_/_/_/_/________________________________________________________
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> Christopher Fields
> >> Postdoctoral Researcher
> >> Lab of Dr. Marie-Claude Hofmann
> >> College of Veterinary Medicine
> >> University of Illinois Urbana-Champaign
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l



-- 
______ _/      _/_____________________________________________________
      _/      _/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
    _/_/_/_/_/  Senior Scientist    skype: heikki_lehvaslaiho
   _/  _/  _/  SANBI, South African National Bioinformatics Institute
  _/  _/  _/  University of Western Cape, South Africa
     _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
___ _/_/_/_/_/________________________________________________________



More information about the Bioperl-l mailing list