[Bioperl-l] GCG MSF format alignments

David J. Evans David.Evans@vir.gla.ac.uk
Fri, 9 Mar 2001 17:33:50 -0000


Bioperlers
My recent attempts to use AlignIO (in v.0.7.0) have choked on GCG format MSF
files - these have the  header info, a // and the aligned sequences, but
also have a number line. Something like this :

[snip]
 Name: cavy             Len:   377  Check: 7038  Weight:  1.00
 Name: rat              Len:   377  Check: 6296  Weight:  1.00

//

         1                                                   50
  human  DCGLPPDVPN AQP..ALEGR TSFPEDTVIT YKCEESFVKI PGEKDSVICL
  chimp  DCGLPPEVPN AQP..ALEGR TSFPEDTVIT YKCEESFVKI PGEKDS~~~~
[snip]

GCG v.10 also uses ~ to pad out the end of sequences as I've shown in the
chimp line above.  The .msf files in the current distribution aren't in this
format ... they lack the number line and don't pad the ends of sequences.
I've looked as msf.pm in /AlignIO and it looks as though all starts going
wrong about line 102 :

if( ! exists $hash{$name} ) {

which throws an exception ... changing 'throw' to 'warn' and adding a 'next'
next line gets round this, and it creates the SimpleAlign object correctly
(except for those ~'s ... see below).  Is there a definition of MSF format
files somewhere ?  Do I need to pre-process my MSF format alignments
generated with GCG (as I've done with earlier versions of Bioperl).

Regarding the ~ characters ... my primitive understanding suggests this
throws an exception in Bio::Primaryseq::seq with a message that '[some
sequence ~~~] does not look healthy'.  Is there an acceptable way to allow ~
characters to be included, even if only to use map_chars() on the object at
a later stage to get rid of them ?  Should bioperl read the standard GCG
format MSF files directly, and should they always need pre-processing ?

Finally, I get an error when I try and generate MSF format output from my
SimpleAlign object, which looks like this :

Can't locate object method "GCG_checksum" via package "Bio::LocatableSeq" at
D:/Computing/DJEcustomLib/bioperl-0.07.0/Bio/SimpleAlign.pm line 1594,
<GEN0> line 119.

which has me stumped, because I can only find the method in
Bio/SeqIO/gcg.pm.  It generates FASTA format output perfectly ...

Are these bugs, features or 'user errors' ?  All the above is on Win2k (a
user error in itself perhaps), but nothing to do with the problems I don't
think.

Looking forward to (Hoping for) some clarification.
Regards
David




=====
David J. Evans
Institute of Virology  |  David.Evans@vir.gla.ac.uk
University of Glasgow  |  Tel/Fax : +44 0141 330 6249
Church Street          |  SMS/Mobile : +44 07940 592768
GLASGOW G11 5JR        |  http://www.polio.vir.gla.ac.uk/