[Bioperl-l] simplealign, prediction parselist

Wed, 18 Sep 2002 18:22:58 +0100

Hi Heikki,
Here is code for functional Simple::Align purge method.  It has the same
algorithm
as the original and is largely based on that code.. but it works ( at least for
me).
In the code that follows it is implemented as a normal subroutine that returns a
reference
to a purgedalignment in the main test script. This can obviously be altered when
it is put in
Bio::SimpleAlign.

I make no claims for the beauty of the program but hope it meets your
requirements.
Best wishes

Richard

#!/bin/perl -w
use strict;

use Bio::AlignIO;
use Purge2 qw(purge2);

my $Alnobj = Bio::AlignIO->new (-file => 'myalignmentfilewithduplicates');
my $aln = $Alnobj->next_aln;
 print "there are", $aln->no_sequences, " sequences in original alignment\n";

my $purged_aln = purge2 ($aln, 0.7);

 print "after purging, purgedalign_ obj has  ", $purged_aln->no_sequences, "
sequences\n";
################ end of test script#######################

#!/usr/bin/perl -w
##############this package has a sub which identifies and removes
######    sequences from an alignment above a given threshold of similarity
######     uses remove_seq method of Bio::SimpleAlign.
######    purge2 should be able to directly replace Bio::Simplealign purge.
package Purge2;
use Exporter;
our @ISA  = qw(Exporter);
our @EXPORT_OK = qw(purge2);

sub purge2 {

 my ($self,$perc) = @_;         #takes alignment ref and threshold for
similarity as parameters, like the original.
  my (@seqs,$seq,$i,$j,$count,@one,@two,$seq2,$k,$res,$ratio, %duplicate,
@dups);

  @seqs = $self->each_seq();

  for ($i=0;$i< @seqs - 1;$i++ ) {         #for each seq in alignment
      $seq = $seqs[$i];
  #skip if already in duplicate hash
  next if exists $duplicate{$seq->display_id} ;
  my $one = $seq->seq();
      @one = split '', $one;#split to get 1aa per array element

      for($j=$i+1;$j < @seqs;$j++) {
          $seq2 = $seqs[$j];

    #skip if already in duplicate hash
    next if exists $duplicate{$seq2->display_id} ;

    my $two = $seq2->seq();
         @two = split '', $two;
    #print "\@two has ", scalar @two, "elements\n";
          $count = 0;
          $res = 0;

          for($k=0;$k<@one;$k++) {
              if( $one[$k] ne '.' && $one[$k] ne '-' && defined($two[$k]) &&
$one[$k] eq $two[$k]) {
                  $count++;
               }
              if( $one[$k] ne '.' && $one[$k] ne '-' && defined($two[$k]) &&
$two[$k] ne '.' && $two[$k] ne '-' ) {
                  $res++;
               }
           }

          if( $res == 0 ) {
              $ratio = 0;
             }
    else {
              $ratio = $count/$res;
           }
   # if above threshold put in duplicate hash and push onto
   # duplicate array for returning to get_unique
          if( $ratio > $perc ) {
    print "duplicate!", $seq2->display_id, "\n";
    $duplicate{$seq2->display_id} = 1;
    push @dups, $seq2;
    }
  }
 }
 foreach my $seq(@dups) {
 $self->remove_seq($seq);
}
return $self;
}

Heikki Lehvaslaiho wrote:

> Rickhard,
>
> The recommended way to contribute is to first post descriptions of
> modules, names and methods, for other people to comment on. Then in
> addition to writing the actual code, you are expected to write test
> suite (into t directory) with necessary data files (into t/data) which
> test at least all public methods in the modules. The actual commit can
> be handled by any of the core developer or anyone with a CVS account. If
> your contributions to bioperl are more than one off, we'd eventually
> like to give you a CVS account, so that you can maintain and add to the
> code base yourself.
>
> To start with, it is best that you post everything to the list.
>
>         -Heikki
>
> On Tue, 2002-09-17 at 20:57, Richard Adams wrote:
> > Heikki,
> > I'll tidy up the Simple::Align->purge code and send it to you asap.
> > Re parsers for netphos, psort etc. Basically  I'm working with about 30-40
> > genes on chromosome 4
> > in a region associated with bipolar disorder. I'm trying to build up a
> > database of possibly important
> > functional residues/regions of the proteins for prioritising the
> > experimental analysis of SNPs within
> > the coding region of the genes. Also I'm hoping to bolster the prediction
> > accuracy of these
> > programs by looking for conservation of prediction between sequence
> > homologues in a Clustal alignment.
> >
> > So that's why I'm developing the parsers. At present they are not in the
> > BioPerl 'format' at all and are standalone
> > modules/subroutines. I've not contributed to BioPerl before - is there a
> > standard way that you want code contributed in
> > or is some assistance likely to be available from someone more experienced
> > with BioPerl guts than myself?
> >
> > Richard
> >
> > Heikki Lehvaslaiho wrote:
> >
> > > Richard,
> > >
> > > Bio::SimpleAlign::purge() has not been functional as far as I know. We
> > > do not even have tests for it. Please send me your fixes and add them in
> > > and add you as an contributor to the module.
> > > It would be even better of you have written tests into t/SimpleAlign.t,
> > > too!
> > >
> > > It has been a while since you posted your message and there has been no
> > > answers to it. I think it can be safely said that no one else is working
> > > on Psort, TargetP or netphos parsers. It would be great to have them.
> > > Let us know what your plans are.
> > >
> > >         -Heikki
> > >
> > > On Tue, 2002-08-27 at 11:51, Richard Adams wrote:
> > > >     Hello,
> > > > Please ignore this if it's already been fixed but there seems to be 2
> > > > problems in 1.0.2
> > > > with Bio::SimpleAlign purge function.
> > > >
> > > > 1.  the get_nse call produces the error
> > > >     Use of uninitialized value in numeric eq (==) at
> > > > /packages/perl/lib/site_perl/5.6.0/Bio/SimpleAlign.pm line 373, <GEN0>
> > > > line 242.
> > > > 2.
> > > >     the lines
> > > > @one = $seq->seq();
> > > >    @two = $seq2->seq();
> > > >     I guess are supposed to be split so that each element is an amino
> > > > acid or - or .
> > > >     But the way the code works there will only be one element - the
> > > > whole sequence string.
> > > >
> > > > I've fixed these problems for my own use so if my code would be any use
> > > > I'd gladly submit it.
> > > >
> > > >
> > > >
> > > > I'm writing some parsers for various prediction servers, such as
> > > > Expasy's Psort, TargetP, netphos and ultimately plan to develop parsers
> > > > for some secondary
> > > > structure prediction servers.  Are people already working on this sort
> > > > of thing or would these be useful additions?
> > > >
> > > > Richard
> > > >
> > > > Dr Richard Adams
> > > > Molecular Medicine Centre
> > > > University of Edinburgh UK
> > > >
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > --
> > > ______ _/      _/_____________________________________________________
> > >       _/      _/                      http://www.ebi.ac.uk/mutations/
> > >      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
> > >     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
> > >    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
> > >   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
> > >      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> > > ___ _/_/_/_/_/________________________________________________________
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> --
> ______ _/      _/_____________________________________________________
>       _/      _/                      http://www.ebi.ac.uk/mutations/
>      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
>     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
>    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
>   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
>      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> ___ _/_/_/_/_/________________________________________________________