[Bioperl-l] simplealign, prediction parselist

Heikki Lehvaslaiho heikki@ebi.ac.uk
19 Sep 2002 11:00:33 +0100


Thanks Rickhard!

I've placed the method into the module. It's great! I modified the code
slightly to declare the variables only when needed; that is safer and
helped me to understand the code (I know the older code did the same).
Also, the method now returns the array of removed sequences rather than
$self.

Many thanks,

	-Heikki


On Wed, 2002-09-18 at 18:22, Richard Adams wrote:
> Hi Heikki,
> Here is code for functional Simple::Align purge method.  It has the same
> algorithm
> as the original and is largely based on that code.. but it works ( at least for
> me).
> In the code that follows it is implemented as a normal subroutine that returns a
> reference
> to a purgedalignment in the main test script. This can obviously be altered when
> it is put in
> Bio::SimpleAlign.
> 
> I make no claims for the beauty of the program but hope it meets your
> requirements.
> Best wishes
> 
> Richard
> 
> 
> #!/bin/perl -w
> use strict;
> 
> use Bio::AlignIO;
> use Purge2 qw(purge2);
> 
> my $Alnobj = Bio::AlignIO->new (-file => 'myalignmentfilewithduplicates');
> my $aln = $Alnobj->next_aln;
>  print "there are", $aln->no_sequences, " sequences in original alignment\n";
> 
> my $purged_aln = purge2 ($aln, 0.7);
> 
>  print "after purging, purgedalign_ obj has  ", $purged_aln->no_sequences, "
> sequences\n";
> ################ end of test script#######################
> 
> #!/usr/bin/perl -w
> ##############this package has a sub which identifies and removes
> ######    sequences from an alignment above a given threshold of similarity
> ######     uses remove_seq method of Bio::SimpleAlign.
> ######    purge2 should be able to directly replace Bio::Simplealign purge.
> package Purge2;
> use Exporter;
> our @ISA  = qw(Exporter);
> our @EXPORT_OK = qw(purge2);
> 
> 
> sub purge2 {
> 
>  my ($self,$perc) = @_;         #takes alignment ref and threshold for
> similarity as parameters, like the original.
>   my (@seqs,$seq,$i,$j,$count,@one,@two,$seq2,$k,$res,$ratio, %duplicate,
> @dups);
> 
>   @seqs = $self->each_seq();
> 
>   for ($i=0;$i< @seqs - 1;$i++ ) {         #for each seq in alignment
>       $seq = $seqs[$i];
>   #skip if already in duplicate hash
>   next if exists $duplicate{$seq->display_id} ;
>   my $one = $seq->seq();
>       @one = split '', $one;#split to get 1aa per array element
> 
>       for($j=$i+1;$j < @seqs;$j++) {
>           $seq2 = $seqs[$j];
> 
>     #skip if already in duplicate hash
>     next if exists $duplicate{$seq2->display_id} ;
> 
>     my $two = $seq2->seq();
>          @two = split '', $two;
>     #print "\@two has ", scalar @two, "elements\n";
>           $count = 0;
>           $res = 0;
> 
>           for($k=0;$k<@one;$k++) {
>               if( $one[$k] ne '.' && $one[$k] ne '-' && defined($two[$k]) &&
> $one[$k] eq $two[$k]) {
>                   $count++;
>                }
>               if( $one[$k] ne '.' && $one[$k] ne '-' && defined($two[$k]) &&
> $two[$k] ne '.' && $two[$k] ne '-' ) {
>                   $res++;
>                }
>            }
> 
>           if( $res == 0 ) {
>               $ratio = 0;
>              }
>     else {
>               $ratio = $count/$res;
>            }
>    # if above threshold put in duplicate hash and push onto
>    # duplicate array for returning to get_unique
>           if( $ratio > $perc ) {
>     print "duplicate!", $seq2->display_id, "\n";
>     $duplicate{$seq2->display_id} = 1;
>     push @dups, $seq2;
>     }
>   }
>  }
>  foreach my $seq(@dups) {
>  $self->remove_seq($seq);
> }
> return $self;
> }
> 
> 
> 
> 
> 
> 
> 
> Heikki Lehvaslaiho wrote:
> 
> > Rickhard,
> >
> > The recommended way to contribute is to first post descriptions of
> > modules, names and methods, for other people to comment on. Then in
> > addition to writing the actual code, you are expected to write test
> > suite (into t directory) with necessary data files (into t/data) which
> > test at least all public methods in the modules. The actual commit can
> > be handled by any of the core developer or anyone with a CVS account. If
> > your contributions to bioperl are more than one off, we'd eventually
> > like to give you a CVS account, so that you can maintain and add to the
> > code base yourself.
> >
> > To start with, it is best that you post everything to the list.
> >
> >         -Heikki
> >
> > On Tue, 2002-09-17 at 20:57, Richard Adams wrote:
> > > Heikki,
> > > I'll tidy up the Simple::Align->purge code and send it to you asap.
> > > Re parsers for netphos, psort etc. Basically  I'm working with about 30-40
> > > genes on chromosome 4
> > > in a region associated with bipolar disorder. I'm trying to build up a
> > > database of possibly important
> > > functional residues/regions of the proteins for prioritising the
> > > experimental analysis of SNPs within
> > > the coding region of the genes. Also I'm hoping to bolster the prediction
> > > accuracy of these
> > > programs by looking for conservation of prediction between sequence
> > > homologues in a Clustal alignment.
> > >
> > > So that's why I'm developing the parsers. At present they are not in the
> > > BioPerl 'format' at all and are standalone
> > > modules/subroutines. I've not contributed to BioPerl before - is there a
> > > standard way that you want code contributed in
> > > or is some assistance likely to be available from someone more experienced
> > > with BioPerl guts than myself?
> > >
> > > Richard
> > >
> > > Heikki Lehvaslaiho wrote:
> > >
> > > > Richard,
> > > >
> > > > Bio::SimpleAlign::purge() has not been functional as far as I know. We
> > > > do not even have tests for it. Please send me your fixes and add them in
> > > > and add you as an contributor to the module.
> > > > It would be even better of you have written tests into t/SimpleAlign.t,
> > > > too!
> > > >
> > > > It has been a while since you posted your message and there has been no
> > > > answers to it. I think it can be safely said that no one else is working
> > > > on Psort, TargetP or netphos parsers. It would be great to have them.
> > > > Let us know what your plans are.
> > > >
> > > >         -Heikki
> > > >
> > > > On Tue, 2002-08-27 at 11:51, Richard Adams wrote:
> > > > >     Hello,
> > > > > Please ignore this if it's already been fixed but there seems to be 2
> > > > > problems in 1.0.2
> > > > > with Bio::SimpleAlign purge function.
> > > > >
> > > > > 1.  the get_nse call produces the error
> > > > >     Use of uninitialized value in numeric eq (==) at
> > > > > /packages/perl/lib/site_perl/5.6.0/Bio/SimpleAlign.pm line 373, <GEN0>
> > > > > line 242.
> > > > > 2.
> > > > >     the lines
> > > > > @one = $seq->seq();
> > > > >    @two = $seq2->seq();
> > > > >     I guess are supposed to be split so that each element is an amino
> > > > > acid or - or .
> > > > >     But the way the code works there will only be one element - the
> > > > > whole sequence string.
> > > > >
> > > > > I've fixed these problems for my own use so if my code would be any use
> > > > > I'd gladly submit it.
> > > > >
> > > > >
> > > > >
> > > > > I'm writing some parsers for various prediction servers, such as
> > > > > Expasy's Psort, TargetP, netphos and ultimately plan to develop parsers
> > > > > for some secondary
> > > > > structure prediction servers.  Are people already working on this sort
> > > > > of thing or would these be useful additions?
> > > > >
> > > > > Richard
> > > > >
> > > > > Dr Richard Adams
> > > > > Molecular Medicine Centre
> > > > > University of Edinburgh UK
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Bioperl-l mailing list
> > > > > Bioperl-l@bioperl.org
> > > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > > --
> > > > ______ _/      _/_____________________________________________________
> > > >       _/      _/                      http://www.ebi.ac.uk/mutations/
> > > >      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
> > > >     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
> > > >    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
> > > >   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
> > > >      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> > > > ___ _/_/_/_/_/________________________________________________________
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > --
> > ______ _/      _/_____________________________________________________
> >       _/      _/                      http://www.ebi.ac.uk/mutations/
> >      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
> >     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
> >    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
> >   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
> >      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> > ___ _/_/_/_/_/________________________________________________________
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________