[Bioperl-l] sequence comparison

Mon, 11 Mar 2002 16:45:52 -0500 (EST)

On Mon, 11 Mar 2002, Phillip Lord wrote:

> I'm trying to compare two sequences at the moment. I want to end up
> with a simple metric of how similar the sequences are. My thought was
> just to align the sequences and come up with a percent ID value.
>
> I've tried using the pSW class to do this. It seems to work but it
> appears to be producing alignments only over a very small section of
> the two proteins
>
Smith-Waterman finds the best local alignments between two sequences so it
is only going to show the alignment sections where the score was good
enough.  You might want to consider a global alignment such as
Needleman-Wunsch if you are interested in the whole length of the
sequence.  This can produce somewhat misleading information, however, if
the two sequences you are comparing are of vastly different lengths
because of the number of gaps that will be introduced.

Don't use bl2seq it is not very sensitive and it too produces local
alignments not global alignments.

Needleman-Wunsch is accessible through the EMBOSS program 'needle' and can
be parsed with Bio::AlignIO(-format=> 'emboss').  You can get a
percent_identity from the SimpleAlign object that is returned.

Are you interested in an evolutionary distance between two sequences?
This can be generated using protdist from Phylip(for protein, dnadist for
dna seqs) or distmat from EMBOSS, generally this is most useful when
looking at multi-sequence data not just pairwise.

You can dump the SimpleAlign object to an alignment file with AlignIO and
output it as 'msf' and then run the distmat program on this alignment file
to get the evolutionary distance (scores will be dependent on the
substitution matrix you use).

Might be good to go back to the question of what do you want to do with
the whole 'how similar are they' answer?  Programs like prss can tell you
how similar two protein sequences are by comparing the similarity score to
those of random sequences generated with the same amino-acid composition.
>
> So for instance
>
> (where $swissprot->{inx} is a Bio::Index::Swissprot object)
>
> my $seq1 = $swissprot->{inx}->fetch( "KV1K_HUMAN" );
> my $seq2 = $swissprot->{inx}->fetch( "IF1Y_HUMAN" );
>
>
> my $factory = new Bio::Tools::pSW( '-matrix' => `blosum62.bla,
>                                   '-gap' => 12,
>                                   '-ext' => 2, );
>
> print "Seq1 is " . $seq1->seq . "\n";
> print "Seq2 is " . $seq2->seq . "\n";
>
> $aln->write_fasta( \*STDOUT);
>
>
> gives me...
>
> Seq1 is DIQMTQSPSTLSVSVGDRVTITCEASQTVLSYLNWYQQKPGKAPKLLIYAASSLETGV
>         PSRFSGQGSGTBFTFTISSVZPZBFATYYCQZYLDLPRTFGQGTKVDLKR
> Seq2 is PKNKGKGGKNRRRGKNENESEKRELVFKEDGQEYAQVIKMLGNGRLEALCFDGVKRLC
>         HIRGKLRKKVWINTSDIILVGLRDYQDNKADVILKYNADEARSLKAYGGLPEHAKINE
>         TDTFGPGDDDEIQFDDIGDDDEDIDDI
>
> >KV1K_HUMAN/89-101
> QZYLDLPRTFGQG
> >IF1Y_HUMAN/32-44
> QEYAQVIKMLGNG
> % ID is 30.7692307692308
>
>
> I'm not sure whether or not this is the expected behaviour of the pSW
> class. I'm guessing its just returning the maximal scoring fragment
> between the two sequences.
>
> My question is this though. What's the best way that I can extract
> some measure of sequence similarity from two sequences using bioperl?
> Will the bl2seq module work better for me?
>
> Thanks in advance
>
> Phil
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu