[Bioperl-l] Help on a basic EST-genomic alignment script

Tue Jul 6 21:38:43 EDT 2004

Hi,

> 
> my $cdsseq = $seq->trunc($hsp->query->start, $hsp->query->end);
> 
> Make a hash of all these seqs
> $cdsseqs{$hsp->query->seq_id} = $cdsseq;
> 
> You'll also need to get the CDS region from the subject (Mus CDS)- I'm
> assuming you built your protein set from just Mus CDS and not cDNA -
> otherwise if these are ensembl peps you can get just the CDS which codes
> for each protein accession from EnsMart.  Or you can just get the CDS
> clipped out from the genbank file as you would have already done.
> 
> The start/end of the alignment in subject nt coords will be
> my ($hstart,$hend) = ( ($hsp->hit->start -1) * 3 + 1,
>                      ($hsp->hit->end -1) * 3 + 1);
> 
> So do the same thing as before and grab the sub-sequence from the Mus
> cDNA and add it to the %cdssseq hash.
> 
>  Now you still have to contend with frameshifts - you're going to have to
>  figure where they are coded as '/' and '\' in the query string
>  ($hsp->query_string, $hsp->hit_string) and either insert or delete an
>  appropriate base (insert an N if it is a missing base, remove the extra
>  base if it is there).
> 

I followed mostly what you said, also adding "---" in the nuc seq when
there's a "-" in the peptide. Can you elaborate on what the "/" and
"\" are? I couldn't find any documentation on them. Do I have to
figure out how many nucleotides its shifted? for '/' I have it remove
the base, for '\' I have it remove 2 bases and add '------', and it
seems to work for my test cases. But it looks like I'm getting there.
Printing out both sequences gives a pretty good match.

> For good measure you might take your cdsseqs, translate them back to
> protein, and align with pSW or needle/water with EMBOSS and check that
> you don't have any stop codons (in case your fixing of frameshifts didn't
> work or to detect if you are accidently clipping out the wrong piece of
> sequence).  Given this alignment - $proteinln you can use the following to
> align the cds sequences using the protein aln as a template.
> 
> use Bio::Align::Utils qw(aa_to_dna_aln);
> my $cdsaaln = &aa_to_dna_aln($proteinaln,%cdsseqs);

OK, I aligned the proteins with pSW and they look great, do I need to
do anything about the stop codons at the end? I'm also having some
trouble making the hash for this. It keeps saying I can't use "mus"
(or anything I put in, including $est->id or $hsp->query->seq_id) when
use strict refs are on, but I remember making hashes before. All
syntaxes, my %cdsseqs = ('mus', $est) or $cdsseqs{'mus'} = $est, etc
do this. Probably something wrong in my overall code, I think I'll
figure it out eventually. I still don't understand, howerver, what
exactly does aa_to_dna_aln do?

> 
> Then pass the $cdsaln object to the PAML Runner
> (Bio::Tools::Run::Phylo::PAML:: Codeml or Yn00).

The sample script to do this looks very complicated :). I'll try it
out though. The one in Bio::Align::Utilities looks easier to use, but
I'm having problems making a simplealn object. I'm pretty sure my
est/pero are aligned, so I tried just adding them to the simplealn
object, but I needed to convert them into LocatableSeq objects.
Anything special I need for the start/end values, or should I just use
the values for the shorter sequence?

Thank you so much for all the help!

-Ed

-- 
Edward Chuong
http://iacs5.ucsd.edu/~echuong