[Bioperl-l] Help on a basic EST-genomic alignment script

Tue Jun 29 20:40:03 EDT 2004

On Tue, 29 Jun 2004, Edward Chuong wrote:

> Hi,
>
> I tried using FASTX, which looks like it gives similar alignment
> results as estwise. Can bioperl take in the fastx output? I always
> thought "fasta" output was just the simple >header followed by
> sequence, but the output by fastx has far more information. However,
> even fastx output is missing the original nucleotide sequence.. which
> wouldn't be a problem if it would included the nucleotide locations,
> but it doesn't as far as I can see.

FASTA has more than one meaning in bioinformatics

FASTA/Pearson sequence format
>ID DESC
SEQUENCE
[parse with Bio::SeqIO and get sequences]

FASTA multiple sequence format
>ID DESC
SEQ-WITH-GAPS
[parse with Bio::AlignIO and get an alignment]

FASTA programs (FASTA, FASTX,FASTY,TFASTA,TFASTX....)
produces a FASTA report
[parse with Bio::SearchIO and get Search Report objects]

See the SearchIO HOWTO on the bioperl website.

>
> Ultimatley I need something that will align an EST (from a file) to a
> mus CDS (retrieved from genbank) and get an aln object I can use to
> find dn/ds (which means the alignment must be in the correct protein
> coding frame) in bioperl..
>
> So estwise and fastx/y align the both translated sequences
> beautifully, but it seems like I can't parse them in bioperl, and they
> don't return as nucleotide anyway.
>
Ah but you HAVE the est sequence in your query file you used to run FASTX.
FASTX/Y gives you nt coordinates for your query sequence just like BLASTX
does for translated searches.

So $hsp->query->start and $hsp->query->end are the start and stop of the
alignment.

You can just get the NT sequence by reading in your EST sequence with
SeqIO (or getting it from a sequence database like Bio::Index::Fasta) and
then call

my $cdsseq = $seq->trunc($hsp->query->start, $hsp->query->end);

Make a hash of all these seqs
$cdsseqs{$hsp->query->seq_id} = $cdsseq;

You'll also need to get the CDS region from the subject (Mus CDS)- I'm
assuming you built your protein set from just Mus CDS and not cDNA -
otherwise if these are ensembl peps you can get just the CDS which codes
for each protein accession from EnsMart.  Or you can just get the CDS
clipped out from the genbank file as you would have already done.

The start/end of the alignment in subject nt coords will be
my ($hstart,$hend) = ( ($hsp->hit->start -1) * 3 + 1,
                     ($hsp->hit->end -1) * 3 + 1);

So do the same thing as before and grab the sub-sequence from the Mus
cDNA and add it to the %cdssseq hash.

 Now you still have to contend with frameshifts - you're going to have to
 figure where they are coded as '/' and '\' in the query string
 ($hsp->query_string, $hsp->hit_string) and either insert or delete an
 appropriate base (insert an N if it is a missing base, remove the extra
 base if it is there).

 I'm not really sure how to code this up best so you may at first just bin
 all the ESTs which have likely frameshifts and work with them later by
 hand.  Rather than trying to write this whole algorithm at one time I
 would get the easier things working first.  Other people may have done
 this too and have better suggestions than me.

For good measure you might take your cdsseqs, translate them back to
protein, and align with pSW or needle/water with EMBOSS and check that
you don't have any stop codons (in case your fixing of frameshifts didn't
work or to detect if you are accidently clipping out the wrong piece of
sequence).  Given this alignment - $proteinln you can use the following to
align the cds sequences using the protein aln as a template.

use Bio::Align::Utils qw(aa_to_dna_aln);
my $cdsaaln = &aa_to_dna_aln($proteinaln,%cdsseqs);

Then pass the $cdsaln object to the PAML Runner
(Bio::Tools::Run::Phylo::PAML:: Codeml or Yn00).

May seem complicated - but I don't know how else you do it....  When it's
done, we should add it as a script to bioperl.

> The closest thing I can find is est2genome from EMBOSS. It aligns the
> EST to the mus cDNA nucleotide sequences great--but that alignment
> isn't in any particular coding frame, which would cause problems if I
> stuck it in a dn/ds module. Also, est2genome doesn't appear to output
> in standard EMBOSS format.
>
Try sim4 if you are doing est to genome alignments - but not sure how well
this will work cross-species.  The FASTX/Y approach is probably going to
give you a better

> Anyone with experience doing something similar with dn/ds care to
> share how you got a proper alignment object?
>
> Thanks!
>
> -Ed
>
> On Mon, 28 Jun 2004 19:16:11 -0400 (EDT), Jason Stajich
> <jason at cgt.duhs.duke.edu> wrote:
> >
> > Hmm - I guess estwise doesn't provide a machine parseable output as I
> > would have thought.  What does one do ewan?  No one has written a
> > wise prettyblock alignment parser yet sadly.
> >
> > -jason
> >
> > On Mon, 28 Jun 2004, Edward Chuong wrote:
> >
> > > > Just use estwise as a standalone program not from within perl.
> > > >  % estwise protein est
> > > >
> > > > estwise is pretty slow so I wouldn't embark on this route unless you know
> > > > what you are doing.  Try a BLAST or FASTA route first to get likely
> > > > homologs.
> > > >
> > >
> > > Hey,
> > >
> > > I'm using blast already to find the likely homologs, which is working
> > > fine, and I get the homolog CDS/protein sequence by querying genbank
> > > with the accessionID from blast.
> > >
> > > ESTwise seems fast enough for my very small ESTs. I'm not sure what
> > > fastx is used for. I need a library..?
> > >
> > > How can I automate running estwise? Should i just use some sort of shell script?
> > >
> > > Is there any way to get the alignment I get from estwise into one of
> > > the dn/ds modules in bioperl?
> > >
> > > Thanks so much for helping!
> > >
> > > -Ed
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> >
>
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu