[Bioperl-l] Classifying SNPs

Tue Jul 14 18:59:51 EDT 2009

Hello Abhishek

Ensembl has a module for calculate SNP consequences in a transcript.

The script that they use to create their consequences is located in:

ensembl-55/ensembl-variation/scripts/import/parallel_transcript_variation.pl

The important bit is to  convert your snp 
coordenates and the variation_allele into a ConsequenceType object

  $consequence_type = 
Bio::EnsEMBL::Variation::ConsequenceType->new($tr->dbID,$chr,$start,$end,$strand,\@alleles);

and pass this and a transcript to the  type_variation
Bio::EnsEMBL::Utils::TranscriptAlleles exported method

   $consequences = type_variation($tr, "", $consequence_type);

in the module

ensembl-55/ensembl/modules/Bio/EnsEMBL/Utils/TranscriptAlleles.pm

The other important bit in this script is that now the functional_genomics 
consequences are calculated in this script instead in the type_variation()

The only drawback is that it return only the ensembl classes of 
consequences , but you can extend that later if you need more specific 
consequences (I have done that in the past for different projects).

This ensembl aproach will save you a lot of problems with the mapping from 
gene to protein and with multiple snps in a codon.

If you have experience with ensembl then is easy to follow the code. If 
not you can always ask for help in the ensembl-dev mailing list 
(ensembl-dev at ebi.ac.uk)

If you want to read the code without checking out the whole api:

http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation/scripts/import/parallel_transcript_variation.pl?revision=1.27&root=ensembl&view=markup
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl/modules/Bio/EnsEMBL/Utils/TranscriptAlleles.pm?root=ensembl&view=log

hope this helps

   - Pablo

Abhishek Pratap <abhishek.vit <at> gmail.com> writes:

>
> Hi Jason
>
> Thanks for a detailed insight. I would definitely go the ensembl way 
first
> and try to see if it can do exactly what we want.
>
>  In case it does/'nt I will report back on this same thread. I think 
having
> something like this in the Bioperl will help the NGS community. Lot of
> people are predicting SNPs from NGS.oops(next generation sequencing ) 
data
> and looking for ways to better annotate/classify their predictions.
>
> Thanks guys .. It is a pleasure to interact with you all. Just 
overwhelmed
> to see the responses.
>
> best,
> -Abhi
>
> On Mon, Jul 13, 2009 at 12:54 PM, Jason Stajich <jason <at> bioperl.org> 
wrote:
>
> > Ensembl would be best place to go if you are working with human SNPs 
but
> > for those who aren't so data lucky...
> >
> > Aspects of this also relates to the dn/dS code in the
> > Bio::Align::DNAStatistics -- thought it does the classification and
> > comparison all at once so you'd have to dig code out.
> >
> > And the mcdonald_kreitman code in Bio::PopGen::Statistics which 
computes a
> > synonymous or nonsynonymous via lookup table that is stored in
> > Bio::MolEvol::CodonModel which compares the edit path which is encoded 
as
> > the two codons concatenated together -- i.ee
> >
> > use Bio::MolEvol::CodonModel;
> >  my $codon_path = Bio::MolEvol::CodonModel->codon_path;
> >  my ($ns, $syn) = $codon_path->{'AATAAC'};
> >  print "AAT -> AAC: $ns ns mutations, $syn syn mutations\n";
> >
> >
> > It all kind of depends on how you have the data organized, if it is 
just
> > SNPs and you are trying to figure out if they are syn or non-syn then 
you
> > kind of need a good database to do this since you'll have to know what 
gene
> > they are in, CDS of the gene, etc.  It is possible to do with 
something as
> > basic as GFF3 for your genome and the SNP locations and
> > Bio::DB::SeqFeature::Store.  While I can think of a way to code it up 
from
> > those bare-bones - maybe you should report back if you can just use 
the
> > Ensembl classification of the SNPs?
> >
> > -jason
> >
> >
> > On Jul 13, 2009, at 9:33 AM, Chris Fields wrote:
> >
> >  Bio::Coordinate might help with coordinate conversion.  However, much 
of
> >> this sounds very Ensembl-like.  Have you looked at the Ensembl perl 
API?  It
> >> can do #1 (coordinate conversion), and I'm sure something could be 
written
> >> up to do the second.
> >>
> >> chris
> >>
> >> On Jul 13, 2009, at 10:43 AM, Mark A. Jensen wrote:
> >>
> >>  Thanks Abhi-- I had a feeling there was more (or "less") to it-- 
this
> >>> would be a nice feature to have, don't think it exists. Will think 
about
> >>> it-- cheers
> >>> ----- Original Message ----- From: "Abhishek Pratap" <
> >>> abhishek.vit <at> gmail.com>
> >>> To: "Mark A. Jensen" <maj <at> fortinbras.us>
> >>> Cc: <bioperl-l <at> lists.open-bio.org>
> >>> Sent: Monday, July 13, 2009 11:10 AM
> >>> Subject: Re: [Bioperl-l] Classifying SNPs
> >>>
> >>>
> >>>  Dear Mark
> >>>> Sorry I was not able to reply earlier. Many Thanks for your 
detailed
> >>>> explanation. However this is not exactly what I am looking for. May 
be
> >>>> my
> >>>> initial mail was not well articulated or I am not able to infer 
your
> >>>> reply
> >>>> fully. My bad.
> >>>>
> >>>> Well as an input what we have is the just the genomic coordinates 
for
> >>>> SNP's
> >>>> predicted by Illumina propriety software CASAVA. What we would like 
to
> >>>> do is
> >>>> to further classify these predicted SNP's  . If they fall into 
Coding
> >>>> region
> >>>> then whether they are synonymous/non-syn SNPs.
> >>>>
> >>>> So I guess something which translates
> >>>> 1. SNP genomic coordinate into mRNA offset
> >>>> 2. Then identify the ORF and target codon and check whether the SNP
> >>>> substitution will be syn/non-syn.
> >>>>
> >>>> Thanks,
> >>>> -Abhi
> >>>>
> >>>> On Wed, Jul 8, 2009 at 11:23 AM, Mark A. Jensen <maj <at> 
fortinbras.us>
> >>>> wrote:
> >>>>
> >>>>  Hey Abhishek-
> >>>>> You might root around in Bio::PopGen. Here's a script to get stuff 
from
> >>>>> raw fasta data--see comments within.
> >>>>> cheers
> >>>>> Mark
> >>>>>
> >>>>> use Bio::AlignIO;
> >>>>> use Bio::PopGen::Utilities;
> >>>>>
> >>>>> $file = "your_raw_file.fas";
> >>>>>
> >>>>>
> >>>>> my $aln = Bio::AlignIO->new(-format=>'fasta', 
-file=>$file)->next_aln;
> >>>>> # get the alignment into a Bio::PopGen::Population format, with 
codons
> >>>>> # as the marker sites
> >>>>> my $pop = 
Bio::PopGen::Utilities->aln_to_population(-alignment=>$aln,
> >>>>> -site_model=>'cod');
> >>>>> # here are your variable codons...
> >>>>> my @cdnpos = $pop->get_marker_names;
> >>>>> # here are your individuals represented in the alignment
> >>>>> my @inds = $pop->get_Individuals;
> >>>>> # which have names like "Codon-3-9", "Codon-4-12", etc
> >>>>> foreach my $cdn (@cdnpos) {
> >>>>> # calculate the unique codons represented at this codon position
> >>>>> my (%ucdns, @ucdns);
> >>>>> @genos = $pop->get_Genotypes(-marker=>$cdn);
> >>>>> $ucdns{$_->get_Alleles}++ for @genos;
> >>>>> @ucdns = sort keys %ucdns;
> >>>>> #
> >>>>> # here, use translate or something faster to identify syn/non-syn
> >>>>> # check out code in Bio::Align::DNAStatistics for various methods
> >>>>>
> >>>>> }
> >>>>> # relate back to individuals with this
> >>>>> foreach my $ind (@inds) {
> >>>>> print "Individual ".$ind->unique_id."\n";
> >>>>> print "Site\tAllele\n";
> >>>>> foreach my $cdn (@cdnpos) {
> >>>>> print $cdn, "\t", $ind->get_Genotypes($cdn)->get_Alleles, "\n";
> >>>>> }
> >>>>> }
> >>>>>
> >>>>>
> >>>>> 1;
> >>>>>
> >>>>> ----- Original Message ----- From: "Abhishek Pratap" <
> >>>>> abhishek.vit <at> gmail.com>
> >>>>> To: <bioperl-l <at> lists.open-bio.org>
> >>>>> Sent: Wednesday, July 08, 2009 10:24 AM
> >>>>> Subject: [Bioperl-l] Classifying SNPs
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hi All
> >>>>>
> >>>>> This might seem to be an old track question. However I was not 
able to
> >>>>> find a good answer in the many diff mailing list archives.
> >>>>>
> >>>>> For all our SNP predictions we would like to know whether they are
> >>>>> synonymous / non-synonymous. If Non-synonymous/Exonic then find 
the
> >>>>> position on the gene where amino acid is getting changed and to 
what
> >>>>> ...Also info about indels will help.
> >>>>>
> >>>>> I am not sure if something like this already exists. If not even 
some
> >>>>> pointers on how to move forward will help.
> >>>>>
> >>>>> Thanks,
> >>>>> -Abhi
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l <at> lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>
> >>>>>
> >>>>>
> >>>>>  _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l <at> lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l <at> lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l <at> lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >
> > --
> > Jason Stajich
> > jason <at> bioperl.org
> >
> >

=====================================================================
                      Pablo Marin-Garcia, PhD

                     \\//          (Argiope bruennichi
                \/\/`(||>O:'\/\/   with stabilimentum)
                     //\\

Sanger Institute                |  PostDoc / Computer Biologist
Wellcome Trust Genome Campus    |  team : 128/108 (Human Genetics)
Hinxton, Cambridge CB10 1HH     |  room : N333
United Kingdom                  |  email: pablo.marin at sanger.ac.uk
====================================================================

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.