[Bioperl-l] reformating dna sequence containing ambiguity symbols
Chris Fields
cjfields at illinois.edu
Mon Mar 16 08:48:18 EDT 2009
I wouldn't rely completely on that functionality for anything in 1.7
and beyond (and that may include trunk). It's very likely that
particular global will change to an instance variable in the future to
deal with seqs coming from same alphabet but different symbol tables;
it causes all sorts of subtle scoping problems within code.
See here for some reasoning re: that and other Bio::Align issues (the
page is a stub for now):
http://www.bioperl.org/wiki/Align_Refactor
chris
On Mar 16, 2009, at 4:57 AM, Bernd Web wrote:
> Hi Russel,
>
> Sometimes I have a non-standard symbol too (though these relate to
> residue letters generally and do not change sequence length). I add
> these letters to the variable used for matching:
>
> $Bio::PrimarySeq::MATCHPATTERN = 'A-Za-z\-\.\*\?';
>
>
> Bernd
>
> On Mon, Mar 16, 2009 at 4:30 AM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz> wrote:
>> I've just been reformatting some fasta from dbSNP that contains
>> ambiguity symbols and had to come up with a non-standard solution
>> as I needed to turn validation off in Bio::Seq but couldn't see an
>> obvious way to do it.
>>
>> dbSNP format the fasta for their rs* SNPs like this:
>>
>>> gnl|dbSNP|rs29025902 rs=29025902|pos=251|len=501|taxid=9913|
>>> mol="genomic"|class=1|alleles="A/G"|build=125
>> AGGCTACCAA TAGGACATCA CTGACTGTGA GGCTGGGAAG AAAGACCGAG AAGCACCCCA
>> GGACACAGAA TCTCCTTCAC
>> ATACAGAGGC AGTGGACACA TAGAGTACAG GCAGCGGTAA AATGGAGTAA AAAATTAGAT
>> AGTGGTGAGT TTTGAGTATG
>> AATTGCCTTT GTTTTTAAAT TAGTTCTAAG TTTATAAGAC AAGTTTTATT TTTTTATTTT
>> ATTTATTTGC CACATGGCTT
>> GTGGGTTTGC
>> R
>> GGATCTTAGT TCCCTGACCA GGGATTGAAC CTGTGCCCTC AGCAGTGAAA ACATGGAGTC
>> CTAACAACTG GACCACCAGG
>> GAATTCCCTA TATGACTTAA TTTTTAATAA TATTTGTAGC TAACAATTGA CATGCAGAGT
>> TCCTGAAGAT TTAAGCATGG
>> GCTCCCATGA ACCAGTATGA ACCAGCTCCA GCACAGCACA GGTTTTGTTT TACTTTTGGA
>> GGGGAGGTTT GATTGTGTCT
>> ACATGCTAAT
>>
>> But I needed it (for the Sequenom) with the ambiguity symbol on
>> brackets like this:
>>
>>> gnl|dbSNP|rs29025902
>> AGGCTACCAATAGGACATCACTGACTGTGAGGCTGGGAAGAAAGACCGAGAAGCACCCCAGGACACAGAATCTCCTTCACATACAGAGGCAGTGGACACATAGAGTACAGGCAGCGGTAAAATGGAGTAAAAAATTAGATAGTGGTGAGTTTTGAGTATGAATTGCCTTTGTTTTTAAATTAGTTCTAAGTTTATAAGACAAGTTTTATTTTTTTATTTTATTTATTTGCCACATGGCTTGTGGGTTTGC[A/G]GGATCTTAGTTCCCTGACCAGGGATTGAACCTGTGCCCTCAGCAGTGAAAACATGGAGTCCTAACAACTGGACCACCAGGGAATTCCCTATATGACTTAATTTTTAATAATATTTGTAGCTAACAATTGACATGCAGAGTTCCTGAAGATTTAAGCATGGGCTCCCATGAACCAGTATGAACCAGCTCCAGCACAGCACAGGTTTTGTTTTACTTTTGGAGGGGAGGTTTGATTGTGTCTACATGCTAAT
>>
>> I came up with a 50% BioPerl solution using Bio::SeqIO and
>> Bio::Tools::IUPAC but the final printing of the fasta is dome
>> 'manually'.
>> It's a bit hacky but I'm particularly proud of the obscurity I
>> managed in my switch statement :-)
>>
>> ###############################
>> #!perl -w
>>
>> use Bio::SeqIO;
>> use Bio::Tools::IUPAC;
>> use Switch;
>>
>> my $seq_in = Bio::SeqIO->new(-file=>$ARGV[0], -format=>"fasta" ) or
>> die $!;
>>
>> while (my $seqobj = $seq_in->next_seq) {
>> my $seq = sprintf ">%s\n", $seqobj->display_id;
>> my $iupac_seq = new Bio::Tools::IUPAC(-seq => $seqobj);
>> foreach (@{$iupac_seq->{_alpha}}){
>> switch($#{@{$_}}){
>> case 0{$seq.= @{$_}[0]}
>> case 1{$seq .= sprintf
>> "[%s]",join("/",@{$_})}
>> else {$seq .= 'N'}
>> }
>> }
>> print "$seq\n";
>> }
>> #######################
>>
>> :-)
>>
>> --Russell
>>
>>
>> Russell Smithies
>> Bioinformatics Applications Developer
>> T +64 3 489 9085
>> E russell.smithies at agresearch.co.nz
>> Invermay Research Centre
>> Puddle Alley,
>> Mosgiel,
>> New Zealand
>> T +64 3 489 3809
>> F +64 3 489 9174
>> www.agresearch.co.nz
>>
>> Toitu te whenua, Toitu te tangata
>> Sustain the land, Sustain the people
>>
>>
>>
>>
>>
>>
>> =
>> =
>> =====================================================================
>> Attention: The information contained in this message and/or
>> attachments
>> from AgResearch Limited is intended only for the persons or entities
>> to which it is addressed and may contain confidential and/or
>> privileged
>> material. Any review, retransmission, dissemination or other use
>> of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipients is prohibited by
>> AgResearch
>> Limited. If you have received this message in error, please notify
>> the
>> sender immediately.
>> =
>> =
>> =====================================================================
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list