[Biopython] cleaning sequences

Wed Jul 15 08:35:40 EDT 2009

Hi Liam;
That makes sense. It's a good suggestion and I added it to the
Project Ideas area of the wiki so hopefully it'll get picked up on
in the future:

http://biopython.org/wiki/Active_projects#Project_ideas

For your specific problem, you should be able to do something along
the lines of:

def convert_ambiguous(orig_seq):
    new_bases = []
    for base in str(orig_seq).upper():
        if base in ["G", "A", "T", "C"]:
            new_bases.append(base)
        else:
            new_bases.append("N")
    return Seq("".join(new_bases), orig_seq.alphabet)

which would switch all non GATCs to the N ambiguity character,
assuming your downstream program accepts that.

Hope this helps,
Brad

> 
> Yes, I remember the posts rereading them now. I think my problem is a little
> less complicated than sequence data, seeing as my sequences are genbank
> entries, so they just need to be read, even if they're bad quality. I
> suppose changing the letter would be a better option for me, especially as
> the reading frame is important for aligning based on peptide sequence.
> 
> As for implementation, I am a complete greenhorn at python nevermind
> programming, so I wouldn't even know where to start suggestions, sorry about
> that.
> 
> Regards
> Liam
> 
> 
> 
> 
> On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > Hi Liam;
> > I don't believe there is built in functionality for doing this. The
> > problem itself is hard because it is a bit underspecified: what
> > should be done when encountering ambiguous characters? Depending on
> > your situation this can be a couple of different things:
> >
> > - Trim the sequence to remove the bases. This might be a
> >  post-sequencing step, and there was some discussion between Peter
> >  and Giles about the parameters of doing this earlier this month:
> >
> >  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
> >
> > - Replace the bases with an accepted ambiguity character (say, N or
> >  x)
> >
> > So it's a bit hard to generalize. Saying that, we'd be happy for
> > thoughts on an implementation that would tackle these sorts of
> > issues.
> >
> > Brad
> >
> > > I was wondering if there was a built in method for determining whether a
> > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> > > reason I ask is I am trying to subtype a couple hundred viral DNA
> > sequences,
> > > and due to bad sequencing, the sequences often have ambiguous characters
> > in
> > > them, which the algorithm used to subtype doesn't like. I realise I can
> > > compare each letter of each genome in a loop with GATC to determine
> > > ambiguity, but it might be easier if there was a built in function.
> > >
> > > Thanks
> > > Liam
> > >
> > >
> > >
> > > --
> > > -----------------------------------------------------------
> > > Antiviral Gene Therapy Research Unit
> > > University of the Witwatersrand
> > > Faculty of Health Sciences, Room 7Q07
> > > 7 York Road, Parktown
> > > 2193
> > >
> > > Tel: 2711 717 2465/7
> > > Fax: 2711 717 2395
> > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
> 
> 
> -- 
> -----------------------------------------------------------
> Antiviral Gene Therapy Research Unit
> University of the Witwatersrand
> Faculty of Health Sciences, Room 7Q07
> 7 York Road, Parktown
> 2193
> 
> Tel: 2711 717 2465/7
> Fax: 2711 717 2395
> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com