[Biopython] cleaning sequences

Tue Jul 14 08:45:21 EDT 2009

Hi Liam;
I don't believe there is built in functionality for doing this. The
problem itself is hard because it is a bit underspecified: what
should be done when encountering ambiguous characters? Depending on
your situation this can be a couple of different things:

- Trim the sequence to remove the bases. This might be a
  post-sequencing step, and there was some discussion between Peter
  and Giles about the parameters of doing this earlier this month:

  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html

- Replace the bases with an accepted ambiguity character (say, N or
  x)

So it's a bit hard to generalize. Saying that, we'd be happy for
thoughts on an implementation that would tackle these sorts of
issues.

Brad

> I was wondering if there was a built in method for determining whether a
> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> reason I ask is I am trying to subtype a couple hundred viral DNA sequences,
> and due to bad sequencing, the sequences often have ambiguous characters in
> them, which the algorithm used to subtype doesn't like. I realise I can
> compare each letter of each genome in a loop with GATC to determine
> ambiguity, but it might be easier if there was a built in function.
> 
> Thanks
> Liam
> 
> 
> 
> -- 
> -----------------------------------------------------------
> Antiviral Gene Therapy Research Unit
> University of the Witwatersrand
> Faculty of Health Sciences, Room 7Q07
> 7 York Road, Parktown
> 2193
> 
> Tel: 2711 717 2465/7
> Fax: 2711 717 2395
> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython