[Bioperl-l] [BiO BB] B, Z, N, X in refseq

Mon May 30 22:46:20 EDT 2005

> I am using the refseq from Genbank.   There are some strange
> characteristic such as B, Z, N, X in the protein sequence.

These are standard ambiguity codes, see for example
< http://www.ncbi.nlm.nih.gov/blast/html/search.html >

(except for "N", which is simply asparagine)

> Can anybody tell me what these "bad " characteristics means?

This most likely means that particular sequence was derived by chemical 
sequencing of polypeptides, not by translation of nucleic acids; thus 
it may be hard to distinguish between N/D or Q/E.

>   What
> should I do if my program compain these bad characteristics.  Remove
> them or replace them with some specific amino acid?

That depends on what you want to do. For database sequence search the 
BLAST server accepts these codes and they are correctly represented in 
standard mutation-data matrices for alignment scores, so you don't need 
to worry. For molecular weight calculations you could use an average or 
randomly choose one or the other. I can't imagine an application where 
this level of detail would make much of a difference. However: removing 
them is always a bad choice, e.g. for sequence alignments you would be 
introducing a gap (bad!).

Hope this helps, it's pretty standard textbook knowledge though, and 
maybe it would be worthwhile to read up on the Net before you post to 
several groups at once  :-)

B.

>
> Thanks
>
> Frank
>
>
> -- 
> Do not guess who  I am.  I am not Bush in BlackHouse
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l