[BioPython] accessing "data quality" Phrap records in Genbank

Wed Aug 1 21:05:33 UTC 2007

Emanuel Hey wrote:
> for some sequence records, NCBI has a a record of the
> Phrap scores corresponding to the sequence  (i.e. one
> score for each base). 
> 
> These are typically records containing draft sequences
> from genome projects
> 
> to see an example, try this link
> 
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&qty=1&c_start=1&list_uids=153792835&uids=&dopt=qual&dispmax=5&sendto=&fmt_mask=0&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256&ef_Exon=512
> 
> How could I go about downloading these sequence
> quality scores?  

One option for getting the data would be to construct the URL then 
download it using standard python tools, e.g. the urllib.urlretrieve 
function. Alternatively Biopython has some NCBI/Entrez code you might be 
able to use...

The second step is actually parsing the data file into a usable form. 
The "Base Quality" format looks very easy to parse, with a FASTA like 
header followed by space separated decimal scores.  Their XML format 
also looks fairly simple - the core data looks like its held as a string 
where each two characters represents one score in hex.  As far as I 
could see based on the URL you gave, none of the other format options 
actually contain the "data quality" information.

I'm not aware of any code in Biopython to cope with either of these file 
formats.

> I need to filter the data by a certain score

Are you trying to select parts of the associated sequence?

Peter