[BioPython] accessing "data quality" Phrap records in Genbank
Peter
biopython at maubp.freeserve.co.uk
Wed Aug 1 21:05:33 UTC 2007
Emanuel Hey wrote:
> for some sequence records, NCBI has a a record of the
> Phrap scores corresponding to the sequence (i.e. one
> score for each base).
>
> These are typically records containing draft sequences
> from genome projects
>
> to see an example, try this link
>
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&qty=1&c_start=1&list_uids=153792835&uids=&dopt=qual&dispmax=5&sendto=&fmt_mask=0&from=begin&to=end&extrafeatpresent=1&ef_CDD=8&ef_MGC=16&ef_HPRD=32&ef_STS=64&ef_tRNA=128&ef_microRNA=256&ef_Exon=512
>
> How could I go about downloading these sequence
> quality scores?
One option for getting the data would be to construct the URL then
download it using standard python tools, e.g. the urllib.urlretrieve
function. Alternatively Biopython has some NCBI/Entrez code you might be
able to use...
The second step is actually parsing the data file into a usable form.
The "Base Quality" format looks very easy to parse, with a FASTA like
header followed by space separated decimal scores. Their XML format
also looks fairly simple - the core data looks like its held as a string
where each two characters represents one score in hex. As far as I
could see based on the URL you gave, none of the other format options
actually contain the "data quality" information.
I'm not aware of any code in Biopython to cope with either of these file
formats.
> I need to filter the data by a certain score
Are you trying to select parts of the associated sequence?
Peter
More information about the Biopython
mailing list