[Biopython] More on Assembly db

Mark Nadel mark.e.nadel at gmail.com
Wed Sep 21 16:01:36 UTC 2016


I’m trying to mine the Assembly db for some basic statistics about the included assemblies. Here is a sample BioPython script to help explain the issue:

from Bio import Entrez
Entrez.email = "mark_nadel at hotmail.com"     
handle4= Entrez.esearch(db="assembly", term="eukaryotes",retmax=5000)
record4 = Entrez.read(handle4)
print(record4["Count"])
print(len(record4["IdList"]))
holder=[-1]*10
#holder=[-1]*len(record4["IdList"])
outputfile= open('/Users/marknadel/Mark_New/python_stuff/assembly_eukaryote.txt', 'w')
#for i in range(0,len(record4['IdList'])):
for i in range(0,10):
    handle5=Entrez.esummary(db="assembly",id=record4['IdList'][i])
    record5=Entrez.read(handle5)
    info = record5['DocumentSummarySet']['DocumentSummary'][0]
    holder[i]=float(info['Coverage'])
    print(info['Organism'],info['SpeciesName'],info['Coverage'],sep='\t',file=outputfile)
outputfile.close() 
print('DONE')
from stats import mean, stdev
print(mean(holder), stdev(holder))
***********************

Most of the information I’m really looking for is a level down. While I can use 

record5['DocumentSummarySet']['DocumentSummary'][0]['Meta’] to get what I assume is an XML file (see below),

' <Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat> <Stat category="chromosome_count" sequence_tag="all">11</Stat> <Stat category="contig_count" sequence_tag="all">32588</Stat> <Stat category="contig_l50" sequence_tag="all">4339</Stat> <Stat category="contig_n50" sequence_tag="all">26637</Stat> <Stat category="non_chromosome_replicon_count" sequence_tag="all">0</Stat> <Stat category="replicon_count" sequence_tag="all">11</Stat> <Stat category="scaffold_count" sequence_tag="all">3387</Stat> <Stat category="scaffold_count" sequence_tag="placed">11</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">3376</Stat> <Stat category="scaffold_l50" sequence_tag="all">11</Stat> <Stat category="scaffold_n50" sequence_tag="all">8174047</Stat> <Stat category="total_length" sequence_tag="all">444438822</Stat> <Stat category="ungapped_length" sequence_tag="all">399138682</Stat> </Stats> <assembly-level>3</assembly-level> <assembly-status>Chromosome</assembly-status> <representative-status>na</representative-status> <submitter-organization>Seoul National University</submitter-organization>    ‘

I cannot find any direct access to the fields such as “scaffold count” in the way there was to “Coverage” in the code sample above.  It would be great to have that direct access.

Any help would be greatly appreciated.

Thanks,

Mark




More information about the Biopython mailing list