[Biopython] More on Assembly db

Wed Sep 21 16:26:00 UTC 2016

Hi Mark,

On Wed, Sep 21, 2016 at 5:01 PM, Mark Nadel <mark.e.nadel at gmail.com> wrote:
> I’m trying to mine the Assembly db for some basic statistics about the included assemblies. Here is a sample BioPython script to help explain the issue:

Thank you for providing a self contained example - that made it
really easy to reproduce locally (other than changing the hard
coded absolute path for the output file).

>
> from Bio import Entrez
> Entrez.email = "mark_nadel at hotmail.com"
> handle4= Entrez.esearch(db="assembly", term="eukaryotes",retmax=5000)
> record4 = Entrez.read(handle4)
> print(record4["Count"])
> print(len(record4["IdList"]))
> holder=[-1]*10
> #holder=[-1]*len(record4["IdList"])
> outputfile= open('/Users/marknadel/Mark_New/python_stuff/assembly_eukaryote.txt', 'w')
> #for i in range(0,len(record4['IdList'])):
> for i in range(0,10):
>     handle5=Entrez.esummary(db="assembly",id=record4['IdList'][i])
>     ...

At this point I would suggest you save the esummary XML directly
to a file on disk, which you can then inspect at your leisure - and
parse with Entrez.read if you want. e.g.

    with open("chunk_%i.xml" % i, "w") as out_handle:
        out_handle.write(handle5.read())
    with open("chunk_%i.xml" % i) as handle6:
        record5=Entrez.read(handle6)

> Most of the information I’m really looking for is a level down. While I can use
>
> record5['DocumentSummarySet']['DocumentSummary'][0]['Meta’] to get what I assume is an XML file (see below),
>
> ' <Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat> <Stat category="chromosome_count" sequence_tag="all">11</Stat> <Stat category="contig_count" sequence_tag="all">32588</Stat> <Stat category="contig_l50" sequence_tag="all">4339</Stat> <Stat category="contig_n50" sequence_tag="all">26637</Stat> <Stat category="non_chromosome_replicon_count" sequence_tag="all">0</Stat> <Stat category="replicon_count" sequence_tag="all">11</Stat> <Stat category="scaffold_count" sequence_tag="all">3387</Stat> <Stat category="scaffold_count" sequence_tag="placed">11</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">3376</Stat> <Stat category="scaffold_l50" sequence_tag="all">11</Stat> <Stat category="scaffold_n50" sequence_tag="all">8174047</Stat> <Stat category="total_length" sequence_tag="all">444438822</Stat> <Stat category="ungapped_length" sequence_tag="all">399138682</Stat> </Stats> <assembly-level>3</assembly-level> <assembly-status>Chromosome</assembly-status> <representative-status>na</representative-status> <submitter-organization>Seoul National University</submitter-organization>    ‘
>
> I cannot find any direct access to the fields such as “scaffold count” in the way there was to “Coverage” in the code sample above.  It would be great to have that direct access.
>
> Any help would be greatly appreciated.

That is odd - but yes, the NCBI have recorded a chunk of XML as
text embedded within the main XML results from esummary:

...
        <SortOrder>5C90017355859899</SortOrder>
        <Meta><![CDATA[ <Stats> <Stat category="alt_loci_count"
sequence_tag="all">0</Stat> <Stat category="chromosome_count"
sequence_tag="all">0</Stat> <Stat category="contig_count"
sequence_tag="all">275340</Stat> <Stat category="contig_l50"
sequence_tag="all">25198</Stat> <Stat category="contig_n50"
sequence_tag="all">5106</Stat> <Stat
category="non_chromosome_replicon_count" sequence_tag="all">0</Stat>
<Stat category="replicon_count" sequence_tag="all">0</Stat> <Stat
category="scaffold_count" sequence_tag="all">187166</Stat> <Stat
category="scaffold_count" sequence_tag="placed">0</Stat> <Stat
category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat
category="scaffold_count" sequence_tag="unplaced">0</Stat> <Stat
category="scaffold_l50" sequence_tag="all">17324</Stat> <Stat
category="scaffold_n50" sequence_tag="all">7184</Stat> <Stat
category="total_length" sequence_tag="all">532827421</Stat> <Stat
category="ungapped_length" sequence_tag="all">523151953</Stat>
</Stats> <assembly-level>8</assembly-level>
<assembly-status>Scaffold</assembly-status>
<representative-status>na</representative-status>
<submitter-organization>Indiana University-Purdue University
Indianapolis (IUPUI)</submitter-organization>    ]]></Meta>
</DocumentSummary>
...

The CDATA trick is a way to embed unescaped text in XML.

You can turn this string into a handle (e.g. with StringIO)
and give it to any of the Python XML parsers.

Peter