[Biopython] gbwithparts not working on NCBI RefSeq?

Ivan Erill ivan.erill at gmail.com
Thu Sep 22 16:04:22 UTC 2016


Hi all,

I am trying to download a full genome record from NCBI Entrez, using
'gbwithparts' to get the full record. However, when I run my code, I get
only the 'header' portion of the record, without either the features or the
sequence at the bottom (even though a simple browser access to the record
(without requesting GenBank (full)) will at least provide the annotation.

If I try the same with the equivalent GenBank accession for the record, I
get the full record (features and sequence).

This is reproducible at least for several other bacterial genomes.

I had previously downloaded RefSeq records using the same type of call, so
I was wondering whether this might be related to NCBI transitioning to
HTTPS, the phasing-out of GI numbers, or both. Before pestering the NCBI
staff, however, I thought I would ask whether there have been any changes
to the BioPython parser that might explain the effect.

Here is the code:

#******************************************************************************
# -*- coding: utf-8 -*-
from Bio import Entrez
Entrez.email ="ivan.erill at gmail.com"

#RefSeq accession for Acetobacterium woodii DSM 1030, complete genome
#NC_016894 / 379009891
ncbi_handle =
Entrez.efetch(db='nuccore',id='379009891',retmode='gbwithparts',\
                            rettype='gb')
ncbi_record = ncbi_handle.read()
print 'End of RefSeq retrieved record: '
print ncbi_record[-44:]
#this gives me:
#--> End of RefSeq retrieved record:
#--> CONTIG      join(CP002987.1:1..4044777)
#--> //
#showing that the record ends with a contig join statement
#using NC_016894 as 'id' gives same behavior

#GenBank accession for Acetobacterium woodii DSM 1030, complete genome
#CP002987 / 375300680
ncbi_handle =
Entrez.efetch(db='nuccore',id='375300680',retmode='gbwithparts',\
                            rettype='gb')
ncbi_record = ncbi_handle.read()
print 'End of RefSeq retrieved record: '
print ncbi_record[-77:]
#this gives me:
#--> End of RefSeq retrieved record:
#-->   4044721 ttttacctgg taatgttttt ttatattatc aacatttatt cttataaatt
acttgat
#--> //
#showing that the record ends with the complete sequence
#using CP002987 as 'id' gives same behavior
#******************************************************************************


Any insights will be greatly appreciated. Thanks,

Ivan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20160922/591d23a8/attachment.html>


More information about the Biopython mailing list