[BioPython] Adding NCBI XML sequence formats to Bio.SeqIO

Thu Jun 19 16:13:16 UTC 2008

Dear all,

I've realised that as a bonus from Michiel's work on Bio.Entrez,
Biopython should be able to parse several of the XML sequence file
formats used by the NCBI - and ideally we should be able to do this
via Bio.SeqIO and get SeqRecord objects.  I am thinking about adding a
new module to Bio.SeqIO which will map the python list/dictionary
structures from Bio.Entrez into SeqRecord object(s).

What I wanted to ask the list about, is which XML sequence files are
of interest - and are there any strong views on format names should I
use?

I've looked at BioPerl list since I try and re-use the same format
names, but could only spot one NCBI XML file listed here:
http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats

NCBI TinySeq XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd
BioPerl call this "tinyseq", which seems like a good choice of name.
http://www.bioperl.org/wiki/Tinyseq_sequence_format

Also potentially of interest are:

NCBI INSDSeq XML format
http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

NCBI Seq-entry XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Seqset.dtd

NCBI Entrezgene XML format (BioPerl uses "entrezgene" to refer to the
ASN.1 variant of this file format).
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.dtd

(I haven't actually sat down and looked at the details of the
implementation yet, so no promises on the timing!)

Peter