[BioPython] Adding NCBI XML sequence formats to Bio.SeqIO
Peter
biopython at maubp.freeserve.co.uk
Thu Jun 19 12:13:16 EDT 2008
Dear all,
I've realised that as a bonus from Michiel's work on Bio.Entrez,
Biopython should be able to parse several of the XML sequence file
formats used by the NCBI - and ideally we should be able to do this
via Bio.SeqIO and get SeqRecord objects. I am thinking about adding a
new module to Bio.SeqIO which will map the python list/dictionary
structures from Bio.Entrez into SeqRecord object(s).
What I wanted to ask the list about, is which XML sequence files are
of interest - and are there any strong views on format names should I
use?
I've looked at BioPerl list since I try and re-use the same format
names, but could only spot one NCBI XML file listed here:
http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats
NCBI TinySeq XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd
BioPerl call this "tinyseq", which seems like a good choice of name.
http://www.bioperl.org/wiki/Tinyseq_sequence_format
Also potentially of interest are:
NCBI INSDSeq XML format
http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd
NCBI Seq-entry XML format
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Seqset.dtd
NCBI Entrezgene XML format (BioPerl uses "entrezgene" to refer to the
ASN.1 variant of this file format).
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.dtd
(I haven't actually sat down and looked at the details of the
implementation yet, so no promises on the timing!)
Peter
More information about the BioPython
mailing list