[Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri May 8 18:37:47 EDT 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2825


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
            Summary|SeqIO does not successfully |Parsing whole genome
                   |parse Genbank records       |sequencing (WGS) Genbank
                   |related to whole genome     |records
                   |sequencing deposits, as Did |
                   |not recognise the LOCUS line|
                   |layout                      |




------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-05-08 18:37 EST -------
Hi David,

This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
nucleotides.  Here you have "353 rc" (rc for record count), which as our error
message says, is unexpected.  At the end of the record, there are also WGS
and/or WGS_SCAFLD lines to worry about:

http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html

Given these WGS files have no sequence, and no real sequence associated
features either, it stikes me that supporting this in Bio.SeqIO is a stretch
(these records are not really sequences, nor are they about a sequence).

However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
bug open for that as a possible enhancement.  Note I have changed the bug title
from "SeqIO does not successfully parse Genbank records related to whole genome
sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
whole genome sequencing (WGS) Genbank records", and changed the bug priority to
an enhancement.

What information do you want from this file?  In the meantime, I suggest you
fetch the record as XML, which you can parse using Bio.Entrez.read() or your
XML parser of choice.

Peter

P.S. This is a shorter way to dump the file to screen in python:

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
>>> print handle.read()
LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
            sequencing project.
ACCESSION   ABIN00000000
VERSION     ABIN00000000.1  GI:162285818
DBLINK      Project:27955
KEYWORDS    WGS.
SOURCE      Mycobacterium intracellulare ATCC 13950
  ORGANISM  Mycobacterium intracellulare ATCC 13950
            Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
            Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
            avium complex (MAC).
REFERENCE   1  (bases 1 to 353)
  AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
            Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
  TITLE     Mycobacterium intracellulare Genome Project
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 353)
  AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
            Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
            Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
            H3A 1A4, Canada
COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
            (WGS) project has the project accession ABIN00000000.  This version
            of the project (01) has the accession number ABIN01000000, and
            consists of sequences ABIN01000001-ABIN01000353.
            The whole genome shotgun sequence was generated by the McGill
            University and Genome Quebec Innovation Centre using the GS De Novo
            Assembler from GS-FLX reads.  This strain is available from the
            American Type Culture Collection (www.atcc.org).
FEATURES             Location/Qualifiers
     source          1..353
                     /organism="Mycobacterium intracellulare ATCC 13950"
                     /mol_type="genomic DNA"
                     /strain="ATCC 13950"
                     /serovar="16"
                     /isolation_source="human lymph node"
                     /db_xref="taxon:487521"
                     /note="type strain of Mycobacterium intracellulare ATCC
                     13950
                     associated with disease"
WGS         ABIN01000001-ABIN01000353
//


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list