[Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Fri May 8 18:37:47 EDT 2009
http://bugzilla.open-bio.org/show_bug.cgi?id=2825
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Summary|SeqIO does not successfully |Parsing whole genome
|parse Genbank records |sequencing (WGS) Genbank
|related to whole genome |records
|sequencing deposits, as Did |
|not recognise the LOCUS line|
|layout |
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-05-08 18:37 EST -------
Hi David,
This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For
the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
nucleotides. Here you have "353 rc" (rc for record count), which as our error
message says, is unexpected. At the end of the record, there are also WGS
and/or WGS_SCAFLD lines to worry about:
http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
Given these WGS files have no sequence, and no real sequence associated
features either, it stikes me that supporting this in Bio.SeqIO is a stretch
(these records are not really sequences, nor are they about a sequence).
However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
bug open for that as a possible enhancement. Note I have changed the bug title
from "SeqIO does not successfully parse Genbank records related to whole genome
sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
whole genome sequencing (WGS) Genbank records", and changed the bug priority to
an enhancement.
What information do you want from this file? In the meantime, I suggest you
fetch the record as XML, which you can parse using Bio.Entrez.read() or your
XML parser of choice.
Peter
P.S. This is a shorter way to dump the file to screen in python:
>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
>>> print handle.read()
LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007
DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun
sequencing project.
ACCESSION ABIN00000000
VERSION ABIN00000000.1 GI:162285818
DBLINK Project:27955
KEYWORDS WGS.
SOURCE Mycobacterium intracellulare ATCC 13950
ORGANISM Mycobacterium intracellulare ATCC 13950
Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
avium complex (MAC).
REFERENCE 1 (bases 1 to 353)
AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
TITLE Mycobacterium intracellulare Genome Project
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 353)
AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
TITLE Direct Submission
JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec
Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
H3A 1A4, Canada
COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
(WGS) project has the project accession ABIN00000000. This version
of the project (01) has the accession number ABIN01000000, and
consists of sequences ABIN01000001-ABIN01000353.
The whole genome shotgun sequence was generated by the McGill
University and Genome Quebec Innovation Centre using the GS De Novo
Assembler from GS-FLX reads. This strain is available from the
American Type Culture Collection (www.atcc.org).
FEATURES Location/Qualifiers
source 1..353
/organism="Mycobacterium intracellulare ATCC 13950"
/mol_type="genomic DNA"
/strain="ATCC 13950"
/serovar="16"
/isolation_source="human lymph node"
/db_xref="taxon:487521"
/note="type strain of Mycobacterium intracellulare ATCC
13950
associated with disease"
WGS ABIN01000001-ABIN01000353
//
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list