[Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Fri May 8 23:12:43 UTC 2009
http://bugzilla.open-bio.org/show_bug.cgi?id=2825
------- Comment #2 from david.wyllie at ndm.ox.ac.uk 2009-05-08 19:12 EST -------
Thank you for your help.
I just wanted to extract the WGS line, which I'm able to do.
(In reply to comment #1)
> Hi David,
>
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides. Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected. At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
>
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
>
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
>
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement. Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
>
> What information do you want from this file? In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
>
> Peter
>
> P.S. This is a shorter way to dump the file to screen in python:
>
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007
> DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun
> sequencing project.
> ACCESSION ABIN00000000
> VERSION ABIN00000000.1 GI:162285818
> DBLINK Project:27955
> KEYWORDS WGS.
> SOURCE Mycobacterium intracellulare ATCC 13950
> ORGANISM Mycobacterium intracellulare ATCC 13950
> Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
> Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
> avium complex (MAC).
> REFERENCE 1 (bases 1 to 353)
> AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
> Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
> TITLE Mycobacterium intracellulare Genome Project
> JOURNAL Unpublished
> REFERENCE 2 (bases 1 to 353)
> AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
> Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
> TITLE Direct Submission
> JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec
> Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
> H3A 1A4, Canada
> COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
> (WGS) project has the project accession ABIN00000000. This version
> of the project (01) has the accession number ABIN01000000, and
> consists of sequences ABIN01000001-ABIN01000353.
> The whole genome shotgun sequence was generated by the McGill
> University and Genome Quebec Innovation Centre using the GS De Novo
> Assembler from GS-FLX reads. This strain is available from the
> American Type Culture Collection (www.atcc.org).
> FEATURES Location/Qualifiers
> source 1..353
> /organism="Mycobacterium intracellulare ATCC 13950"
> /mol_type="genomic DNA"
> /strain="ATCC 13950"
> /serovar="16"
> /isolation_source="human lymph node"
> /db_xref="taxon:487521"
> /note="type strain of Mycobacterium intracellulare ATCC
> 13950
> associated with disease"
> WGS ABIN01000001-ABIN01000353
> //
>
(In reply to comment #1)
> Hi David,
>
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment. For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides. Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected. At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
>
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
>
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
>
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement. Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
>
> What information do you want from this file? In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
>
> Peter
>
> P.S. This is a shorter way to dump the file to screen in python:
>
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS ABIN01000000 353 rc DNA linear BCT 10-DEC-2007
> DEFINITION Mycobacterium intracellulare ATCC 13950, whole genome shotgun
> sequencing project.
> ACCESSION ABIN00000000
> VERSION ABIN00000000.1 GI:162285818
> DBLINK Project:27955
> KEYWORDS WGS.
> SOURCE Mycobacterium intracellulare ATCC 13950
> ORGANISM Mycobacterium intracellulare ATCC 13950
> Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
> Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
> avium complex (MAC).
> REFERENCE 1 (bases 1 to 353)
> AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
> Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
> TITLE Mycobacterium intracellulare Genome Project
> JOURNAL Unpublished
> REFERENCE 2 (bases 1 to 353)
> AUTHORS Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
> Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
> TITLE Direct Submission
> JOURNAL Submitted (30-NOV-2007) McGill University and Genome Quebec
> Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
> H3A 1A4, Canada
> COMMENT The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
> (WGS) project has the project accession ABIN00000000. This version
> of the project (01) has the accession number ABIN01000000, and
> consists of sequences ABIN01000001-ABIN01000353.
> The whole genome shotgun sequence was generated by the McGill
> University and Genome Quebec Innovation Centre using the GS De Novo
> Assembler from GS-FLX reads. This strain is available from the
> American Type Culture Collection (www.atcc.org).
> FEATURES Location/Qualifiers
> source 1..353
> /organism="Mycobacterium intracellulare ATCC 13950"
> /mol_type="genomic DNA"
> /strain="ATCC 13950"
> /serovar="16"
> /isolation_source="human lymph node"
> /db_xref="taxon:487521"
> /note="type strain of Mycobacterium intracellulare ATCC
> 13950
> associated with disease"
> WGS ABIN01000001-ABIN01000353
> //
>
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list