[Biopython-dev] [Bug 2825] Parsing whole genome sequencing (WGS) Genbank records

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri May 8 19:12:43 EDT 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2825





------- Comment #2 from david.wyllie at ndm.ox.ac.uk  2009-05-08 19:12 EST -------
Thank you for your help.  
I just wanted to extract the WGS line, which I'm able to do.


(In reply to comment #1)
> Hi David,
> 
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides.  Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected.  At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
> 
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
> 
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
> 
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement.  Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
> 
> What information do you want from this file?  In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
> 
> Peter
> 
> P.S. This is a shorter way to dump the file to screen in python:
> 
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
> DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
>             sequencing project.
> ACCESSION   ABIN00000000
> VERSION     ABIN00000000.1  GI:162285818
> DBLINK      Project:27955
> KEYWORDS    WGS.
> SOURCE      Mycobacterium intracellulare ATCC 13950
>   ORGANISM  Mycobacterium intracellulare ATCC 13950
>             Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>             Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
>             avium complex (MAC).
> REFERENCE   1  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Mycobacterium intracellulare Genome Project
>   JOURNAL   Unpublished
> REFERENCE   2  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Direct Submission
>   JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
>             Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
>             H3A 1A4, Canada
> COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
>             (WGS) project has the project accession ABIN00000000.  This version
>             of the project (01) has the accession number ABIN01000000, and
>             consists of sequences ABIN01000001-ABIN01000353.
>             The whole genome shotgun sequence was generated by the McGill
>             University and Genome Quebec Innovation Centre using the GS De Novo
>             Assembler from GS-FLX reads.  This strain is available from the
>             American Type Culture Collection (www.atcc.org).
> FEATURES             Location/Qualifiers
>      source          1..353
>                      /organism="Mycobacterium intracellulare ATCC 13950"
>                      /mol_type="genomic DNA"
>                      /strain="ATCC 13950"
>                      /serovar="16"
>                      /isolation_source="human lymph node"
>                      /db_xref="taxon:487521"
>                      /note="type strain of Mycobacterium intracellulare ATCC
>                      13950
>                      associated with disease"
> WGS         ABIN01000001-ABIN01000353
> //
> 

(In reply to comment #1)
> Hi David,
> 
> This is not expected to work in Bio.SeqIO or Bio.GenBank at the moment.  For
> the LOCUS line we expect things like "123 aa" for proteins, or "123 bp" for
> nucleotides.  Here you have "353 rc" (rc for record count), which as our error
> message says, is unexpected.  At the end of the record, there are also WGS
> and/or WGS_SCAFLD lines to worry about:
> 
> http://www.ncbi.nlm.nih.gov/Genbank/wgs.html
> http://www.bio.net/bionet/mm/genbankb/2009-January/000299.html
> 
> Given these WGS files have no sequence, and no real sequence associated
> features either, it stikes me that supporting this in Bio.SeqIO is a stretch
> (these records are not really sequences, nor are they about a sequence).
> 
> However, Bio.GenBank should perhaps be updated to cope... so I'll leave this
> bug open for that as a possible enhancement.  Note I have changed the bug title
> from "SeqIO does not successfully parse Genbank records related to whole genome
> sequencing deposits, as Did not recognise the LOCUS line layout", to "Parsing
> whole genome sequencing (WGS) Genbank records", and changed the bug priority to
> an enhancement.
> 
> What information do you want from this file?  In the meantime, I suggest you
> fetch the record as XML, which you can parse using Bio.Entrez.read() or your
> XML parser of choice.
> 
> Peter
> 
> P.S. This is a shorter way to dump the file to screen in python:
> 
> >>> from Bio import Entrez
> >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", id="162285818")
> >>> print handle.read()
> LOCUS       ABIN01000000             353 rc    DNA     linear   BCT 10-DEC-2007
> DEFINITION  Mycobacterium intracellulare ATCC 13950, whole genome shotgun
>             sequencing project.
> ACCESSION   ABIN00000000
> VERSION     ABIN00000000.1  GI:162285818
> DBLINK      Project:27955
> KEYWORDS    WGS.
> SOURCE      Mycobacterium intracellulare ATCC 13950
>   ORGANISM  Mycobacterium intracellulare ATCC 13950
>             Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
>             Corynebacterineae; Mycobacteriaceae; Mycobacterium; Mycobacterium
>             avium complex (MAC).
> REFERENCE   1  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Mycobacterium intracellulare Genome Project
>   JOURNAL   Unpublished
> REFERENCE   2  (bases 1 to 353)
>   AUTHORS   Turenne,C., Alexander,D., Nagy,C., Leveque,G., Dias,J.,
>             Forgetta,V., Barreau,G., Lepage,P., Dewar,K. and Behr,M.
>   TITLE     Direct Submission
>   JOURNAL   Submitted (30-NOV-2007) McGill University and Genome Quebec
>             Innovation Centre, 740 Avenue Docteur Penfield, Montreal, Quebec
>             H3A 1A4, Canada
> COMMENT     The Mycobacterium intracellulare ATCC 13950 whole genome shotgun
>             (WGS) project has the project accession ABIN00000000.  This version
>             of the project (01) has the accession number ABIN01000000, and
>             consists of sequences ABIN01000001-ABIN01000353.
>             The whole genome shotgun sequence was generated by the McGill
>             University and Genome Quebec Innovation Centre using the GS De Novo
>             Assembler from GS-FLX reads.  This strain is available from the
>             American Type Culture Collection (www.atcc.org).
> FEATURES             Location/Qualifiers
>      source          1..353
>                      /organism="Mycobacterium intracellulare ATCC 13950"
>                      /mol_type="genomic DNA"
>                      /strain="ATCC 13950"
>                      /serovar="16"
>                      /isolation_source="human lymph node"
>                      /db_xref="taxon:487521"
>                      /note="type strain of Mycobacterium intracellulare ATCC
>                      13950
>                      associated with disease"
> WGS         ABIN01000001-ABIN01000353
> //
> 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list